I have this list:
list="aa bb cc dd ee ff ab cd ef"
What I'm trying so far:
$ grep -o "^[^cd]*" <<<"$list"
aa bb
Expected output:
$ grep -o "^[^cd]*" <<<"$list"
aa bb cc dd ee ff ab
Assumptions:
list at the substring matching cdWe can use parameter substitution to extract the desired substring without the need to call a separate binary (eg, grep, sed).
Expanding OP's sample string to include a 2nd cd:
list="aa bb cc dd ee ff ab cd ef cd xx yy"
^^ ^^
NOTE: for display purposes I'm adding a pair of colons to highlight the start/end of the result
Truncating the string at the first occurrence of cd:
$ echo ":${list/cd*/}:"
:aa bb cc dd ee ff ab :
^ trailing space
Truncating the string at the first occurrence of cd (include the leading space):
$ echo ":${list/ cd*/}:"
:aa bb cc dd ee ff ab:
An alternative that truncates the string at the first occurrence of cd:
$ echo ":${list%% cd*}:"
:aa bb cc dd ee ff ab:
Truncating the string at the last occurrence of cd:
$ echo ":${list% cd*}:"
:aa bb cc dd ee ff ab cd ef:
Bash substitution, your solution is perfect! I just wonder why you used: ":${list/cd*/}:" instead of: "${list/cd*/}"
If you're using GNU grep, you can do:
$ grep -P -o '^.*?(?=cd)' <<<"$list"
aa bb cc dd ee ff ab
This uses the -P flag for Perl-compatible regular expressions [sic!]. This gives us two features:
(?=cd) looks ahead, so we match from the beginning of the line until the last character before the first occurrence of cd.*? makes grep perform a non-greedy match, so it matches up to the first and not the last cd.Note that your example output doesn't have the trailing space. If you want to exclude the space, you can add it to the pattern:
$ grep -P -o '^.*?(?= cd)' <<<"$list"
aa bb cc dd ee ff ab
Delete the longest suffix cd.* if one exists:
$ sed 's/cd.*$//g' <<<"$list"
aa bb cc dd ee ff ab
You can’t; you need lookahead. (Thanks to @ilkkachu for pointing out the mistake I made earlier.)
If you had any grep command like
grep -o -E '<pattern>' that worked here, then it would have to treat these strings differently:
abccce → abcccabcccd → abccThe grep command would have to build a DFA that can land on a final state after reading either abcc or abccc. It would then have to return abccc both times, because that's the longest prefix where it lands on a final state. But then it would give the wrong answer for abcccd.
-P (perl expression).
-o isn't portable, so if you're using grep -o that is already far less portable than grep's -E option which, unlike -o, is part of POSIX.
cd, that ^(c[^d]|[^c])* fails on an input containing ccd, as the [^d] will accept the second c. e.g. echo 'abc ccd foo' | grep -o -E '^(c[^d]|[^c])*' will just print the whole input
cd, there's the additional issue that a lone c at the end will be dropped, as the c[^d] requires something after a c. e.g. echo 'abc c' | grep -o -E '^(c[^d]|[^c])*' will print just abc
abccce is abccc, but the right answer to abcccd is abcc.
If the point is to get the subset of a list that runs from the first element to the one before the first occurrence of a given value, then I'd do instead (in zsh, not bash):
list=(
first second '3rd with spaces' '4th with cd and
newline' $'5th\0with\0nuls\0' '' etc etc cd more elements
)
subset=( "${(@)list[1,list[(i)cd]-1]}" )
typeset -p1 subset
Which here gives:
typeset -a subset=(
first
second
'3rd with spaces'
$'4th with cd and\nnewline'
$'5th\C-@with\C-@nuls\C-@'
''
etc
etc
)
Which allows you to work with arbitrary lists of elements.
You can simplify it to subset=( $list[1,list[(i)cd]-1] ) if you don't have to account for empty elements (quotes combined with the @ parameter expansion flag are there to preserve those empty elements like in the Bourne shell's "$@"). \C-@ is a representation of the NUL aka \0 aka ^@ character, \n (same as \C-J) a representation of newline aka LF aka \12.
$list[(i)cd] (or just list[(i)cd] inside an arithmetic expression like here) expands to the index of the first element matching the cd pattern. If there's no match, it expands to 1 plus the size of the list, so $list[1,that-1] would give you the whole list.
To modify the list in-place, you could also perform this array slice assignment:
list[(r)cd,-1]=()
Where the (r) subscript flag is for reverse-subscripting to refer to the first element that matches the pattern.
Then:
$ typeset -p1 list
typeset -a list=(
first
second
'3rd with spaces'
$'4th with cd and\nnewline'
$'5th\C-@with\C-@nuls\C-@'
''
etc
etc
)
In bash (or zsh where bash copied the array=() and array+=() syntax from, but not most of the other array operators and the rest of its array design being more inspired from that of ksh), you can always use a loop:
list=(
first second '3rd with spaces' '4th with cd and
newline' "bash doesn't support nuls" '' etc etc cd more elements
)
subset=()
for i in "${list[@]}"; do
[ "$i" = cd ] && break
subset+=( "$i" )
done
typeset -p subset
In your approach, your $list is not really a list as that's a scalar variable which holds a single string value, not an array variable.
[^cd] matches any character other than c or d, so [^cd]* stops before the first character that is either c or d.
If we assume that $list contains a list of space separated words that cannot contain spaces or newlines, then assuming GNU grep built with PCRE2 (or formerly PCRE) support (-o is already a GNU extension anyway), you could do:
$ list='aa bb abcde cc dd ee ff ab cd ef'
$ LC_ALL=C grep -Po '(?x) ^ .*? (?= [ ]* (?<! [^ ] ) cd (?! [^ ] ) )' <<< "$list"
aa bb abcde cc dd ee ff ab
Where:
(?x) enables PCRE2_EXTENDED which allows us to add some spacing in the regexp to improve legibility (slightly).^ matches at the end of the subject. For grep, that's the line.. matches any single character. With LC_ALL=C, that's any single byte, regardless of whether they would form part of valid characters or not in the user's locale.*? is the non-greedy version of *, which matches 0 or more of the preceding atom (.), but as few as possible.(?= regexp ) is a positive look-ahead operator. So matches at a given spot provided what follows matches the regexp.(?<! [^ ] ) and (?! [^ ]) are respectively negative look behind and look ahead operators. Here we want a cd that is neither preceded nor followed by non-space characters as we don't want to match on the cd in abcde for instance.