Revisions to grep capture from beginning until first 2 chars found

added 484 characters in body

Source Link

edited May 26 at 7:12

584.6k
96
1.1k
1.7k

In bash (or zsh where bash copied the array=() and array+=() syntax from, but not most of the other array operators and the rest of its array design being more inspired from that of ksh), you can always use a loop:

list=(
  first second '3rd with spaces' '4th with cd and
newline' "bash doesn't support nuls" '' etc etc cd more elements
)
subset=()
for i in "${list[@]}"; do
  [ "$i" = cd ] && break
  subset+=( "$i" )
done
typeset -p subset

In your approach, your $list is not really a list as that's a scalar variable which holds a single string value, not an array variable.

$ list='aa bb abcde cc dd ee ff ab cd ef'
$ LC_ALL=C grep -Po '(?x) ^ .*? (?= [ ]* (?<! [^ ] ) cd (?! [^ ] ) )' <<< "$list"
aa bb abcde cc dd ee ff ab

In your approach, your $list is not really a list as that's a scalar variable which holds a single string value, not an array variable.

$ list='aa bb abcde cc dd ee ff ab cd ef'
$ LC_ALL=C grep -Po '(?x) ^ .*? (?= (?<! [^ ] ) cd (?! [^ ] ) )' <<< "$list"
aa bb abcde cc dd ee ff ab

In bash (or zsh where bash copied the array=() and array+=() syntax from, but not most of the other array operators and the rest of its array design being more inspired from that of ksh), you can always use a loop:

list=(
  first second '3rd with spaces' '4th with cd and
newline' "bash doesn't support nuls" '' etc etc cd more elements
)
subset=()
for i in "${list[@]}"; do
  [ "$i" = cd ] && break
  subset+=( "$i" )
done
typeset -p subset

In your approach, your $list is not really a list as that's a scalar variable which holds a single string value, not an array variable.

$ list='aa bb abcde cc dd ee ff ab cd ef'
$ LC_ALL=C grep -Po '(?x) ^ .*? (?= [ ]* (?<! [^ ] ) cd (?! [^ ] ) )' <<< "$list"
aa bb abcde cc dd ee ff ab

added 1518 characters in body

Source Link

edited May 26 at 5:52

Stéphane Chazelas

584.6k
96
1.1k
1.7k

In your approach, your $list is not really a list as that's a scalar variable which holds a single string value, not an array variable.

[^cd] matches any character other than c or d, so [^cd]* stops before the first character that is either c or d.

If we assume that $list contains a list of space separated words that cannot contain spaces or newlines, then assuming GNU grep built with PCRE2 (or formerly PCRE) support (-o is already a GNU extension anyway), you could do:

$ list='aa bb abcde cc dd ee ff ab cd ef'
$ LC_ALL=C grep -Po '(?x) ^ .*? (?= (?<! [^ ] ) cd (?! [^ ] ) )' <<< "$list"
aa bb abcde cc dd ee ff ab

Where:

(?x) enables PCRE2_EXTENDED which allows us to add some spacing in the regexp to improve legibility (slightly).

^ matches at the end of the subject. For grep, that's the line.

. matches any single character. With LC_ALL=C, that's any single byte, regardless of whether they would form part of valid characters or not in the user's locale.

*? is the non-greedy version of *, which matches 0 or more of the preceding atom (.), but as few as possible.

(?= regexp ) is a positive look-ahead operator. So matches at a given spot provided what follows matches the regexp.

(?<! [^ ] ) and (?! [^ ]) are respectively negative look behind and look ahead operators. Here we want a cd that is neither preceded nor followed by non-space characters as we don't want to match on the cd in abcde for instance.

In your approach, your $list is not really a list as that's a scalar variable which holds a single string value, not an array variable.

[^cd] matches any character other than c or d, so [^cd]* stops before the first character that is either c or d.

If we assume that $list contains a list of space separated words that cannot contain spaces or newlines, then assuming GNU grep built with PCRE2 (or formerly PCRE) support (-o is already a GNU extension anyway), you could do:

$ list='aa bb abcde cc dd ee ff ab cd ef'
$ LC_ALL=C grep -Po '(?x) ^ .*? (?= (?<! [^ ] ) cd (?! [^ ] ) )' <<< "$list"
aa bb abcde cc dd ee ff ab

Where:

(?x) enables PCRE2_EXTENDED which allows us to add some spacing in the regexp to improve legibility (slightly).

^ matches at the end of the subject. For grep, that's the line.

. matches any single character. With LC_ALL=C, that's any single byte, regardless of whether they would form part of valid characters or not in the user's locale.

*? is the non-greedy version of *, which matches 0 or more of the preceding atom (.), but as few as possible.

(?= regexp ) is a positive look-ahead operator. So matches at a given spot provided what follows matches the regexp.

(?<! [^ ] ) and (?! [^ ]) are respectively negative look behind and look ahead operators. Here we want a cd that is neither preceded nor followed by non-space characters as we don't want to match on the cd in abcde for instance.

Source Link

answered May 26 at 5:30

Stéphane Chazelas

584.6k
96
1.1k
1.7k

If the point is to get the subset of a list that runs from the first element to the one before the first occurrence of a given value, then I'd do instead (in zsh, not bash):

list=(
  first second '3rd with spaces' '4th with cd and
newline' $'5th\0with\0nuls\0' '' etc etc cd more elements
)
subset=( "${(@)list[1,list[(i)cd]-1]}" )
typeset -p1 subset

Which here gives:

typeset -a subset=(
  first
  second
  '3rd with spaces'
  $'4th with cd and\nnewline'
  $'5th\C-@with\C-@nuls\C-@'
  ''
  etc
  etc
)

Which allows you to work with arbitrary lists of elements.

You can simplify it to subset=( $list[1,list[(i)cd]-1] ) if you don't have to account for empty elements (quotes combined with the @ parameter expansion flag are there to preserve those empty elements like in the Bourne shell's "$@"). \C-@ is a representation of the NUL aka \0 aka ^@ character, \n (same as \C-J) a representation of newline aka LF aka \12.

$list[(i)cd] (or just list[(i)cd] inside an arithmetic expression like here) expands to the index of the first element matching the cd pattern. If there's no match, it expands to 1 plus the size of the list, so $list[1,that-1] would give you the whole list.

To modify the list in-place, you could also perform this array slice assignment:

list[(r)cd,-1]=()

Where the (r) subscript flag is for reverse-subscripting to refer to the first element that matches the pattern.

Then:

$ typeset -p1 list
typeset -a list=(
  first
  second
  '3rd with spaces'
  $'4th with cd and\nnewline'
  $'5th\C-@with\C-@nuls\C-@'
  ''
  etc
  etc
)

Stack Exchange Network

Return to Answer