grep capture from beginning until first 2 chars found

Question

I have this list:

list="aa bb cc dd ee ff ab cd ef"

What I'm trying so far:

$ grep -o "^[^cd]*" <<<"$list"
aa bb

Expected output:

$ grep -o "^[^cd]*" <<<"$list"
aa bb cc dd ee ff ab

markp-fuso · Accepted Answer · 2025-05-26 00:55:53Z

7

Assumptions:

truncate list at the substring matching cd

We can use parameter substitution to extract the desired substring without the need to call a separate binary (eg, grep, sed).

Expanding OP's sample string to include a 2nd cd:

list="aa bb cc dd ee ff ab cd ef cd xx yy"
                           ^^    ^^

NOTE: for display purposes I'm adding a pair of colons to highlight the start/end of the result

Truncating the string at the first occurrence of cd:

$ echo ":${list/cd*/}:"
:aa bb cc dd ee ff ab :
                     ^     trailing space

Truncating the string at the first occurrence of cd (include the leading space):

$ echo ":${list/ cd*/}:"
:aa bb cc dd ee ff ab:

An alternative that truncates the string at the first occurrence of cd:

$ echo ":${list%% cd*}:"
:aa bb cc dd ee ff ab:

Truncating the string at the last occurrence of cd:

$ echo ":${list% cd*}:"
:aa bb cc dd ee ff ab cd ef:

edited May 26 at 0:55

answered May 25 at 22:18

markp-fuso

1,6801 gold badge9 silver badges11 bronze badges

I almost forgot about the Bash substitution, your solution is perfect! I just wonder why you used: ":${list/cd*/}:" instead of: "${list/cd*/}"

Zero
– Zero

2025-05-26 06:39:24 +00:00
Commented May 26 at 6:39
1

@Zero, they noted they added a pair of colons for display purposes (to make the final space more visible)

ilkkachu
– ilkkachu

2025-05-26 06:48:13 +00:00
Commented May 26 at 6:48

Add a comment |

wobtax · Accepted Answer · 2025-05-26 14:06:48Z

7

With Perl-compatible regular expressions

If you're using GNU grep, you can do:

$ grep -P -o '^.*?(?=cd)' <<<"$list"
aa bb cc dd ee ff ab

This uses the -P flag for Perl-compatible regular expressions [sic!]. This gives us two features:

The (?=cd) looks ahead, so we match from the beginning of the line until the last character before the first occurrence of cd.
The question mark in *? makes grep perform a non-greedy match, so it matches up to the first and not the last cd.

Note that your example output doesn't have the trailing space. If you want to exclude the space, you can add it to the pattern:

$ grep -P -o '^.*?(?= cd)' <<<"$list"
aa bb cc dd ee ff ab

With sed instead of grep

Delete the longest suffix cd.* if one exists:

$ sed 's/cd.*$//g' <<<"$list"
aa bb cc dd ee ff ab

What about basic or extended regular expressions?

You can’t; you need lookahead. (Thanks to @ilkkachu for pointing out the mistake I made earlier.)

If you had any grep command like grep -o -E '<pattern>' that worked here, then it would have to treat these strings differently:

abccce → abccc
abcccd → abcc

The grep command would have to build a DFA that can land on a final state after reading either abcc or abccc. It would then have to return abccc both times, because that's the longest prefix where it lands on a final state. But then it would give the wrong answer for abcccd.

edited May 26 at 14:06

answered May 25 at 20:04

wobtax

1,1753 silver badges17 bronze badges

thanks! your solution does works but since you have mentioned it, I would like to see a more portable solution, I need to achieve this without -P (perl expression).

Zero
– Zero

2025-05-25 20:21:29 +00:00
Commented May 25 at 20:21
3

@Zero why? Grep isn't a good tool for this, so any solution using grep is going to be inelegant, as you said. So why insist on grep? If you have constraints on what you can use, please edit your question and explain them. By the way, -o isn't portable, so if you're using grep -o that is already far less portable than grep's -E option which, unlike -o, is part of POSIX.

terdon
– terdon ♦

2025-05-25 23:13:46 +00:00
Commented May 25 at 23:13
1

note that, if the intent is to stop at the first cd, that ^(c[^d]|[^c])* fails on an input containing ccd, as the [^d] will accept the second c. e.g. echo 'abc ccd foo' | grep -o -E '^(c[^d]|[^c])*' will just print the whole input

ilkkachu
– ilkkachu

2025-05-26 06:46:31 +00:00
Commented May 26 at 6:46
1

Also, if the intent is to print the whole input in case there is no cd, there's the additional issue that a lone c at the end will be dropped, as the c[^d] requires something after a c. e.g. echo 'abc c' | grep -o -E '^(c[^d]|[^c])*' will print just abc

ilkkachu
– ilkkachu

2025-05-26 06:53:12 +00:00
Commented May 26 at 6:53
1

@ilkkachu D’oh! You’re so right, and now I’m thinking it’s not even possible to extract that prefix just with a true regular expression and no lookahead—because the right answer to abccce is abccc, but the right answer to abcccd is abcc.

wobtax
– wobtax

2025-05-26 13:45:11 +00:00
Commented May 26 at 13:45

| Show 7 more comments

Stéphane Chazelas · Accepted Answer · 2025-05-26 07:12:42Z

If the point is to get the subset of a list that runs from the first element to the one before the first occurrence of a given value, then I'd do instead (in zsh, not bash):

list=(
  first second '3rd with spaces' '4th with cd and
newline' $'5th\0with\0nuls\0' '' etc etc cd more elements
)
subset=( "${(@)list[1,list[(i)cd]-1]}" )
typeset -p1 subset

Which here gives:

typeset -a subset=(
  first
  second
  '3rd with spaces'
  $'4th with cd and\nnewline'
  $'5th\C-@with\C-@nuls\C-@'
  ''
  etc
  etc
)

Which allows you to work with arbitrary lists of elements.

You can simplify it to subset=( $list[1,list[(i)cd]-1] ) if you don't have to account for empty elements (quotes combined with the @ parameter expansion flag are there to preserve those empty elements like in the Bourne shell's "$@"). \C-@ is a representation of the NUL aka \0 aka ^@ character, \n (same as \C-J) a representation of newline aka LF aka \12.

$list[(i)cd] (or just list[(i)cd] inside an arithmetic expression like here) expands to the index of the first element matching the cd pattern. If there's no match, it expands to 1 plus the size of the list, so $list[1,that-1] would give you the whole list.

To modify the list in-place, you could also perform this array slice assignment:

list[(r)cd,-1]=()

Where the (r) subscript flag is for reverse-subscripting to refer to the first element that matches the pattern.

Then:

$ typeset -p1 list
typeset -a list=(
  first
  second
  '3rd with spaces'
  $'4th with cd and\nnewline'
  $'5th\C-@with\C-@nuls\C-@'
  ''
  etc
  etc
)

In bash (or zsh where bash copied the array=() and array+=() syntax from, but not most of the other array operators and the rest of its array design being more inspired from that of ksh), you can always use a loop:

list=(
  first second '3rd with spaces' '4th with cd and
newline' "bash doesn't support nuls" '' etc etc cd more elements
)
subset=()
for i in "${list[@]}"; do
  [ "$i" = cd ] && break
  subset+=( "$i" )
done
typeset -p subset

In your approach, your $list is not really a list as that's a scalar variable which holds a single string value, not an array variable.

[^cd] matches any character other than c or d, so [^cd]* stops before the first character that is either c or d.

If we assume that $list contains a list of space separated words that cannot contain spaces or newlines, then assuming GNU grep built with PCRE2 (or formerly PCRE) support (-o is already a GNU extension anyway), you could do:

$ list='aa bb abcde cc dd ee ff ab cd ef'
$ LC_ALL=C grep -Po '(?x) ^ .*? (?= [ ]* (?<! [^ ] ) cd (?! [^ ] ) )' <<< "$list"
aa bb abcde cc dd ee ff ab

Where:

(?x) enables PCRE2_EXTENDED which allows us to add some spacing in the regexp to improve legibility (slightly).
^ matches at the end of the subject. For grep, that's the line.
. matches any single character. With LC_ALL=C, that's any single byte, regardless of whether they would form part of valid characters or not in the user's locale.
*? is the non-greedy version of *, which matches 0 or more of the preceding atom (.), but as few as possible.
(?= regexp ) is a positive look-ahead operator. So matches at a given spot provided what follows matches the regexp.
(?<! [^ ] ) and (?! [^ ]) are respectively negative look behind and look ahead operators. Here we want a cd that is neither preceded nor followed by non-space characters as we don't want to match on the cd in abcde for instance.

Stack Exchange Network

grep capture from beginning until first 2 chars found

3 Answers 3

With Perl-compatible regular expressions

With sed instead of grep

What about basic or extended regular expressions?

You must log in to answer this question.

Hot Network Questions

grep capture from beginning until first 2 chars found

3 Answers 3

With Perl-compatible regular expressions

With sed instead of grep

What about basic or extended regular expressions?

You must log in to answer this question.

Related

Hot Network Questions