8

I'm not talking about -o option. Posix says:

The search for a matching sequence starts at the beginning of a string and stops when the first sequence matching the expression is found, where "first" is defined to mean "begins earliest in the string". If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched. For example, the BRE "bb*" matches the second to fourth characters of the string "abbbc", and the ERE "(wee|week)(knights|night)" matches all ten characters of the string "weeknights".

And I want to verify what is being said in posix and this tutorial regTutorialSite:

A POSIX-compliant engine will still find the leftmost match. If you apply Set|SetValue to Set or SetValue once, it will match Set.

How to "apply once"? When i run grep -o the result is two strings: Set and SetValue, but not just " one leftmost " . That is, I read about one thing, but in practice I get something else. So, how to see what string was matched by regex?

(Perhaps the question was formulated incorrectly or could have been better)

7
  • 3
    Is this just for self-education about grep or do you want to actually use the extracted matches in a script? if the former, then self-education is great. If the latter, then you'll be much better off using a language like awk or perl which are designed for exactly that kind of task, trying to do it in shell with grep and command substitution will be slow and awkward. e.g. perl can return ALL matches in an array, making it easy to iterate over the results rather than repeating the search multiple times. Commented Aug 2 at 4:05
  • 1
    So, why does it matter if regular grep finds the first possible match on the line, or one of the possible other matches? The default operation is defined to just print the line, regardless of what exactly matched. On the ither hand, is there reason to doubt the internal matching logic of grep wouldn't find the first match, as usual? That would mean grep would need to have a different regex engine from everything else, and for no real use, since it prints just the whole line anyway. Commented Aug 2 at 8:00
  • 1
    As for grep -o, I think it's defined to print all matches, so again, there's no conflict between what you saw and what is documented. Grep isn't the same thing as the regex engine in the C library, and it's not reasonable to expect a description of the library functions would describe accurately the behavior of grep. Commented Aug 2 at 8:03
  • 1
    @Mark a regular expression library provides the raw functionality - that's its purpose, to provide regex capability for other programs to use. A program, such as grep, which uses that library may (and almost certainly will) expand on what the library does, like providing convenience features for users. grep, for example can show all matches on a line (-0), count the matches in a file (-c), show the line number of matches (-n), and more - none of these features are mentioned in the docs for the regex library, because they don't belong there. they belong in the docs for grep. Commented Aug 18 at 11:15
  • 1
    Since you've been given the task of implementing grep, you'll need to understand regexes themselves, and whichever library you plan to use that provides regex capabilities (unless part of that task is to implement the regex functions yourself). Even if you're required to implement everything from scratch yourself, studying existing libraries/implementations to figure out how they work and how they solved the problems you're going to run into will be useful. Commented Aug 18 at 11:19

1 Answer 1

17

grep is named after the g/re/p command of the ed editor. It's about printing the lines that match the given regular expression.

What portion of the line matches is not relevant then.

The GNU implementation has added these two extensions over the standard:

  • -o that prints all the non-empty matches of the regexp
  • --color that highlights all the matches

But both match the regexp differently from how grep without them does as they carry on looking for more matches after the first in a manner similar to that of ed's s/pattern/<&>/g command (g being the key here).

No grep implementation that I know has a way to output the one and only match that grep without -o/--color matches on.

You'd need to use other tools such as sed, awk or perl.

For instance, to see what grep regex matches on, you could do:

sed -n 's/regex/<&>/p'

Which would print the matching lines with <...> around the matched portion. To print only the matched portion:

sed -n '
  /regex/ {
    s//\
&\
/
    s/^.*\n\(.*\)\n.*$/\1/p
  }'

For grep -E regex:

awk 'match($0, "regex") {print substr($0, RSTART, RLENGTH)}'

(awk regexps are similar to those supported by grep with -E; or used the same approach as above with sed -E where supported)

For grep -P regex:

perl -lne 'print $& if /regex/'

(grep -P initially from GNU grep is for perl regular expressions, but via PCRE2 (formerly PCRE) which are not fully equivalent to perl ones, ast-open grep has its own variant of perl-like regular expressions).

With the perl regexps (grep implementations that support both -P and -o such as GNU grep when built with optional PCRE2 support), you can also do:

grep -Po '^.*?\K(?:regex)'

The ^.*? matches as few characters as possible starting with the start of the line which prevents the regex from matching more than once. \K marks the start of what's to be Kept (and output with -o) from that. The (?:...) grouping is in case there's an alternation operator in the regex (as in a|b), avoiding the capturing variant ((...)) in case the regex has some \1 or (?1)... operators.

Beware it doesn't report empty matches.

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.