4

I've been asking 1 hour ago a similar question about regular expression using the grep command, pardon me if the prefered choice would have been to post in the same thread, if this is the case I would do so next time.

It might seems like basic synthax, but I'm trying to understand how regular expression recognition pattern works and the results I get seems to be contradicting the manual I'm reading about them ( I'm most likely not interpreting the material properly).

A files contains the following list of words:

mael@mael-HP:~/repertoireVide$ cat MySQLServ
remembré
emmuré
emmené
dilemmes
jumeaux
écrémage
emmena
emmailloter
flemmard

The following command gives the output

mael@mael-HP:~/repertoireVide$ grep -r 'emm*[a-f].[^ta]$'
MySQLServ:remembré
MySQLServ:emmené
MySQLServ:flemmard

I'm wondering why grep is not matching the word 'emmailloter', since 'emmailloter':

  1. contains 'em'
  2. contains a caracter between [a-f] afterwards : 'a'
  3. 'i' fulfills the '.' component
  4. does not end with either the caracter 't' or 'a'

Thanks.

2
  • They are different questions, so should remain separate. However I would recommend finding a good reg ex practice program or web site, and practice practice practice. Start with the basics ., *, +, ?, then add ^ and $, then add more. Commented Oct 26, 2019 at 8:59
  • 1
    @ctrl-alt-delor Note that most websites that provides regular expression tests implements Perl-like regular expressions and not POSIX regular expressions. GNU grep understands these with -P, but if you want portability, POSIX basic and extended regular expressions are more often used with Unix command line tools. Commented Oct 26, 2019 at 9:06

1 Answer 1

6

The word emmailloter contains much more than i between the bits matched by [a-f] and [^ta]$. The . pattern only ever matches a single character, so if you want to match multiple characters between emma and r at the end, you will have to allow for multiple characters:

emm*[a-f]..*[^ta]$

With grep -E (enabling extended regular expressions), ..* could be written .+, i.e. "match at least one character". The expression ..* reads as "match a character, and then possibly more characters". In the same way, emm* could be replaced by em+, i.e. "e followed by at least one m" if using grep -E.

This would match the string

blop-emmmmmmmmma-blarg-b
     ^^^^^^^^^^^^^^^^^^^
     1111111111233333334

1: emm*
2: [a-f]
3: ..*
4: [^ta]$

(the matching part indicated by the ^ characters above), for example, and also emmailloter:

emmailloter
^^^^^^^^^^^
11123333334

Testing:

$ grep -E 'emm*[a-f].+[^ta]$' MySQLServ
remembré
emmené
emmailloter
flemmard

Note that for the word remembré, the match will be

remembré
 ^^^^^^^
 1123334

not

remembré
   ^^^^^
   11234

One way to visualise the matches using sed:

$ sed -n -E 's/(emm*)([a-f])(.+)([^ta]$)/(\1)(\2)(\3)(\4)/p' MySQLServ
r(em)(e)(mbr)(é)
(emm)(e)(n)(é)
(emm)(a)(illote)(r)
fl(emm)(a)(r)(d)

This will only print matching lines, with each matched part of the regular expression in parentheses. This also assumes that you are using a sed implementation that can be used to match French characters and that the locale environment variables are properly set up for doing that.

Compare this with what your original expression produces:

$ sed -n -E 's/(emm*)([a-f])(.)([^ta]$)/(\1)(\2)(\3)(\4)/p' MySQLServ
rem(em)(b)(r)(é)
(emm)(e)(n)(é)
fl(emm)(a)(r)(d)
2
  • I'd add that the author's doubt probably comes from thinking of * as the shell glob operator. On regular expressions * (and other count operators such as + and ?) always apply to the previous match expression. Commented Oct 26, 2019 at 13:53
  • @Spidey, except that here it seems more like that they were expecting the dot . to match multiple characters (namely the illote in emmailloter) Commented Oct 26, 2019 at 16:37

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.