Revisions to Why does my regular expression work in X but not in Y?

added 288 characters in body

Source Link

edited Jan 6, 2022 at 12:30

584.5k
96
1.1k
1.7k

Extended regular expressions are codified by the POSIX standard. Their major advantage over BRE is regularity: all standard operators are bare punctuation characters, a backslash before a punctuation character always quotes it. It is the syntax used by awk, grep -E or egrep, BSD (and GNU and soon POSIX) sed -E (formerly sed -r in GNU sed), and bash'sbash / ksh93 / yash / zsh¹'s =~ operator. This syntax provides the following features:

^{¹ unless the rematchpcre option is enabled in zsh in which case =~ uses PCREs there. ksh93's extended regexps also support some of perl's extended operators such as the look-around ones.}

Extended regular expressions are codified by the POSIX standard. Their major advantage over BRE is regularity: all standard operators are bare punctuation characters, a backslash before a punctuation character always quotes it. It is the syntax used by awk, grep -E or egrep, GNU sed -r, and bash's =~ operator. This syntax provides the following features:

Extended regular expressions are codified by the POSIX standard. Their major advantage over BRE is regularity: all standard operators are bare punctuation characters, a backslash before a punctuation character always quotes it. It is the syntax used by awk, grep -E or egrep, BSD (and GNU and soon POSIX) sed -E (formerly sed -r in GNU sed), and bash / ksh93 / yash / zsh¹'s =~ operator. This syntax provides the following features:

^{¹ unless the rematchpcre option is enabled in zsh in which case =~ uses PCREs there. ksh93's extended regexps also support some of perl's extended operators such as the look-around ones.}

add more information about PCRE, in particular lookahead/lookbehind

Source Link

edited Feb 13, 2019 at 23:23

Gilles 'SO- stop being evil'

865.3k
205
1.8k
2.3k

PCRE are extensions of ERE, originally introduced by Perl and adopted by GNU grep -P and many modern tools and programming languages, usually via the PCRE library. See the Perl documentation for nice formatting with examples. Not all features of the latest version of Perl are supported by PCRE (e.g. Perl code execution is only supported in Perl), see. See the PCRE manual for a summary of supported features. The main additions to ERE are:

(?:…) is a non-capturing group: like (…), but does not count for backreferences.

(?=FOO)BAR (lookahead) matches BAR, but only if there is also a match for FOO starting at the same position. This is most useful to anchor a match without including the following text in the match: foo(?=bar) matches foo but only if it's followed by bar.

(?!FOO)BAR (negative lookahead) matches BAR, but there is not also a match for FOO at the same position. For example (?!foo)[a-z]+ matches any lowercase word that does not start with foo; [a-z]+(?![0-9) matches any lowercase word that is not followed by a digit (so in foo123, it matches fo but not foo).

(?<=FOO)BAR (lookbehind) matches BAR, but only if it is immediately preceded by a match for FOO. FOO must have a known length (you can't use repetition operators such as *). This is most useful to anchor a match without including the preceding text in the match: (?<=^| )foo matches foo but only if it's preceded by a space or by the beginning of the string.

(?<!FOO)BAR (negative lookbehind) matches BAR, but only if it is not immediately preceded by a match for FOO. FOO must have a known length (you can't use repetition operators such as *). This is most useful to anchor a match without including the preceding text in the match: (?<![a-z])foo matches foo but only if it is not preceded by a lowercase letter.

PCRE are extensions of ERE, originally introduced by Perl and adopted by many modern tools and programming languages, usually via the PCRE library. See the Perl documentation for nice formatting with examples. Not all features of the latest version of Perl are supported by PCRE (e.g. Perl code execution is only supported in Perl), see the PCRE manual for a summary of supported features.

PCRE are extensions of ERE, originally introduced by Perl and adopted by GNU grep -P and many modern tools and programming languages, usually via the PCRE library. See the Perl documentation for nice formatting with examples. Not all features of the latest version of Perl are supported by PCRE (e.g. Perl code execution is only supported in Perl). See the PCRE manual for a summary of supported features. The main additions to ERE are:

(?:…) is a non-capturing group: like (…), but does not count for backreferences.

(?=FOO)BAR (lookahead) matches BAR, but only if there is also a match for FOO starting at the same position. This is most useful to anchor a match without including the following text in the match: foo(?=bar) matches foo but only if it's followed by bar.

(?!FOO)BAR (negative lookahead) matches BAR, but there is not also a match for FOO at the same position. For example (?!foo)[a-z]+ matches any lowercase word that does not start with foo; [a-z]+(?![0-9) matches any lowercase word that is not followed by a digit (so in foo123, it matches fo but not foo).

(?<=FOO)BAR (lookbehind) matches BAR, but only if it is immediately preceded by a match for FOO. FOO must have a known length (you can't use repetition operators such as *). This is most useful to anchor a match without including the preceding text in the match: (?<=^| )foo matches foo but only if it's preceded by a space or by the beginning of the string.

(?<!FOO)BAR (negative lookbehind) matches BAR, but only if it is not immediately preceded by a match for FOO. FOO must have a known length (you can't use repetition operators such as *). This is most useful to anchor a match without including the preceding text in the match: (?<![a-z])foo matches foo but only if it is not preceded by a lowercase letter.

added 150 characters in body

Source Link

edited Dec 18, 2017 at 14:57

Stéphane Chazelas

584.5k
96
1.1k
1.7k

\| for alternation: foo\|bar matches foo or bar.
\? (short for \{0,1\}) and \+ (short for \{1,\}) match the preceding character or subexpression at most 1 time, or at least 1 time respectively.
\n matches a newline, \t matches a tab, etc.
\w matches any word constituent (short for [_[:alnum:]] but with variation when it comes to localisation) and \W matches any character that isn't a word constituent.
\< and \> match the empty string only at the beginning or end of a word respectively; \b matches either, and \B matches where \b doesn't.

^ and $ match only at the beginning and end of a line.
. matches any character (or any character except a newline).
[…] matches any one character listed inside the brackets (character set). Complementation with an initial ^ and ranges work like in BRE (see above). Character classes can be used but are missing from a few implementations. Modern implementations also support equivalence classes and collating elements. A backslash inside brackets quotes the next character in some but not all implementations; use \\ to mean a backslash for portability.
(…) is a syntactic group, for use with * or \DIGIT replacements.
| for alternation: foo|bar matches foo or bar.
*, + and ? matches the preceding character or subexpression a number of times: 0 or more for *, 1 or more for +, 0 or 1 for ?.
Backslash quotes the next character if it is not alphanumeric.
{m,n} matches the preceding character or subexpression between m and n times (missing from some implementations); n or m can be omitted, and {m} means exactly m.
Some common extensions as in BRE: \DIGIT backreferences (notably absent in awk except in the busybox implementation where you can use $0 ~ "(...)\\1"); special characters \n, \t, etc.; word boundaries \b and \B, word constituents \b and \B, …

\| for alternation: foo\|bar matches foo or bar.
\? (short for \{0,1\} and \+ (short for \{1,\}) match the preceding character or subexpression at most 1 time, or at least 1 time respectively.
\n matches a newline, \t matches a tab, etc.
\w matches any word constituent and \W matches any character that isn't a word constituent.
\< and \> match the empty string only at the beginning or end of a word respectively; \b matches either, and \B matches where \b doesn't.

^ and $ match only at the beginning and end of a line.
. matches any character (or any character except a newline).
[…] matches any one character listed inside the brackets (character set). Complementation with an initial ^ and ranges work like in BRE (see above). Character classes can be used but are missing from a few implementations. Modern implementations also support equivalence classes and collating elements. A backslash inside brackets quotes the next character in some but not all implementations; use \\ to mean a backslash for portability.
(…) is a syntactic group, for use with * or \DIGIT replacements.
| for alternation: foo|bar matches foo or bar.
*, + and ? matches the preceding character or subexpression a number of times: 0 or more for *, 1 or more for +, 0 or 1 for ?.
Backslash quotes the next character if it is not alphanumeric.
{m,n} matches the preceding character or subexpression between m and n times (missing from some implementations); n or m can be omitted, and {m} means exactly m.
Some common extensions as in BRE: \DIGIT backreferences (notably absent in awk); special characters \n, \t, etc.; word boundaries \b and \B, word constituents \b and \B, …

\| for alternation: foo\|bar matches foo or bar.
\? (short for \{0,1\}) and \+ (short for \{1,\}) match the preceding character or subexpression at most 1 time, or at least 1 time respectively.
\n matches a newline, \t matches a tab, etc.
\w matches any word constituent (short for [_[:alnum:]] but with variation when it comes to localisation) and \W matches any character that isn't a word constituent.
\< and \> match the empty string only at the beginning or end of a word respectively; \b matches either, and \B matches where \b doesn't.

^ and $ match only at the beginning and end of a line.
. matches any character (or any character except a newline).
[…] matches any one character listed inside the brackets (character set). Complementation with an initial ^ and ranges work like in BRE (see above). Character classes can be used but are missing from a few implementations. Modern implementations also support equivalence classes and collating elements. A backslash inside brackets quotes the next character in some but not all implementations; use \\ to mean a backslash for portability.
(…) is a syntactic group, for use with * or \DIGIT replacements.
| for alternation: foo|bar matches foo or bar.
*, + and ? matches the preceding character or subexpression a number of times: 0 or more for *, 1 or more for +, 0 or 1 for ?.
Backslash quotes the next character if it is not alphanumeric.
{m,n} matches the preceding character or subexpression between m and n times (missing from some implementations); n or m can be omitted, and {m} means exactly m.
Some common extensions as in BRE: \DIGIT backreferences (notably absent in awk except in the busybox implementation where you can use $0 ~ "(...)\\1"); special characters \n, \t, etc.; word boundaries \b and \B, word constituents \b and \B, …