The Wayback Machine - http://web.archive.org/web/20160416032008/http://python.about.com/od/regularexpressions/a/regexprimer.htm

Regular Expressions: A Primer

python icon done in the Tango! style. - derivative work: Phroy (talk)/Wikimedia Commons
derivative work: Phroy (talk)/Wikimedia Commons
Updated October 27, 2014.

A regular expression is an expression that describes a set of strings. One does not use regular expressions on integers unless they have been converted as a string type of data, usually using str(). Regular expressions, or regex, allow the use of escaped letters and special symbols to match a wide range of strings according to certain syntax rules. Python's special terms for regular expressions are summarized below with some examples:

  • "." Any character except a newline.
    • 'a' through 'Z'
    • any numbers and symbols
    • tab ('\t')
  • "^" The start of the string. This is not the first character of the string but the invisible boundary which precedes the string. So, in the string 'cartwheel', the term '^' would match the location immediately before the 'c'.
  • "$" The end of the string or just before the end of a line. This is not the last character of the string but the invisible boundary which follows the string. As with the preceding expression, the term '$' would match the location immediately following the 'l'.
  • "*" 0 or more instances of the pattern
    • 'cart.*' would match 'cartwheel', 'cartridge', 'cart567', and any other string that begins with the four characters 'cart'.
    • '.*wheel' would match 'cartwheel', 'backwheel', 'frontwheel', '4-wheel', and any other string that ends with the five characters of 'wheel'.
    • 'c.*l' would match 'cartwheel', 'control', 'cancel', 'c5a-f67l', and any other string that begins with 'c' and ends with 'l'.
  • "+" 1 or more instances of the pattern. This is usually used in conjunction with square braces.
    • 'c[art]+' matches 'c' followed by one or more instances of either 'a', 'r', or 't'.
  • "?" 0 or 1 instances of the pattern.
    • 'ca?t' matches 'cart', 'cast', 'cat', and any other string in which the first two places are 'ca', the last is 't' and the string is at least 3 and no more than 4 places long.
  • "*?", "+?", "??" Match as few repetitions of the term preceding '?' as possible. Other forms of these operators try to match as many as possible.
  • "{m}" Specifies how many instances of the regex should be matched
  • "{m,n}" Specifies a range of the number of instances that should be matched
  • "{m,n}?" Specifies a range of the number of instances that should be matched, matching as few as possible
  • "\" Escapes special characters or signals a special sequence (like octal if the next character is 0).
    • newline character: '\n'
    • tab character: '\t'
  • "[]" Indicates a set of characters for a single position in the regex
    • 'd[aou]' matches 'da', 'do', and 'du'.
    • '200[0-9]' matches all numbers from '2000' through '2009'. Note that this matches them as a string literal, not as an integer.
  • "|" Matches either the value on the left of the pipe or the value on the right
    • '[d|c]og' matches 'dog' or 'cog'.
    • '[d|c][a|o][g|t]' matches any of the following: dog, dag, dot, dat, cog, cag, cot, and cat.
  • "(...)" Indicates a grouping for the regex.
    • (ca[rtp]) matches car, cat, and cap. The regex is also saved and can be accessed in other ways, saving one the effort of repeating it.
  • "(?iLmsux)" Each letter defines the further meaning of the construction.
  • "(?:...)" Non-grouping of a regex
  • "(?P<name>...)" Give name 'name' to the regex for later usage
  • "(?P=name)" Recalls the text matched by the regex named 'name'
  • "(?#...)" A comment/remark. The parentheses and their contents are ignored.
  • "(?=...)" Matches if the preceding part of the regex and the subsequent part both match
  • "(?!... )" Matches expressions when the part of the regex preceding the parenthesis is not followed by the expression in parentheses
  • "(?<=...)" Matches the expression to the right of the parentheses when it is preceded by the value of ...
  • "(?<!...)" Matches the expression to the right of the parentheses when it is not preceded by the value of ...
  • "\A" Matches the start of the string. This is similar to '^', above.
  • "\b" Matches the empty string that forms the boundary at the beginning or end of a word.
    • "\bwheel" will match 'wheel' but not 'chartwheel'.
  • "\B" Matches the empty string that is not the beginning or end of a word
  • "\d" Matches any decimal digit. This includes the numbers 0 through 9 or any number in the real set.
  • "\D" Matches any non-decimal digit.
  • "\s" Matches any whitespace character like a blank space, tab, and the like.
  • "\S" Matches any non-whitespace charaacter. This is obviously the inverse of '\s', above.
  • "\w" Matches any alphanumeric character and the underscore: a through z, A through Z, 0 through 9, and '_'.
  • "\W" Matches any non-alphanumeric character. Examples for this include '&', '$', '@', etc.
  • "\Z" Matches the end of the string. This is similar to '$', above.
Further discussion about Python's regular expression syntax may be found here.