Metacharacters inside [ ]:
Here’s a complete list of the metacharacters; their meanings will be discussed in the rest of this HOWTO.
. ^ $ * + ? { } [ ] \ | ( )
Metacharacters are not active inside classes. For example, [akm$] will match any of the characters 'a', 'k', 'm', or '$'; '$' is usually a metacharacter, but inside a character class it’s stripped of its special nature.
But \ [ ] still have special meaning:
>>> re.match(r'[\]a]+', 'aa]]]')
<_sre.SRE_Match object; span=(0, 5), match='aa]]]'>
>>> re.match(r'[\\n]+', '\\n')
<_sre.SRE_Match object; span=(0, 2), match='\\n'>
>>> re.match(r'[\n]+', '\n')
<_sre.SRE_Match object; span=(0, 1), match='\n'>
>>> re.match(r'[\n]+', '\n\n')
<_sre.SRE_Match object; span=(0, 2), match='\n\n'>
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>> import re
>>> p = re.compile(r'ABC\-001') #i think the RE matching machine(re.compile) ignore \, and the string to be matched become ABC-001
>>> p.match('ABC-001')
<_sre.SRE_Match object; span=(0, 7), match='ABC-001'>
>>> p.match(r'ABC\-001')
>>> p.match('ABC\\-001')
>>>
>>> re.split(r'[\s\,\:]+', 'a, b:: c d')
['a', 'b', 'c', 'd']
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
* + ? {m, n} greedy repitition
Repetitions such as *
are greedy; when repeating a RE, the matching engine will try to repeat it as many times as possible. If later portions of the pattern don’t match, the matching engine will then back up and try again with fewer repetitions.
A step-by-step example will make this more obvious. Let’s consider the expression a[bcd]*b
. This matches the letter 'a'
, zero or more letters from the class [bcd]
, and finally ends with a 'b'
. Now imagine matching this RE against the string abcbd
.
Step | Matched | Explanation |
---|---|---|
1 | a | The a in the RE matches. |
2 | abcbd | The engine matches [bcd]* , going as far as it can, which is to the end of the string. |
3 | Failure | The engine tries to match b , but the current position is at the end of the string, so it fails. |
4 | abcb | Back up, so that [bcd]* matches one less character. |
5 | Failure | Try b again, but the current position is at the last character, which is a 'd' . |
6 | abc | Back up again, so that [bcd]* is only matching bc . |
6 | abcb | Try b again. This time the character at the current position is 'b' , so it succeeds. |
Compiling Regular Expressions
Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.
re.compile()
also accepts an optional flags argument, used to enable various special features and syntax variations. We’ll go over the available settings later, but for now a single example will do:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The Backslash Plague
The RE is passed to re.compile()
as a string. REs are handled as strings because regular expressions aren’t part of the core Python language, and no special syntax was created for expressing them.
As stated earlier, regular expressions use the backslash character ('\'
) to indicate special forms or to allow special characters to be used without invoking their special meaning. This conflicts with Python’s usage of the same character for the same purpose in string literals.
Let’s say you want to write a RE that matches the string \section
, which might be found in a LaTeX file. To figure out what to write in the program code, start with the desired string to be matched.
( Escape metacharacters )Next,you must escape any backslashes and other metacharacters by preceding them with a backslash, resulting in the string \\section
. The resulting string that must be passed to re.compile()
must be \\section
. However, to express this as a Python string literal, both backslashes must be escaped again.
Characters | Stage |
---|---|
\section | Text string to be matched |
\\section | Escaped backslash for re.compile() |
"\\\\section" | Escaped backslashes for a string literal |
In short, to match a literal backslash, one has to write '\\\\'
as the RE string, because the regular expression must be \\
, and each backslash must be expressed as \\
inside a regular Python string literal. In REs that feature backslashes repeatedly, this leads to lots of repeated backslashes and makes the resulting strings difficult to understand.
The solution is to use Python’s raw string notation for regular expressions; backslashes are not handled in any special way in astring literalprefixed with 'r'
, so r"\n"
is a two-character string containing '\'
and 'n'
, while "\n"
is a one-character string containing a newline. Regular expressions will often be written in Python code using this raw string notation.
Regular String | Raw string |
---|---|
"ab*" | r"ab*" |
"\\\\section" | r"\\section" |
"\\w+\\s+\\1" | r"\w+\s+\1" |
Backreferences:
Backreferences in a pattern allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string. For example, \1
will succeed if the exact contents of group 1 can be found at the current position, and fails otherwise. Remember that Python’s string literals also use a backslash followed by numbers to allow including arbitrary characters in a string, so be sure to use a raw string when incorporating backreferences in a RE.
For example, the following RE detects doubled words in a string.
Backreferences like this aren’t often useful for just searching through a string — there are few text formats which repeat data in this way — but you’ll soon find out that they’re very useful when performing string substitutions.