Referecned from
1. https://docs.python.org/3.4/howto/regex.html
2. https://fishc.com.cn/thread-57073-1-3.html
Simple pattern
Matching characters
Most letters and characters will simply match themselves. However, there are some exceptions (metacharacters) to this rule.
Here is the list of metacharacters:
. ^ $ * + ? { } [ ] \ | ( )
[] are used for specifying a character class, which is a set of characters that you wish to match.
Metacharacters are not active inside classes. For example, [akm$]
will match any of the characters 'a'
, 'k'
, 'm'
, or '$'
; '$'
is usually a metacharacter, but inside a character class it’s stripped of its special nature.
You can match the characters not listed within the class by complementing the set. This is indicated by including a '^'
as the first character of the class; '^'
outside a character class will simply match the '^'
character. For example, [^5]
will match any character except '5'
.
\d | Matches any decimal digit; this is equivalent to the class |
\D | Matches any non-digit character; this is equivalent to the class |
\s | Matches any whitespace character; this is equivalent to the class |
\S | Matches any non-whitespace character; this is equivalent to the class |
\w | Matches any alphanumeric character; this is equivalent to the class |
\W | Matches any non-alphanumeric character; this is equivalent to the class |
These sequences can be included inside a character class. For example, [\s,.]
is a character class that will match any whitespace character, or ','
or '.'
.
The final metacharacter in this section is '.'
. It matches anything except a newline character, and there’s an alternate mode (re.DOTALL
) where it will match even a newline. '.'
is often used where you want to match “any character”.
Repeating things
The first metacharacter for repeating things that we’ll look at is *
. *
doesn’t match the literal character *
; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once.
Another repeating metacharacter is +
, which matches one or more times. Pay careful attention to the difference between *
and +
; *
matches zero or more times, so whatever’s being repeated may not be present at all, while +
requires at least one occurrence. To use a similar example, ca+t
will match cat
(1 a
), caaat
(3 a
’s), but won’t match ct
.
There are two more repeating qualifiers. The question mark character, ?
, matches either once or zero times; you can think of it as marking something as being optional. For example, home-?brew
matches either homebrew
or home-brew
.
The most complicated repeated qualifier is {m,n}
, where m and n are decimal integers. This qualifier means there must be at least m repetitions, and at most n. For example, a/{1,3}b
will match a/b
, a//b
, and a///b
. It won’t match ab
, which has no slashes, or ab
, which has four.
You can omit either m or n; in that case, a reasonable value is assumed for the missing value. Omitting m is interpreted as a lower limit of 0, while omitting n results in an upper bound of infinity — actually, the upper bound is the 2-billion limit mentioned earlier, but that might as well be infinity.