2.regular exprssions
regex are patterns that match character strings.
The four main concepts of regex mirror the four types of structure in imperative programming languages.
- Matching: /cat/
Sequence: i = 2; j = 3;
- Memoization: (pattern)
Assignment: i = 2;
- Alternation: /cat|dog/
Selection: if A:
do thing;
else:
do other thing;
- Repetition: /(cat)*/
Loop: while True:
i += 1;
2.1 Matching
2.1.1 The foundation of regex is literal matching:
/knowledge/
- each character matches itself.
- matches are case sensitive.
- whitespace is significant:
–/over priced/ won’t match “overpriced”
-substrings are uninterpreted; they are not assumed to be whole words or have any specific semantics.
–/lane/ will match “planet”
2.1.2 The wildcard . is the most basic metacharacter
- matches any single character (except a newline ); good for crossword puzzles:
/.n.wl.d../
acknowledge
acknowledged
.
.
.
2.1.3 The anchor ^ and $ match the start and end of a line or string, respectively.
/^.n.wl.d..$/
knowledge
2.2 Alternation
- the | metacharacter expresses alternation or disjunction
/a|b|c/ matches "a", "b", or "c"
/cat|dog/ matches "cat" or "dog"
/\$(US|AU|CD)/ matches "$US", "$AU" or "$CD"
the | character has low precedence, and the parentheses in the last example are necessary. Just like the difference between:
"ed|ing$" and "(ed|ing)$"
2.3 Repetition
- *: zero or more of the preceding element
- ?: zero or one of the preceding element
- +: one or more of the preceding element
- {n}: exactly n of the preceding element
- {m,n}: between m and n (inclusive) of the preceding element
- {n,}: n or more of the preceding element
- {,m}; up to m of the preceding element
2.4 Character classes
- /[Kk]nowledge/
- /[aeiou]/ is equivalent to /a|e|i|o|u/ or /(a|e|i|o|u)/
- /^\$[0-9]+/
- /^[A-Z][a-z]*/
- / [A-Za-z]+ /
Observe also that within [,], metacharacters may be used in their literal meaning. For example, in some languages, the class [\$] matches “\” or “$”.
2.5 Negative classes
A second use of the ^ metacharacter is to negate character classes. [^A-Za-z] matches any non-alpha character.
2.6 Named classes
- [0-9] = [[:digit:]] = \d
- [a-zA-Z0-9_] = [[:word:]] = \w
- [\ \t\r\n\f] = [[:space:]]= \s
and do their negations:
- [^0-9] = [[:digit:]] = \D
- [^a-zA-Z0-9_] = [[:word:]] = \W
- [^\ \t\r\n\f] = [[:space:]]= \S
notice: which named character classes are available and how they are represented depends on the software you use.
2.7 Back-references or memoization
Placing a pattern in parentheses leads to the match being stored as a variable.
The first stored pattern has the name \1, the nth is \n. Sadly, there is no way of operating on stored patterns, but they can be accessed for subsequent matching.
ex:
/([a-zA-Z]+) +\1/
matches:
az az
azd azd
([a-zA-Z])\1
matches:
aa
bb
2.8 Putting it all together
Now we can parse the regex from earlier on:
^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$
- ^[A-Z0-9._%±]+: match one or more of these characters
- @: followed by an “@”
- [A-Z0-9.-]+: followed by one or more of these characters
- .: followed by a dot
- [A-Z]{2,4}$: followed by 2–4 upper case letters, and then end of line