2.String processing via regular expressions-CSDN博客

本文链接：https://blog.csdn.net/Saul_M/article/details/98505933

2.regular exprssions

regex are patterns that match character strings.

The four main concepts of regex mirror the four types of structure in imperative programming languages.

Matching: /cat/

Sequence: i = 2; j = 3;

Memoization: (pattern)

Assignment: i = 2;

Alternation: /cat|dog/

Selection: if A:
						do thing;
					else:
					 	do other thing;

Repetition: /(cat)*/

Loop: while True:
					i += 1;

2.1 Matching

2.1.1 The foundation of regex is literal matching:

				/knowledge/

each character matches itself.
matches are case sensitive.
whitespace is significant:
–/over priced/ won’t match “overpriced”
-substrings are uninterpreted; they are not assumed to be whole words or have any specific semantics.
–/lane/ will match “planet”

2.1.2 The wildcard . is the most basic metacharacter

matches any single character (except a newline ); good for crossword puzzles:

				/.n.wl.d../
				 acknowledge
			     acknowledged
				       .
				       .
				       .

2.1.3 The anchor ^ and $ match the start and end of a line or string, respectively.

				/^.n.wl.d..$/
				knowledge

2.2 Alternation

the | metacharacter expresses alternation or disjunction

/a|b|c/ matches "a", "b", or "c"

/cat|dog/ matches "cat" or "dog"

/\$(US|AU|CD)/ matches  "$US", "$AU" or "$CD"

the | character has low precedence, and the parentheses in the last example are necessary. Just like the difference between:
"ed|ing$" and "(ed|ing)$"

2.3 Repetition

*: zero or more of the preceding element
?: zero or one of the preceding element
+: one or more of the preceding element
{n}: exactly n of the preceding element
{m,n}: between m and n (inclusive) of the preceding element
{n,}: n or more of the preceding element
{,m}; up to m of the preceding element

2.4 Character classes

/[Kk]nowledge/
/[aeiou]/ is equivalent to /a|e|i|o|u/ or /(a|e|i|o|u)/
/^\$[0-9]+/
/^[A-Z][a-z]*/
/ [A-Za-z]+ /

Observe also that within [,], metacharacters may be used in their literal meaning. For example, in some languages, the class [\$] matches “\” or “$”.

2.5 Negative classes

A second use of the ^ metacharacter is to negate character classes. [^A-Za-z] matches any non-alpha character.

2.6 Named classes

[0-9] = [[:digit:]] = \d
[a-zA-Z0-9_] = [[:word:]] = \w
[\ \t\r\n\f] = [[:space:]]= \s

and do their negations:

[^0-9] = [[:digit:]] = \D
[^a-zA-Z0-9_] = [[:word:]] = \W
[^\ \t\r\n\f] = [[:space:]]= \S

notice: which named character classes are available and how they are represented depends on the software you use.

2.7 Back-references or memoization

Placing a pattern in parentheses leads to the match being stored as a variable.

The first stored pattern has the name \1, the nth is \n. Sadly, there is no way of operating on stored patterns, but they can be accessed for subsequent matching.

ex:
		/([a-zA-Z]+) +\1/
		matches:    
					az az
					azd azd
						
		([a-zA-Z])\1
		matches: 
					 aa
					 bb

2.8 Putting it all together

Now we can parse the regex from earlier on:

    ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$

^[A-Z0-9._%±]+: match one or more of these characters
@: followed by an “@”
[A-Z0-9.-]+: followed by one or more of these characters
.: followed by a dot
[A-Z]{2,4}$: followed by 2–4 upper case letters, and then end of line