2.String processing via regular expressions

2.regular exprssions

regex are patterns that match character strings.

The four main concepts of regex mirror the four types of structure in imperative programming languages.

  • Matching: /cat/
Sequence: i = 2; j = 3;
  • Memoization: (pattern)
Assignment: i = 2;
  • Alternation: /cat|dog/
Selection: if A:
						do thing;
					else:
					 	do other thing;
  • Repetition: /(cat)*/
Loop: while True:
					i += 1;

2.1 Matching

2.1.1 The foundation of regex is literal matching:
				/knowledge/
  • each character matches itself.
  • matches are case sensitive.
  • whitespace is significant:
    –/over priced/ won’t match “overpriced”
    -substrings are uninterpreted; they are not assumed to be whole words or have any specific semantics.
    –/lane/ will match “planet”
2.1.2 The wildcard . is the most basic metacharacter
  • matches any single character (except a newline ); good for crossword puzzles:
				/.n.wl.d../
				 acknowledge
			     acknowledged
				       .
				       .
				       .
2.1.3 The anchor ^ and $ match the start and end of a line or string, respectively.
				/^.n.wl.d..$/
				knowledge

2.2 Alternation

  • the | metacharacter expresses alternation or disjunction
/a|b|c/ matches "a", "b", or "c"

/cat|dog/ matches "cat" or "dog"

/\$(US|AU|CD)/ matches  "$US", "$AU" or "$CD"

the | character has low precedence, and the parentheses in the last example are necessary. Just like the difference between:
"ed|ing$" and "(ed|ing)$"

2.3 Repetition

  • *: zero or more of the preceding element
  • ?: zero or one of the preceding element
  • +: one or more of the preceding element
  • {n}: exactly n of the preceding element
  • {m,n}: between m and n (inclusive) of the preceding element
  • {n,}: n or more of the preceding element
  • {,m}; up to m of the preceding element

2.4 Character classes

  • /[Kk]nowledge/
  • /[aeiou]/ is equivalent to /a|e|i|o|u/ or /(a|e|i|o|u)/
  • /^\$[0-9]+/
  • /^[A-Z][a-z]*/
  • / [A-Za-z]+ /

Observe also that within [,], metacharacters may be used in their literal meaning. For example, in some languages, the class [\$] matches “\” or “$”.

2.5 Negative classes

A second use of the ^ metacharacter is to negate character classes. [^A-Za-z] matches any non-alpha character.

2.6 Named classes

  • [0-9] = [[:digit:]] = \d
  • [a-zA-Z0-9_] = [[:word:]] = \w
  • [\ \t\r\n\f] = [[:space:]]= \s

and do their negations:

  • [^0-9] = [[:digit:]] = \D
  • [^a-zA-Z0-9_] = [[:word:]] = \W
  • [^\ \t\r\n\f] = [[:space:]]= \S

notice: which named character classes are available and how they are represented depends on the software you use.

2.7 Back-references or memoization

Placing a pattern in parentheses leads to the match being stored as a variable.

The first stored pattern has the name \1, the nth is \n. Sadly, there is no way of operating on stored patterns, but they can be accessed for subsequent matching.

ex:
		/([a-zA-Z]+) +\1/
		matches:    
					az az
					azd azd
						
		([a-zA-Z])\1
		matches: 
					 aa
					 bb

2.8 Putting it all together

Now we can parse the regex from earlier on:

    ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$
  • ^[A-Z0-9._%±]+: match one or more of these characters
  • @: followed by an “@”
  • [A-Z0-9.-]+: followed by one or more of these characters
  • .: followed by a dot
  • [A-Z]{2,4}$: followed by 2–4 upper case letters, and then end of line
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值