regex--python

Metacharacters inside [ ]:

Here’s a complete list of the metacharacters; their meanings will be discussed in the rest of this HOWTO.
. ^ $ * + ? { } [ ] \ | ( )
Metacharacters are not active inside classes. For example, [akm$] will match any of the characters 'a', 'k', 'm', or '$'; '$' is usually a metacharacter, but inside a character class it’s stripped of its special nature.
But \ [ ] still have special meaning:
>>> re.match(r'[\]a]+', 'aa]]]')
<_sre.SRE_Match object; span=(0, 5), match='aa]]]'>
>>> re.match(r'[\\n]+', '\\n')
<_sre.SRE_Match object; span=(0, 2), match='\\n'>
>>> re.match(r'[\n]+', '\n')
<_sre.SRE_Match object; span=(0, 1), match='\n'>
>>> re.match(r'[\n]+', '\n\n')
<_sre.SRE_Match object; span=(0, 2), match='\n\n'>

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

>>> import re
>>> p = re.compile(r'ABC\-001')  #i think the RE matching machine(re.compile) ignore \, and the string to be matched become ABC-001
>>> p.match('ABC-001')
<_sre.SRE_Match object; span=(0, 7), match='ABC-001'>
>>> p.match(r'ABC\-001')
>>> p.match('ABC\\-001')
>>> 

>>> re.split(r'[\s\,\:]+', 'a,  b::  c d')
['a', 'b', 'c', 'd']



---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

* + ? {m, n} greedy repitition

Repetitions such as * are greedy; when repeating a RE, the matching engine will try to repeat it as many times as possible. If later portions of the pattern don’t match, the matching engine will then back up and try again with fewer repetitions.

A step-by-step example will make this more obvious. Let’s consider the expression a[bcd]*b. This matches the letter 'a', zero or more letters from the class [bcd], and finally ends with a 'b'. Now imagine matching this RE against the string abcbd.

Step Matched Explanation
1aThe a in the RE matches.
2abcbdThe engine matches [bcd]*, going as far as it can, which is to the end of the string.
3FailureThe engine tries to match b, but the current position is at the end of the string, so it fails.
4abcbBack up, so that [bcd]* matches one less character.
5FailureTry b again, but the current position is at the last character, which is a 'd'.
6abcBack up again, so that [bcd]* is only matching bc.
6abcbTry b again. This time the character at the current position is 'b', so it succeeds.
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Compiling Regular Expressions

Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.

>>>
>>> import re
>>> p = re.compile('ab*')
>>> p
re.compile('ab*')

re.compile() also accepts an optional flags argument, used to enable various special features and syntax variations. We’ll go over the available settings later, but for now a single example will do:

>>>
>>> p = re.compile('ab*', re.IGNORECASE)

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The Backslash Plague

The RE is passed to re.compile() as a string. REs are handled as strings because regular expressions aren’t part of the core Python language, and no special syntax was created for expressing them.

As stated earlier, regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This conflicts with Python’s usage of the same character for the same purpose in string literals.

Let’s say you want to write a RE that matches the string \section, which might be found in a LaTeX file. To figure out what to write in the program code, start with the desired string to be matched.

Escape metacharacters )Next,you must escape any backslashes and other metacharacters by preceding them with a backslash, resulting in the string \\section. The resulting string that must be passed to re.compile() must be \\section. However, to express this as a Python string literal, both backslashes must be escaped again.

Characters Stage
\sectionText string to be matched
\\sectionEscaped backslash for re.compile()
"\\\\section"Escaped backslashes for a string literal

In short, to match a literal backslash, one has to write '\\\\' as the RE string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal. In REs that feature backslashes repeatedly, this leads to lots of repeated backslashes and makes the resulting strings difficult to understand.

The solution is to use Python’s raw string notation for regular expressions; backslashes are not handled in any special way in astring literalprefixed with 'r', so r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Regular expressions will often be written in Python code using this raw string notation.

Regular String Raw string
"ab*"r"ab*"
"\\\\section"r"\\section"
"\\w+\\s+\\1"r"\w+\s+\1"
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Backreferences:

Backreferences in a pattern allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string. For example, \1 will succeed if the exact contents of group 1 can be found at the current position, and fails otherwise. Remember that Python’s string literals also use a backslash followed by numbers to allow including arbitrary characters in a string, so be sure to use a raw string when incorporating backreferences in a RE.

For example, the following RE detects doubled words in a string.

>>>
>>> p = re.compile(r'(\b\w+)\s+\1')
>>> p.search('Paris in the the spring').group()
'the the'

Backreferences like this aren’t often useful for just searching through a string — there are few text formats which repeat data in this way — but you’ll soon find out that they’re very useful when performing string substitutions.



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值