正向预查找
import re
# ?=pattern ,正向预查找 (look-ahead)
# 下面是检查是否<尖括号有缺失的情况
address = re.compile(
'''
((?P<name>
([\w.,]+\s+)*[\w.,]+
)
\s+
) # 名字必需存在,正向预查找尖括号
# 尖括号要么配对,要么不要,不能出现单个
(?= (<.*>$) # 配对的尖括号
|
([^<].*[^>]$) # 没有尖括号
)
<? # 尖括号可选
(?P<email>
[\w\d.+-]+
@
([\w\d.]+\.)+ #
(com|org|edu) #
)
>? # 尖括号可选
''',
re.UNICODE | re.VERBOSE)
candidates = [
u'First Last <first.last@example.com>',
u'No Brackets first.last@example.com',
u'Open Bracket <first.last@example.com',
u'Close Bracket first.last@example.com>',
]
for candidate in candidates:
print 'Candidate:', candidate
match = address.search(candidate)
if match:
print ' Name :', match.groupdict()['name']
print ' Email:', match.groupdict()['email']
else:
print ' No match'
结果
Candidate: First Last first.last@example.com
Name : First Last
Email: first.last@example.com
Candidate: No Brackets first.last@example.com
Name : No Brackets
Email: first.last@example.com
Candidate: Open Bracket
关于正向预查找和反向预查找
提供字符串:foobarbarfoo
bar(?=bar) 找到第一个bar (找到的bar后面跟一个bar) .
bar(?!bar) 找到第二个bar (找到的bar后面没有跟一个bar).
(?<=foo)bar 找到第一个bar (找到的bar前面跟一个foo).
(?<!foo)bar 找到第二个bar (找到的bar前面不跟一个foo).
下面是stackoverflow上面的一个解析
Look ahead Positive(?=)
Find expression A where expression B follows
A(?=B)
Look ahead Negative(?!)
Find expression A where expression B does not follow
A(?!B)
Look behind Positive(?<=)
Find expression A where expression B precedes
(?<=B)A
Look behind Negative(?<!)
Find expression A where expression B does not precedes it
(?<!B)A
最小组团
注:最小组团是无捕捉的特殊正则表达式分组,它可以用于优化正则表达式性能
非组团: /\b(engineer|engrave|end)\b/
如果把“engineering”拿去匹配,正则引擎会先匹配到“engineer”,但接下来就遇到了字词边界\b,所以匹配不成功。然后,正则引擎又会尝试在字串里寻找下一个匹配内容:engrave。匹配到eng的时候,后面的又对不上了,匹配失败。最后,尝试 “end”,结果同样是失败。仔细观察,你会发现,一旦engineer匹配失败,并且都抵达了字词边界,“engrave”和“end”这两个词就已经不可能匹配成功了。
这两个词都比engineer短小,从长度上来说就不可能被匹配了,所以正则引擎不应该再多做无谓的尝试。
最小组团:/\b(?>engineer|engrave|end)\b/
只会匹配一次,发现engineer都不满足要求,就不再回溯了,直接匹配不成功
练习代码
look_ahead = re.compile('python(?:2|3)')
look_ahead_pattern = re.compile('python(?=2)')
look_ahead_not_pattern = re.compile('python(?!2)')
text = 'pythonic python2 python3'
def print_info(re_obj, text=text):
for match in re_obj.finditer(text):
print match.group(),
print 'start is %d, end is %d' % (match.start(), match.end())
print
print_info(look_ahead)
print_info(look_ahead_pattern)
print_info(look_ahead_not_pattern)
结果
python2 start is 9, end is 16
python3 start is 17, end is 24
python start is 9, end is 15
python start is 0, end is 6
python start is 17, end is 23