1.3.9 自引用表达式
还可以在表达式后面的部分中使用匹配的值。例如,前面的email例子可以更新为只匹配由人名和姓氏组成的地址,为此要包含这些组的反向引用。要达到这个目的,最容易的办法就是使用\num按ID编号引用先前匹配的组。
import re
address = re.compile(
r'''
# The regular name
(\w+) # First name
\s+
(([\w.]+)\s+)? # Optional middle name or initial
(\w+) # Last name
\s+
<
# The address: first_name.last_name@domain.tld
(?P<email>
\1 # First name
\.
\4 # Last name
@
([\w\d.]+\.)+ # Domain name prefix
(com|org|edu) # Limit the allowed top-level domains.
)
>
''',
re.VERBOSE | re.IGNORECASE)
candidates = [
u'First Last <first.last@example.com>',
u'Different Name <first.last@example.com>',
u'First Middle Last <first.last@example.com>',
u'First M. Last <first.last@example.com>',
]
for candidate in candidates:
print('Candidate:',candidate)
match = address.search(candidate)
if match:
print(' Match name:',match.group(1),match.group(4))
print(' Match email:',match.group(5))
else:
print(' No match')
尽管这个语法很简单,按数字ID创建反向引用也依旧有几个缺点。从实用角度讲,表达式改变时,这些组就必须重新编号,每个引用可能都需要更新。另一个缺点是,采用标准反向引用语法\n只能创建99个引用,因为如果ID编号有3位,那么其便会被解释为一个8进制字符值而不是一个组引用。当然,如果一个表达式有超过99个组,那么问题就不仅仅是无法引用表达式中的所有组,这说明还存在一些更严重的维护问题。
运行结果:
Candidate: First Last first.last@example.com
Match name: First Last
Match email: first.last@example.com
Candidate: Different Name first.last@example.com
No match
Candidate: First Middle Last first.last@example.com
Match name: First Last
Match email: first.last@example.com
Candidate: First M. Last first.last@example.com
Match name: First Last
Match email: first.last@example.com
python的表达式解析器包括一个扩展,可以使用(?P=name)来指示表达式中先前匹配的一个命名组的值。
import re
address = re.compile(
'''
# The regular name
(?P<first_name>\w+)
\s+
(([\w.]+)\s+)? # Optional muddle name or initial
(?P<last_name>\w+)
\s+
<
# The address: first_name.last_name@domain.tld
(?P<email>
(?P=first_name)
\.
(?P=last_name)
@
([\w\d.]+\.)+ # Domain name prefix
(com|org|edu) # Limit the allowed top-level domains.
)
>
''',
re.VERBOSE | re.IGNORECASE)
candidates = [
u'First Last <first.last@example.com>',
u'Different Name <first.last@example.com>',
u'First Middle Last <first.last@example.com>',
u'First M. Last <first.last@example.com>',
]
for candidate in candidates:
print('Candidate:',candidate)
match = address.search(candidate)
if match:
print(' Match name:',match.groupdict()['first_name'],end=' ')
print(match.groupdict()['last_name'])
print(' Match email:',match.groupdict()['email'])
else:
print(' No match')
编译地址表达式时打开了IGNORECASE标志,因为正确的名字通常首字母会大写,而email地址往往不会大写首字母。
运行结果:
Candidate: First Last first.last@example.com
Match name: First Last
Match email: first.last@example.com
Candidate: Different Name first.last@example.com
No match
Candidate: First Middle Last first.last@example.com
Match name: First Last
Match email: first.last@example.com
Candidate: First M. Last first.last@example.com
Match name: First Last
Match email: first.last@example.com
在表达式中使用反向引用还有一种机制,即根据前一个组是否匹配来选择不同的模式。可以修正这个email模式,是的如果出现名字就需要有尖括号,而如果只有email地址本身就不需要尖括号。查看一个组是否匹配的语法是(?(ID)yes-expression|no-expression),这里id是组名或编号,yes-expression是组有值时使用的模式,no-expression则是组没有值时使用的模式。
import re
address = re.compile(
'''
^
# A name is made up of letters, and may include"."
# for title abbreviations and middle initials.
(?P<name>
([\w.]+\s+)*[\w.]+
)?
\s*
# Email addresses are wrapped in angle brackets,but
# only if a name is found.
(?(name)
# Remainder wrapped in angle brackets because
# there is a name
(?P<brackets>(?=(<.*>$)))
|
# Remainder does not include angle brackets without name
(?=([^<].*[^>]$))
)
# Look for a bracket only if the look-ahead assertion
# found both of them.
(?(brackets)<|\s*)
# The address itself: username@domain.tld
(?P<email>
[\w\d.+-]+ # Username
@
([\w\d.]+\.)+ # Domain name prefix
(com|org|edu) # Limit the allowed top-levle domains.
)
# Look for a bracket only if the look-ahead assertion
# found both of them.
(?(brackets)>|\s*)
$
''',
re.VERBOSE)
candidates = [
u'First Last <first.last@example.com>',
u'No Brackets first.last@example.com',
u'Open Bracket <first.last@example.com',
u'Close Bracket first.last@example.com>',
u'no.brackets@example.com',
]
for candidate in candidates:
print('Candidate:',candidate)
match = address.search(candidate)
if match:
print(' Match name:',match.groupdict()['name'])
print(' Match email:',match.groupdict()['email'])
else:
print(' No match')
这个版本的email地址解析器使用了两个测试。如果name组匹配,则前向断言要求两个尖括号都出现,并建立brackets组。如果name不匹配,则这个断言要求余下文本不能使用尖括号包围。接下来,如果设置了brackets组,那么具体的模式匹配代码会使用字面量模式消费输入中的尖括号;否则,它会消费所有空格。
运行结果:
Candidate: First Last first.last@example.com
Match name: First Last
Match email: first.last@example.com
Candidate: No Brackets first.last@example.com
No match
Candidate: Open Bracket <first.last@example.com
No match
Candidate: Close Bracket first.last@example.com>
No match
Candidate: no.brackets@example.com
Match name: None
Match email: no.brackets@example.com