1.3.7.4 详细表达式语法
随着表达式变得越来越复杂,紧凑格式的正则表达式可能会变成障碍。随着表达式中组数的增加,需要做更多的工作来明确为什么需要各个元素,以及表达式的各部分究竟如何交互。使用命名组可以帮助缓解这些问题,不过更好的解决方案是使用详细模式(verbose mode)表达式,它允许在模式中嵌入注释和额外的空白符。可以用一个验证email地址的模式来展示详细模式会让正则表达式的处理更加容易。第一个版本会识别以3个顶级域名之一结尾的地址:.com, .org, .edu。
import re
address = re.compile('[\w\d.+-]+@([\w\d.]+\.)+(com|org|edu)')
candidates = [
u'first.last@example.com',
u'first.last+category@gmail.com',
u'valid-address@mail.example.com',
u'not-valid@example.foo',
]
for candidate in candidates:
match = address.search(candidate)
print('{:<30} {}'.format(
candidate,'Matches' if match else 'No match'))
这个表达式已经狠复杂了,其中有多个字符类,组和重复表达式。
运行结果:
first.last@example.com Matches
first.last+category@gmail.com Matches
valid-address@mail.example.com Matches
not-valid@example.foo No match
将这个表达式转换为一种更详细的格式,使它更容易扩展。
import re
address = re.compile(
'''
[\w\d.+-]+ # Username
@
([\w\d.]+\.)+ # Domain name prefix
(com|org|edu) # TODO:support more top-level domains
''',
re.VERBOSE)
candidates = [
u'first.last@example.com',
u'first.last+category@gmail.com',
u'valid-address@mail.example.com',
u'not-valid@example.foo',
]
for candidate in candidates:
match = address.search(candidate)
print('{:<30} {}'.format(
candidate, 'Matches' if match else 'No match'
),
)
这个表达式会匹配同样的输入,但是采用这种扩展格式更易读。注释也有助于识别模式的不同部分,从而能扩展以匹配更多输入。
运行结果:
first.last@example.com Matches
first.last+category@gmail.com Matches
valid-address@mail.example.com Matches
not-valid@example.foo No match
这个扩展的版本会解析包含一个人名和email地址的输入(可能在email首部出现)。名字在前,后面是email地址,并用尖括号(<>)包围。
import re
address = re.compile(
'''
# A name is made up of letters,and may include "."
# for title abbreviations and middle initials.
((?P<name>
([\w.,]+\s+)*[\w.,]+)
\s*
# Email addresses are wrapped in angle
# brackets <>,but only if a name is
# found,so keep the start bracket in this
# group.
<
)? # The entire name is optional.
# The address itself: username@domain.tld
(?P<email>
[\w\d.+-]+ # Username
@
([\w\d.]+\.)+ # Domain name prefix
(com|org|edu) # Limit the allowed top-level domains.
)
>? # Optional closing angle bracket.
''',
re.VERBOSE)
candidates = [
u'first.last@example.com',
u'first.last+category@gmail.com',
u'valid-address@mail.example.com',
u'not-vaild@example.foo',
u'First Last <first.last@example.com>',
u'No Brackets first.last@example.com',
u'First Last',
u'First Middle Last <first.last@example.com>',
u'First M. Last <first.last@example.com>',
u'<first.last@example.com>',
]
for candidate in candidates:
print('Candidate:',candidate)
match = address.search(candidate)
if match:
print(' Mame:',match.groupdict()['name'])
print(' Email:',match.groupdict()['email'])
else:
print(' No match')
与其他编程语言一样,在详细正则表达式中插入注释有助于提高它的可维护性。最后这个版本包含了为将来维护者提供的实现说明,另外还包括一些空白符以使各个组分开,并突出显示嵌套层次。
运行结果:
Candidate: first.last@example.com
Mame: None
Email: first.last@example.com
Candidate: first.last+category@gmail.com
Mame: None
Email: first.last+category@gmail.com
Candidate: valid-address@mail.example.com
Mame: None
Email: valid-address@mail.example.com
Candidate: not-vaild@example.foo
No match
Candidate: First Last first.last@example.com
Mame: First Last
Email: first.last@example.com
Candidate: No Brackets first.last@example.com
Mame: None
Email: first.last@example.com
Candidate: First Last
No match
Candidate: First Middle Last first.last@example.com
Mame: First Middle Last
Email: first.last@example.com
Candidate: First M. Last first.last@example.com
Mame: First M. Last
Email: first.last@example.com
Candidate: first.last@example.com
Mame: None
Email: first.last@example.com