第一章：文本-re:正则表达式-搜索选项(4)

最新推荐文章于 2022-10-23 23:56:30 发布

学习中的编程老菜鸟

最新推荐文章于 2022-10-23 23:56:30 发布

阅读量311

点赞数

分类专栏： Python标准库

Python标准库专栏收录该内容

819 篇文章 19 订阅

订阅专栏

1.3.7.4 详细表达式语法
随着表达式变得越来越复杂，紧凑格式的正则表达式可能会变成障碍。随着表达式中组数的增加，需要做更多的工作来明确为什么需要各个元素，以及表达式的各部分究竟如何交互。使用命名组可以帮助缓解这些问题，不过更好的解决方案是使用详细模式（verbose mode）表达式，它允许在模式中嵌入注释和额外的空白符。可以用一个验证email地址的模式来展示详细模式会让正则表达式的处理更加容易。第一个版本会识别以3个顶级域名之一结尾的地址：.com, .org, .edu。

import re

address = re.compile('[\w\d.+-]+@([\w\d.]+\.)+(com|org|edu)')
candidates = [
    u'first.last@example.com',
    u'first.last+category@gmail.com',
    u'valid-address@mail.example.com',
    u'not-valid@example.foo',
    ]

for candidate in candidates:
    match = address.search(candidate)
    print('{:<30} {}'.format(
        candidate,'Matches' if match else 'No match'))

这个表达式已经狠复杂了，其中有多个字符类，组和重复表达式。
运行结果：

first.last@example.com Matches
first.last+category@gmail.com Matches
valid-address@mail.example.com Matches
not-valid@example.foo No match

将这个表达式转换为一种更详细的格式，使它更容易扩展。

import re

address = re.compile(
    '''
    [\w\d.+-]+  # Username
    @
    ([\w\d.]+\.)+   # Domain name prefix
    (com|org|edu)   # TODO:support more top-level domains
''',
    re.VERBOSE)

candidates = [
    u'first.last@example.com',
    u'first.last+category@gmail.com',
    u'valid-address@mail.example.com',
    u'not-valid@example.foo',
    ]

for candidate in candidates:
    match = address.search(candidate)
    print('{:<30} {}'.format(
        candidate, 'Matches' if match else 'No match'
        ),
          )

这个表达式会匹配同样的输入，但是采用这种扩展格式更易读。注释也有助于识别模式的不同部分，从而能扩展以匹配更多输入。
运行结果：

first.last@example.com Matches
first.last+category@gmail.com Matches
valid-address@mail.example.com Matches
not-valid@example.foo No match

这个扩展的版本会解析包含一个人名和email地址的输入（可能在email首部出现）。名字在前，后面是email地址，并用尖括号（<>）包围。

import re

address = re.compile(
    '''
    # A name is made up of letters,and may include "."
    # for title abbreviations and middle initials.
    ((?P<name>
        ([\w.,]+\s+)*[\w.,]+)
        \s*
        # Email addresses are wrapped in angle
        # brackets <>,but only if a name is
        # found,so keep the start bracket in this
        # group.
        <
    )? # The entire name is optional.

    # The address itself: username@domain.tld
    (?P<email>
    [\w\d.+-]+   # Username
    @
    ([\w\d.]+\.)+    # Domain name prefix
    (com|org|edu)    # Limit the allowed top-level domains.
    )

    >? # Optional closing angle bracket.
    ''',
    re.VERBOSE)

candidates = [
    u'first.last@example.com',
    u'first.last+category@gmail.com',
    u'valid-address@mail.example.com',
    u'not-vaild@example.foo',
    u'First Last <first.last@example.com>',
    u'No Brackets first.last@example.com',
    u'First Last',
    u'First Middle Last <first.last@example.com>',
    u'First M. Last <first.last@example.com>',
    u'<first.last@example.com>',
    ]

for candidate in candidates:
    print('Candidate:',candidate)
    match = address.search(candidate)
    if match:
        print(' Mame:',match.groupdict()['name'])
        print(' Email:',match.groupdict()['email'])
    else:
        print(' No match')

与其他编程语言一样，在详细正则表达式中插入注释有助于提高它的可维护性。最后这个版本包含了为将来维护者提供的实现说明，另外还包括一些空白符以使各个组分开，并突出显示嵌套层次。
运行结果：

Candidate: first.last@example.com
Mame: None
Email: first.last@example.com
Candidate: first.last+category@gmail.com
Mame: None
Email: first.last+category@gmail.com
Candidate: valid-address@mail.example.com
Mame: None
Email: valid-address@mail.example.com
Candidate: not-vaild@example.foo
No match
Candidate: First Last first.last@example.com
Mame: First Last
Email: first.last@example.com
Candidate: No Brackets first.last@example.com
Mame: None
Email: first.last@example.com
Candidate: First Last
No match
Candidate: First Middle Last first.last@example.com
Mame: First Middle Last
Email: first.last@example.com
Candidate: First M. Last first.last@example.com
Mame: First M. Last
Email: first.last@example.com
Candidate: first.last@example.com
Mame: None
Email: first.last@example.com