第一章：文本-re:正则表达式-自引用表达式

最新推荐文章于 2022-06-01 22:22:17 发布

学习中的编程老菜鸟

最新推荐文章于 2022-06-01 22:22:17 发布

阅读量312

点赞数

分类专栏： Python标准库

Python标准库专栏收录该内容

819 篇文章 18 订阅

订阅专栏

1.3.9 自引用表达式
还可以在表达式后面的部分中使用匹配的值。例如，前面的email例子可以更新为只匹配由人名和姓氏组成的地址，为此要包含这些组的反向引用。要达到这个目的，最容易的办法就是使用\num按ID编号引用先前匹配的组。

import re

address = re.compile(
    r'''
    # The regular name
    (\w+)      # First name
    \s+
    (([\w.]+)\s+)?   # Optional middle name or initial
    (\w+)            # Last name

    \s+

    <

    # The address: first_name.last_name@domain.tld
    (?P<email>
    \1          # First name
    \.
    \4          # Last name
    @
    ([\w\d.]+\.)+    # Domain name prefix
    (com|org|edu)    # Limit the allowed top-level domains.
    )

    >
    ''',
    re.VERBOSE | re.IGNORECASE)

candidates = [
    u'First Last <first.last@example.com>',
    u'Different Name <first.last@example.com>',
    u'First Middle Last <first.last@example.com>',
    u'First M. Last <first.last@example.com>',
    ]

for candidate in candidates:
    print('Candidate:',candidate)
    match = address.search(candidate)
    if match:
        print(' Match name:',match.group(1),match.group(4))
        print(' Match email:',match.group(5))
    else:
        print(' No match')

尽管这个语法很简单，按数字ID创建反向引用也依旧有几个缺点。从实用角度讲，表达式改变时，这些组就必须重新编号，每个引用可能都需要更新。另一个缺点是，采用标准反向引用语法\n只能创建99个引用，因为如果ID编号有3位，那么其便会被解释为一个8进制字符值而不是一个组引用。当然，如果一个表达式有超过99个组，那么问题就不仅仅是无法引用表达式中的所有组，这说明还存在一些更严重的维护问题。
运行结果：

Candidate: First Last first.last@example.com
Match name: First Last
Match email: first.last@example.com
Candidate: Different Name first.last@example.com
No match
Candidate: First Middle Last first.last@example.com
Match name: First Last
Match email: first.last@example.com
Candidate: First M. Last first.last@example.com
Match name: First Last
Match email: first.last@example.com

python的表达式解析器包括一个扩展，可以使用(?P=name)来指示表达式中先前匹配的一个命名组的值。

import re

address = re.compile(
    '''

    # The regular name
    (?P<first_name>\w+)
    \s+
    (([\w.]+)\s+)?      # Optional muddle name or initial
    (?P<last_name>\w+)

    \s+

    <

    # The address: first_name.last_name@domain.tld
    (?P<email>
     (?P=first_name)
     \.
     (?P=last_name)
     @
     ([\w\d.]+\.)+   # Domain name prefix
     (com|org|edu)   # Limit the allowed top-level domains.
    )

    >
    ''',
    re.VERBOSE | re.IGNORECASE)

candidates = [
    u'First Last <first.last@example.com>',
    u'Different Name <first.last@example.com>',
    u'First Middle Last <first.last@example.com>',
    u'First M. Last <first.last@example.com>',
    ]

for candidate in candidates:
    print('Candidate:',candidate)
    match = address.search(candidate)
    if match:
        print(' Match name:',match.groupdict()['first_name'],end=' ')
        print(match.groupdict()['last_name'])
        print(' Match email:',match.groupdict()['email'])
    else:
        print(' No match')

编译地址表达式时打开了IGNORECASE标志，因为正确的名字通常首字母会大写，而email地址往往不会大写首字母。
运行结果：

Candidate: First Last first.last@example.com
Match name: First Last
Match email: first.last@example.com
Candidate: Different Name first.last@example.com
No match
Candidate: First Middle Last first.last@example.com
Match name: First Last
Match email: first.last@example.com
Candidate: First M. Last first.last@example.com
Match name: First Last
Match email: first.last@example.com

在表达式中使用反向引用还有一种机制，即根据前一个组是否匹配来选择不同的模式。可以修正这个email模式，是的如果出现名字就需要有尖括号，而如果只有email地址本身就不需要尖括号。查看一个组是否匹配的语法是（?(ID)yes-expression|no-expression），这里id是组名或编号，yes-expression是组有值时使用的模式，no-expression则是组没有值时使用的模式。

import re

address = re.compile(
    '''
    ^

    # A name is made up of letters, and may include"."
    # for title abbreviations and middle initials.
    (?P<name>
       ([\w.]+\s+)*[\w.]+
    )?
    \s*

    # Email addresses are wrapped in angle brackets,but
    # only if a name is found.
    (?(name)
     # Remainder wrapped in angle brackets because
     # there is a name
     (?P<brackets>(?=(<.*>$)))
     |
     # Remainder does not include angle brackets without name
     (?=([^<].*[^>]$))
    )

    # Look for a bracket only if the look-ahead assertion
    # found both of them.
    (?(brackets)<|\s*)

    # The address itself: username@domain.tld
    (?P<email>
     [\w\d.+-]+    # Username
     @
     ([\w\d.]+\.)+   # Domain name prefix
     (com|org|edu)   # Limit the allowed top-levle domains.
    )

    # Look for a bracket only if the look-ahead assertion
    # found both of them.
    (?(brackets)>|\s*)

    $
    ''',
    re.VERBOSE)

candidates = [
    u'First Last <first.last@example.com>',
    u'No Brackets first.last@example.com',
    u'Open Bracket <first.last@example.com',
    u'Close Bracket first.last@example.com>',
    u'no.brackets@example.com',
    ]

for candidate in candidates:
    print('Candidate:',candidate)
    match = address.search(candidate)
    if match:
        print(' Match name:',match.groupdict()['name'])
        print(' Match email:',match.groupdict()['email'])
    else:
        print(' No match')

这个版本的email地址解析器使用了两个测试。如果name组匹配，则前向断言要求两个尖括号都出现，并建立brackets组。如果name不匹配，则这个断言要求余下文本不能使用尖括号包围。接下来，如果设置了brackets组，那么具体的模式匹配代码会使用字面量模式消费输入中的尖括号；否则，它会消费所有空格。
运行结果：

Candidate: First Last first.last@example.com
Match name: First Last
Match email: first.last@example.com
Candidate: No Brackets first.last@example.com
No match
Candidate: Open Bracket <first.last@example.com
No match
Candidate: Close Bracket first.last@example.com>
No match
Candidate: no.brackets@example.com
Match name: None
Match email: no.brackets@example.com

学习中的编程老菜鸟

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
第一章：文本-re:正则表达式-自引用表达式

1.3.9 自引用表达式还可以在表达式后面的部分中使用匹配的值。例如，前面的email例子可以更新为只匹配由人名和姓氏组成的地址，为此要包含这些组的反向引用。要达到这个目的，最容易的办法就是使用\num按ID编号引用先前匹配的组。import readdress = re.compile( r''' # The regular name (\w+) # F...
复制链接

扫一扫