高级正则表达式技术（Python版）

最新推荐文章于 2024-04-08 10:08:40 发布

linuxchyu

最新推荐文章于 2024-04-08 10:08:40 发布

阅读量896

点赞数

分类专栏： python

python 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

正则表达式是从信息中搜索特定的模式的一把瑞士军刀。它们是一个巨大的工具库，其中的一些功能经常被忽视或未被充分利用。今天我将向你们展示一些正则表达式的高级用法。

举个例子，这是一个我们可能用来检测电话美国电话号码的正则表达式：

 
        r 
        '^(1[-\s.])?(\()?\d{3}(?(2)\))[-\s.]?\d{3}[-\s.]?\d{4}$'

我们可以加上一些注释和空格使得它更具有可读性。

 
        r 
        '^' 
       
        r 
        '(1[-\s.])?' 
         # optional '1-', '1.' or '1' 
       
        r 
        '(\()?'      
         # optional opening parenthesis 
       
        r 
        '\d{3}'      
         # the area code 
       
        r 
        '(?(2)\))'   
         # if there was opening parenthesis, close it 
       
        r 
        '[-\s.]?'    
         # followed by '-' or '.' or space 
       
        r 
        '\d{3}'      
         # first 3 digits 
       
        r 
        '[-\s.]?'    
         # followed by '-' or '.' or space 
       
        r 
        '\d{4}$'    
         # last 4 digits

让我们把它放到一个代码片段里：

 
        import 
         re 
       
        numbers  
        = 
         [  
        "123 555 6789" 
        , 
       
        "1-(123)-555-6789" 
        , 
       
        "(123-555-6789" 
        , 
       
        "(123).555.6789" 
        , 
       
        "123 55 6789" 
         ] 
       
        for 
         number  
        in 
         numbers: 
       
        pattern  
        = 
         re.match(r 
        '^' 
       
        r 
        '(1[-\s.])?'           
         # optional '1-', '1.' or '1' 
       
        r 
        '(\()?'                
         # optional opening parenthesis 
       
        r 
        '\d{3}'                
         # the area code 
       
        r 
        '(?(2)\))'             
         # if there was opening parenthesis, close it 
       
        r 
        '[-\s.]?'              
         # followed by '-' or '.' or space 
       
        r 
        '\d{3}'                
         # first 3 digits 
       
        r 
        '[-\s.]?'              
         # followed by '-' or '.' or space 
       
        r 
        '\d{4}$\s*' 
        ,number)     
        # last 4 digits 
       
        if 
         pattern: 
       
        print 
         '{0} is valid' 
        . 
        format 
        (number) 
       
        else 
        : 
       
        print 
         '{0} is not valid' 
        . 
        format 
        (number)

输出，不带空格：

正则表达式是 python 的一个很好的功能，但是调试它们很艰难，而且正则表达式很容易就出错。

幸运的是，python 可以通过对 re.compile 或 re.match 设置 re.DEBUG (实际上就是整数 128) 标志就可以输出正则表达式的解析树。

 
        import 
         re 
       
        numbers  
        = 
         [  
        "123 555 6789" 
        , 
       
        "1-(123)-555-6789" 
        , 
       
        "(123-555-6789" 
        , 
       
        "(123).555.6789" 
        , 
       
        "123 55 6789" 
         ] 
       
        for 
         number  
        in 
         numbers: 
       
        pattern  
        = 
         re.match(r 
        '^' 
       
        r 
        '(1[-\s.])?'        
         # optional '1-', '1.' or '1' 
       
        r 
        '(\()?'             
         # optional opening parenthesis 
       
        r 
        '\d{3}'             
         # the area code 
       
        r 
        '(?(2)\))'          
         # if there was opening parenthesis, close it 
       
        r 
        '[-\s.]?'           
         # followed by '-' or '.' or space 
       
        r 
        '\d{3}'             
         # first 3 digits 
       
        r 
        '[-\s.]?'           
         # followed by '-' or '.' or space 
       
        r 
        '\d{4}$' 
        , number, re.DEBUG)   
        # last 4 digits 
       
        if 
         pattern: 
       
        print 
         '{0} is valid' 
        . 
        format 
        (number) 
       
        else 
        : 
       
        print 
         '{0} is not valid' 
        . 
        format 
        (number)

解析树

 
        at_beginning 
       
        max_repeat  
        0 
         1 
       
        subpattern  
        1 
       
        literal  
        49 
       
        in 
       
        literal  
        45 
       
        category category_space 
       
        literal  
        46 
       
        max_repeat  
        0 
         2147483648 
       
        in 
       
        category category_space 
       
        max_repeat  
        0 
         1 
       
        subpattern  
        2 
       
        literal  
        40 
       
        max_repeat  
        0 
         2147483648 
       
        in 
       
        category category_space 
       
        max_repeat  
        3 
         3 
       
        in 
       
        category category_digit 
       
        max_repeat  
        0 
         2147483648 
       
        in 
       
        category category_space 
       
        subpattern  
        None 
       
        groupref_exists  
        2 
       
        literal  
        41 
       
        None 
       
        max_repeat  
        0 
         2147483648 
       
        in 
       
        category category_space 
       
        max_repeat  
        0 
         1 
       
        in 
       
        literal  
        45 
       
        category category_space 
       
        literal  
        46 
       
        max_repeat  
        0 
         2147483648 
       
        in 
       
        category category_space 
       
        max_repeat  
        3 
         3 
       
        in 
       
        category category_digit 
       
        max_repeat  
        0 
         2147483648 
       
        in 
       
        category category_space 
       
        max_repeat  
        0 
         1 
       
        in 
       
        literal  
        45 
       
        category category_space 
       
        literal  
        46 
       
        max_repeat  
        0 
         2147483648 
       
        in 
       
        category category_space 
       
        max_repeat  
        4 
         4 
       
        in 
       
        category category_digit 
       
        at at_end 
       
        max_repeat  
        0 
         2147483648 
       
        in 
       
        category category_space 
       
        123 
         555 
         6789 
         is 
         valid 
       
        1 
        - 
        ( 
        123 
        ) 
        - 
        555 
        - 
        6789 
         is 
         valid 
       
        ( 
        123 
        - 
        555 
        - 
        6789 
         is 
         not 
         valid 
       
        ( 
        123 
        ). 
        555.6789 
         is 
         valid 
       
        123 
         55 
         6789 
         is 
         not 
         valid

贪婪和非贪婪

在我解释这个概念之前，我想先展示一个例子。我们要从一段 html 文本寻找锚标签：

 
        import 
         re 
       
        html  
        = 
         'Hello <a href="http://pypix.com" title="pypix">Pypix</a>' 
       
        m  
        = 
         re.findall( 
        '<a.*>.*<\/a>' 
        , html) 
       
        if 
         m: 
       
        print 
         m

结果将在意料之中：

 
        [ 
        '<a href="http://pypix.com" title="pypix">Pypix</a>' 
        ]

我们改下输入，添加第二个锚标签：

 
        import 
         re 
       
        html  
        = 
         'Hello <a href="http://pypix.com" title="pypix">Pypix</a>' 
         \ 
       
        'Hello <a href="http://example.com" title"example">Example</a>' 
       
        m  
        = 
         re.findall( 
        '<a.*>.*<\/a>' 
        , html) 
       
        if 
         m: 
       
        print 
         m

结果看起来再次对了。但是不要上当了！如果我们在同一行遇到两个锚标签后，它将不再正确工作：

 
        [ 
        '<a href="http://pypix.com" title="pypix">Pypix</a>Hello <a href="http://example.com" title"example">Example</a>' 
        ]

这次模式匹配了第一个开标签和最后一个闭标签以及在它们之间的所有的内容，成了一个匹配而不是两个单独的匹配。这是因为默认的匹配模式是“贪婪的”。

当处于贪婪模式时，量词(比如 * 和 +)匹配尽可能多的字符。

当你加一个问号在后面时（.*?）它将变为“非贪婪的”。

 
        import 
         re 
       
        html  
        = 
         'Hello <a href="http://pypix.com" title="pypix">Pypix</a>' 
         \ 
       
        'Hello <a href="http://example.com" title"example">Example</a>' 
       
        m  
        = 
         re.findall( 
        '<a.*?>.*?<\/a>' 
        , html) 
       
        if 
         m: 
       
        print 
         m

现在结果是正确的。

 
        [ 
        '<a href="http://pypix.com" title="pypix">Pypix</a>' 
        ,  
        '<a href="http://example.com" title"example">Example</a>' 
        ]

前向界定符和后向界定符

一个前向界定符搜索当前的匹配之后搜索匹配。通过一个例子比较好解释一点。

下面的模式首先匹配 foo，然后检测是否接着匹配 bar：

 
        import 
         re 
       
        strings  
        = 
         [   
        "hello foo" 
        ,          
        # returns False 
       
        "hello foobar"  
         ]     
        # returns True 
       
        for 
         string  
        in 
         strings: 
       
        pattern  
        = 
         re.search(r 
        'foo(?=bar)' 
        , string) 
       
        if 
         pattern: 
       
        print 
         'True' 
       
        else 
        : 
       
        print 
         'False'

这看起来似乎没什么用，因为我们可以直接检测 foobar 不是更简单么。然而，它也可以用来前向否定界定。下面的例子匹配foo，当且仅当它的后面没有跟着 bar。

 
        import 
         re 
       
        strings  
        = 
         [   
        "hello foo" 
        ,          
        # returns True 
       
        "hello foobar" 
        ,       
        # returns False 
       
        "hello foobaz" 
        ]       
        # returns True 
       
        for 
         string  
        in 
         strings: 
       
        pattern  
        = 
         re.search(r 
        'foo(?!bar)' 
        , string) 
       
        if 
         pattern: 
       
        print 
         'True' 
       
        else 
        : 
       
        print 
         'False'

后向界定符类似，但是它查看当前匹配的前面的模式。你可以使用 (?> 来表示肯定界定，(?<! 表示否定界定。

下面的模式匹配一个不是跟在 foo 后面的 bar。

 
        import 
         re 
       
        strings  
        = 
         [   
        "hello bar" 
        ,          
        # returns True 
       
        "hello foobar" 
        ,       
        # returns False 
       
        "hello bazbar" 
        ]       
        # returns True 
       
        for 
         string  
        in 
         strings: 
       
        pattern  
        = 
         re.search(r 
        '(?<!foo)bar' 
        ,string) 
       
        if 
         pattern: 
       
        print 
         'True' 
       
        else 
        : 
       
        print 
         'False'

条件(IF-Then-Else)模式

正则表达式提供了条件检测的功能。格式如下：

(?(?=regex)then|else)

条件可以是一个数字。表示引用前面捕捉到的分组。

比如我们可以用这个正则表达式来检测打开和闭合的尖括号：

 
        import 
         re 
       
        strings  
        = 
         [   
        "<pypix>" 
        ,     
        # returns true 
       
        "<foo" 
        ,        
        # returns false 
       
        "bar>" 
        ,        
        # returns false 
       
        "hello" 
         ]      
        # returns true 
       
        for 
         string  
        in 
         strings: 
       
        pattern  
        = 
         re.search(r 
        '^(<)?[a-z]+(?(1)>)$' 
        , string) 
       
        if 
         pattern: 
       
        print 
         'True' 
       
        else 
        : 
       
        print 
         'False'

在上面的例子中，1 表示分组 (<)，当然也可以为空因为后面跟着一个问号。当且仅当条件成立时它才匹配关闭的尖括号。

条件也可以是界定符。

无捕获组

分组，由圆括号括起来，将会捕获到一个数组，然后在后面要用的时候可以被引用。但是我们也可以不捕获它们。

我们先看一个非常简单的例子：

 
   
        import 
         re           
       
 
        string  
        = 
         'Hello foobar'          
       
 
        pattern  
        = 
         re.search(r 
        '(f.*)(b.*)' 
        , string)           
       
 
        print 
         "f* => {0}" 
        . 
        format 
        (pattern.group( 
        1 
        ))  
        # prints f* => foo           
       
 
        print 
         "b* => {0}" 
        . 
        format 
        (pattern.group( 
        2 
        ))  
        # prints b* => bar 
       
 
 

现在我们改动一点点，在前面加上另外一个分组 (H.*)：

 
   
        import 
         re           
       
 
        string  
        = 
         'Hello foobar'          
       
 
        pattern  
        = 
         re.search(r 
        '(H.*)(f.*)(b.*)' 
        , string)           
       
 
        print 
         "f* => {0}" 
        . 
        format 
        (pattern.group( 
        1 
        ))  
        # prints f* => Hello           
       
 
        print 
         "b* => {0}" 
        . 
        format 
        (pattern.group( 
        2 
        ))  
        # prints b* => bar 
       
 
 

模式数组改变了，取决于我们在代码中怎么使用这些变量，这可能会使我们的脚本不能正常工作。现在我们不得不找到代码中每一处出现了模式数组的地方，然后相应地调整下标。如果我们真的对一个新添加的分组的内容没兴趣的话，我们可以使它“不被捕获”，就像这样：

 
   
        import 
         re           
       
 
        string  
        = 
         'Hello foobar'          
       
 
        pattern  
        = 
         re.search(r 
        '(?:H.*)(f.*)(b.*)' 
        , string)           
       
 
        print 
         "f* => {0}" 
        . 
        format 
        (pattern.group( 
        1 
        ))  
        # prints f* => foo           
       
 
        print 
         "b* => {0}" 
        . 
        format 
        (pattern.group( 
        2 
        ))  
        # prints b* => bar 
       
 
 

通过在分组的前面添加 ?:，我们就再也不用在模式数组中捕获它了。所以数组中其他的值也不需要移动。

命名组

像前面那个例子一样，这又是一个防止我们掉进陷阱的方法。我们实际上可以给分组命名，然后我们就可以通过名字来引用它们，而不再需要使用数组下标。格式是：(?Ppattern) 我们可以重写前面那个例子，就像这样：

 
   
        import 
         re           
       
 
        string  
        = 
         'Hello foobar'          
       
 
        pattern  
        = 
         re.search(r 
        '(?P<fstar>f.*)(?P<bstar>b.*)' 
        , string)           
       
 
        print 
         "f* => {0}" 
        . 
        format 
        (pattern.group( 
        'fstar' 
        ))  
        # prints f* => foo           
       
 
        print 
         "b* => {0}" 
        . 
        format 
        (pattern.group( 
        'bstar' 
        ))  
        # prints b* => bar 
       
 
 

现在我们可以添加另外一个分组了，而不会影响模式数组里其他的已存在的组：

 
   
        import 
         re           
       
 
        string  
        = 
         'Hello foobar'          
       
 
        pattern  
        = 
         re.search(r 
        '(?P<hi>H.*)(?P<fstar>f.*)(?P<bstar>b.*)' 
        , string)           
       
 
        print 
         "f* => {0}" 
        . 
        format 
        (pattern.group( 
        'fstar' 
        ))  
        # prints f* => foo           
       
 
        print 
         "b* => {0}" 
        . 
        format 
        (pattern.group( 
        'bstar' 
        ))  
        # prints b* => bar           
       
 
        print 
         "h* => {0}" 
        . 
        format 
        (pattern.group( 
        'hi' 
        ))  
        # prints b* => Hello 
       
 
 

使用回调函数

在 Python 中 re.sub() 可以用来给正则表达式替换添加回调函数。

让我们来看看这个例子，这是一个 e-mail 模板：

 
        import 
         re           
       
        template  
        = 
         "Hello [first_name] [last_name], \           
       
        Thank you  
        for 
         purchasing [product_name]  
        from 
         [store_name]. \           
       
        The total cost of your purchase was [product_price] plus [ship_price]  
        for 
         shipping. \           
       
        You can expect your product to arrive  
        in 
         [ship_days_min] to [ship_days_max] business days. \           
       
        Sincerely, \           
       
        [store_manager_name]"           
       
        # assume dic has all the replacement data           
       
        # such as dic['first_name'] dic['product_price'] etc...           
       
        dic  
        = 
         {           
       
        "first_name" 
         :  
        "John" 
        ,           
       
        "last_name" 
         :  
        "Doe" 
        ,           
       
        "product_name" 
         :  
        "iphone" 
        ,           
       
        "store_name" 
         :  
        "Walkers" 
        ,           
       
        "product_price" 
        :  
        "$500" 
        ,           
       
        "ship_price" 
        :  
        "$10" 
        ,           
       
        "ship_days_min" 
        :  
        "1" 
        ,           
       
        "ship_days_max" 
        :  
        "5" 
        ,           
       
        "store_manager_name" 
        :  
        "DoeJohn"          
       
        }           
       
        result  
        = 
         re. 
        compile 
        (r 
        '\[(.*)\]' 
        )           
       
        print 
         result.sub( 
        'John' 
        , template, count 
        = 
        1 
        )

注意到每一个替换都有一个共同点，它们都是由一对中括号括起来的。我们可以用一个单独的正则表达式来捕获它们，并且用一个回调函数来处理具体的替换。

所以用回调函数是一个更好的办法：

 
        import 
         re           
       
        template  
        = 
         "Hello [first_name] [last_name], \           
       
        Thank you  
        for 
         purchasing [product_name]  
        from 
         [store_name]. \           
       
        The total cost of your purchase was [product_price] plus [ship_price]  
        for 
         shipping. \           
       
        You can expect your product to arrive  
        in 
         [ship_days_min] to [ship_days_max] business days. \           
       
        Sincerely, \           
       
        [store_manager_name]"           
       
        # assume dic has all the replacement data           
       
        # such as dic['first_name'] dic['product_price'] etc...           
       
        dic  
        = 
         {           
       
        "first_name" 
         :  
        "John" 
        ,           
       
        "last_name" 
         :  
        "Doe" 
        ,           
       
        "product_name" 
         :  
        "iphone" 
        ,           
       
        "store_name" 
         :  
        "Walkers" 
        ,           
       
        "product_price" 
        :  
        "$500" 
        ,           
       
        "ship_price" 
        :  
        "$10" 
        ,           
       
        "ship_days_min" 
        :  
        "1" 
        ,           
       
        "ship_days_max" 
        :  
        "5" 
        ,           
       
        "store_manager_name" 
        :  
        "DoeJohn"          
       
        }           
       
        def 
         multiple_replace(dic, text): 
       
        pattern  
        = 
         "|" 
        .join( 
        map 
        ( 
        lambda 
         key : re.escape( 
        "[" 
        + 
        key 
        + 
        "]" 
        ), dic.keys())) 
       
        return 
         re.sub(pattern,  
        lambda 
         m: dic[m.group()[ 
        1 
        : 
        - 
        1 
        ]], text)      
       
        print 
         multiple_replace(dic, template)

不要重复发明轮子

更重要的可能是知道在什么时候不要使用正则表达式。在许多情况下你都可以找到替代的工具。

解析 [X]HTML

Stackoverflow 上的一个答案用一个绝妙的解释告诉了我们为什么不应该用正则表达式来解析 [X]HTML。

你应该使用使用 HTML 解析器，Python 有很多选择：

ElementTree 是标准库的一部分
BeautifulSoup 是一个流行的第三方库
lxml 是一个功能齐全基于 c 的快速的库

后面两个即使是处理畸形的 HTML 也能很优雅，这给大量的丑陋站点带来了福音。

ElementTree 的一个例子：

 
        from 
         xml.etree  
        import 
         ElementTree           
       
        tree  
        = 
         ElementTree.parse( 
        'filename.html' 
        )           
       
        for 
         element  
        in 
         tree.findall( 
        'h1' 
        ):           
       
        print 
         ElementTree.tostring(element)