正则表达式python_Python的隐藏正则表达式宝石

最新推荐文章于 2024-10-08 17:02:35 发布

cumei1658

最新推荐文章于 2024-10-08 17:02:35 发布

阅读量253

点赞数

文章标签： python java 编程语言人工智能正则表达式

原文链接：https://www.pybloggers.com/2015/11/pythons-hidden-regular-expression-gems/

版权

正则表达式python

There are many terrible modules in the Python standard library, but the Python re module is not one of them. While it’s old and has not been updated in many years, it’s one of the best of all dynamic languages I would argue.

Python标准库中有许多可怕的模块，但是Python re模块不是其中之一。尽管它已经很旧并且多年没有更新，但它是我认为所有动态语言中最好的语言之一。

What I always found interesting about that module is that Python is one of the few dynamic languages which does not have language integrated regular expression support. However while it lacks syntax and interpreter support for it, it makes up for it with one of the better designed core systems from a pure API point of view. At the same time it’s very bizarre. For instance the parser is written in pure Python which has some bizarre consequences if you ever try to trace Python while importing. You will discover that 90% of your time is probably spent in on of re’s support module.

我对该模块始终感到有趣的是，Python是少数几种不集成语言正则表达式支持的动态语言之一。但是，尽管它缺乏语法和解释器支持，但从纯API的角度来看，它是使用设计更好的核心系统之一来弥补的。同时，这非常奇怪。例如，解析器是用纯Python编写的，如果您在导入时尝试跟踪Python，则会产生一些奇怪的后果。您会发现90％的时间可能花在了re的支持模块上。

古老但久经考验 (Old But Proven)

The regex module in Python is really old by now and one of the constants in the standard library. Ignoring Python 3 it has not really evolved since its inception other than gaining basic unicode support at one point. Till this date it has a broken member enumeration (Have a look at what dir() returns on a regex pattern object).

到目前为止，Python中的regex模块确实很旧，并且是标准库中的常量之一。自从它诞生以来，无视Python 3并没有真正发展，只是在某一时刻获得了基本的unicode支持。到此日期为止，它的成员枚举已损坏（请查看dir（）在正则表达式模式对象上返回的内容）。

However one of the nice things about it being old is that it does not change between Python versions and is very reliable. Not once did I have to adjust something because the regex module changed. Given how many regular expressions I’m writing in Python this is good news.

但是，它过时的好处之一是它在Python版本之间不会更改，并且非常可靠。因为正则表达式模块已更改，所以我不必一次调整任何东西。考虑到我用Python编写了多少个正则表达式，这是个好消息。

One of the interesting quirks about its design is that its parser and compiler is written in Python but the matcher is written in C. This means we can pass the internal structures of the parser into the compiler to bypass the regex parsing entirely if we would feel like it. Not that this is documented. But it still works.

关于它的设计的一个有趣的怪癖之一是它的解析器和编译器是用Python编写的，而匹配器是用C编写的。这意味着，如果我们愿意，我们可以将解析器的内部结构传递到编译器中，从而完全绕过regex解析。喜欢它。并不是说这已被记录下来。但它仍然有效。

There are many other things however that are not or badly documented about the regular expression system, so I want to give some examples of why the Regex module in Python is pretty cool.

但是，关于正则表达式系统还有许多其他事情没有或没有被很好地记录下来，因此我想举一些例子说明为什么Python中的Regex模块非常酷。

迭代匹配 (Iterative Matching)

The best feature of the regex system in Python is without a doubt that it’s making a clear distinction between matching and searching. Something that not many other regular expression engines do. In particular when you perform a match you can provide an index to offset the matching but the matching itself will be anchored to that position.

毫无疑问，Python中的regex系统的最佳功能是它在匹配和搜索之间做出了明确的区分。其他正则表达式引擎没有的功能。特别是执行匹配时，您可以提供一个索引来抵消匹配，但是匹配本身将锚定到该位置。

In particular this means you can do something like this:

特别是，这意味着您可以执行以下操作：

>>> >>>  pattern pattern = = rere .. compilecompile (( 'bar''bar' )
)
>>> >>>  string string = = 'foobar'
'foobar'
>>> >>>  patternpattern .. matchmatch (( stringstring ) ) is is None
None
True
True
>>> >>>  patternpattern .. matchmatch (( stringstring , , 33 )
)
<_sre.SRE_Match object at 0x103c9a510>
<_sre.SRE_Match object at 0x103c9a510>

This is immensely useful for building lexers because you can continue to use the special ^ symbol to indicate the beginning of a line of entire string. We just need to increase the index to match further. It also means we do not have to slice up the string ourselves which saves a ton of memory allocations and string copying in the process (not that Python is particularly good at that anyways).

这对于构建词法分析器非常有用，因为您可以继续使用特殊的^符号来指示整个字符串的行的开头。我们只需要增加索引以进一步匹配即可。这也意味着我们不必自己对字符串进行切片，从而节省了大量的内存分配和字符串复制过程（并不是说Python尤其擅长于此）。

In addition to the matching Python can search which means it will skip ahead until it finds a match:

除了匹配的Python可以搜索之外，这意味着它将向前跳过直到找到匹配的对象：

不匹配也匹配 (Not Matching is also Matching)

A particular common problem is that the absence of a match is expensive to handle in Python. Think of writing a tokenizer for a wiki like language (like markdown for instance). Between the tokens that indicate formatting, there is a lot of text that also needs handling. So when we match some wiki syntax between all the tokens we care about, we have more tokens which need handling. So how do we skip to those?

一个特别常见的问题是，缺少匹配项在Python中处理起来很昂贵。考虑为诸如语言之类的Wiki（例如markdown）编写标记器。在指示格式的标记之间，有很多文本也需要处理。因此，当我们在所有关心的标记之间匹配某种Wiki语法时，就会有更多需要处理的标记。那么我们如何跳到那些呢？

One method is to compile a bunch of regular expressions into a list and to then try one by one. If none matches we skip a character ahead:

一种方法是将一堆正则表达式编译为一个列表，然后一个一个地尝试。如果没有匹配项，我们将跳过前面的字符：

rules rules = = [
    [
    (( 'bold''bold' , , rere .. compilecompile (( r'**'r'**' )),
    )),
    (( 'link''link' , , rere .. compilecompile (( r'[[(.*?)]]'r'[[(.*?)]]' )),
)),
]

]

def def tokenizetokenize (( stringstring ):
    ):
    pos pos = = 0
    0
    last_end last_end = = 0
    0
    while while 11 :
        :
        if if pos pos >= >= lenlen (( stringstring ):
            ):
            break
        break
        for for toktok , , rule rule in in rulesrules :
            :
            match match = = rulerule .. matchmatch (( stringstring , , pospos )
            )
            if if match match is is not not NoneNone :
                :
                startstart , , end end = = matchmatch .. spanspan ()
                ()
                if if start start > > last_endlast_end :
                    :
                    yield yield 'text''text' , , stringstring [[ last_endlast_end :: startstart ]
                ]
                yield yield toktok , , matchmatch .. groupgroup ()
                ()
                last_end last_end = = pos pos = = matchmatch .. endend ()
                ()
                break
        break
        elseelse :
            :
            pos pos += += 1
    1
    if if last_end last_end < < lenlen (( stringstring ):
        ):
        yield yield 'text''text' , , stringstring [[ last_endlast_end :]
:]

This is not a particularly beautiful solution, and it’s also not very fast. The more mismatches we have, the slower we get as we only advance one character at the time and that loop is in interpreted Python. We also are quite inflexible at the moment in how we handle this. For each token we only get the matched text, so if groups are involved we would have to extend this code a bit.

这不是一个特别漂亮的解决方案，也不是很快。我们之间存在的不匹配越多，我们就越慢，因为我们当时只前进一个字符，并且该循环在解释的Python中进行。目前，我们在如何处理此问题上也非常僵化。对于每个令牌，我们只获得匹配的文本，因此，如果涉及到组，则必须稍微扩展此代码。

So is there a better method to do this? What if we could indicate to the regular expression engine that we want it to scan for any of a few regular expressions?

那么有没有更好的方法可以做到这一点？如果我们可以向正则表达式引擎指示我们希望它扫描几个正则表达式中的任何一个怎么办？

This is where it gets interesting. Fundamentally this is what we do when we write a regular expression with sub-patterns: (a|b). This will search for either a or b. So we could build a humongous regular expression out of all the expressions we have, and then match for that. The downside of this is that we will eventually get super confused with all the groups involved.

这就是它变得有趣的地方。从根本上讲，这是我们编写带有子模式的正则表达式时所做的工作： （a | b） 。这将搜索a或b 。因此，我们可以从我们拥有的所有表达式中构建一个庞大的正则表达式，然后进行匹配。不利的一面是，我们最终将对所有涉及的群体感到超级困惑。

进入扫描仪 (Enter The Scanner)

This is where things get interesting. For the last 15 years or so, there has been a completely undocumented feature in the regular expression engine: the scanner. The scanner is a property of the underlying SRE pattern object where the engine keeps matching after it found a match for the next one. There even exists an re.Scanner class (also undocumented) which is built on top of the SRE pattern scanner which gives this a slightly higher level interface.

这就是事情变得有趣的地方。在过去的15年左右的时间里，正则表达式引擎中有一个完全未记录的功能：扫描器。扫描程序是基础SRE模式对象的属性，在找到下一个匹配项之后，引擎将保持匹配。甚至存在一个re.Scanner类（也未记录），该类建立在SRE模式扫描器的顶部，从而为该接口提供了更高层次的接口。

The scanner as it exists in the re module is not very useful unfortunately for making the ‘not matching’ part faster, but looking at its sourcecode reveals how it’s implemented: on top of the SRE primitives.

不幸的是，存在于re模块中的扫描程序对于加快“不匹配”部分的速度不是很有用，但是查看其源代码可以发现其实现方式：位于SRE原语之上。

The way it works is it accepts a list of regular expression and callback tuples. For each match it invokes the callback with the match object and then builds a result list out of it. When we look at how it’s implemented it manually creates SRE pattern and subpattern objects internally. (Basically it builds a larger regular expression without having to parse it). Armed with this knowledge we can extend this:

它的工作方式是接受正则表达式和回调元组的列表。对于每个匹配项，它将使用match对象调用回调，然后从中构建一个结果列表。当我们看一下它是如何实现的时，它会在内部手动创建SRE模式和子模式对象。（基本上，它无需解析即可构建更大的正则表达式）。有了这些知识，我们可以扩展此范围：

So how do we use this? Like this:

那么我们该如何使用呢？像这样：

scanner scanner = = ScannerScanner ([
    ([
    (( 'whitespace''whitespace' , , r's+'r's+' ),
    ),
    (( 'plus''plus' , , r'+'r'+' ),
    ),
    (( 'minus''minus' , , r'-'r'-' ),
    ),
    (( 'mult''mult' , , r'*'r'*' ),
    ),
    (( 'div''div' , , r'/'r'/' ),
    ),
    (( 'num''num' , , r'd+'r'd+' ),
    ),
    (( 'paren_open''paren_open' , , r'('r'(' ),
    ),
    (( 'paren_close''paren_close' , , r')'r')' ),
),
])

])

for for tokentoken , , match match in in scannerscanner .. scanscan (( '(1 + 2) * 3''(1 + 2) * 3' ):
    ):
    print print (( tokentoken , , matchmatch .. groupgroup ())
())

In this form it will raise an EOFError in case it cannot lex something, but if you pass skip=True then it skips over unlexable parts which is perfect for building things like wiki syntax lexers.

在这种形式下，如果无法对某些内容进行语法化处理，则会引发EOFError ，但是如果您传递skip = True，则它将跳过不可语法化的部分，这对于构建诸如Wiki语法词法分析器之类的内容非常理想。

Kong扫描 (Scanning with Holes)

When we skip, we can use match.start() and match.end() to figure out which parts we skipped over. So here the first example adjusted to do exactly that:

跳过时，可以使用match.start（）和match.end（）来确定跳过了哪些部分。因此，这里将第一个示例调整为完全做到这一点：

整理组 (Fixing up Groups)

One annoying thing is that our group indexes are not local to our own regular expression but to the combined one. This means if you have a rule like (a|b) and you want to access that group by index, it will be wrong. This would require a bit of extra engineering with a class that wraps the SRE match object with a custom one that adjusts the indexes and group names. If you are curious about that I made a more complex version of the above solution that implements a proper match wrapper in a github repository together with some samples of what you can do with it.

一件令人讨厌的事情是，我们的组索引不是我们自己的正则表达式局部的，而是联合的局部表达式。这意味着，如果您有（a | b）这样的规则，并且想按索引访问该组，那将是错误的。这将需要对类进行一些额外的工程设计，该类将SRE匹配对象与用于调整索引和组名的自定义对象包装在一起。如果您对此感到好奇，那么我会为上述解决方案制作一个更复杂的版本，该解决方案在github存储库中实现了正确的匹配包装，以及一些您可以使用它的示例。