正则表达式的高级特性

最新推荐文章于 2024-06-21 04:42:38 发布

shhgs

最新推荐文章于 2024-06-21 04:42:38 发布

阅读量1.7k

点赞数

文章标签：正则表达式 search lambda python regex table

本文链接：https://blog.csdn.net/shhgs/article/details/43881

版权

non-greedy

正则表达式的quantifier，都是greedy的。也就是说，如果一个字符串里有多个match(这些match的起始位置必须相同)，正则表达式会匹配最长的那个。由于"."能匹配"<"和">"，因此".+"会匹配到倒数第二个字符。所以这里要用non-greedy[1]。所谓non-greedy是指，正则表达式在匹配的时候，会选择最短的那个match。

 >>> txt = "<tag>Hello World</tag>" >>> pat = r'<.+?>' >>> re.match(pat, txt).group() '<tag>'

下面再看一个例子：

 >>> txt = "abcdddd" >>> code = """ m = re.match(pat, txt) for s in m.groups() : print s """ >>> pat = "(a/w+)" >>> exec code abcdddd >>> pat = "(a/w+?)" >>> exec code ab >>> pat = "(a/w*?)" >>> exec code a

用了non-greedy之后，"a/w*?"同"a"， "a/w+?"同"a/w"已经没什么区别了，所以它的主要用途还是在匹配成对出现的标点符号方面。因此这个正则表达式应该是：

Single-line与Multi-line

Single-line和multi-line都是是针对"/n"的。而"/n"的处理又牵涉到"^"和"$"的匹配。简而言之，在Single-line的模式下，'.'能匹配"/n"：

>>> match = lambda pat, str : re.match(pat, str).group() >>> match(".", "/n") Traceback (most recent call last): ... >>> match("(?s).", "/n") '/n' >>> match("(?m).", "/n") Traceback (most recent call last): ... >>> search = lambda pat, str: re.search(pat, str).group() >>> txt = """ 123abc def456 """ >>> search = lambda pat, str: re.search(pat, str).group() >>> search("(?s)/w$", txt) '6' >>> pat = "(?s)^.+$" >>> search(pat, txt) '/n123abc/ndef456/n' >>> search("(?m)/w$", txt) 'c' >>> search('(?m)^.+$', txt) '123abc'

multiline的主要功能是让"/n"充当一行的开始和结束。默认情况下，"/n"会被视作一个普通的,可以匹配"/s"的字符。

 >>> search("^[a-zA-z].*$", "Hello World!/n") 'Hello World!' >>> search("^[a-zA-z].*$", "/nHello World!/n") Traceback (most recent call last): ... >>> search("^/s*[a-zA-z].*$", "/nHello World!/n") '/nHello World!' >>> search("(?m)^[a-zA-z].*$", "/nHello World!/n") 'Hello World!'

总之，multiline和single-line都是针对"/n"的，在multi-line模式下，'/n'可以被认作"^"和"$"；在single -line的模式下，"."可以匹配'/n'。由此正则表达式的工作方式也会相应地发生一些变化。但是multi-line和single-line不是一种非此即彼的关系。

 >>> match('.', '/n') Traceback (most recent call last): ... >>> match('(?s).', '/n') '/n' >>> search('(?m)^.+$', '/nhello/nworld/n') 'hello' >>> search('^.+$', '/nhello/nworld/n') Traceback (most recent call last): ...

前者证明默认情况下，正则表达式不是single-line的；后者证明，默认情况下它不是multiline的。

Backreference

前面用括号捕捉了一个regex，后面可以用/i(i表示数字)来引用这个regex。

 >>> str = "<tag>Hello World</tag>" >>> pat = r"<(.+?)>(.+)<//1>" >>> re.match(pat, str) <_sre.SRE_Match object at 0x009F0F08> >>> m = re.match(pat, str) >>> code = """ for s in m.groups() : print s """ >>> exec code tag Hello World

再举一个例子，加入要捕捉连续三个相同的字母：

 >>> pat = r'([a-zA-Z])/1/1'

注意，在进行backreferece的时候，一定要使用raw_string，否则反斜杠会被escape掉了。

如果一个表达式里有多个group，backrefence就会变得很混乱。这时可以用(?:)将group排除出backrefence的候选范围。

 >>> txt = "<h1>Title</h1>" >>> pat = r'<(.+?)>(?:.+?)<//1>' >>> m = re.match(pat, txt) >>> m.groups() ('h1',)

如果表达式真的非常复杂，那么还可以考虑给group命名。命名group语法比较复杂。首先，命名的时候要用"?P "，引用的时候要用"P=Name"，替换的时候，要用"/g "。此外，匹配成功之后，还可以用group的名字来检查子串。

 >>> txt = "<h1>Title</h1>" >>> pat = r"<(?P<tag>.+?)>(?P<content>.+)</(?P=tag)>" >>> m = re.match(pat, txt) >>> print m.group() <h1>Title</h1> >>> print m.group("tag") h1 >>> re.sub(pat, '<anytag>/g<content></anytag>', txt) '<anytag>Title</anytag>'

assertion

Python 的re有两种assertion，分别是lookahead assertion和lookbehind assertion。所谓lookahead是指这个assertion是为正则表达式的前半部分服务的，因此它assert的应该是正则表达式的后半部分。lookbehind也是一样^[1]。

 >>> def re_show(pat, s): print re.compile(pat, re.M).sub("{/g<0>}", s.rstrip()),'/n' >>> txt = "Micheal Jordan and Micheal Jackson" >>> pat = "(Micheal)(?= Jordan)" >>> re_show(pat, txt) {Micheal} Jordan and Micheal Jackson >>> pat = "(Micheal)((?i)(?=/s+jordan))" >>> re_show(pat, txt) {Micheal} Jordan and Micheal Jackson

单从这个例子来看，lookahead assertion除了不参与match之外，同group没什么两样。其实它的主要是用来进行替换的。

 >>> txt = "Micheal Jordan and Micheal Jackson" >>> pat = "(Micheal)(?= Jackson)" >>> re.sub(pat, "Phil", txt) 'Micheal Jordan and Phil Jackson'

lookahead assertion分positive和negative两种，前面举的都是positive的例子，negative同positive的很相似，只是把符号变成了"?!"

 >>> txt = "Bill Clinton, Bill Joy, Bill Gates" >>> pat = "(Bill)(?! Joy)" >>> re_show(pat, txt) {Bill} Clinton, Bill Joy, {Bill} Gates >>> re.sub(pat, "William", txt) 'William Clinton, Bill Joy, William Gates'

lookbehind assertion同lookahead assertion很类似，不过它有一个重大的限制，pattern必须是定长的，不能有数量标识符(quantifier)。它的positive的符号是"?<="；negative的符号是"?<!"

 >>> txt = "CPython, JPython, and Python.NET" >>> pat_positive = "(?<=J)(Python)" >>> pat_negative = "(?<!J)(Python)" >>> re_show(pat_positive, txt) CPython, J{Python}, and Python.NET >>> re_show(pat_negative, txt) C{Python}, JPython, and {Python}.NET

Tips

下面举几个小例子。

 tag = lambda s: r"(?s)<(?P<tag>%s)/b(?P<attributes>.+?)>(?P<content>.+?)</%s>" % (s,s)

tag专门用来生成匹配element的正则表达式。如果要生成一个能匹配h1元素的regex，只要调用tag('h1')就可以了。下面来演示一下：

 >>> tg = tag('anytag') >>> txt = r"""<anytag a="yes" b="no">Hello world</anytag>""" >>> re.match(tg, txt) <_sre.SRE_Match object at 0x009E4660> >>> m = re.match(tg, txt) >>> m.group('tag') 'anytag' >>> m.group('attribute') ' a="yes" b="no"' >>> m.group('content') 'Hello world'

 tag2 = lambda s : r"(?s)<%s/b.+?>(.+?)</%s>" % (s, s) GrabEntries = lambda s, tg : re.findall(tag2(tg), s)

下面再举一个例子。比方说table里面会有很多tr，tr里面又有很多td。如果能有一个函数，传给它一个table的element，它返回tr的list就好了。下面就是这个函数。不过这个blog有点问题，不能真刀真枪地上table，所以改了一下：

 >>> txt = """ <table a="yes" b="no"> <tr c='good' d='bad'> <td>123</td> <td>abc</td> </tr> <tr e='red' f='white'> <td>red</td> <td>white</td> </tr> </table> """ >>> GrabEntries(txt, 'tr') ['/n/t<td>123</td>/n/t<td>abc</td>/n', '/n/t<td>red</td>/n/t<td>white</td>/n']

^[1]看来我原先的理解有问题。原先我认为 lookahead assertion的意思是，这个正则表达式对它的前半部分做了个断言，搞了半天人家的意思是，这个正则表达式就是一个assertion，而 lookahead表示它只关心表达式的前半部分。也就是说，它assert了表达式的后面部分。