这种模式仅意味着将字符串中的所有内容抓取到数据中第一个潜在句子边界为止:
[^\.?!\r\n]*
输出:
>>> pattern = re.compile(r"([^\.?!\r\n]*)")
>>> matches = pattern.findall("Australians go hard!!!") # Actual source snippet, not a personal comment about Australians. :-)
>>> print matches
['Australians go hard', '', '', '', '']
从Python文档中:
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
现在,如果从左到右扫描字符串,并且*运算符是贪婪的,那么很有意义的是,返回的第一个匹配项是整个字符串,直到感叹号为止.但是,在消耗完该部分之后,我看不到该模式是如何准确产生四次空匹配的,大概是通过在“ d”之后向左扫描字符串来实现的.我确实知道*运算符表示此模式可以匹配空字符串,但我只是看不出它在尾随字母的“ d”和开头的“!”之间会多次这样做标点符号.
添加^锚具有以下效果:
>>> pattern = re.compile(r"^([^\.?!\r\n]*)")
>>> matches = pattern.findall("Australians go hard!!!")
>>> print matches
['Australians go hard']
由于这消除了空字符串匹配,因此似乎表明所述空匹配发生在字符串的前导“ A”之前.但这似乎与按照找到的顺序返回的匹配项的文档相矛盾(应该先出现前导“ A”之前的匹配项),并且再次恰好有四个空匹配项使我感到困惑.