python正则表达式——regex模块

最新推荐文章于 2023-05-12 21:23:59 发布

满腹的小不甘

最新推荐文章于 2023-05-12 21:23:59 发布

阅读量1.9k

点赞数

分类专栏： Python 文章标签： python 正则表达式

本文链接：https://blog.csdn.net/qq_27586341/article/details/106007708

版权

Python 专栏收录该内容

11 篇文章 1 订阅

订阅专栏

1. 为了与re模块兼容，此模块具有2个行为

2. Unicode中不区分大小写的匹配：Case-insensitive matches

3. Flags

4. 组

5. 其他功能，如下表

参考：扩展模块官网regex 2020.5.7

regex正则表达式实现与标准“ re”模块向后兼容，但提供了其他功能。

re模块的零宽度匹配行为是在Python 3.7中更改的，并且为Python 3.7编译时，此模块将遵循该行为。

1. 为了与re模块兼容，此模块具有2个行为

Version 0：(old behaviour，与re模块兼容):

Please note that the re module’s behaviour may change over time, and I’ll endeavour to match that behaviour in version 0.
- Indicated by the VERSION0 or V0 flag, or (?V0) in the pattern.
- Zero-width matches are not handled correctly in the re module before Python 3.7. The behaviour in those earlier versions is:
  - .split won’t split a string at a zero-width match.
  - .sub will advance by one character after a zero-width match.
- Inline flags apply to the entire pattern, and they can’t be turned off.
- Only simple sets are supported.
- Case-insensitive matches in Unicode use simple case-folding by default.
Version 1：(new behaviour, possibly different from the re module):
- Indicated by the VERSION1 or V1 flag, or (?V1) in the pattern.
- Zero-width matches are handled correctly.
- Inline flags apply to the end of the group or pattern, and they can be turned off.
- Nested sets and set operations are supported.
- Case-insensitive matches in Unicode use full case-folding by default.

如果未指定版本，则regex模块将默认为regex.DEFAULT_VERSION。

2. Unicode中不区分大小写的匹配：Case-insensitive matches

regex模块支持简单和完整的大小写折叠，以实现Unicode中不区分大小写的匹配。可以使用FULLCASE或F标志或模式中的（？f）来打开完整的大小写折叠。请注意，该标志会影响IGNORECASE标志的工作方式。FULLCASE标志本身不会打开不区分大小写的匹配。

在版本0行为中，默认情况下该标志处于关闭状态。
在版本1行为中，默认情况下该标志处于启用状态。

3. Flags

标志有2种：局部标志和全局标志。范围标志只能应用于模式的一部分，并且可以打开或关闭；全局标志适用于整个模式，只能将其打开。

局部标志： FULLCASE, IGNORECASE, MULTILINE, DOTALL, VERBOSE, WORD.

全局标志：ASCII, BESTMATCH, ENHANCEMATCH, LOCALE, POSIX, REVERSE, UNICODE, VERSION0, VERSION1.

如果未指定ASCII，LOCALE或UNICODE标志，则如果正则表达式模式为Unicode字符串，则默认为UNICODE；如果为字节字符串，则默认为ASCII。

ENHANCEMATCH标志进行模糊匹配，以提高找到的下一个匹配的匹配度。
BESTMATCH标志使模糊匹配搜索最佳匹配而不是下一个匹配。

4. 组

所有捕获组都有一个组号，从1开始。具有相同组名的组将具有相同的组号，而具有不同组名的组将具有不同的组号。

同一名称可由多个组使用，以后的捕获“覆盖”较早的捕获。该组的所有捕获都可以通过match对象的captures方法获得。

组号将在分支重置的不同分支之间重用，例如。(?|(first)|(second))仅具有组1。如果捕获组具有不同的组名，则它们当然将具有不同的组号，例如，(?|(?P<foo>first)|(?P<bar>second)) 具有组1 (“foo”) 和组2 (“bar”).

正则表达式： (\s+)(?|(?P<foo>[A-Z]+)|(\w+)) (?P<foo>[0-9]+) 有2组

(\s+) is group 1.
(?P<foo>[A-Z]+) is group 2, also called “foo”.
(\w+) is group 2 because of the branch reset.
(?P<foo>[0-9]+) is group 2 because it’s called “foo”.

5. 其他功能，如下表

模式描述

单词起始位置、结束位置、分界位置

regex用\m表示单词起始位置，用\M表示单词结束位置。

\b：是单词分界位置，但不能区分是起始还是结束位置。

(?flags-flags:...) 局部

(?flags-flags) 全局

局部范围控制：

(?i:)是打开忽略大小写，(?-i:)则是关闭忽略大小写。

如果有多个flag挨着写既可，如(?is-f:)：减号左边的是打开，减号右边的是关闭。

>>> regex.search(r"(?i:good)", "GOOD")
<regex.Match object; span=(0, 11), match='GOOD'>

全局范围控制：

(?si-f)good

lookaround

对条件模式中环顾四周的支持：

>>> regex.match(r'(?(?=\d)\d+|\w+)', '123abc')
<regex.Match object; span=(0, 3), match='123'>
>>> regex.match(r'(?(?=\d)\d+|\w+)', 'abc123')
<regex.Match object; span=(0, 6), match='abc123'>

这与在一对替代方案的第一个分支中进行环视不太一样：

>>> print(regex.match(r'(?:(?=\d)\d+\b|\w+)', '123abc')) # 若分支1不匹配，尝试第2个分支
<regex.Match object; span=(0, 6), match='123abc'>
>>> print(regex.match(r'(?(?=\d)\d+\b|\w+)', '123abc')) # 若分支1不匹配，不尝试第2个分支
None

(?p)

POSIX匹配（最左最长）

正常匹配：
>>> regex.search(r'Mr|Mrs', 'Mrs')
<regex.Match object; span=(0, 2), match='Mr'>
>>> regex.search(r'one(self)?(selfsufficient)?', 'oneselfsufficient')
<regex.Match object; span=(0, 7), match='oneself'>

POSIX匹配：
>>> regex.search(r'(?p)Mr|Mrs', 'Mrs')
<regex.Match object; span=(0, 3), match='Mrs'>
>>> regex.search(r'(?p)one(self)?(selfsufficient)?', 'oneselfsufficient')
<regex.Match object; span=(0, 17), match='oneselfsufficient'>

[[a-z]--[aeiou]]

V0：simple sets，与re模块兼容

V1：nested sets，功能增强，集合包含'a'-'z'，排除“a”, “e”, “i”, “o”, “u”

eg：

regex.search(r'(?V1)[[a-z]--[aeiou]]+', 'abcde')

或

regex.search(r'[[a-z]--[aeiou]]+', 'abcde', flags=regex.V1)

<regex.Match object; span=(1, 4), match='bcd'>

(?(DEFINE)...)

命名组内容及名字：如果没有名为“ DEFINE”的组，则…将被忽略，但只要有任何组定义，(?(DEFINE))将起作用。

eg：

>>> regex.search(r'(?(DEFINE)(?P<quant>\d+)(?P<item>\w+))(?&quant) (?&item)', '5 elephants')
<regex.Match object; span=(0, 11), match='5 elephants'>

# 卡两头为固定样式、中间随意的内容
>>> regex.search(r'(?(DEFINE)(?P<quant>\d+)(?P<item>\w+))(?&quant)[\u4E00-\u9FA5](?&item)', '123哈哈dog')
<regex.Match object; span=(0, 8), match='123哈哈dog'>

保留K出现位置之后的匹配内容，丢弃其之前的匹配内容。

>>> m = regex.search(r'(\w\w\K\w\w\w)', 'abcdef')
<regex.Match object; span=(2, 5), match='cde'> 保留cde，丢弃ab
>>> m[0] 'cde'
>>> m[1] 'abcde'

>>> m = regex.search(r'(?r)(\w\w\K\w\w\w)', 'abcdef')
<regex.Match object; span=(1, 3), match='bc'> 反向，保留bc，丢弃def
>>> m[0] 'bc'
>>> m[1] 'bcdef'

(?r) 反向搜索

>>> regex.findall(r".", "abc")
['a', 'b', 'c']
>>> regex.findall(r"(?r).", "abc")
['c', 'b', 'a']

注意：反向搜索的结果不一定与正向搜索相反

>>> regex.findall(r"..", "abcde")
['ab', 'cd']
>>> regex.findall(r"(?r)..", "abcde")
['de', 'bc']

expandf

使用下标来获取重复捕获组的所有捕获

>>> m = regex.match(r"(\w)+", "abc")
>>> m.expandf("{1}") 'c' m.expandf("{1}") == m.expandf("{1[-1]}") 后面的匹配覆盖前面的匹配，所以{1}=c
>>> m.expandf("{1[0]} {1[1]} {1[2]}") 'a b c'
>>> m.expandf("{1[-1]} {1[-2]} {1[-3]}") 'c b a'

定义组名
>>> m = regex.match(r"(?P<letter>\w)+", "abc")
>>> m.expandf("{letter}") 'c'
>>> m.expandf("{letter[0]} {letter[1]} {letter[2]}") 'a b c'
>>> m.expandf("{letter[-1]} {letter[-2]} {letter[-3]}") 'c b a'

>>> m = regex.match(r"(\w+) (\w+)", "foo bar")
>>> m.expandf("{0} => {2} {1}") 'foo bar => bar foo'

>>> m = regex.match(r"(?P<word1>\w+) (?P<word2>\w+)", "foo bar")
>>> m.expandf("{word2} {word1}") 'bar foo'

同样可以用于search()方法

capturesdict()

groupdict()

captures()

capturesdict() 是 groupdict() 和 captures()的结合：

groupdict()：返回一个字典，key = 组名，value = 匹配的最后一个值

captures()：返回一个所有匹配值的列表

capturesdict()：返回一个字典，key = 组名，value = 所有匹配值的列表

>>> m = regex.match(r"(?:(?P<word>\w+) (?P<digits>\d+)\n)+", "one 1\ntwo 2\nthree 3\n")
>>> m.groupdict()
{'word': 'three', 'digits': '3'}

>>> m.captures("word")
['one', 'two', 'three']

>>> m.captures("digits")
['1', '2', '3']
>>> m.capturesdict()

{'word': ['one', 'two', 'three'], 'digits': ['1', '2', '3']}

访问组的方式

（1）通过下标、切片访问：
>>> m = regex.search(r"(?P<before>.*?)(?P<num>\d+)(?P<after>.*)", "pqr123stu")
>>> m["before"]
pqr
>>> len(m)
4
>>> m[:]
('pqr123stu', 'pqr', '123', 'stu')

（2）通过group("name")访问：
>>> m.group('num')

'123'

（3）通过组序号访问：
>>> m.group(0)

'pqr123stu'

>>> m.group(1)

'pqr'

subf

subfn

subf和subfn分别是sub和subn的替代方案。当传递替换字符串时，他们将其视为格式字符串。

>>> regex.subf(r"(\w+) (\w+)", "{0} => {2} {1}", "foo bar")
'foo bar => bar foo'
>>> regex.subf(r"(?P<word1>\w+) (?P<word2>\w+)", "{word2} {word1}", "foo bar")
'bar foo'

partial

部分匹配：match、search、fullmatch、finditer都支持部分匹配，使用partial关键字参数设置。匹配对象有一个pattial参数，当部分匹配时返回True，完全匹配时返回False

>>> regex.search(r'\d{4}', '12', partial=True)
<regex.Match object; span=(0, 2), match='12', partial=True>
>>> regex.search(r'\d{4}', '123', partial=True)
<regex.Match object; span=(0, 3), match='123', partial=True>
>>> regex.search(r'\d{4}', '1234', partial=True)
<regex.Match object; span=(0, 4), match='1234'> 完全匹配：没有partial
>>> regex.search(r'\d{4}', '12345', partial=True)
<regex.Match object; span=(0, 4), match='1234'>
>>> regex.search(r'\d{4}', '12345', partial=True).partial 完全匹配
False
>>> regex.search(r'\d{4}', '145', partial=True).partial 部分匹配
True
>>> regex.search(r'\d{4}', '1245', partial=True).partial 完全匹配
 False

(?P<name>)

允许组名重复

允许组名重复，后面的捕获覆盖前面的捕获
可选组：
>>> # Both groups capture, the second capture 'overwriting' the first.
>>> m = regex.match(r"(?P<item>\w+)? or (?P<item>\w+)?", "first or second")
>>> m.group("item") 'second'
>>> m.captures("item") ['first', 'second']

>>> m = regex.match(r"(?P<item>\w+)? or (?P<item>\w+)?", " or second")
>>> m.group("item") 'second'
>>> m.captures("item") ['second']

>>> m = regex.match(r"(?P<item>\w+)? or (?P<item>\w+)?", "first or ")
>>> m.group("item") 'first'
>>> m.captures("item") ['first']

强制性组：
>>> m = regex.match(r"(?P<item>\w*) or (?P<item>\w*)?", "first or second")
>>> m.group("item") 'second'
>>> m.captures("item") ['first', 'second']

>>> m = regex.match(r"(?P<item>\w*) or (?P<item>\w*)", " or second")
>>> m.group("item") 'second'
>>> m.captures("item") ['', 'second']

>>> m = regex.match(r"(?P<item>\w*) or (?P<item>\w*)", "first or ")
>>> m.group("item") ''
>>> m.captures("item") ['first', '']

detach_string

匹配对象通过其string属性，对所搜索字符串进行引用。detach_string方法将“分离”该字符串，使其可用于垃圾回收，如果该字符串很大，则可能节省宝贵的内存。

>>> m = regex.search(r"\w+", "Hello world") >>> print(m.group()) Hello >>> print(m.string) Hello world >>> m.detach_string() >>> print(m.group()) Hello >>> print(m.string) None

(?0)、(?1)、(?2)

(?R)或(?0)尝试递归匹配整个正则表达式。
(?1)、(?2)等，尝试匹配相关的捕获组，第1组、第2组。(Tarzan|Jane) loves (?1) == (Tarzan|Jane) loves (?:Tarzan|Jane)
(?＆name)尝试匹配命名的捕获组。

>>> regex.match(r"(Tarzan|Jane) loves (?1)", "Tarzan loves Jane").groups()
('Tarzan',)
>>> regex.match(r"(Tarzan|Jane) loves (?1)", "Jane loves Tarzan").groups()
('Jane',)

>>> m = regex.search(r"(\w)(?:(?R)|(\w?))\1", "kayak")
>>> m.group(0, 1, 2)
('kayak', 'k', None)

模糊匹配

三种类型错误：

插入： “i”
删除：“d”
替换：“s”
任何类型错误：“e”

Examples:

foo match “foo” exactly
(?:foo){i} match “foo”, permitting insertions
(?:foo){d} match “foo”, permitting deletions
(?:foo){s} match “foo”, permitting substitutions
(?:foo){i,s} match “foo”, permitting insertions and substitutions
(?:foo){e} match “foo”, permitting errors

如果指定了某种类型的错误，则不允许任何未指定的类型。在以下示例中，我将省略item并仅写出模糊性：

{d<=3} permit at most 3 deletions, but no other types
{i<=1,s<=2} permit at most 1 insertion and at most 2 substitutions, but no deletions
{1<=e<=3} permit at least 1 and at most 3 errors
{i<=2,d<=2,e<=3} permit at most 2 insertions, at most 2 deletions, at most 3 errors in total, but no substitutions

It’s also possible to state the costs of each type of error and the maximum permitted total cost.

Examples:

{2i+2d+1s<=4} each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4
{i<=1,d<=1,s<=1,2i+2d+1s<=4} at most 1 insertion, at most 1 deletion, at most 1 substitution; each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4

Examples:

{s<=2:[a-z]} at most 2 substitutions, which must be in the character set [a-z].
{s<=2,i<=3:\d} at most 2 substitutions, at most 3 insertions, which must be digits.

默认情况下，模糊匹配将搜索满足给定约束的第一个匹配项。ENHANCEMATCH (?e)标志将使它尝试提高找到的匹配项的拟合度（即减少错误数量）。

BESTMATCH标志将使其搜索最佳匹配。

regex.search("(dog){e}", "cat and dog")[1] returns "cat" because that matches "dog" with 3 errors (an unlimited number of errors is permitted).
regex.search("(dog){e<=1}", "cat and dog")[1] returns " dog" (with a leading space) because that matches "dog" with 1 error, which is within the limit.
regex.search("(?e)(dog){e<=1}", "cat and dog")[1] returns "dog" (without a leading space) because the fuzzy search matches " dog" with 1 error, which is within the limit, and the (?e) then it attempts a better fit.

匹配对象具有属性fuzzy_counts，该属性给出替换、插入和删除的总数：

>>> # A 'raw' fuzzy match:
>>> regex.fullmatch(r"(?:cats|cat){e<=1}", "cat").fuzzy_counts
(0, 0, 1)
>>> # 0 substitutions, 0 insertions, 1 deletion.

>>> # A better match might be possible if the ENHANCEMATCH flag used:
>>> regex.fullmatch(r"(?e)(?:cats|cat){e<=1}", "cat").fuzzy_counts
(0, 0, 0)
>>> # 0 substitutions, 0 insertions, 0 deletions.

匹配对象还具有属性fuzzy_changes，该属性给出替换、插入和删除的位置的元组：

>>> m = regex.search('(fuu){i<=2,d<=2,e<=5}', 'anaconda foo bar')
>>> m
<regex.Match object; span=(7, 10), match='a f', fuzzy_counts=(0, 2, 2)>
>>> m.fuzzy_changes
([], [7, 8], [10, 11])

\L<name>

Named lists

老方法：

p = regex.compile(r"first|second|third|fourth|fifth")，如果列表很大，则解析生成的正则表达式可能会花费大量时间，并且还必须注意正确地对字符串进行转义和正确排序，例如，“ cats”位于“ cat”之间。

新方法：顺序无关紧要，将它们视为一个set

>>> option_set = ["first", "second", "third", "fourth", "fifth"] >>> p = regex.compile(r"\L<options>", options=option_set)

named_lists属性：

>>> print(p.named_lists) # Python 3 {'options': frozenset({'fifth', 'first', 'fourth', 'second', 'third'})} # Python 2 {'options': frozenset(['fifth', 'fourth', 'second', 'third', 'first'])}

Set operators

集合、嵌套集合

仅版本1行为

添加了集合运算符，并且集合可以包含嵌套集合。

按优先级高低排序的运算符为：

|| for union (“x||y” means “x or y”)
~~ (double tilde) for symmetric difference (“x~~y” means “x or y, but not both”)
&& for intersection (“x&&y” means “x and y”)
-- (double dash) for difference (“x–y” means “x but not y”)

隐式联合，即[ab]中的简单并置具有最高优先级。因此，[ab && cd] 与 [[a || b] && [c || d]] 相同。

eg：

[ab] # Set containing ‘a’ and ‘b’
[a-z] # Set containing ‘a’ .. ‘z’
[[a-z]--[qw]] # Set containing ‘a’ .. ‘z’, but not ‘q’ or ‘w’
[a-z--qw] # Same as above
[\p{L}--QW] # Set containing all letters except ‘Q’ and ‘W’
[\p{N}--[0-9]] # Set containing all numbers except ‘0’ .. ‘9’
[\p{ASCII}&&\p{Letter}] # Set containing all characters which are ASCII and letter

开始、结束索引

匹配对象具有其他方法，这些方法返回有关重复捕获组的所有成功匹配的信息。这些方法是：

matchobject.captures([group1, ...])
matchobject.starts([group])
matchobject.ends([group])
matchobject.spans([group])

>>> m = regex.search(r"(\w{3})+", "123456789")
>>> m.group(1)
'789'
>>> m.captures(1)
['123', '456', '789']
>>> m.start(1)
6
>>> m.starts(1)
[0, 3, 6]
>>> m.end(1)
9
>>> m.ends(1)
[3, 6, 9]
>>> m.span(1)
(6, 9)
>>> m.spans(1)
[(0, 3), (3, 6), (6, 9)]

搜索锚，它在每个搜索开始/继续的位置匹配，可用于连续匹配或在负变长后向限制中使用，以限制后向搜索的范围：

>>> regex.findall(r"\w{2}", "abcd ef")
['ab', 'cd', 'ef']
>>> regex.findall(r"\G\w{2}", "abcd ef")
['ab', 'cd']

(?|...|...) 分支重置

捕获组号将在所有替代方案中重复使用，但是具有不同名称的组将具有不同的组号。

>>> regex.match(r"(?|(first)|(second))", "first").groups()
('first',)
>>> regex.match(r"(?|(first)|(second))", "second").groups()
('second',)

注：只有一个组

超时

匹配方法和功能支持超时。超时（以秒为单位）适用于整个操作：

>>> from time import sleep
>>>
>>> def fast_replace(m):
...     return 'X'
...
>>> def slow_replace(m):
...     sleep(0.5)
...     return 'X'
...
>>> regex.sub(r'[a-z]', fast_replace, 'abcde', timeout=2)
'XXXXX'
>>> regex.sub(r'[a-z]', slow_replace, 'abcde', timeout=2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python37\lib\site-packages\regex\regex.py", line 276, in sub
    endpos, concurrent, timeout)
TimeoutError: regex timed out

满腹的小不甘

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
python正则表达式——regex模块

单词起始位置、结束位置、分界位置regex用\m表示单词起始位置，用\M表示单词结束位置。\b：是单词分界位置，但不能区分是起始还是结束位置。局部范围的flag控制(?flags-flags:...)在re模块，flag只能作用于整个表达式，现在可以作用于局部范围了：>>> regex.search(r"(?i:good)", "GOOD")<regex.Match object
复制链接

扫一扫