re(正则表达式)详细总结

最新推荐文章于 2024-04-24 11:21:50 发布

IT运维大爆炸（运维开发）

最新推荐文章于 2024-04-24 11:21:50 发布

阅读量1.3w

点赞数 32

本文链接：https://blog.csdn.net/yunweimao/article/details/106688046

版权

1、引入正则模块(Regular Expression)

要使用python3中的re则必须引入 re模块

 import re #引入正则表达式

2、主要使用的方法 match(), 从左到右进行匹配

 result =re.match(pattern, str) 
 #pattern 为要校验的规则
 #str 为要进行校验的字符串
 >>> import re
 >>> print(re.match('www', 'www.runoob.com').span())  #在起始位置匹配
 (0, 3)
 >>> print(re.match('com', 'www.runoob.com'))  #不在起始位置匹配
 None
 #如果result不为None,则group方法则对result进行数据提取

3、正则表达式匹配规则

1.单字符串匹配规则

 字符         功能
 .       匹配任意1个字符(除了\n)
 []     匹配[]中列举的字符
 \d     匹配数字,也就是0-9
 \D     匹配非数字,也就是匹配不是数字的字符
 \s     匹配空白符,也就是 空格\tab
 \S     匹配非空白符,\s取反
 \w     匹配单词字符, a-z, A-Z, 0-9, _
 \W     匹配非单词字符, \w取反

2.表示数量的规则

 字符           功能
 *       匹配前一个字符出现0次多次或者无限次,可有可无,可多可少
 +      匹配前一个字符出现1次多次或则无限次,直到出现一次
 ?       匹配前一个字符出现1次或者0次,要么有1次,要么没有
 {m}     匹配前一个字符出现m次
 {m,}   匹配前一个字符至少出现m次
 {m,n}   匹配前一个字符出现m到n次

3.案例：验证手机号码是否符合规则(不考虑边界问题)

 #首先清楚手机号的规则
 #1.都是数字       2.长度为11   3.第一位是1     4.第二位是35678中的一位
 >>> import re
 >>> pattern ="1[35678]\d{9}"
 >>> phoneStr ="18230092223"
 >>> result =re.match(pattern,phoneStr)
 >>> result.group()
 '18230092223'

4、表示边界

1.字符含义

 字符             功能
 ^           匹配字符串开头
 $           匹配字符串结尾
 \b           匹配一个单词的边界
 \B           匹配非单词边界

2.案例：边界(制定规则来匹配str="ho ve r")

 #定义规则匹配str="ho ve r"
 #1. 以字母开始 ^\w
 #2. 中间有空字符 \s
 #3. \b的两种解释是：
 #'\b', 如果前面不加r, 那么解释器认为是转义字符“退格键backspace”;
 #r'\b', 如果前面加r, 那么解释器不会进行转义，\b 解释为正则表达式模式中的字符串边界。
 #4. ve两边分别限定匹配单词边界
 >>> import re
 >>> str ="dasdho ve rgsdf"
 >>> pattern =r"^\w+\s\bve\b\sr"
 >>> result =re.match(pattern, str)
 >>> result.group()
 'dasdho ve r'

5、匹配分组

1.字符含义

 字符             功能
 |           匹配左右任意一个表达式
 (ab)       将括号中字符作为一个分组
 \num       引用分组num匹配到的字符串
 (?P<name>) 分组起别名
 (?P=name)   引用别名为name分组匹配到的字符串

2.匹配出0-100之间的数字

 #匹配出0-100之间的数字
 #首先:正则是从左往又开始匹配
 #经过分析: 可以将0-100分为三部分
 #1. 0 "0$"
 #2. 100 "100$"
 #3. 1-99 "[1-9]\d{0,1}$"
 #所以整合如下
 >>> import re
 >>> pattern =r"0$|100$|[1-9]\d{0,1}$"
 >>> result = re.match(pattern,"27")
 >>> result.group()
 '27'
 >>> result =re.match(pattern,"212")
 >>> result.group()
 Traceback (most recent call last):
 File "<stdin>", line 1, in<module>
 AttributeError: 'NoneType'object has no attribute 'group'
 #将0考虑到1-99上,上述pattern还可以简写为:pattern=r"100$|[1-9]?\d{0,1}$"

3. 从指定字符串开始操作

 #(?<=abc)def ，并不是从 a 开始搜索，而是从 d 往回看的。你可能更加愿意使用 search() 函数，而不是 match() 函数：
 >>> import re
 >>> m =re.search('(?<=abc)def', 'abcdef')
 >>> m.group(0)
 'def'
 #搜索一个跟随在连字符后的单词
 >>> m =re.search(r'(?<=-)\w+', 'spam-egg')
 >>> m.group(0)
 'egg'

4. 如果在 pattern 中捕获到括号，那么所有的组里的文字也会包含在列表里。如果 maxsplit 非零，最多进行 maxsplit 次分隔，剩下的字符全部返回到列表的最后一个元素

 >>> re.split(r'\W+', 'Words, words, words.')
 ['Words', 'words', 'words', '']
 >>> re.split(r'(\W+)', 'Words, words, words.')
 ['Words', ', ', 'words', ', ', 'words', '.', '']
 >>> re.split(r'\W+', 'Words, words, words.', 1)
 ['Words', 'words, words.']
 >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
 ['0', '3', '9']

5.如果分隔符里有捕获组合，并且匹配到字符串的开始，那么结果将会以一个空字符串开始。对于结尾也是一样

 >>> re.split(r'(\W+)', '...words, words...')
 ['', '...', 'words', ', ', 'words', '...', '']

6.这样的话，分隔组将会出现在结果列表中同样的位置。样式的空匹配将分开字符串，但只在不相临的状况生效

 >>> re.split(r'\b', 'Words, words, words.')
 ['', 'Words', ', ', 'words', ', ', 'words', '.']
 >>> re.split(r'\W*', '...words...')
 ['', '', 'w', 'o', 'r', 'd', 's', '', '']
 >>> re.split(r'(\W*)', '...words...')
 ['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']

7.如果一个组包含在样式的一部分，并被匹配多次，就返回最后一个匹配

 >>> m =re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
 >>> m.group(0)       # The entire match
 'Isaac Newton'
 >>> m.group(1)       # The first parenthesized subgroup.
 'Isaac'
 >>> m.group(2)       # The second parenthesized subgroup.
 'Newton'
 >>> m.group(1, 2)    # Multiple arguments give us a tuple.
 ('Isaac', 'Newton')

8.如果正则表达式使用了 `(?P…)` 语法， groupN 参数就也可能是命名组合的名字。如果一个字符串参数在样式中未定义为组合名，一个 `IndexError`就 `raise`

 >>> m =re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
 >>> m.group('first_name')
 'Malcolm'
 >>> m.group('last_name')
 'Reynolds'
 #命名组合同样可以通过索引值引用
 >>> m.group(1)
 'Malcolm'
 >>> m.group(2)
 'Reynolds'

9.如果一个组匹配成功多次，就只返回最后一个匹配

 #匹配最后两位
 >>> m =re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
 >>> m.group(1)                        # Returns only the last match.
 'c3'

10.这个等价于 `m.group(g)`。这允许更方便的引用一个匹配

 #匹配单词字符每个代表一个字符，一共匹配了两个
 >>> m =re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
 >>> m[0]       # The entire match
 'Isaac Newton'
 >>> m[1]       # The first parenthesized subgroup.
 'Isaac'
 >>> m[2]       # The second parenthesized subgroup.
 'Newton'

11.`Match.`groups`(default=None)返回一个元组，包含所有匹配的子组，在样式中出现的从1到任意多的组合。default 参数用于不参与匹配的情况，默认为`None

 >>> m =re.match(r"(\d+)\.(\d+)", "24.1632")
 >>> m.groups()
 ('24', '1632')

12.如果我们使小数点可选，那么不是所有的组都会参与到匹配当中。这些组合默认会返回一个 `None` ，除非指定了 default 参数

 >>> m =re.match(r"(\d+)\.?(\d+)?", "24")
 >>> m.groups()      # Second group defaults to None.
 ('24', None)
 >>> m.groups('0')   # Now, the second group defaults to '0'.
 ('24', '0')

13.`Match.groupdict`(default=None)返回一个字典，包含了所有的命名子组。key就是组名。default 参数用于不参与匹配的组合；默认为 `None`

 >>> m =re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
 >>> m.groupdict()
 {'first_name': 'Malcolm', 'last_name': 'Reynolds'}

14.这个例子会从email地址中移除掉 remove_this

 >>> email ="tony@tiremove_thisger.net"
 >>> m =re.search("remove_this", email)
 >>> email[:m.start()] +email[m.end():]
 'tony@tiger.net'

15.`findall()` 匹配样式所有的出现，不仅是像 `search()` 中的第一个匹配。比如，如果一个作者希望找到文字中的所有副词，他可能会按照以下方法用 `findall()`

 >>> text ="He was carefully disguised but captured quickly by police."
 >>> re.findall(r"\w+ly", text)
 ['carefully', 'quickly']

6、找到所有副词和位置

如果需要匹配样式的更多信息， finditer() 可以起到作用，它提供了匹配对象作为返回值，而不是字符串。继续上面的例子，如果一个作者希望找到所有副词和它的位置，可以按照下面方法使用 finditer()

 >>> text ="He was carefully disguised but captured quickly by police."
 >>> form inre.finditer(r"\w+ly", text):
 ...     print('%02d-%02d: %s'% (m.start(), m.end(), m.group(0)))
 07-16: carefully
 40-47: quickly