Python3中的正则表达式2

最新推荐文章于 2024-07-09 17:54:14 发布

纸球_o

最新推荐文章于 2024-07-09 17:54:14 发布

阅读量228

点赞数 2

分类专栏： Python 文章标签：正则表达式 python

本文链接：https://blog.csdn.net/qq_39527601/article/details/99428449

版权

Python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

1、分组

用小括号来指定的子表达式成为分组。可以使用重复限制符，像*，+，?，{m,n}，来重复组里的内容。

例：一个html标签 content="<a>你是谁？</a>" 过滤出它的标签为？它的内容为？

>>> re.match(r"<(\w+)>(.*)</\1>",content).group(1)   #第一个分组为(\w+),第二个分组(.*),\1来引用第一个分组的内容
'a'
>>> re.match(r"<(\w+)>(.*)</\1>",content).group(2)
'你是谁？'

分组用（）包裹起来，在其他地方可以使用\n来引用第n个()内的内容。

组用()来指定，并且得到他们匹配文本的开始和结束索引。这样可通过一个参数用group(),start(),end(),span()来进行索引。组是从0开始计数的，组0总是存在的。

group() 按索引或名称返回匹配项的子组。group(0)返回整个匹配
start() 返回组匹配的子字符串开始的索引
end() 返回组匹配的字符串末尾的索引
span() 返回一个二元组（m.start(),m.end()）.

练习：获取<html><a>z你好啊</a></html><html><a>H你好呀</a></html>中的内容。

方法一：
>>> re.match(".*<a>(.*)</a>.*",content).group(1)  #只能匹配到第一组内容，group(2)出错。
'H你好呀'
>>> re.match(".*<a>(.*)</a>.*",content).group(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

方法二：
>>> re.findall("<a>(.*)</a>",content)    #匹配错误，他从第一个<a>开始匹配，到最后一个截止。
['z你好啊</a></html><html><a>H你好呀']

方法三：
>>> re.match(".*<a>(.*)</a>.*<a>(.*)</a>.*",content).group(1)   #匹配成功，但是过程繁琐，如果有多个这个的标签，那正则表达式岂不是要写哭。
'z你好啊'
>>> re.match(".*<a>(.*)</a>.*<a>(.*)</a>.*",content).group(2)
'H你好呀'

2、贪婪匹配和非贪婪匹配

python的正则表达式有两种匹配模式：贪婪模式和非贪婪模式（也成为懒惰模式）

贪婪模式（python默认）：正则表达式在符合条件的情况下，尽可能所得匹配结果。

非贪婪模式：正则表达式中如果符合条件，会尽可能少的匹配。

转换：在*、?、+、{m,n}后加上?，可以将贪婪模式编程非贪婪模式。

所以：对于上面的练习题，我们可以这样做：

方法四：
>>> re.findall("<a>(.*?)</a>",content)
['z你好啊', 'H你好呀']

3、re中的方法（上一篇文章中聊过一点，这就不全部详细说）：

split()按照能够匹配的字符串将字符串分割后返回列表
finditer() #查找所有匹配的内容，返回值为迭代器
sub() #用来替换匹配的内容，将他换为需要的内容（数据清洗）
compile() #先制定规则,生成一个正则表达式对象，在匹配数据。注意：当我们填入第二个参数时，会修改规则。

split()

>>> re.split("\.","www.baidu.com")
['www', 'baidu', 'com']

finditer()

>>> s=re.finditer("[a-z]","ZhangLi")

>>> next(s)
<_sre.SRE_Match object; span=(1, 2), match='h'>
>>> next(s)
<_sre.SRE_Match object; span=(2, 3), match='a'>
>>> next(s)
<_sre.SRE_Match object; span=(3, 4), match='n'>

sub()

>>> content          
'<html><a>z你好啊</a></html><html><a>H你好呀</a></html>'
>>> re.sub("</?\w+>","",content)      #将标签全部清洗掉，只留下内容。
'z你好啊H你好呀'

compile()

>>> p=re.compile("[A-Z]\d+")        #定义一个正则表达式
>>> p.match("A10")             #

①、compile()的第二个参数：

➢ re.I(IGNORECASE)忽略大小写，括号内是完整的写法
➢ re.M(MULTILINE)多行模式，改变^和$的行为
➢ re.S(DOTALL)点可以匹配任意字符，包括换行符
➢ re.L(LOCALE)做本地化识别的匹配，不推荐使用
➢ re.U(UNICODE) 使用\w \W \s \S \d \D 使用取决于 unicode 定义的字符属性。在 python3 中默认使用该 flag
➢ re.X(VERBOSE)冗长模式，该模式下 pattern 字符串可以是多行的，忽略空白字符，并可以添加注释

>>> c='''
... [A-Z]\d+
... '''
>>> p=re.compile(c,re.X)
>>> p.match("A00")
<_sre.SRE_Match object; span=(0, 3), match='A00'>

②、compile编译正则表达式模式字符串，并生成正则表达式对象，可供match()等使用。

    findall(self, /, string=None, pos=0, endpos=9223372036854775807, *, source=None)
 |      Return a list of all non-overlapping matches of pattern in string.
 |
 |  finditer(self, /, string, pos=0, endpos=9223372036854775807)
 |      Return an iterator over all non-overlapping matches for the RE pattern in string.
 |
 |      For each match, the iterator returns a match object.
 |
 |  fullmatch(self, /, string=None, pos=0, endpos=9223372036854775807, *, pattern=None)
 |      Matches against all of the string
 |
 |  match(self, /, string=None, pos=0, endpos=9223372036854775807, *, pattern=None)
 |      Matches zero or more characters at the beginning of the string.
 |
 |  scanner(self, /, string, pos=0, endpos=9223372036854775807)
 |
 |  search(self, /, string=None, pos=0, endpos=9223372036854775807, *, pattern=None)
 |      Scan through string looking for a match, and return a corresponding match object
    instance.
 |
 |      Return None if no position in the string matches.
 |
 |  split(self, /, string=None, maxsplit=0, *, source=None)
 |      Split string by the occurrences of pattern.
 |
 |  sub(self, /, repl, string, count=0)
 |      Return the string obtained by replacing the leftmost non-overlapping occurrences 
    of pattern in string by the replacement repl.
 |
 |  subn(self, /, repl, string, count=0)
 |      Return the tuple (new_string, number_of_subs_made) found by replacing the 
    leftmost non-overlapping occurrences of pattern with the replacement repl.

纸球_o

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python3中的正则表达式2

1、分组用小括号来指定的子表达式成为分组。可以使用重复限制符，像*，+，?，{m,n}，来重复组里的内容。例：一个html标签 content="<a>你是谁？</a>" 过滤出它的标签为？它的内容为？>>> re.match(r"<(\w+)>(.*)</\1>",content).group(1) #第...
复制链接

扫一扫