python之re模块

最新推荐文章于 2023-09-11 09:12:16 发布

凯凯恺恺恺恺凯凯

最新推荐文章于 2023-09-11 09:12:16 发布

阅读量457

点赞数

分类专栏： python之模块

本文链接：https://blog.csdn.net/weixin_42832313/article/details/106418099

版权

python之模块专栏收录该内容

12 篇文章 1 订阅

订阅专栏

re模块

就其本质而言，正则表达式（或 RE）是一种小型的、高度专业化的编程语言，（在Python中）它内嵌在Python中，并通过 re 模块实现。正则表达式模式被编译成一系列的字节码，然后由用 C 编写的匹配引擎执行。

一、正则表达式的特殊字符介绍

\w	    匹配字母（包含中文）或数字或下划线
\W	    匹配非字母（包含中文）或数字或下划线
\s	    匹配任意的空白符
\S	    匹配任意非空白符
\d	    匹配数字
\D      匹配非数字
\A      从字符串开头匹配
\z	    匹配字符串的结束，如果是换行，只匹配到换行前的结果
\n	    匹配一个换行符
\t	    匹配一个制表符

.	    匹配任意字符，除了换行符，当re.DOTALL标记被指定时，则可以匹配包括换行符的任意字符。
[...]	匹配字符组中的字符
...	    匹配除了字符组中的字符的所有字符
*	    匹配0个或者多个左边的字符。
+	    匹配一个或者多个左边的字符。
？	    匹配0个或者1个左边的字符，非贪婪方式。
{n}	    精准匹配n个前面的表达式。
{n,m}	匹配n到m次由前面的正则表达式定义的片段，贪婪方式
ab	    匹配a或者b
()	    匹配括号内的表达式，也表示一个组
^	    匹配字符串的开始
$	    匹配字符串的结尾
                   
[]      匹配包含在中括号中的任意字符
[^]     匹配包含在中括号中的字符之外的字符
[-]     匹配指定范围的任意单个字符
{n,}    匹配之前项至少n次

1、元字符之. ^ $ * + ? { }

import re
 
ret=re.findall('a..in','helloalvin')
print(ret)#['alvin']
 
ret=re.findall('^a...n','alvinhelloawwwn')
print(ret)#['alvin']
 
ret=re.findall('a...n$','alvinhelloawwwn')
print(ret)#['awwwn']
 
ret=re.findall('a...n$','alvinhelloawwwn')
print(ret)#['awwwn']
 
ret=re.findall('abc*','abcccc')#贪婪匹配[0,+oo]  
print(ret)#['abcccc']
 
ret=re.findall('abc+','abccc')#[1,+oo]
print(ret)#['abccc']

ret=re.findall('abc?','abccc')#[0,1]
print(ret)#['abc']
 
ret=re.findall('abc{1,4}','abccc')
print(ret)#['abccc'] 贪婪匹配

注意：前面的*,+,?等都是贪婪匹配，也就是尽可能匹配，后面加?号使其变成惰性匹配

ret=re.findall('abc*?','abcccccc')
print(ret)#['ab']

2、元字符之字符集［］

#--------------------------------------------字符集[]
ret=re.findall('a[bc]d','acd')
print(ret)#['acd']
 
ret=re.findall('[a-z]','acd')
print(ret)#['a', 'c', 'd']
 
ret=re.findall('[.*+]','a.cd+')
print(ret)#['.', '+']
 
#在字符集里有功能的符号: - ^ \
 
ret=re.findall('[1-9]','45dha3')
print(ret)#['4', '5', '3']
 
ret=re.findall('[^ab]','45bdha3')
print(ret)#['4', '5', 'd', 'h', '3']
 
ret=re.findall('[\d]','45bdha3')
print(ret)#['4', '5', '3']

3、元字符之转义符\

反斜杠后边跟元字符去除特殊功能,比如.
反斜杠后边跟普通字符实现特殊功能,比如\d

\d 匹配任何十进制数；它相当于类 [0-9]。
\D 匹配任何非数字字符；它相当于类 [^0-9]。
\s 匹配任何空白字符；它相当于类 [ \t\n\r\f\v]。
\S 匹配任何非空白字符；它相当于类 [^ \t\n\r\f\v]。
\w 匹配任何字母数字字符；它相当于类 [a-zA-Z0-9_]。
\W 匹配任何非字母数字字符；它相当于类 [^a-zA-Z0-9_]
\b 匹配一个特殊字符边界，比如空格，&，＃等

ret=re.findall('I\b','I am LIST')
print(ret)#[]
ret=re.findall(r'I\b','I am LIST')
print(ret)#['I']

现在我们聊一聊,先看下面两个匹配：

#-----------------------------eg1:
import re
ret=re.findall('c\l','abc\le')
print(ret)#[]
ret=re.findall('c\\l','abc\le')
print(ret)#[]
ret=re.findall('c\\\\l','abc\le')
print(ret)#['c\\l']
ret=re.findall(r'c\\l','abc\le')
print(ret)#['c\\l']
 
#-----------------------------eg2:
#之所以选择\b是因为\b在ASCII表中是有意义的
m = re.findall('\bblow', 'blow')
print(m)
m = re.findall(r'\bblow', 'blow')
print(m)

在这里插入图片描述

4、元字符之分组()

m = re.findall(r'(ad)+', 'add')
print(m)
 
ret=re.search('(?P<id>\d{2})/(?P<name>\w{3})','23/com')
print(ret.group())#23/com
print(ret.group('id'))#23

import re
aa = re.findall("(abc)+","abcabcabcef")  # abc作为一个整体
bb = re.findall("(?:abc)+","abcabcabcef") # ?:取消优先级

print(aa)
print(bb)

5、元字符之｜

ret=re.search('(ab)|\d','rabhdg8sd')
print(ret.group())#ab

二、re模块中的常用函数：

search()
从头搜索直到第一个匹配，regex对象search方法可以重新设定开始位置和结束位置，返回match

search方法，模式匹配成功后，也会返回一个SRE_Match对象，search方法和match的方法区别在于match只能从头开始匹配，而search可以从字符串的任意位置开始匹配，他们的共同点是，如果匹配成功，返回一个SRE_Match对象，如果匹配失败，返回一个None，这里还要注意，search仅仅查找第一次匹配，也就是说一个字符串中包含多个模式的匹配，也只会返回第一个匹配的结果，如果要返回所有的结果，最简单的方法就是findall方法，也可以使用finditer方法

import re
s = '''bottle\nbag\nbig\napple'''
regex = re.compile('b')
result = regex.search(s,1)
print(1,result)#扫描找到匹配的第一个位置
regex = re.compile('^b',re.M)
result = regex.search(s)
print(2,result)#不管是不是多行，找到就返回
result = regex.search(s,8)
print(3,result)#big

# <_sre.SRE_Match object; span=(7, 8), match=’b’>
# <_sre.SRE_Match object; span=(0, 1), match=’b’>
# <_sre.SRE_Match object; span=(11, 12), match=’b’>

match()

~~从字符串的第一个字符开始匹配，如果找到返回match对象，没找到返回None~~
match方法，类似于字符串中的startwith方法，只是match应用在正则表达式中更加强大，更富有表现力，match函数用以匹配字符串的开始部分
如果模式匹配成功，返回一个SRE_Match类型的对象
如果模式匹配失败，则返回一个None，因此对于普通的前缀匹配，他的用法几乎和startwith一模一样
例如我们要判断data字符串是否以what和是否以数字开头

regex对象match方法可以重设定开始位置和结束位置，返回match对象
定义：re.match(pattern,string,flags = 0)
regex.match(string[,pos[,endpos]])

import re
test = '''bottle\nbag'''
regex = re.compile('b.+')
matcher = regex.match(test,1)
print(matcher)

import re
s_true = "what is a boy"
s_false = "What is a boy"
re_obj = re.compile("what")
  
print(re_obj.match(string=s_true))
# <_sre.SRE_Match object; span=(0, 4), match='what'
  
print(re_obj.match(string=s_false))
# None
  
s_true = "123what is a boy"
s_false = "what is a boy"
 
re_obj = re.compile("\d+")
  
print(re_obj.match(s_true))
# <_sre.SRE_Match object; span=(0, 3), match='123'>
  
print(re_obj.match(s_true).start())
# 0

print(re_obj.match(s_true).end())
# 3

print(re_obj.match(s_true).string)
# 123what is a boy

print(re_obj.match(s_true).group())
# 123  
print(re_obj.match(s_false))
# None

findall()

~~在字符串中匹配，如果成功返回match对象，如果失败返回None~~
findall方法，该方法在字符串中查找模式匹配，将所有的匹配字符串以列表的形式返回
如果文本中没有任何字符串匹配模式，则返回一个空的列表
如果有一个子字符串匹配模式，则返回包含一个元素的列表所以，无论怎么匹配，我们都可以直接遍历findall返回的结果而不会出错，这对工程师编写程序来说，减少了异常情况的处理，代码逻辑更加简洁

# re.findall() 用来输出所有符合模式匹配的子串
import re
re_str = "hello this is python 2.7.13 and python 3.4.5"
pattern = "python [0-9]\.[0-9]\.[0-9]"
res = re.findall(pattern=pattern,string=re_str)
print(res)  
# ['python 2.7.1', 'python 3.4.5']

import re 
pattern = "python [0-9]\.[0-9]\.[0-9]{2,}"
res = re.findall(pattern=pattern,string=re_str)
print(res)
# ['python 2.7.13']

import re  
pattern = "python[0-9]\.[0-9]\.[0-9]{2,}"
res = re.findall(pattern=pattern,string=re_str)
print(res) 
# []
  
# re.findall() 方法，返回一个列表，如果匹配到的话，列表中的元素为匹配到的子字符串，如果没有匹配到，则返回一个空的列表

import re 
re_str = "hello this is python 2.7.13 and Python 3.4.5" 
pattern = "python [0-9]\.[0-9]\.[0-9]"
res = re.findall(pattern=pattern,string=re_str,flags=re.IGNORECASE)
print(res)  
# ['python 2.7.1', 'Python 3.4.5'] 
# 设置标志flags=re.IGNORECASE，意思为忽略大小写

finditer()

*~~在字符串中匹配，如果成功返回match可迭代对象，如果失败返回None~~ *
finditer返回一个迭代器，遍历迭代器可以得到一个SRE_Match对象

import re 
re_str = "what is a different between python 2.7.14 and python 3.5.4"
re_obj = re.compile("\d{1,}\.\d{1,}\.\d{1,}")
  
for i in re_obj.finditer(re_str):
  print(i)
  
# <_sre.SRE_Match object; span=(35, 41), match='2.7.14'>
# <_sre.SRE_Match object; span=(53, 58), match='3.5.4'>

split()

*~~按照匹配的字符串进行分割~~ *
re模块的split方法和python字符串中的split方法功能是一样的，都是将一个字符串拆分成子字符串的列表，区别在于re模块的split方法能够使用正则表达式
比如下面的例子，使用. 空格 : !分割字符串，返回的是一个列表

import re 
re_str = "what is a different between python 2.7.14 and python 3.5.4 USA:NewYork!Zidan.FRA"  
re_obj = re.compile("[. :!]")
  
print(re_obj.split(re_str))
# ['what', 'is', 'a', 'different', 'between', 'python', '2', '7', '14', 'and', 'python', '3', '5', '4', 'USA', 'NewYork', 'Zidan', 'FRA']

sub()

~~替换匹配的子字符串，返回替换之后的字符串~~
re模块sub方法类似于字符串中的replace方法，只是sub方法支持使用正则表达式，所以，re模块的sub方法使用场景更加广泛

import re 
re_str = "what is a different between python 2.7.14 and python 3.5.4"
re_obj = re.compile("\d{1,}\.\d{1,}\.\d{1,}")
  
print(re_obj.sub("a.b.c",re_str,count=1))
# what is a different between python a.b.c and python 3.5.4
  
print(re_obj.sub("a.b.c",re_str,count=2))
# what is a different between python a.b.c and python a.b.c
  
print(re_obj.sub("a.b.c",re_str))
# what is a different between python a.b.c and python a.b.c

compile()

~~编译的方式使用正则表达式~~
编译一个正则表达式,用这个结果去search,match,fildall,finditer 能够节省时间

re.compile(pattern,flags= 0)
pattern就是正则表达式字符串，flags是选项。正则表达式推荐先编译，为了提高效率，因为编译后的结果被保存，下次使用同样的pattern的时候，就不需要再次编译，

# 我们一般采用编译的方式使用python的正则模块，如果在大量的数据量中，编译的方式使用正则性能会提高很多，具体读者们可以可以实际测试

import re
re_str = "hello this is python 2.7.13 and Python 3.4.5"
re_obj = re.compile(pattern = "python [0-9]\.[0-9]\.[0-9]",flags=re.IGNORECASE)
res = re_obj.findall(re_str)
print(res)
# ['python 2.7.1', 'Python 3.4.5']

re模块下的常用方法：

import re
#1
re.findall('a','alvin yuan')    #返回所有满足匹配条件的结果,放在列表里
#2
re.search('a','alvin yuan').group()  #函数会在字符串内查找模式匹配,只到找到第一个匹配然后返回一个包含匹配信息的对象,该对象可以
                                     # 通过调用group()方法得到匹配的字符串,如果字符串没有匹配，则返回None。
 
#3
re.match('a','abc').group()     #同search,不过尽在字符串开始处进行匹配
 
#4
ret=re.split('[ab]','abcd')     #先按'a'分割得到''和'bcd',在对''和'bcd'分别按'b'分割
print(ret)#['', '', 'cd']
 
#5
ret=re.sub('\d','abc','alvin5yuan6',1)
print(ret)#alvinabcyuan6
ret=re.subn('\d','abc','alvin5yuan6')
print(ret)#('alvinabcyuanabc', 2)
 
#6
obj=re.compile('\d{3}')
ret=obj.search('abc123eeee')
print(ret.group())#123

import re
ret=re.finditer('\d','ds3sy4784a')
print(ret)        #<callable_iterator object at 0x10195f940>
 
print(next(ret).group())
print(next(ret).group())

注意：

import re
 
ret=re.findall('www.(baidu|oldboy).com','www.oldboy.com')
print(ret)#['oldboy']     这是因为findall会优先把匹配结果组里内容返回,如果想要匹配结果,取消权限即可
 
ret=re.findall('www.(?:baidu|oldboy).com','www.oldboy.com')
print(ret)#['www.oldboy.com']

import re

print(re.findall("<(?P<tag_name>\w+)>\w+</(?P=tag_name)>","<h1>hello</h1>"))
print(re.search("<(?P<tag_name>\w+)>\w+</(?P=tag_name)>","<h1>hello</h1>"))
print(re.search(r"<(\w+)>\w+</\1>","<h1>hello</h1>"))

#匹配出所有的整数
import re

#ret=re.findall(r"\d+{0}]","1-2*(60+(-40.35/5)-(-4*3))")
ret=re.findall(r"-?\d+\.\d*|(-?\d+)","1-2*(60+(-40.35/5)-(-4*3))")
ret.remove("")

print(ret)

使用小括号的pattern捕获的数据被放到了组group中。

match、search函数可以返回match对象；findall 返回字符串列表；finditer返回一个个match对象

如果pattern，如果有匹配的结果，会在match对象中
1 使用group（N）方式返回对应分组，1到N是对应的分组，0返回整个匹配的字符串
2 如果使用了命名分组，可以使用group（‘name’）的方式取分组
3 也可以使用groups（）返回所有组
4 使用groupdict()返回所有命名的分组

凯凯恺恺恺恺凯凯

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python之re模块

re模块就其本质而言，正则表达式（或 RE）是一种小型的、高度专业化的编程语言，（在Python中）它内嵌在Python中，并通过 re 模块实现。正则表达式模式被编译成一系列的字节码，然后由用 C 编写的匹配引擎执行。一、正则表达式的特殊字符介绍\w 匹配字母（包含中文）或数字或下划线\W 匹配非字母（包含中文）或数字或下划线\s 匹配任意的空白符\S 匹配任意非空白符\d 匹配数字\D 匹配非数字\A 从字符串开头匹配\z
复制链接

扫一扫

专栏目录