python正则表达式详解，常用函数，常用规则介绍

最新推荐文章于 2024-04-12 14:26:02 发布

Z_阳

最新推荐文章于 2024-04-12 14:26:02 发布

阅读量355

点赞数 1

分类专栏： python基础文章标签： python 正则表达式字符串

本文链接：https://blog.csdn.net/Z_love_u/article/details/110927648

版权

python基础专栏收录该内容

15 篇文章 1 订阅

订阅专栏

文章目录

正则表达式语法：

单字符串匹配规则

匹配多个字符串

开始结束和或语法

转义字符和原生字符串

re模块中常用的函数

match

group分组

findall

sub

split

compile

1、单个字符匹配 ——使用match方法

匹配某个字符	直接写字符
匹配任意的字符（除了'\n'）	.
匹配任意的数字	\d
匹配任意非数字	\D
匹配的是空白字符串（\n \t \r 和空格）	\s
匹配非空白字符串	\S
匹配的是a-z和A-Z以及数字和下划线规则和python当中变量的命名一致	\w
匹配的是与\w相反的	\W
组合的方式，只要满足中括号中的某一项都算匹配成功	[]

代码详解

import re
# 匹配某个字符
text = 'python'
#  match方法匹配第一个，如果第一个匹配不到则返回结果为空
res = re.match('p', text)
print('只匹配一个字符',res.group())

# . 匹配任意的字符（除了'\n'）
text = 'python'
res = re.match('.', text)
print('匹配任意字符:"."',res.group())

# \d 匹配任意的数字
text = '1python'
res = re.match('\d', text)
print('匹配任意数字:"\d"', res.group())

# \D 匹配任意非数字
text = 'python'
res = re.match('\D', text)
print('匹配任意非数字:"\D"', res.group())

# \s 匹配的是空白字符串（\n \t \r 和空格）
text = '\npython'
res = re.match('\s', text)
print('匹配的是空白字符:"\s"', res.group(), r'\n\t\r空格')

# \S 匹配非空白字符串
text = 'python'
res = re.match('\S', text)
print('匹配的是非空白字符:"\S"', res.group())

# \w 匹配的是a-z和A-Z以及数字和下划线
# 规则和python当中变量的命名一致
text = '_python'
res = re.match('\w', text)
print('匹配的是a-z和A-Z以及数字和下划线:"\w"', res.group())

# \W 匹配的是与\w相反的
text = '+_python'
res = re.match('\W', text)
print('匹配的是与\w相反的内容:"\W"', res.group())

# [] 组合的方式， 只要满足中括号中的某一项都算匹配成功
# 实现单个字符匹配
text = 'python'
res = re.match('[py]', text)
print('用组合方式匹配指定字符:[py]', res.group())

# 组合的方式实现 \d 匹配
text = '3python'
res = re.match('[0-9]', text)
print('用组合的方式匹配数字:[0-9]', res.group())

# 组合的方式实现 \D 匹配
text = 'python'
res = re.match('[^0-9]', text)
print('用组合的方式实现"\D":[^0-9]', res.group())

# 组合的方式实现 \s 匹配
text = '\t+python'
res = re.match('[\n\t\r ]', text)
print(r'组合的方式实现"\s"[\n\r\t ]', res.group(), r'\n\r\t ')

# 组合的方式实现\S 匹配
text = '-python'
res = re.match('[^\n\t\r ]', text)
print(r'组合的方式实现"\S"[^\n\t\r ]', res.group())

# 组合的方式实现\w 匹配
text = '456_python'
res = re.match('[0-9a-zA-z]', text)
print('组合的方式实现"\w"[0-9a-zA-z]', res.group())

# 组合的方式实现 \W 匹配
text = '+python'
res = re.match('[^0-9a-zA-z]', text)
print('组合的方式实现"\W"[^0-9a-zA-z]', res.group())

在这里插入图片描述

2、多字符匹配

*	匹配前一个字符0次或者无限次
+	匹配前一个字符1次或者无限次
?	匹配前一个字符0次或者1次
{m}/{m,n}	匹配前一个字符m次或者n次
*?/+?/??	匹配模式变成非贪婪(尽可能少匹配字符)

详细代码例子

import re
# 匹配单个字符
# *:匹配任意多个字符
'''
如果满足条件，匹配到了，那就直接匹配多少返回多少
如果没有匹配成功，也不会报错，只是返回空而已
'''
text = 'python'
res = re.match('\w*', text)
print('匹配任意多个字符*:',res.group())


# +:匹配1个或多个字符
'''
最少要返回一个字符，如果没有匹配到，那就直接报错
'''
text = 'python'
res = re.match('\w+', text)
print('匹配1个或多个字符+:',res.group())


# ？:匹配前一个字符0个或者1个
'''
如果匹配到的，那就返回一个，
匹配不到，就返回0个，不报错
'''
text = '+python'
res = re.match('\w?', text)
print('匹配前一个字符0个或者1个*:',res.group())


# {m}:匹配m个字符
'''
匹配到了就返回结果，匹配不到就报错
'''
text = 'python'
res = re.match('\w{3}', text)
print('匹配m个字符{m}:',res.group())


# {m, n}:匹配m-n之间的个数的字符
'''
返回m-n直接的个数的字符，
如果字符串里面符合要求的个数满足n个，则直接输出n个字符
'''
text = 'python'
res = re.match('\w{2,4}', text)
print('匹配m-n之间的个数的字符{m, n}:',res.group())

运行结果

在这里插入图片描述

3、正则练习

import re
# 验证手机号；手机号的规则是1开头，第二位可以是34578，后面的9位随意
text = '13293820912'
res = re.match('1[34578]\d{9}', text)
print('验证手机:', res.group())


# 验证邮箱：邮箱的规则是邮箱的名称是用数字、英文字符、下划线组成的，然后是@符号，后面是域名
text = 'strawberry@123.com'
res = re.match('\w+@[0-9a-z]+\.[a-z]+', text)
print('验证邮箱:', res.group())


# 验证url：URL的规则是前面是http或https或ftp等传输协议，再加上一个冒号，再加斜杠，后面是任意非空白字符
text = 'https://www.baidu.com/'
# 如果要在匹配的规则当中用到一些或的条件，要用（）括起来，条件之间用|进行分割
res = re.match('(http|https|ftp)://\S+', text)
print('验证url:', res.group())


# 验证身份证：身份证的规则是，总共有18位，前面17位都是数字，后面一位可以是数字，也可以是小写的x或大写X
text = '12345678998765429x'
res = re.match('\d{17}[\dxX]', text)
print('验证身份证:', res.group())

运行效果
在这里插入图片描述

4、开始（^）、结束（$）、非贪婪

# ^: 以...开头:
# match()方法
text = 'hello world'
res = re.match('^hello', text)
print('match ^ :', res.group())
text = 'hello world'
res = re.search('^hello', text)
print('search ^ :', res.group())
text = 'hello world'
res = re.search('world', text)
print('search world:', res.group())
text = 'hello world'
res = re.search('^world', text)
print('search ^world', res.group())

在这里插入图片描述

# $: 以...结尾:

text = 'hello world'
res = re.search('world$', text)
print('search $', res.group())

# 如果说在匹配的时候用^$，那文本只能是为空了
text = ''
res = re.search('^$', text)
print('search ^$', res.group())

在这里插入图片描述

|: 匹配多个字符串或者表达式
这个按照字面意思直接用就可以了，多个表达式中间用竖线隔开即可

贪婪和非贪婪模式

# 贪婪和非贪婪:
text = '12345'
res = re.search('\d+?', text)
print('非贪婪(问号的前面有其他的匹配符号)', res.group())
text = '12345'
res = re.search('\d+', text)
print('贪婪', res.group())

在这里插入图片描述

案例

# 案例一:提取html标签名称
text = '<h1>这是标题</h1>'
res = re.search('<.+>', text)
print('在+后面不加问号?的是贪婪模式:',res.group())
res = re.search('<.+?>', text)
print('加上问号?之后的是非贪婪模式:', res.group())
# 案例二:验证一个字符是不是0-100之间的数字:
#  1, 99, 100, 0
text = '12'
res = re.search('0|[1-9]\d?|100', text)
print('search方法是在字符串中去找符合条件的', res.group())
text = '01'
res = re.match('0$|[1-9]\d?|100', text)
print('match方法是从字符串开头向后去找', res.group())
# 最终的效果如下
text = '1000'
res = re.match("0$|[1-9]\d?$|100$", text)
print('必须分别以0结尾，个数数字结尾，100结尾', res.group())

在这里插入图片描述

5、转义字符和原生字符

import re

text = 'apple price $99, orange price  $8'
# 要找出里面的价格，$符号的数字，要用到findall方法
# 返回类型是list
res = re.findall('$\d+', text)
print(res)
# 在上面的结果当中，由于$符号在正则层面是具有特殊含义的字符，表示以...结尾
# 所以在找的时候，会得不到结果,
# 如果要把这个$符号转换成普通字符，可以加入\,
# 字符串前面不能加入r，这样会导致里面所有有特殊含义的字符全部失效
res = re.findall('\$\d+', text)
print('用斜杠转义', res)
res = re.findall(r'$\d+', text)
print('用r转义', res)


# 另外一种情况, 在字符串中有\这些，而且要找带\的字符
'''
正则表达式的字符串解析规则
1、先把这个字符串放在python语言层面进行解析
2、把python语言层面解析的结果再放到正则表达式层间进行解析
'''
text = '\cabs 3'  # 要匹配到\c
# 使用\进行转义
# \\\\c =(python 语言层面) -> \\c = (正则表达式层面) -> \c
res = re.match('\\\\c', text)
print(res.group())
# 使用r ，原生字符进行转义
res = re.match(r'\\c', text)
print(res.group())

运行结果:

在这里插入图片描述

6、分组

group()/group(0)	匹配整个分组
group(n)	匹配第n个分组
groups()	获取所有分组

text = 'apple price : $100, banana price : $55'
# . 匹配任意字符 +一个或多个，匹配$，\d匹配数字 +一个或多个
# . 匹配任意字符 +一个或多个，匹配$，\d匹配数字 +一个或多个
res = re.search('.+\$\d+.+\$\d+', text)
print('group()', res.group())
# 那么加下来如果想要匹配'.+\$\d+.+\$\d+'中的\$\d+，要使用分组了
# 用()把需要分组的地方括起来，在group()中就可以用数字1,2去获取第几个分组的内容了
res = re.search('.+(\$\d+).+(\$\d+)', text)
print('group(n)', res.group(2), res.group(1), sep = '\n')
# group里面使用0的话，就把整个字符串匹配出来了
res = re.search('.+(\$\d+).+(\$\d+)', text)
print('group(0)', res.group(0))
res = re.search('.+(\$\d+).+(\$\d+)', text)
print('groups()', res.groups())
'''
group()/group(0):匹配整个分组
group(n):匹配第n个分组
groups():获取所有分组
'''

运行结果

在这里插入图片描述

7、re常用函数

findall	查找所有满足条件的返回值类型是list
sub	根据规则替换其他字符
split	根据规则分割字符串
complie	编译正则表达式
正则表达式中也可以加注释	re.VERBOSE

text = 'apple price  $99, orange price  $88'
# findall :查找所有满足条件的  返回值类型是list
res = re.findall('\$\d+', text)
print('findall():',res)

# sub:根据规则替换其他字符
text = 'every cloud has a silver lining, yes'
# python中的替换是str.replace('被替换字符串','替换后的字符串')
# sub()的功能更强大，可以同时选择不同的字符串被替换
res = re.sub(' ', '|', text)
print('sub:', res)
res = re.sub(' |,', '$', text)
print('sub(带正则):', res)

# 练习
div_html = '''
<div id="info">
        <span><span class="pl">导演</span>: <span class="attrs"><a href="/celebrity/1313751/" rel="v:directedBy">郭敬明</a></span></span><br>
        <span><span class="pl">编剧</span>: <span class="attrs"><a href="/celebrity/1313751/">郭敬明</a></span></span><br>
        <span class="actor"><span class="pl">主演</span>: <span class="attrs"><span><a href="/celebrity/1274608/" rel="v:starring">赵又廷</a> / </span><span><a href="/celebrity/1333341/" rel="v:starring">邓伦</a> / </span><span><a href="/celebrity/1275649/" rel="v:starring">王子文</a> / </span><span><a href="/celebrity/1339442/" rel="v:starring">春夏</a> / </span><span><a href="/celebrity/1352236/" rel="v:starring">汪铎</a> / </span><span style="display: none;"><a href="/celebrity/1419858/" rel="v:starring">孙晨竣</a> / </span><span style="display: none;"><a href="/celebrity/1337019/" rel="v:starring">徐开骋</a> / </span><span style="display: none;"><a href="/celebrity/1393182/" rel="v:starring">欧米德</a> / </span><span style="display: none;"><a href="/celebrity/1399147/" rel="v:starring">芦展翔</a> / </span><span style="display: none;"><a href="/celebrity/1348671/" rel="v:starring">雎晓雯</a> / </span><span style="display: none;"><a href="/celebrity/1449120/" rel="v:starring">王倾</a></span><a href="javascript:;" class="more-actor" title="更多主演">更多...</a></span></span><br>
        <span class="pl">类型:</span> <span property="v:genre">爱情</span> / <span property="v:genre">奇幻</span><br>
        
        <span class="pl">制片国家/地区:</span> 中国大陆<br>
        <span class="pl">语言:</span> 汉语普通话<br>
        <span class="pl">上映日期:</span> <span property="v:initialReleaseDate" content="2020-12-25(中国大陆)">2020-12-25(中国大陆)</span> / <span property="v:initialReleaseDate" content="2021-02-05(美国网络)">2021-02-05(美国网络)</span><br>
        <span class="pl">片长:</span> <span property="v:runtime" content="132">132分钟</span><br>
        <span class="pl">又名:</span> 阴阳师(上)<br>
        <span class="pl">IMDb链接:</span> <a href="https://www.imdb.com/title/tt11454718" target="_blank" rel="nofollow">tt11454718</a><br>

</div>
'''
res = re.sub(r'<.+?>', '', div_html)
print('纯文本', res)

# split:根据规则分割字符串
# 比python的split方法更灵活，也是可以使用正则语法去规定多个分隔符
text = 'I have been reading ,books of old'
res = re.split(' |,', text)
print('split:', res)


# complie:编译正则表达式
# 先编译，编译好之后直接使用，提升效率
text = 'The π : 3.1415'
r = re.compile('\d+\.?\d+')  # 适用于这个匹配方式会调用成千上万次
res = re.search(r, text)
print('先经过compile:', res.group())

# 正则表达式中也可以加注释 re.VERBOSE, 用多行字符串将匹配字符包裹起来
# 加注释的话需要再正则表达式函数的最后一个参数加一个're.VERBOSE'
text = 'The π : 3.1415'
r = re.compile('''
\d+ # 匹配数字，一个或多个
\.? # 匹配小数点，0个或1个
\d+ # 匹配数字，一个或多个
''', re.VERBOSE)
res = re.search(r, text)
print('compile中加注释:', res.group())

# 注释的内容也可以放在search中
text = 'The π : 3.1415'
r = re.search('''
\d+ # 匹配数字，一个或多个
\.? # 匹配小数点，0个或1个
\d+ # 匹配数字，一个或多个
''', text,re.VERBOSE)

print('search中加注释:', r.group())

运行结果

在这里插入图片描述

Z_阳

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python正则表达式详解，常用函数，常用规则介绍

正则表达式语法：单字符串匹配规则匹配多个字符串开始结束和或语法转义字符和原生字符串 re模块中常用的函数 match search group分组 findall sub split compile 1、单个字符匹配 ——使用match方法代码详解import re# 匹配某个字符text = 'python'#
复制链接

扫一扫