11.正则表达式

最新推荐文章于 2024-07-21 08:00:00 发布

Aislli

最新推荐文章于 2024-07-21 08:00:00 发布

阅读量3.7k

点赞数 1

分类专栏： Python 文章标签： python 正则表达式

本文链接：https://blog.csdn.net/Aislli/article/details/81169861

版权

Python 专栏收录该内容

23 篇文章 0 订阅

订阅专栏

1.常用
\d：数字
\w：字母或数字
\s：空格或tab
.：任意字符
[0-9a-zA-z_]：0到9或小写字母或大写字母或下划线
A|B：A或B（例如（P|p）ython匹配’Python’或’python’）
\：转译字符

*：0个或多个
+：1个或多个
？：0个或1个
｛n｝：正好N个
{n，m}：n至m个

^{：行的开头，}\d表示必须以数字开头
$：行的结尾，\d$表示必须以数字结尾

例如匹配010-123456的正则为：\d{3}-\d{3,8}

匹配手机号：1[358]\d{9}

2.python的re
因为Python字符串本身也是用\转义，所以需要注意正则表达式的写法，例如一个正则表达式为’abc\-123’，那么实际上会变成’abc-123’，如果想让\不被转义，也就是保持两个\，可以在前面加个r，r’abc\-123’，这样就不会被转义，所以建议在表达示前加r，这样可以不用考虑表达示被转义的问题。

s = 'ABC\\-001'
s1 = r'ABC\\-001'
print(s, s1)
---
ABC\-001 ABC\\-001

正则的一般用法：

import re

s1 = 'abc-123456'
m1 = r'^[a-z]{3}-\d{3,6}$'
match = re.match(m1, s1)
print(match)
if match:
    print('yes')
else:
    print('no')
---
<_sre.SRE_Match object; span=(0, 10), match='abc-123456'>
yes

如果匹配正确，返回一个对象，否则返回None。
split：

s2 = 'a,;   b;  c  d'
split = re.split(r'[\s\,\;]+', s2)
print(split)
---
['a', 'b', 'c', 'd']

组：

s1 = 'abc-123456'
m1 = r'^([a-z]{3})-(\d{3,6})$'
match = re.match(m1, s1)
print(match.group(0))
print(match.group(1))
print(match.group(2))
---
abc-123456
abc
123456

贪婪匹配：
正则默认使用贪婪匹配，也就是匹配尽可能多的字符：

s3 = '12300'
re_match = re.match(r'(\d+)(0*)', s3)
print(re_match.groups())
---
('12300', '')

前面的\d+采用贪婪匹配把所有的数字全给匹配了，导致第二组的0匹配了个空字符串，如果想让后面的0也匹配到，可用?使前面的\d+不使用贪婪匹配：

s3 = '12300'
re_match = re.match(r'^(\d+?)(0*)$', s3)
print(re_match.groups())
---
('123', '00')

编译
正则表达式在匹配之前会先去对表达式进行编译，如果同一个表达式要进行很多次匹配，为了性能考虑，可以先预编译再匹配，例如匹配多次邮箱：

re_compile = re.compile(r'^(\w+)@(\w+)(\.[a-zA-Z]+)+$')
print(re_compile.match('Aislli@163.com').groups())
print(re_compile.match('12345@qq.com').groups())
---
('Aislli', '163', '.com')
('12345', 'qq', '.com')

3.正则的一些其它用法
3.1 findall和finditer的区别
正则表达式里一个组也没有的情况

import re

# 正则的其它使用
content = '''http://www.baidu.com/bg1.jgp
http://www.baidu.com/bg2.png
http://www.baidu.com/bg3.bmp
'''
# findall 0个组
rem = re.compile(r'http\:\/\/.+?/\w+\.[a-zA-Z0-9]{3}?')
findall = rem.findall(content)
print('findall:', findall, type(findall), type(findall[0]))

# finditer 0个组
finditer = rem.finditer(content)
for item in finditer:
    print('finditer:', item, item.group(0))
---
findall: ['http://www.baidu.com/bg1.jgp', 'http://www.baidu.com/bg2.png', 'http://www.baidu.com/bg3.bmp'] <class 'list'> <class 'str'>
finditer: <_sre.SRE_Match object; span=(0, 28), match='http://www.baidu.com/bg1.jgp'> http://www.baidu.com/bg1.jgp
finditer: <_sre.SRE_Match object; span=(29, 57), match='http://www.baidu.com/bg2.png'> http://www.baidu.com/bg2.png
finditer: <_sre.SRE_Match object; span=(58, 86), match='http://www.baidu.com/bg3.bmp'> http://www.baidu.com/bg3.bmp

正则表达式里有一个组，findall只返回这一个组匹配到的内容，finditer可通过.group(0)取到整个正则匹配到的内容，用.group(1)来取第一个组里匹配到内容

# findall 1个组
rem = re.compile(r'(http\:\/\/.+?)/\w+\.[a-zA-Z0-9]{3}?')
findall = rem.findall(content)
print('findall:', findall, type(findall), type(findall[0]))

# finditer 1个组
finditer = rem.finditer(content)
print(finditer)
for item in finditer:
    print('finditer:', item, item.groups(), item.group(0), item.group(1))
---
findall: ['http://www.baidu.com', 'http://www.baidu.com', 'http://www.baidu.com'] <class 'list'> <class 'str'>
<callable_iterator object at 0x0000000002989860>
finditer: <_sre.SRE_Match object; span=(0, 28), match='http://www.baidu.com/bg1.jgp'> ('http://www.baidu.com',) http://www.baidu.com/bg1.jgp http://www.baidu.com
finditer: <_sre.SRE_Match object; span=(29, 57), match='http://www.baidu.com/bg2.png'> ('http://www.baidu.com',) http://www.baidu.com/bg2.png http://www.baidu.com
finditer: <_sre.SRE_Match object; span=(58, 86), match='http://www.baidu.com/bg3.bmp'> ('http://www.baidu.com',) http://www.baidu.com/bg3.bmp http://www.baidu.com

有两个组时，findall是直接把所有匹配结果作为一个list返回，如果有多个组，每个组的内容会以一个tuple作为list的元素返回，finditer用法同一个组；

import re

# 正则的其它使用
content = '''http://www.baidu.com/bg1.jgp
http://www.baidu.com/bg2.png
http://www.baidu.com/bg3.bmp
'''
# 匹配一段文字里的url和图片类型
# findall
rem = re.compile(r'(http\:\/\/.+?)/\w+\.([a-zA-Z0-9]{3}?)')
findall = rem.findall(content)
print('findall:', findall, type(findall), type(findall[0]))

# finditer
finditer = rem.finditer(content)
print(finditer)
for item in finditer:
    print('finditer:', item, item.groups(), item.group(0), item.group(1), item.group(2))
---
findall: [('http://www.baidu.com', 'jgp'), ('http://www.baidu.com', 'png'), ('http://www.baidu.com', 'bmp')] <class 'list'> <class 'tuple'>
<callable_iterator object at 0x00000000029BC080>
finditer: <_sre.SRE_Match object; span=(0, 28), match='http://www.baidu.com/bg1.jgp'> ('http://www.baidu.com', 'jgp') http://www.baidu.com/bg1.jgp http://www.baidu.com jgp
finditer: <_sre.SRE_Match object; span=(29, 57), match='http://www.baidu.com/bg2.png'> ('http://www.baidu.com', 'png') http://www.baidu.com/bg2.png http://www.baidu.com png
finditer: <_sre.SRE_Match object; span=(58, 86), match='http://www.baidu.com/bg3.bmp'> ('http://www.baidu.com', 'bmp') http://www.baidu.com/bg3.bmp http://www.baidu.com bmp

说明：只要正则里有组的存在，findall就会只返回组的内容，如果想在有组的情况下还要用findall返回整个正则匹配的内容，也可以把整个正则用括号括起来作为一个大组，这样就会返回整个正则匹配的内容。

3.2 search

import re

content = '''sadf123http://www.baidu.com/bg1.jgpsadfsdaf234http://www.baidu.com/bg1.jgpfdsa
'''
# 查找是否有满足正则的内容
rem = re.compile(r'http\:\/\/.+?/.+?\.[a-zA-Z]{3}')
search = rem.search(content)
print(search, type(search), search.group(0))
---
<_sre.SRE_Match object; span=(7, 35), match='http://www.baidu.com/bg1.jgp'> <class '_sre.SRE_Match'> http://www.baidu.com/bg1.jgp

3.3 找出一串字符中的想要的内容

import re

content = '''
asdfadsfsadfsda-sadfsdaf11http://ws.stream.qqmusic.qq.com/C100001jlxUj2UEa33.m4a?fromtag=466646Aaf11http://ws.stream.qqmusic.qq.com/C100001jlxUj2UEa33.m4a?fromtag=466
'''
rex = r'http\:\/\/ws\.stream.+?\.m4a\?.+?(46)+?'
re_compile = re.compile(rex)
search = re_compile.search(content)
rehttp = re.compile(r'http://ws\.stream.+?/(\w+\.m4a)\?.+?46')
# 因为需要取得资源的名字，所以分个组更方便，但是只要有组的存在，findall就会只匹配组里的内容
# findall = rehttp.findall(content)
# print(findall)

finditer = rehttp.finditer(content)
for iter in finditer:
    print('url: %s       fileName: %s'%(iter.group(0), iter.group(1)))
---
url: http://ws.stream.qqmusic.qq.com/C100001jlxUj2UEa33.m4a?fromtag=46       fileName: C100001jlxUj2UEa33.m4a
url: http://ws.stream.qqmusic.qq.com/C100001jlxUj2UEa33.m4a?fromtag=46       fileName: C100001jlxUj2UEa33.m4a