目录
一、什么是正则
正则其实就是一个匹配规则,根据这个匹配规则去一个大文本里寻找或匹配想要的字符串。
正则的作用
1.在一大堆文本字符串中找到想要的字符串
2.验证输入是否合法
正则的优缺点
优点:提高工作效率,节省代码
缺点:复杂,难于理解
二、re模块基本用法
search和match都是查找第一个匹配项,match只能从字符串的开头查找,开始部分没有找到,那就不会匹配上。
#以下代码在python3交互式环境里运行
>>> import re
>>> result = re.search("sanchuang","hello world, this is sanchuang")
>>> result
<_sre.SRE_Match object; span=(21, 30), match='sanchuang'>
>>> result = re.match("sanchuang","hello world,sanchuang")
>>> result
>>> result = re.match("sanchuang","sanchuanghello world,sanchuang")
>>> result
<_sre.SRE_Match object; span=(0, 9), match='sanchuang'>
search和match查找都是生成一个match对象
>>> result = re.search("world","hello world")
>>> type(result)
<class 're.Match'>
>>> dir(result)
['__class__', '__copy__', '__deepcopy__', '__delattr__', '__dir__', ......]
>>> result.end()
11
>>> result.start()
6
>>> result.group()
'world'
>>>
findall和finditer找到多个匹配,re.findall查找并返回匹配的字符串,返回一个列表,re.finditer查找并返回匹配的字符串,返回一个迭代器。
#以下代码在python3交互式环境中运行
>>> import re
>>> msg = "I love python python1 python2"
>>> re.findall("python",msg)
['python', 'python', 'python']
>>> re.finditer("python",msg)
<callable_iterator object at 0x7fc6254f8cc0>
>>> for i in re.finditer("python",msg):
... print(i)
... print(i.group())
...
<_sre.SRE_Match object; span=(7, 13), match='python'>
python
<_sre.SRE_Match object; span=(14, 20), match='python'>
python
<_sre.SRE_Match object; span=(22, 28), match='python'>
python
>>>
正则替换re.sub
>>> import re
>>> msg = "I love python"
>>> result = re.sub("py","PY",msg)
>>> result
'I love PYthon'
正则编译re.compile
>>> msg = "I love python python"
>>> reg = re.compile("py") #先生成match对象,再通过findall查找,和re.findall("py",msg)一样的
>>> reg.findall(msg)
['py', 'py']
>>>
三、基本正则匹配
[]表示在这个区间任选其一
[A-Za-z0-9_]匹配所有字母数字下划线
[^]在前面添加^,表示区间取反
ret = re.findall("[A-Za-z0-9_]","hdkaH_HFKJ_3873_")
print(ret)
['h', 'd', 'k', 'a', 'H', '_', 'H', 'F', 'K', 'J', '_', '3', '8', '7', '3', '_',]
msg = "a3aab45c2"
ret = re.findall("[a-z][0-9]",msg)
print(ret)
['a3', 'b4', 'c2']
ret = re.findall("[^A-Za-z0-9_]","hdka.H_HFKJ_3873__")
print(ret)
['.']
或匹配 |
msg = "xyf hzj lzw ly sh"
ret = re.findall("sh|hzj",msg)
print(ret)
print(re.search("sh|hzj",msg))
['hzj', 'sh']
<re.Match object; span=(4, 7), match='hzj'>
. 占位符,表示除换行符之外的任意一个字符,在方括号以外使用
ret = re.findall("p.thon","Python python pgthon p thon p\nthon")
print(ret)
['python', 'pgthon', 'p thon']
快捷方式表示
\A 匹配字符串的开始
\b 词边界
\B 非词边界
\w 匹配任何单词字符
\W 匹配任何非单词字符
\d 匹配数字
\D 匹配非数字
\s 匹配空字符
\S 匹配非空字符
ret = re.findall(r"\Aworld","hello world")
print(ret)
ret = re.findall(r"\Aworld","world hello world")
print(ret)
[]
['world']
ret = re.findall(r"world\b","world123 world# worldabc 123world")
print(ret)
ret = re.findall(r"\bworld\b","world123 world# worldabc 123world")
print(ret)
ret = re.findall(r"\Bworld","world123 world# worldabc 123world")
print(ret)
['world', 'world']
['world']
['world']
ret = re.findall(r"\w","dhak中文$%@#_123")
print(ret)
ret = re.findall(r"\W","dhak$%@#_123")
print(ret)
['d', 'h', 'a', 'k', '中', '文', '_', '1', '2', '3']
['$', '%', '@', '#']
ret = re.findall(r"\d","kh3hd382yt8aha")
print(ret)
ret = re.findall(r"\D","kh3hd382yt8aha")
print(ret)
['3', '3', '8', '2', '8']
['k', 'h', 'h', 'd', 'y', 't', 'a', 'h', 'a']
ret = re.findall(r"\s","dhaj djaj s a ")
print(ret)
ret = re.findall(r"\S","dhaj djaj s a ")
print(re.findall("[^ ]","dhaj djaj s a "))
print(ret)
[' ', ' ', ' ', ' ', ' ', ' ', ' ']
['d', 'h', 'a', 'j', 'd', 'j', 'a', 'j', 's', 'a']
['d', 'h', 'a', 'j', 'd', 'j', 'a', 'j', 's', 'a']
开始与结束,^表示以什么什么开头,$表示以什么什么结尾
ret = re.findall("^python","hello python")
print(ret)
ret = re.findall("^python","python hello python \npython")
print(ret)
ret = re.findall("python$","hello python")
print(ret)
[]
['python']
['python']
四、正则重复
通配符 ? * +
? 匹配指定的字符(组)出现0次或1次
+ 匹配指定的字符(组)出现1次以上
* 匹配指定的字符(组)出现任意多次
{n,m} 匹配指定的字符(组)出现n-m次
ret = re.findall("py?","py p pyython")
print(ret)
ret = re.findall("py+","py p pyython")
print(ret)
ret = re.findall("py*", "py p pyython")
print(ret)
ret = re.findall("py{2,4}","py pyy pyyy pyyyy pyyyyyy")
print(ret)
['py', 'p', 'py']
['py', 'pyy']
['py', 'p', 'pyy']
['pyy', 'pyyy', 'pyyyy', 'pyyyy']
贪婪匹配和非贪婪匹配
贪婪匹配:.* 尽可能多的匹配字符
非贪婪匹配:.*?
msg = "<div>test</div>bb<div>test2</div>"
ret = re.findall("<div>.*</div>",msg)
print(ret)
msg = "<div>test</div>bb<div>test2</div>"
ret = re.findall("<div>.*?</div>",msg)
print(ret)
['<div>test</div>bb<div>test2</div>']
['<div>test</div>', '<div>test2</div>']
五、正则分组
当使用分组时,除了可以获得整个匹配,还能够获得选择每一个单独组,使用()进行分组
ret = re.search(r"(\d{3})-(\d{3})-(\d{3})","abc123-465-789aaa")
print(ret.group())
print(ret.group(0))
print(ret.group(1))
print(ret.group(2))
print(ret.group(3))
123-465-789
123-465-789
123
465
789
捕获分组(正则表达式) 非捕获分组(?:正则表达式)
捕获分组就是分组并捕获,分组之后匹配到的数据,会放在内存中,并且给定一个从1开始索引。
使用findall 如果有捕获分组的话,只会显示捕获分组里的内容
非捕获分组就是只分组不捕获,不会匹配到的项不保存在内存中,不会分配从1开始下标索引。
msg = "hello sc1 hello sc1"
# msg = "hello sc1 hello2 sc2"
print(re.search(r"(\w+)\s(\w+)\s(\w+)\s(\w+)",msg).group())
hello sc1 hello sc1
print(re.search(r"(\w+)\s(\w+)\s\1\s\2",msg).group())
hello sc1 hello sc1
print(re.findall(r"(\w+)\s(\w+)\s\1\s\2",msg))
[('hello', 'sc1')]
msg2 = "a1 a2 a1 a1"
print(re.findall(r"(?:\w+)\s(\w+)\s\1",msg2))
['a1']
msg2 = "a1 a2 a2 a1"
print(re.findall(r"(?:\w+)\s(\w+)\s\1",msg2))
['a2']
六、正则标记
msg = """
PYTHON
python
"""
#对大小写不敏感的标志位 re.I
ret = re.findall(r"python",msg,re.I)
print(ret)
['PYTHON', 'python']
#不用多行模式,字符串就是一个整体匹配
#多行匹配模式re.M 将字符串每一行中的内容做一次匹配
ret = re.findall(r"^python$",msg,re.I|re.M)
print(ret)
['PYTHON', 'python']
msg = """
zhang shao han
zhang yi xing
zhang xue you
zhang jie
zhang fei
xie na
he jiong
zhang heng
"""
ret = re.findall(r"^zhang\s[a-z]+$",msg,re.M)
print(ret)
['zhang jie', 'zhang fei', 'zhang heng']
#re.S 表示.这个任意字符包括换行符
ret = re.findall(r".+",msg,re.S)
print(ret)
['\nzhang shao han\nzhang yi xing\nzhang xue you\nzhang jie\nzhang fei\nxie na\nhe jiong\nzhang heng\n']
七、正则断言
正则表达式的断言分为:先行断言(lookahead)和后行断言(lookbehind)
正则表达式的先行断言和后行断言一共有4种形式:
(?=pattern) 零宽正向先行断言(zero-width positive lookahead assertion)
(?!pattern) 零宽负向先行断言(zero-width negative lookahead assertion)
(?<=pattern) 零宽正向后行断言(zero-width positive lookbehind assertion)
(?<!pattern) 零宽负向后行断言(zero-width negative lookbehind assertion)
s = "sc1 hello sc2 hello"
#匹配后面是 sc2的hello
print(re.findall(r"hello(?= sc2)",s))
#匹配后面不是 sc2的hello
print(re.findall(r"hello(?! sc2)",s))
#匹配前面是sc2 的hello
print(re.findall(r"(?<=sc2 )hello",s))
#匹配前面不是sc2 的hello
print(re.findall(r"(?<!sc2 )hello",s))
['hello']
['hello']
['hello']
['hello']
msg = """
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:0c:29:1b:68:1a brd ff:ff:ff:ff:ff:ff
inet 192.168.0.204/24 brd 192.168.0.255 scope global noprefixroute ens33
valid_lft forever preferred_lft forever
inet6 fe80::20c:29ff:fe1b:681a/64 scope link
valid_lft forever preferred_lft forever
"""
print(re.findall(r"(?<=inet ).+(?=/)",msg))
['192.168.0.204']