python | 一篇文章正则表达式全学会

>>> import re
>>> rest=re.search(r'sanle','hello sanle')
>>> print(rest)
<_sre.SRE_Match object; span=(6, 11), match='sanle'>
>>> type(rest)
<class '_sre.SRE_Match'>

re.match
• 从字符串头查找匹配项
• 接受一个正则表达式和字符串，从主串第一个字符开始匹配，并返回发现的第一个匹配。
• 如果字符串开始不符合正则表达式，则匹配失败，re.match返回None

>>> rest=re.match(r'sanle','hello sanle')
>>> print(rest)
None
>>> type(rest)
<class 'NoneType'>
>>> rest=re.match(r'sanle','sanle sanle hello sanle')
>>> print(rest)
<_sre.SRE_Match object; span=(0, 5), match='sanle'>
>>> type(rest)
<class '_sre.SRE_Match'>

re模块基本用法-raw

r'sanle' 中的r代表的是raw（原始字符串）

• 原始字符串与正常字符串的区别是原始字符串不会将\字符解释成一个转义字符

• 正则表达式使用原始字符很常见且有用

>>> rest=re.search('\\tsanle','hello\\tsanle')
>>> print(rest)
None
>>> rest=re.search(r'\\tsanle','hello\\tsanle')
>>> print(rest)
<_sre.SRE_Match object; span=(5, 12), match='\\tsanle'>
>>> re.search('\\\\tsanle','hello\\\\tsanle')
<_sre.SRE_Match object; span=(6, 13), match='\\tsanle'>
>>> re.search(r'\\\\tsanle','hello\\\\tsanle')
<_sre.SRE_Match object; span=(5, 13), match='\\\\tsanle'>

re模块基本用法-match对象

match.group(default=0)：返回匹配的字符串。

• group是由于正则表达式可以分拆为多个只调出匹配子集的子组。

• 0是默认参数，表示匹配的整个串，n 表示第n个分

match.start()

• start方法提供了原始字符串中匹配开始的索引

match.end()

• end方法提供了原始字符串中匹配开始的索引

match.groups()

• groups返回一个包含所有小组字符串的元组，从 1 到所含的小组号

>>> msg="It's rainning cats and dogs"
>>> match=re.search(r'cats',msg)
>>> print(match)
<_sre.SRE_Match object; span=(14, 18), match='cats'>
>>> print(match.group())
cats
>>> print(match.start())
14
>>> print(match.end())
18
>>> print(match.groups())
()

re模块基本用法-findall

findall和finditer：找到多个匹配

re.findall

• 查找并返回匹配的字符串，返回一个列表

re.finditer

• 查找并返回匹配的字符串，返回一个迭代器

>>> rest=re.findall(r'sanle','hello sanle sanlee sanlee')
>>> print(rest)
['sanle', 'sanle', 'sanle']
>>> msg="It's rainning cats and dogs"
>>> re.findall('a',msg)
['a', 'a', 'a']
>>> re.finditer('a',msg)
<callable_iterator object at 0x7f06f13bc5f8>
# msg="aaaaaa"
# result=re.finditer("a",msg)
# for i in result:
#     print(i)
#     print(i.group())

正则替换

re.sub('匹配正则','替换内容','string')
• 将string中匹配的内容替换为新内容

print(re.sub("python","Python","I am learning python3"))
print(re.sub("python","Python","I am learning python3 python"))

re模块基本用法-compile

编译正则的特点：

• 复杂的正则可复用。

• 使用编译正则更方便，省略了参数。

• re模块缓存它即席编译的正则表达式，因此在大多数情况下，使用compile并没有很大的性能优势

msg1="hello world"
msg2="i am learning python"
msg3="sanle"
print(re.findall("python",msg1))
print(re.findall("python",msg2))
print(re.findall("python",msg3))

reg = re.compile("python")  #把正则表达式编译成对象
print(reg.findall(msg1))
print(reg.findall(msg2))
print(reg.findall(msg3))

基本正则

1.区间[] 根据编码顺序来规定范围

ret1=re.findall("python","Python on python")
print(ret1)
ret2=re.findall("[Pp]ython","Python on python")
print(ret2)
ret3=re.findall("[A-Za-z0-9-]","abc123ABCD--")
print(ret3)
ret4=re.findall("[a-zA-Z0-9-]","abc123ABCD--")
print(ret4)
ret5=re.findall("[A-z0-9\-]","abc123ABCD--\\")
print(ret5)

输出结果如下

['python']
['Python', 'python']
['a', 'b', 'c', '1', '2', '3', 'A', 'B', 'C', 'D', '-', '-']
['a', 'b', 'c', '1', '2', '3', 'A', 'B', 'C', 'D', '-', '-']
['a', 'b', 'c', '1', '2', '3', 'A', 'B', 'C', 'D', '-', '-', '\\']

2.区间取反

ret6=re.findall("[^A-Z]c","Ac111crc#c")
print(ret6)
ret7=re.findall("[^A-Z][0-9]","Ac121crc#c")
print(ret7)

输出结果如下

['1c', 'rc', '#c']
['c1', '21']

3.匹配或

msg="welcome to changsha,welcome to hunan"
rest=re.findall("changsha|hunan",msg)
print(rest)

输出结果如下

['changsha', 'hunan']

4. “.”占位符，表示除\n以外的任意一个字符

rest2=re.findall("p.thon","Pythonpthon p thon p-thon p\nthon")
print(rest2)

输出结果如下

['p thon', 'p-thon']

5.匹配开始与结束 ^,$

rest3=re.findall("^python","python hello pyth3on1")
print(rest3)
rest4=re.findall("python$","pyth3on hello python")
print(rest4)

输出结果如下

['python']
['python']

快捷方式

\d	匹配数字，即0-9
\D	匹配⾮数字，即不是数字
\s	匹配空⽩，即空格，tab键
\S	匹配⾮空⽩字符
\w	匹配单词字符，即a-z、A-Z、0-9、_
\W	匹配⾮单词字符
\A	匹配字符串开始
\b	词边界，匹配空字符串，但只在单词开始或结尾的位置
\B	非词边界，不能在词的开头或者结尾

正则重复

1. ? 表示匹配前一项0次或1次

ret=re.findall("py?","python p pyy ps")
print(ret)

输出结果如下

['py', 'p', 'py', 'p']

2. * 表示匹配前一项任意次(0-n次)

ret=re.findall("py*","python p pyy ps")
print(ret)

输出结果如下

['py', 'p', 'pyy', 'p']

3. + 表示匹配前一项至少一次

ret=re.findall("py+","python p pyy ps")
print(ret)

输出结果如下

['py', 'pyy']

4.{n} n 是一个非负整数。匹配确定的 n 次。

ret=re.findall("py{2}","python p pyy ps pyyyy")
print(ret)

输出结果如下

['pyy', 'pyy']

5.{n,} n 是一个非负整数。至少匹配n 次。

ret=re.findall("py{2,}","python p pyy ps pyyyy")
print(ret)

输出结果如下

['pyy', 'pyyyy']

6.{n,m} 表示匹配前一项n-m次,最少匹配 n 次且最多匹配 m 次

ret=re.findall("py{2,4}","python p pyy ps pyyyy")
print(ret)

输出结果如下

['pyy', 'pyyyy']

贪婪模式和非贪婪模式

贪婪模式：* + ?都是贪婪的，他们会尽可能匹配长的字符串

非贪婪模式：匹配到就输出，尽可能短的匹配 (+? *? ?? {2,4}?)


msg="helloooooo,I am sanchuang,123"
print(re.findall("lo{3,}",msg))
print(re.findall("lo{3,}?",msg))
print(re.findall("lo*?",msg))
print(re.findall("lo?",msg))
print(re.findall("lo??",msg))
msg="cats and dogs , cats1 and dog1"
print(re.findall("cats.*s",msg))
print(re.findall("cats.*?s",msg))

输出结果如下

['loooooo']
['looo']
['l', 'l']
['l', 'lo']
['l', 'l']
['cats and dogs , cats']
['cats and dogs']

正则分组

当使用分组时，除了可以获得整个匹配，还能够获得选择每一个单独组，使用 () 进行分组

1.捕获分组

match对象的group函数，默认参数为0，表示输出函数的所有字符串
                    参数n(n>0)，表示输出第几个分组匹配到的内容

msg="tel:173-7572-2991"
ret=re.search(r"(\d{3})-(\d{4})-(\d{4})",msg)
# ret1=re.search(r"\d{3}-\d{4}-\d{4}",msg)
print(ret.groups())
print(ret.group())
print(ret.group(1))
print(ret.group(2))
print(ret.group(3))
ret=re.search(r"(\d{3})-(\d{4})-(\d{4})",msg)

输出结果如下

('173', '7572', '2991')
173-7572-2991
173
7572
2991

2.引用分组（分组向后引用）

捕获分组 --分组之后匹配到的数据都是暂时放在内存里，并且给定一个从一开始的索引
             所以，捕获分组是可以向后引用的 \1 \2

ret = re.search(r"(\d{3})-(\d{4})-\2","173-7572-7572")
print(ret.group())
ret = re.search(r"(\d{3})-(\d{4})-\1","173-7572-173")
print(ret.group())

输出结果如下

173-7572-7572
173-7572-173

3.非捕获分组 (?:regex)

只分组不捕获，不会将匹配到的内容临时放到内存里，不能使用分组向后引用

ret = re.search(r"(?:\d{3})-(\d{4})-\1","173-7572-7572")
print(ret.group(1))

输出结果如下

如果有捕获分组，findall只会输出捕获分组内容

ret = re.findall(r"(?:\d{3})-(\d{4})-\1","173-7572-7572")
print(ret)

输出结果如下

['7572']

例题

msg="comaa@126.comyy@bb.comcombb@qq.comxx@163.com"
找出126.com和qq.com和163.com的邮箱地址

代码实现

msg="comaa@126.comyy@bb.comcombb@qq.comxx@163.com"
print(re.findall(r"(?:\.com)?(\w+@(?:126|qq|163)\.com)",msg))

输出结果如下

['comaa@126.com', 'combb@qq.com', 'xx@163.com']

4.命名分组

import re
ret=re.search(r'(?P<first>\d{3})-\d{3}-(?P<last>\d{3})',"321-123-231")
print(ret.group())
print(ret.groups())
print(ret.groupdict())
ret=re.findall(r'(?P<first>\d{3})-\d{3}-(?P<last>\d{3})',"321-123-231")
print(ret)

输出结果如下

321-123-231
('321', '231')
{'first': '321', 'last': '231'}
[('321', '231')]

常用正则标记

 re.I    GNORECASE，使匹配对大小写不敏感
 re.M    re.MULTILINE，多行匹配，影响 ^ 和$
 re.S    re.DOTALL，使 . 匹配包括换行在内的所有字符

import re
ret=re.findall("^python$","Python",re.I)
print(ret)
ret=re.findall("^python$","Python\npython",re.I)
print(ret)
ret=re.findall("^python$","Python\npython",re.I|re.M)
print(ret)

输出结果如下

['Python']
[]
['Python', 'python']

# 大小写不敏感，且多行匹配

msg="""
python
python
Python
"""
print(re.findall("^python$",msg,re.M|re.I))
print(re.findall(".+",msg,re.S))

输出结果如下

['python', 'python', 'Python']
['\npython\npython\nPython\n']

内联标记

(?imx) 正则表达式包含三种可选标志：i, m, 或 x 。只影响括号中的区域。
(?imx: re) 在括号中使用i, m, 或 x可选标志

import re
ret=re.findall("(?i)^python$","Python")
print(ret)
ret=re.findall("(?i)^python$","Python\npython")
print(ret)
ret=re.findall("(?im)^python$","Python\npython")
print(ret)

输出结果如下

['Python']
[]
['Python', 'python']

内联标记可以只对某一字段生效，使用内联标记时与后面的表达式间要加空格

ret=re.findall("(?i:hello) Python","Hello python")
print(ret)
ret=re.findall("(?i:hello) python","Hello python")
print(ret)

输出结果如下

[]
['Hello python']

正则断言

正则表达式的断言分为：先行断言(lookahead)和后行断言(lookbehind)
正则表达式的先行断言和后行断言一共有4种形式：
n (?=pattern) 零宽正向先行断言(zero-width positive lookahead assertion)
n (?!pattern) 零宽负向先行断言(zero-width negative lookahead assertion)
n (?<=pattern) 零宽正向后行断言(zero-width positive lookbehind assertion)
n (?<!pattern) 零宽负向后行断言(zero-width negative lookbehind assertion)

1.零宽正向先行断言

import re
s='a reguler expression'
print(re.findall(r're(?=guler)',s))
s='a reguller expression'
print(re.findall(r're(?=guler)',s))

输出结果如下

['re']
[]

2.零宽负向先行断言

import re
s='a reguler expression'
print(re.findall(r're(?!guler)',s))
s='a reguller expression'
print(re.findall(r're(?!guler)',s))

输出结果如下

['re']
['re', 're']

3.零宽正向后行断言

import re
s='a reguler expression'
print(re.findall(r'(?<=re)guler',s))
s='a reguller expression'
print(re.findall(r'(?<=re)guler',s))

输出结果如下

['guler']
[]

4.零宽负向后行断言

import re
s='a reguler expression'
print(re.findall(r'(?<!re)guler',s))
s='a reguller expression'
print(re.findall(r'(?<!re)expression',s))

输出结果如下

[]
['expression']

缘来是黎

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python | 一篇文章正则表达式全学会

正则表达式的作用1.过滤文本(数据挖掘) 指定一个匹配规则，从而识别该规则是否在一个更大的文本字符串中。2.合法性验证使用正则确认获得的数据是否是期望值正则表达式的优缺点• 优点：提高工作效率、节省代码• 缺点：复杂，难于理解re模块基本用法1.match与search: 查找第一个匹配re.search• 查找匹配项• 接受一个正则表达式和字符串，并返回发现的第一个匹配。• 如果完全没有找到匹配，re.search返回None>>...
复制链接

扫一扫