python知识点（五）正则表达式

最新推荐文章于 2024-08-10 10:46:00 发布

han_stars

最新推荐文章于 2024-08-10 10:46:00 发布

阅读量113

点赞数

分类专栏：小知识点文章标签： python

本文链接：https://blog.csdn.net/han_stars/article/details/100087724

版权

小知识点专栏收录该内容

14 篇文章 0 订阅

订阅专栏

正则表达式（Regular Expression）

一、概述
1、概念
一种文本模式，描述在搜索文本时要匹配的一个或多个字符串
2、典型场景

数据验证
文本扫描
文本提取
文本替换
文本分割

3、语法

字面值
a、普通字符
b、需转义：\ ^ $ . | ? * + () [] {}
元字符

4、匹配
单字、预定义元字符

. 除\n外的所有字符
\d 数字，等同于[0-9]
\D 非数字，等同于[^0-9]
\s 空白字符 \t \n \r \f \v
\S 非空白字符 [^\t\n\r\f\v]
\w 字母数字字符 [a-zA-Z0-9]
\W 非字母数字字符 [^a-zA-Z0-9]

批量备选
|
量词（字符、元字符、字符集如何重复）

? 0或1次
* 0或多次
+ 1或多次
特定
1 {n,m} 范围次数
2 {n} n次
3 {n,} 至少n次
4 {,m} 至多m次

贪婪与非贪婪
贪婪（默认）：尽量匹配最大范围结果
非贪婪：尽量匹配最小结果，方法是量词后加？

边界匹配

^ 行首
$ 行尾
\b 单词边界
\B 非单词边界
\A 输入开头
\Z 输入结尾

二、python正则
1、Regexobject 正则对象

.findall() 查找所有非重叠的匹配项
返回list

import re
text = "Tom is 8 years old.Make is 13 years old.\\auther"
pattern = re.compile('\d+')
pattern.findall(text)
['8', '13']
pattern = re.compile('\\\\auther')
pattern.findall(text)
['\\auther']
pattern = re.compile('[A-Z]\w+')
pattern.findall(text)
['Tom', 'Make']

.match(string[,pos[,endpos]])，从起始位置开始
返回MathObject

import re
pattern = re.compile(r'<html>')
text = '<html><body></head><body></body></html>'
pattern.match(text)
<_sre.SRE_Match object; span=(0, 6), match='<html>'>

.search(string[,pos[,endpos]])，从任意位置开始
返回MatchObject

import re
p2 = re.compile(r'<body>')
text = '<html><body></head><body></body></html>'
p2.search(text)
<_sre.SRE_Match object; span=(6, 12), match='<body>'>

.finditer() 查找所有匹配项
返回包括MatObject元素的迭代器

text = "Tom is 23 years old. Mike is 11 years old!"
import re
p1 = re.compile(r'\d+')
it = p1.finditer(text)
for m in it:
    print(m)
    
<_sre.SRE_Match object; span=(7, 9), match='23'>
<_sre.SRE_Match object; span=(29, 31), match='11'>

2、MatchObject 匹配对象

.group()
参数为0或空返回整个匹配，有参数时返回特定分组匹配细节
.groups()
返回包含所有子分组的元组
.start()
返回特定分组的起始索引
.end()
返回特定分组的终止索引
.span()
返回特定分组的起始终止分组索引的元组
.groupdict()
以字典表的形式返回分组名及结果

text = "Tom is 23 years old. Mike is 11 years old!"
import re
p1 = re.compile(r'(\d+).*?(\d+)')
i = p1.search(text)
i
<_sre.SRE_Match object; span=(7, 31), match='23 years old. Mike is 11'>
i.group()
'23 years old. Mike is 11'
i.group(1)  # 显示分组细节
'23'
i.group(2)
'11'
i.start(1)
7
i.end(1)
9
i.groups()
('23', '11')
i.span(1)
(7, 9)
i.groupdict()  # 因为还未定义分组名
{}

import re
text = "Tom is 23 years old. Mike is 11 years old!"
p1 = re.compile(r'(\w+) (\w+)')
p1.findall(text)
[('Tom', 'is'), ('23', 'years'), ('Mike', 'is'), ('11', 'years')]
it = p1.finditer(text)
for m in it:
    print(m.group())
    
Tom is
23 years
Mike is
11 years

3、group编组
1）从匹配模式中提取信息
2）创建子正则以应用量词

import re
re.search('ab+c','ababc')
<_sre.SRE_Match object; span=(2, 5), match='abc'>
re.search('(ab)+c','ababc')
<_sre.SRE_Match object; span=(0, 5), match='ababc'>

3）限制备选想范围

re.search(r'cent(er|re)','centre')
<_sre.SRE_Match object; span=(0, 6), match='centre'>re.search(r'(\w+) \1','centre centre')

4）重用正则模式中提取的内容

re.search(r'(\w+) \1','centre centre')
<_sre.SRE_Match object; span=(0, 13), match='centre centre'>

5）编组命名
?P<name>模式

import re
text = 'Tpm:13'
p2 = re.compile(r'(?P<name>\w+):(?P<age>\d+)')
n = p2.search(text)
n.group('age')
'13'

引用

匹配对象内 n.group(‘name’)
模式内 (?P=name)
表现内 (\g<name>)

4、应用
1）字符串操作

.split(string, maxsplit=0) 分割字符串

import re
text = 'Tom is 24 years old. Mike is 22 years old. Jame is 12 years old'
p = re.compile('\.')
p.split(text)
['Tom is 24 years old', ' Mike is 22 years old', ' Jame is 12 years old']
re.split(r'\.', text)
['Tom is 24 years old', ' Mike is 22 years old', ' Jame is 12 years old']
re.split(r'(\.)', text)  # 包含分割符
['Tom is 24 years old', '.', ' Mike is 22 years old', '.', ' Jame is 12 years old']
re.split(r'(\.)', text, 1)
['Tom is 24 years old', '.', ' Mike is 22 years old. Jame is 12 years old']

.sub(pattern, repl, string, count=0) 替换字符串

text = 'Tom is 24 years old.'
re.sub(r'\W', ',',text)
'Tom,is,24,years,old,'
text = 'Tom is *24* years old.'
re.sub(r'\*(.*?)\*', '<strong>\g<1></strong>',text) # 在表现内引用分组，使用\g<namw>
'Tom is <strong>24</strong> years old.'
ords = 'ORD000\nORD001\nORD002'
re.sub(r'([A-Z]+)(\d+)', '\g<2>\g<1>',ords)
'000ORD\n001ORD\n002ORD'

.subn(pattern, repl, string, count=0) 替换并返回替换数量
2）编译标记
改变正则的默认行为
re.I 忽略大小写
re.M 匹配多行
re.S 指定"."匹配所有字符，包括\n
3）模块级别操作
re.purge() 清除正则缓存
re.escape() 逃逸字符

re.findall(r'^','^python^')
['']
re.findall(re.escape(r'^'),'^python^')
['^', '^']

han_stars

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python知识点（五）正则表达式

正则表达式（Regular Expression）一、概述1、概念一种文本模式，描述在搜索文本时要匹配的一个或多个字符串2、典型场景数据验证文本扫描文本提取文本替换文本分割3、语法字面值a、普通字符b、需转义：\ ^ $ . | ? * + () [] {}元字符4、匹配单字、预定义元字符. 除\n外的所有字符\d 数字，等同于[...
复制链接

扫一扫