Python与正则

最新推荐文章于 2024-05-20 00:00:17 发布

aaaaaaxin

最新推荐文章于 2024-05-20 00:00:17 发布

阅读量176

点赞数

分类专栏： python笔记文章标签： python 正则表达式开发语言

本文链接：https://blog.csdn.net/qq_41121485/article/details/126783800

版权

python笔记专栏收录该内容

12 篇文章 0 订阅

订阅专栏

Python与正则

不同编程语言来说，对正则表达式的语法绝大数语言都是支持的，但还是略有不同，每种编程语言都有一些独特的匹配规则。
首先一个反斜杠问题。正则表达式里使用’‘作为转义字符，加入需要匹配文字中的字符’‘，那么编程语言表示的正则表达式里将需要4个反斜杠’\\‘。
Python提供了原生字符串的支持，从而解决了这个问题。匹配一个 ‘’ 正则表达式可以写为r’\'。
Python通过 re 模块提供了对正则表达式的支持。
其中一些用到的方法列举如下：
1. re.compile(string[,flag])
2. re.match(pattern, string[,flags])
3. re.search(pattern, string[,flags])
4. re.split(pattern, string[,maxsplit])
5. re.findall(pattern, string[, flags])
6. re.finditer(pattern, string[, flags])
7. re.sub(pattern, repl, string[, count])
8. re.subn(pattern, repl, string[, count])

re.compile
这会生成一个匹配数字的pattern对象，用来给接下来的函数作为参数，进行进一步的搜索操作。
其他几个函数中，有一个flag参数。参数flag是匹配模式，取值可以使用按位或运算符’|’ 表示同时生效，比如 re.I|re.M。
re.I：忽略大小写
re.M：多行模式，改变’^‘和’$‘的行为。
re.S：点任意匹配模式，改变’.'的行为。
re.L：使预定字符类\w \W \b \B \s \S取决于当前区域设定。
re.U：使预定字符类\w \W \b \B \s \S \d \D取决于unicode定义的字符属性。
re.X：详细模式。这个模式下正则表达式可以使多行，忽略空白字符，并可以加入注释。

re.match(pattern, string[,flags])
函数使从输入参数string(匹配的字符串)的开头开始，尝试匹配pattern,一直向后匹配，如果遇到无法匹配的字符或者已经到达string的末尾，立即返回None.

import re

# re.match
# 将正则表达式编译成pattern对象
pattern = re.compile(r'\d+')

# 使用re.match匹配文本，获得匹配结果，无法匹配时将返回None
result = re.match(pattern, '192abc')

if result:
    print(result.group())
else:
    print('匹配失败1')

result = re.match(pattern, 'abc192')
if result:
    print(result.group())
else:
    print('匹配失败2')

输出：
192
匹配失败2

如上代码匹配192abc字符串时，match函数时从字符串开头进行匹配，匹配到192立即返回值，通过group()可以获取捕获的值。同样，匹配abc192字符串时，字符串开头不符合正则表达式，立即返回None。

re.search(pattern, string[,flags])
search方法与match方法极其类似，区别在于match()只从string的开始位置匹配，search()会扫描整个string查找匹配，match()只有在string其实位置匹配成功的时候才有返回，
如果不是开始位置匹配成功的话，match()就返回None。search()方法的返回对象和match()返回对象在方法和属性上是一直的。

# re.search()
# 将正则表达式编译成pattern对象
pattern = re.compile(r'\d+')
# 使用re.search 匹配文本获取匹配结果；无法匹配时返回None
result = re.search(pattern, 'abc192edf')
if result:
    print(result.group())
else:
    print('匹配失败3')

输出：192

re.split(pattern, string[,maxsplit])
按照能够匹配的字串将string分割后返回列表。maxsplit用于指定最大分割次数，不指定，则将全部分割。

# re.split()
pattern = re.compile(r'\d+')
print(re.split(pattern, 'A1B2C3D4'))

输出：['A', 'B', 'C', 'D', '']

re.findall(pattern, string[,flags])
搜索整个string, 以列表形式返回能匹配的全部字串。

# re.findall()
pattern = re.compile(r'\d+')
print(re.findall(pattern, 'A1B2C3D4'))

输出：['1', '2', '3', '4']

re.finditer(pattern, string[,flags])
搜索整个string, 以迭代器形式返回能匹配的全部Match对象。

# re.finditer()
pattern = re.compile(r'\d+')
matchiter = re.finditer(pattern, 'A1B2C3D4')
for match in matchiter:
    print(match.group())

输出：	1
		2
		3
		4

re.sub(pattern, repl, string[,count])
使用repl替换string中每一个匹配的子串后返回替换后的字符串。当repl时一个字符串时，可以使用\id 或 \g, \g引用分组，但不能使用编号0.
当repl是一个方法时，这个方法应当只接受一个参数(match对象)，并返回一个字符串用于替换(返回的字符串中不嫩再引用分组)。count用于指定最多替换次数，不指定时全部替换。

# re.sub()
# 使用名称引用
pattern = re.compile(r'(?P<word1>\w+) (?P<word2>\w+)')
s = 'i say, hello world, one two!'
print(re.sub(pattern, r'\g<word2> \g<word1>', s))
# 使用编号
pattern = re.compile(r'(\w+) (\w+)')
print(re.sub(pattern, r'\2 \1', s))


def func(m):
    return m.group(1).title() + ' ' + m.group(2).title()


print(re.sub(pattern, func, s))

输出：
say i, world hello, two one!
say i, world hello, two one!
I Say, Hello World, One Two!

re.subn(pattern, repl, string[,count])
返回(sub(),替换次数)

# re.subn()
s = 'i say, hello world!'
p = re.compile(r'(\w+) (\w+)')
print(p.subn(r'\2 \1', s))


def fund(m):
    return m.group(1).title() + ' ' + m.group(2).title()

print(p.subn(fund, s))

输出：
('say i, world hello!', 2)
('I Say, Hello World!', 2)

以上七个函数在re模块中进行搜索匹配，如果将捕获到的值提取出来呢？需要用到Match对象，之前已经使用了Match中的groups方法，还有一些其他的属性和方法。
Match对象的属性：

    string：匹配时使用的文本
    
    re：匹配时使用的Pattern对象
    
    pos：文本中正则表达式开始搜索的索引。
    
    endpos：文本中正则表达式结束搜索的索引。
    
    lastinidex：最后一个被捕获的分组在文本中的索引。如果没有被捕获的分组，将为None
    
    lastgroup：最后一个被捕获的分组的别名。如果这个分组没有别名或者灭有被捕获的分组，将为None

Match对象的方法：

    group([group1,...])：获得一个或多个分组截获的字符串，指定多个参数时将以元组形式放回。group1可以使用编号也可以使用别名，编码0代表整个匹配的字串，不填写参数时，返回group(0)。没有截获字符串的组返回None，截获了多次的组返回最后一次截获的字串。
    
    groups([default])：以元组形式返回全部分组截获的字符串。相当于调用 group(1,2,...)。default表示没有截获字符串的组以这个值替代，默认为None
  
    groupdict([default])：返回以有别名的组的别名键，以该组截获的子串为值的字典，没有别名的组不包含在内。default表示没有截获字符串的组以这个值替代，默认为None
    
    start([group])：返回指定的组截获的子串在string中的其实索引(子串第一个字符的索引)。group默认值为0
    
    end([group])：返回指定的组截获的子串在string中的结束索引(子串最后一个字符的索引+1)。group默认值为0
    
    span([group])：返回(start[group], end(group))
    
    expand(template)：将匹配到的分组带入template中然后返回。template中可以使用\id或\g<id>,\g<name>引用分组，但不能使用编号。\id与\g<id>是等价的，但\10将被认为是第10个分组，如果你想表达\1之后字符'0',只能使用\g<1>0

import re
pattern = re.compile(r'(\w+) (\w+) (?P<word>.*)')
match = pattern.match('I love you!')
print(match.string)
print(match.re)
print(match.pos)
print(match.endpos)
print(match.lastindex)
print(match.lastgroup)

print(match.group(1, 2))
print(match.groups())
print(match.groupdict())
print(match.start(2))
print(match.end(2))
print(match.span(2))
print(match.expand(r'\2 \1 \3'))

输出：
I love you!
re.compile('(\\w+) (\\w+) (?P<word>.*)')
0
11
3
word
('I', 'love')
('I', 'love', 'you!')
{'word': 'you!'}
2
6
(2, 6)
love I you!