Python学习21天学习挑战赛Day2之正则表达式总结

最新推荐文章于 2024-08-24 11:00:49 发布

Apathfinder

最新推荐文章于 2024-08-24 11:00:49 发布

阅读量370

点赞数 1

分类专栏： Python学习笔记文章标签： python 正则表达式学习

本文链接：https://blog.csdn.net/kevinlegion/article/details/126144357

版权

Python学习笔记专栏收录该内容

10 篇文章 3 订阅

订阅专栏

活动地址：CSDN21天学习挑战赛

作者简介：大家好我是Apathfinder，目前是一名在校大学生，软件工程专业，记录学习路上的点点滴滴。
个人主页：Apathfinder

本文专栏：Python学习

前言:本文是对正则表达式的一次总结及部分内容介绍

目录

正文

一.正则表达式对象

1.Regex对象及其函数

1.re.compile

2.re.match

3.re.search

4.re.findall

5.re.finditer

6.re.sub

7.re.spilt

2.Match对象及用法

1.group(num)

2.groups

二.正则表达式模式匹配（以下只是模式中几个例子）

1.括号分组

2.管道匹配多个分组

3.问号可选匹配

4.星号0或n次匹配

5.加号1或n次匹配

6.通配符

三.贪心和非贪心匹配

四.正则表达式修饰符 - 可选标志

五.正则表达式模式总结

六.正则表达式字符分类

写在最后

正文

一.正则表达式对象

1.Regex对象及其函数

1.re.compile

Python中所有正则表达式的函数都在re模块中，导入模块：import re

向re.compile()传入一个字符串值，返回一个正则表达式对象，例如:创建一个Regex对象(RegexObject)来匹配电话号码，

phonumregex = re.compile(r'\d{3}-\d{3}-\d{4}')

2.re.match

语法：re.match(pattern, string, flags=0)

参数解释:pattern指的是匹配的正则表达式，string指的是要匹配的字符串，flags是标志位，用于控制正则表达式的匹配方式，如：是否区分大小写，多行匹配等等。(下同)

re.match 尝试从字符串的起始位置匹配一个模式，匹配成功re.match方法返回一个匹配的对象，如果不是起始位置匹配成功的话，match()就返回none.

import re
print(re.match('hello', 'hello,world!'))

print(re.match('hello', 'hello,world!').span())  #返回匹配字符串的索引片段
print(re.match('hello', 'hello,world!').start()) #返回匹配字符串的索引起始
print(re.match('hello', 'hello,world!').end())  #返回匹配字符串的索引结束
print(re.match('hello', 'hello,world!').group())    #返回匹配字符串
print(re.match('hello', 'hello,world!').groups())   #返回一个包含所有小组字符串的元组


print(re.match('world', 'hello,world!'))    #返回None

3.re.search

语法: re.search(pattern, string, flags=0)

re.search 扫描整个字符串并返回第一个成功的匹配。

import re

print(re.search('world', 'hello,world!'))
print(re.search('hello', 'hello,world!'))

4.re.findall

语法: re.findall(pattern, string, flags=0) 或 pattern.findall(string[, pos[, endpos]])

pos 可选参数，指定字符串的起始位置，默认为 0。

endpos 可选参数，指定字符串的结束位置，默认为字符串的长度。

在字符串中找到正则表达式所匹配的所有子串，并返回一个列表，如果有多个匹配模式，则返回元组列表，如果没有找到匹配的，则返回空列表。

对匹配的电话号不实行分组

phonumregex = re.compile(r'\d{3}-\d{3}-\d{4}')    #未分组

print(phonumregex.findall('call me 333-444-5555 today, call me 888-999-0000 tomorrow!'))

对匹配的电话号实行分组

phonumregex = re.compile(r'(\d{3})-(\d{3})-(\d{4})') #分组

print(phonumregex.findall('call me 333-444-5555 today, call me 888-999-0000 tomorrow!'))

找不到匹配对象


phonumregex = re.compile(r'(\d{3})-(\d{3})-(\d{4})')    #返回空列表

print(phonumregex.findall('call me 33s-444-5555 today, call me w88-999-0000 tomorrow!'))

5.re.finditer

语法: re.finditer(pattern, string, flags=0)

和 findall 类似，在字符串中找到正则表达式所匹配的所有子串，并把它们作为一个迭代器返回。

it = re.finditer(r'\d{3}-\d{3}-\d{4}', "yesterday called 999-999-8888,call me 333-444-5555 today, call me 888-999-0000 tomorrow!")
for match in it:
    print(match.group())

6.re.sub

语法:re.sub(pattern, repl, string, count=0, flags=0)

pattern : 正则中的模式字符串。
repl : 替换的字符串，也可为一个函数。
string : 要被查找替换的原始字符串。
count : 模式匹配后替换的最大次数，默认 0 表示替换所有的匹配。
flags : 编译时用的匹配模式，数字形式

正则表达式不仅能找到文本模式，而且能够用新的文本替换掉这些模式。

（1）简单使用(repel为字符串)

hidename = re.compile(r'(wo)?man \w+')
mo1 = hidename.sub('xxx', 'man number_1 give the money to man number_2.')
mo2 = hidename.sub('rich man', 'woman number_1 give the money to woman number_2.')
print(mo1)
print(mo2)


#或

mo1 = re.sub(r'(wo)?man \w+', 'xxx', 'man number_1 give the money to man number_2.')
mo2 = re.sub(r'(wo)?man \w+', 'rich man', 'woman number_1 give the money to woman number_2.')
print(mo1)
print(mo2)

（2）进阶使用(repel为函数)

# 将匹配的数字减去2
def minus(matched):
    value = int(matched.group('value'))
    return str(value - 2)

str = 'A2f4g6F8e9'
print(re.sub('(?P<value>\d)', minus, str))

可以看到传入的字符串中数字都已减去2

7.re.spilt

split 方法按照能够匹配的子串将字符串分割后返回列表

语法:re.split(pattern, string[, maxsplit=0, flags=0])

maxsplit：分割次数，maxsplit=1 分割一次，默认为 0，不限制次数。

mo1 = re.split('\W+', 'abc, def, ghi')
print(mo1)
mo2 = re.split('(\W+)', 'abc, def, ghi ')    #加上括号后会返回切割的值
print(mo2)
mo3 = re.split('\W+', 'abc, def, ghi', 1)    #切割一次
print(mo3)
mo4 = re.split('lmn*', 'abc, def, ghi')  # 对于一个找不到匹配的字符串而言，split 不会对其作出分割
print(mo4)

2.Match对象及用法

regex对象的search()方法查找传入的字符串，寻找该正则表达式的所有匹配，如果没有找到返回None，如果找到了返回一个Match对象(MatchObject).

我们可以使用group(num) 或 groups() 匹配对象函数来获取匹配表达式。

1.group(num)

mo = re.search(r'(\d{3})-(\d{3}-\d{4})', 'call me at 711-888-8888 today')
print(mo.group())
print(mo.group(0))
print(mo.group(1))
print(mo.group(2))

group默认情况下组号为0

2.groups

返回一个包含所有小组字符串的元组，从 1 到所含的小组号。

mo = re.search(r'(\d{3})-(\d{3}-\d{4})', 'call me at 711-888-8888 today,')
print(mo.groups())

二.正则表达式模式匹配（以下只是模式中几个例子）

1.括号分组

假设你想要将电话号码中的区号与号码分离，你可以使用括号来对其进行分组处理，(\d{3})-(\d{3}-\d{4})，然后可以使用group匹配对象方法，从1个分组中获取出文本。

phonumregex = re.compile(r'(\d{3})-(\d{3}-\d{4})')
mo = phonumregex.search('call me 333-444-5555 today')
print(mo.group(1))
print(mo.group(2))
print(mo.group())
print(mo.group(0))

2.管道匹配多个分组

字符'|'被称为管道符，如果你希望匹配两个电话号码中的任意一个，你可以使用管道。

当两个电话号码同时出现在字符串中，它会返回第一次所匹配的对象；如果字符串中只有一个，那就返回该对象；如果search没有匹配到则返回None

phonumregex = re.compile(r'666-666-6666|888-888-8888')
mo = phonumregex.search('call me666-666-6666 or 888-888-8888today')
print(mo.group())

mo = phonumregex.search('call me 888-888-8888 or 666-666-6666today')
print(mo.group())


mo = phonumregex.search('call me 7ss-888-8888 or 666-666-6666today')
print(mo.group())

mo = phonumregex.search('call me 7ss-888-8888 or s66-666-6666today')
print(mo.group())

3.问号可选匹配

当你想匹配的模式是可选的，也就是不论这段文本有没有，都要匹配，字符'?'表示它前面的分组是可选的。


moneyregex = re.compile(r'i have (apple)?juice')
mo = moneyregex.search('i have applejuice')
print(mo.group())

mo = moneyregex.search('i have juice')
print(mo.group())

4.星号0或n次匹配

星号*之前的分组可在文本中出现任意次，可以不存在，也可以n次。

mo1 = re.search(r'that (wo)*man', 'that man is my hero')
mo2 = re.search(r'that (wo)*man', 'that wowoman is my hero')
print(mo1.group())
print(mo2.group())

5.加号1或n次匹配

加号+之前的分组，在文本中至少出现一次

mo1 = re.search(r'that (wo)+man', 'that man is my hero')
mo2 = re.search(r'that (wo)+man', 'that wowoman is my hero')

if re.search(r'that (wo)+man', 'that man is my hero') == None:
    print("you are right")
print(mo2.group())

6.通配符

在正则表达式中，'.'句点字符被称为通配符，它匹配除换行之外的所有字符。

mo = re.findall(r'.at', 'the cat in the hat sat on the apple the same as the cat in the hat')
print(mo)

三.贪心和非贪心匹配

Python的正则表达式默认是贪心的，在有两种情况时，它们会尽量匹配最长的字符串；非贪心匹配尽可能最短的字符串。（即在正则表达式的花括号后加上字符'?'）

import re

greedyregex1 = re.compile(r'(hello){2,3}')  #贪心
greedyregex2 = re.compile(r'(hello){2,3}?') #非贪心
mo1 = greedyregex1.search('hellohellohello')
mo2 = greedyregex2.search('hellohellohello')
print(mo1.group())
print(mo2.group())

四.正则表达式修饰符 - 可选标志

五.正则表达式模式总结

正则表达式模式
模式	描述
^	匹配字符串的开头
$	匹配字符串的末尾。
.	匹配任意字符，除了换行符，当re.DOTALL标记被指定时，则可以匹配包括换行符的任意字符。
[...]	用来表示一组字符,单独列出：[amk] 匹配 'a'，'m'或'k'
[^...]	不在[]中的字符：[^abc] 匹配除了a,b,c之外的字符。
re*	匹配0个或多个的表达式。
re+	匹配1个或多个的表达式。
re?	匹配0个或1个由前面的正则表达式定义的片段，非贪婪方式
re{ n}	匹配n个前面表达式。例如，"o{2}"不能匹配"Bob"中的"o"，但是能匹配"food"中的两个o。
re{ n,}	精确匹配n个前面表达式。例如，"o{2,}"不能匹配"Bob"中的"o"，但能匹配"foooood"中的所有o。"o{1,}"等价于"o+"。"o{0,}"则等价于"o*"。
re{ n, m}	匹配 n 到 m 次由前面的正则表达式定义的片段，贪婪方式
a\| b	匹配a或b
(re)	匹配括号内的表达式，也表示一个组
(?imx)	正则表达式包含三种可选标志：i, m, 或 x 。只影响括号中的区域。
(?-imx)	正则表达式关闭 i, m, 或 x 可选标志。只影响括号中的区域。
(?: re)	类似 (...), 但是不表示一个组
(?imx: re)	在括号中使用i, m, 或 x 可选标志
(?-imx: re)	在括号中不使用i, m, 或 x 可选标志
(?#...)	注释.
(?= re)	前向肯定界定符。如果所含正则表达式，以 ... 表示，在当前位置成功匹配时成功，否则失败。但一旦所含表达式已经尝试，匹配引擎根本没有提高；模式的剩余部分还要尝试界定符的右边。
(?! re)	前向否定界定符。与肯定界定符相反；当所含表达式不能在字符串当前位置匹配时成功。
(?> re)	匹配的独立模式，省去回溯。
\w	匹配数字字母下划线
\W	匹配非数字字母下划线
\s	匹配任意空白字符，等价于 [\t\n\r\f]。
\S	匹配任意非空字符
\d	匹配任意数字，等价于 [0-9]。
\D	匹配任意非数字
\A	匹配字符串开始
\Z	匹配字符串结束，如果是存在换行，只匹配到换行前的结束字符串。
\z	匹配字符串结束
\G	匹配最后匹配完成的位置。
\b	匹配一个单词边界，也就是指单词和空格间的位置。例如， 'er\b' 可以匹配"never" 中的 'er'，但不能匹配 "verb" 中的 'er'。
\B	匹配非单词边界。'er\B' 能匹配 "verb" 中的 'er'，但不能匹配 "never" 中的 'er'。
\n, \t, 等。	匹配一个换行符。匹配一个制表符, 等
\1...\9	匹配第n个分组的内容。
\10	匹配第n个分组的内容，如果它经匹配。否则指的是八进制字符码的表达式。

六.正则表达式字符分类

正则表达式字符分类
实例	匹配
python	匹配 "python".
[Pp]ython	匹配 "Python" 或 "python"
rub[ye]	匹配 "ruby" 或 "rube"
[aeiou]	匹配中括号内的任意一个字母
[0-9]	匹配任何数字。类似于 [0123456789]
[a-z]	匹配任何小写字母
[A-Z]	匹配任何大写字母
[a-zA-Z0-9]	匹配任何字母及数字
[^aeiou]	除了aeiou字母以外的所有字符
[^0-9]	匹配除了数字外的字符
.	匹配除 "\n" 之外的任何单个字符。要匹配包括 '\n' 在内的任何字符，请使用象 '[.\n]' 的模式。
\d	匹配一个数字字符。等价于 [0-9]。
\D	匹配一个非数字字符。等价于 [^0-9]。
\s	匹配任何空白字符，包括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。
\S	匹配任何非空白字符。等价于 [^ \f\n\r\t\v]。
\w	匹配包括下划线的任何单词字符。等价于'[A-Za-z0-9_]'。
\W	匹配任何非单词字符。等价于 '[^A-Za-z0-9_]'。