Python中的正则表达式——re模块，实战案例解析与技巧分享

本文链接：https://blog.csdn.net/m0_62413155/article/details/136353616

一、正则表达式

1、正则表达式：

使用不同模式表示各种字符。

2、正则表达式的语法：

可以使用不同的元字符和特殊序列来构建模式，以匹配、搜索和替换字符串中的文本。

3、作用：

用于匹配、搜索、替换字符串模式。

4、优势：

执行效率特别高。

5、缺点

可读性不强。

二、各种模式、元字符

①、r : 表示不用转义。

②、. : 表示任意字符。

③、\d : 匹配数字。

④、\D ：匹配非数字。

⑤、\s ：匹配空白字符（空格、转义字符、换行字符）。

⑥、\S：匹配非空白字符。

⑦、\w：匹配数字、字母、下划线。

⑧、\W：匹配非数字、字母、下划线。

⑨、re.I: 忽略大小写。

⑩、re.M ：忽略空白符。

⑪、\b ：匹配空白符。匹配单词边界，即匹配单词的开头或结尾位置。例如，正则表达式 r'\bhello\b' 可以匹配单词 "hello"。

⑫、\B ：匹配非空白符。

⑬、^ ：匹配单词开头。

⑭、$：匹配单词结尾。

⑮、[abcdefg]：只能取其中一个。

⑯、[^abcdefg]：不能取其中的内容。

⑰、()：分组。

⑱、[ | ]：或。

⑲、\1\2：取第几个分组中的内容，要和分组中的元素保持一致。

重复

①、* ：出现0次以上。

②、+ ：出现一次以上。

③、？：有一个或0个。

④、.* ：默认贪婪模式，即尽可能多。

⑤、.*? ：非贪婪模式，即尽可能少。

⑥、{n} ：匹配n个。

⑦、{m, n} ：匹配m-n个。

⑧、.*？\b ：匹配单词边界。

⑨、^……$ ：匹配边界。

三、re模块的常用方法

①、match : 从头开始匹配，匹配成功返回匹配的字符串，匹配失败返回None。

注意match对象使用group查看。

r = re.match(r".\w{8}", "#Hello_word", re.I)
print(type(r), r)
if r:
    print(r.group())
# <class 're.Match'> <re.Match object; span=(0, 9), match='#Hello_wo'>
# #Hello_wo

②、findmatch : 从头匹配到结尾，返回的是match或空，匹配整个字符串。

r = re.fullmatch(r"\d\w\d{3}", "2d456")
print(r, type(r))
if r:
    print(r.group())
# <re.Match object; span=(0, 5), match='2d456'> <class 're.Match'>
# 2d456

③、search : 从整个字符串中查找，找到第一个返回match或None。

# search: 从整个字符串中找，找到第一个，返回Match或None
r = re.search(r"\d", "a2b3c")
print(r, type(r))
if r:
    print(r.group())
# <re.Match object; span=(1, 2), match='2'> <class 're.Match'>
# 2

④、findall：从整个字符串中匹配，匹配所有，返回列表。

r = re.findall(r"\s\S", "he1l2lo wor\tld\n")
print(type(r), r)
# <class 'list'> [' w', '\tl']

r = re.findall(f"a*b", "aaabcccabc")
print(r, type(r))
# ['aaab', 'ab'] <class 'list'>

⑤、finditer : 匹配所有，返回迭代器，即可以遍历，每个元素都是match，可以使用group取值。

# # fulliter: 匹配所有， 返回迭代器， 即可以遍历， 每个元素都是Match，可以使用group取值
r = re.finditer(r"\d", "1a2b3c")
print(r, type(r))
for e in r:
    print(e, e.group())
    
# <callable_iterator object at 0x0000021913CBCEE0> <class 'callable_iterator'>
# <re.Match object; span=(0, 1), match='1'> 1
# <re.Match object; span=(2, 3), match='2'> 2
# <re.Match object; span=(4, 5), match='3'> 3

⑥、split : 切割，返回列表，可设置最大切割数。

# # split: 切割，返回列表，有最大切割数
r = re.split(r"\d", "a2b3c4", 2)
print(r, type(r))
# ['a', 'b', 'c4'] <class 'list'>

⑦、sub : 替换，返回str。

# sub: 替换， 返回str
r = re.sub(r"\d", "+", "a2b3c4", 2)
print(r, type(r))
# a+b+c4 <class 'str'>

⑧、subn ：替换，返回元组（新字符串，替换个数），可是只替换个数。

# subn: 替换， 返回元组（新字符串， 替换个数）， 替换个数可以指定
r = re.subn(r"\d", "+", "a2b3c4", 2)
print(r, type(r))
# ('a+b+c4', 2) <class 'tuple'>

四、re模块实战

①、匹配邮箱地址：

# # 匹配邮箱地址：
email = "example@example.com"
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
if re.match(pattern, email):
    print("Valid email address")
else:
    print("Invalid email address")

②、提取 HTML 标签中的内容：

html = "<p>Hello, <strong>world</strong></p>"
pattern = r'<[^>]+>'  # 表示匹配不包含大于号（">"）的字符串
result = re.sub(pattern, '', html)  # 删除所有 HTML 标签
print(result)  # 输出：Hello, world

③、查找特定模式的单词：

text = "Python is a popular programming language. I love Python!"
pattern = r'\bPython\b'
matches = re.findall(pattern, text)
print(matches)  # 输出：['Python', 'Python']

④、替换文本中的日期格式：

# 替换文本中的日期格式
# \b：表示单词边界，确保匹配的日期格式前后没有其他字符。
# (\d{4})：使用括号 () 表示一个捕获组，\d{4} 表示匹配四个数字，即年份。
text = "Today's date is 2024-02-28. Tomorrow will be 2024-03-01."
pattern = r'\b(\d{4})-(\d{2})-(\d{2})\b'
replacement = r'\2/\3/\1'
result = re.sub(pattern, replacement, text)
print(result)  # 输出：Today's date is 02/28/2024. Tomorrow will be 03/01/2024.

⑤、

# * 匹配所有， 空也匹配
r = re.findall(r"\d*", "1a2b3c")
print(r, type(r))
# ['1', '', '2', '', '3', '', ''] <class 'list'>

⑥、

# 即这个字母要么是a,要么不是a, 空也取，不是a的取空
r = re.findall(r"a*", "aaa1a2b3ca")
print(r, type(r))
# ['aaa', '', 'a', '', '', '', '', 'a', ''] <class 'list'>

⑦、

# a可有可无，但是必须有b
r = re.findall(r"a*b", "aaa1a2b3cab")
print(r, type(r))
# ['b', 'ab'] <class 'list'>

⑧、

r = re.findall(r".*", "aaa1a2b3cab")
print(r, type(r))
# ['aaa1a2b3cab', ''] <class 'list'>

⑨、

r = re.findall(r".*?", "aaa1a2b3cab")
print(r, type(r))
# ['', 'a', '', 'a', '', 'a', '', '1', '', 'a', '', '2', '', 'b', '', '3', '', 'c', '', 'a', '', 'b', ''] <class 'list'>

⑩、

r = re.findall(r"\d{2,4}", "12345678910")  # 注意不要空格
print(r, type(r))
# ['1234', '5678', '910'] <class 'list'>

⑪、

r = re.search(r"^a.*?d$", "abababababd")
print(r, type(r))
if r:
    print(r.group())
# <re.Match object; span=(0, 11), match='abababababd'> <class 're.Match'>
# abababababd

⑫、

# .*? 匹配尽可能多
r = re.findall(r".*?\s", "hello world i am chinese")
print(r, type(r))
# ['hello ', 'world ', 'i ', 'am '] <class 'list'>

⑬、

# re.M 忽略空白符\n \t 空格
r = re.findall(r".*?\s", "hello world i\n am chinese", re.M)
print(r, type(r))

⑭、

r = re.findall(r"[abcd]", "abcdefg")
print(r)  # ['a', 'b', 'c', 'd']

⑮、

r = re.findall(r"[^abcd]", "abcdefg")  # ['e', 'f', 'g']
print(r)

⑯、

# \1 取前面分组中的内容
pattern = r"(.)\1+"
text = "Helloooo Worlddd!!"
matches = re.findall(pattern, text)
print(matches)

⑰、

# 如果使用了分组则只能得到分组中的内容
r = re.findall(r"(\d)(\d)a\1\2", "12a12")
print(r)  # [('1', '2')]

⑱、

r = re.findall(r"(\d\d|ab)cd\1", "13cd13-12cd24-ab-acb")
print(r)    # ['13']