python 模块介绍之正则化表达式re

CL.LIANG

已于 2023-08-18 01:25:54 修改

阅读量275

点赞数

分类专栏： python模块专项学习文章标签： python

于 2023-08-16 05:51:54 首次发布

本文链接：https://blog.csdn.net/Liang_Cailei/article/details/132310322

版权

python模块专项学习专栏收录该内容

5 篇文章 0 订阅

订阅专栏

作者： LIANG，Southampton, UK.

当涉及到处理文本数据时，正则表达式（通常缩写为"regex"或"re"）是一种非常强大的工具，它可以用来在文本中进行模式匹配、搜索和替换操作。Python中有一个内置的re模块，它提供了用于处理正则表达式的函数和类。

特别的，从一些文本文件提取数据并转换为浮点型时，正则化表达式就会派上大用场。以下是正则化表达式的常用函数与功能。

1. 字符匹配

1.1 特殊字符

.: 匹配任意单个字符（除了换行符）。
\d: 匹配任意数字字符。
\w: 匹配任意字母、数字或下划线字符。
\s: 匹配任意空白字符（空格、制表符、换行等）。

1.2 重复

*: 匹配前一个字符0次或多次。
+: 匹配前一个字符1次或多次。
?: 匹配前一个字符0次或1次。
{n}: 匹配前一个字符恰好n次。
{n, m}: 匹配前一个字符至少n次，最多m次。

1.3字符类匹配

[abc]: 匹配a、b或c中的任意一个字符。
[^abc]: 匹配除了a、b、c以外的任意字符。

1.4 边界匹配

^: 匹配字符串的开头。
$: 匹配字符串的结尾。
\b: 匹配单词边界。

1.5 分组匹配

(): 分组匹配，使用括号进行分组的话，正则表达式单独匹配()里面的内容，之后再按照规则匹配后面的内容。这点比较难理解，举例说明：

import re

pattern = r"(\w+)\s(\d+)"  # 匹配由字母和数字组成的单词，后跟一个数字
text = "There are 3 apples and 5 bananas."
matches = re.findall(pattern, text)
print(matches)  # 输出: [('There', '3'), ('bananas', '5')]
# **********************************************************
pattern = r"\w+\s\w+\s\d+"  # 匹配两个单词后跟一个数字
text = "There are 3 apples and 5 bananas."
matches = re.findall(pattern, text)
print(matches)  # 输出: ['There are 3']

上述例子详细说明了使用分组()与不使用分组()的区别，简单来说，就是先匹配一块，再匹配一块。

1.6 分类匹配

在Python的re模块中，方括号[]用于创建字符类（character class），它用来匹配其中的任意一个字符。字符类可以包含多个字符或字符范围，用来指定匹配的字符集合。

单个字符匹配：可以在字符类中列出单个字符，它将匹配这些字符中的任意一个。

import re

pattern = r"[aeiou]"  # 匹配任意一个元音字母
text = "apple"
matches = re.findall(pattern, text)
print(matches)  # 输出: ['a', 'e']

字符范围匹配：可以使用连字符 - 来指定一个字符范围

pattern = r"[a-z]"  # 匹配任意一个小写字母
text = "Hello"
matches = re.findall(pattern, text, re.IGNORECASE)
print(matches)  # 输出: ['H', 'e', 'l', 'l', 'o']

排除字符匹配：可以使用 ^ 在字符类的开头，来指定排除某些字符。

pattern = r"[^0-9]"  # 匹配任意一个非数字字符
text = "abc123"
matches = re.findall(pattern, text)
print(matches)  # 输出: ['a', 'b', 'c']

组合多个字符类：可以组合多个字符类，以获取更复杂的匹配。

pattern = r"[A-Za-z0-9]"  # 匹配任意一个字母或数字
text = "Hello123"
matches = re.findall(pattern, text)
print(matches)  # 输出: ['H', 'e', 'l', 'l', 'o', '1', '2', '3']

上述介绍的正则化字符匹配能够处理绝大多数任务，更多用法请参考官网：正则化表达式语法

2. re的使用

re 模块是Python中用于正则表达式操作的内置模块，下面介绍其常用函数及功能。

2.1 re.match(pattern, string, flags=0):

从字符串的开头开始匹配模式。如果模式在字符串开头匹配成功，就返回一个匹配对象，否则返回None。

import re

pattern = r"hello"
text = "hello, world!"
match_result = re.match(pattern, text, re.IGNORECASE)
if match_result:
    print("Match found:", match_result.group())
else:
    print("No match")

2.2 re.search(pattern, string, flags=0)

在字符串中搜索匹配模式的第一个位置。如果匹配成功，返回一个匹配对象；否则返回None。

import re

pattern = r"world"
text = "Hello, world!"
search_obj = re.search(pattern, text)
if search_obj:
    print("Match found:", search_obj.group())
else:
    print("No match")

2.3 re.findall(pattern, string, flags=0)

返回一个包含所有匹配子串的列表。

import re

pattern = r"\d+"
text = "There are 123 apples and 456 bananas."
matches = re.findall(pattern, text)
print(matches)  # 输出: ['123', '456']

2.4 re.finditer(pattern, string, flags=0)

返回一个迭代器，包含所有匹配的迭代对象。

import re

pattern = r"\d+"
text = "There are 123 apples and 456 bananas."
match_iter = re.finditer(pattern, text)
for match in match_iter:
    print("Match found:", match.group())
# 输出:
# Match found: 123
# Match found: 456

2.5 re.fullmatch(pattern, string, flags=0)

尝试将整个字符串与模式匹配。如果匹配成功，返回一个匹配对象；否则返回None。

import re

pattern = r"\d+"
text = "123"
full_match_obj = re.fullmatch(pattern, text)
if full_match_obj:
    print("Full match:", full_match_obj.group())
else:
    print("No match")
# 输出: Full match: 123

2.6 re.split(pattern, string, maxsplit=0, flags=0)

使用模式分割字符串，并返回分割后的子串列表。

import re

pattern = r"\s"
text = "Hello world!"
split_result = re.split(pattern, text)
print(split_result)
# 输出: ['Hello', 'world!']

2.7 re.sub(pattern, repl, string, count=0, flags=0)

使用替换字符串 repl 替换字符串中的模式匹配。可以指定 count 来限制替换次数。

import re

pattern = r"\d+"
text = "I have 3 apples and 5 bananas."
new_text = re.sub(pattern, "X", text)
print(new_text)
# 输出: I have X apples and X bananas.

2.8 re.subn(pattern, repl, string, count=0, flags=0)

与re.sub()类似，但返回替换后的新字符串和替换次数。

import re

pattern = r"\d+"
text = "I have 3 apples and 5 bananas."
new_text, num_replacements = re.subn(pattern, "X", text)
print(new_text)
# 输出: I have X apples and X bananas.
print("Number of replacements:", num_replacements)
# 输出: Number of replacements: 2