实用正则表达式技巧：Python 中的高效文本处理！

最新推荐文章于 2024-10-08 15:02:33 发布

程序员喵哥

最新推荐文章于 2024-10-08 15:02:33 发布

阅读量975

点赞数 19

分类专栏： Python 文章标签：正则表达式 python mysql

本文链接：https://blog.csdn.net/GitHub_miao/article/details/135558704

版权

Python 专栏收录该内容

40 篇文章 3 订阅

订阅专栏

更多资料获取

📚 个人网站：ipengtao.com

正则表达式是一种强大的文本处理工具，它可以以灵活的方式定义文本模式，以查找、匹配和操作字符串。在Python中，re模块提供了支持正则表达式的功能。

正则表达式的字面字符

字面字符是正则表达式中的普通字符，与在字符串中的出现完全匹配。例如，正则表达式"python"将精确匹配字符串中的 “python”。

import re

text = "Python is a powerful language."
pattern = re.compile(r"Python")
match = pattern.search(text)
if match:
    print("匹配成功：", match.group())

正则表达式的特殊字符

正则表达式中有一些特殊字符，它们具有特殊含义。以下是一些常见的特殊字符及其含义：

.：匹配除换行符以外的任何字符。

text = "apple, banana, cherry"
pattern = re.compile(r".a.a")
matches = pattern.findall(text)
print("匹配结果：", matches)

*：匹配前一个字符零次或多次。

text = "ab, aab, aaab, aaaab"
pattern = re.compile(r"a*b")
matches = pattern.findall(text)
print("匹配结果：", matches)

+：匹配前一个字符一次或多次。

text = "ab, aab, aaab, aaaab"
pattern = re.compile(r"a+b")
matches = pattern.findall(text)
print("匹配结果：", matches)

?：匹配前一个字符零次或一次。

text = "color, colour"
pattern = re.compile(r"colou?r")
matches = pattern.findall(text)
print("匹配结果：", matches)

[]：匹配方括号中的任何一个字符。

text = "apple, banana, cherry"
pattern = re.compile(r"[bc]ana")
matches = pattern.findall(text)
print("匹配结果：", matches)

|：表示或操作，匹配左边或右边的表达式。

text = "apple, banana, cherry"
pattern = re.compile(r"apple|cherry")
matches = pattern.findall(text)
print("匹配结果：", matches)

使用正则表达式的常见操作

1. 匹配字符串

使用search()方法可以在文本中查找第一个匹配的子字符串：

text = "Python is a powerful language. Python is also easy to learn."
pattern = re.compile(r"Python")
match = pattern.search(text)
if match:
    print("匹配成功：", match.group())

2. 查找所有匹配

使用findall()方法可以查找文本中的所有匹配项，并将它们存储在一个列表中：

text = "apple, banana, cherry, date, banana"
pattern = re.compile(r"ba\w+")
matches = pattern.findall(text)
print("匹配结果：", matches)

3. 替换文本

使用sub()方法可以替换文本中的匹配项：

text = "Hello, my name is Alice. Hello, my name is Bob."
pattern = re.compile(r"Alice|Bob")
new_text = pattern.sub("John", text)
print("替换后的文本：", new_text)

4. 分组和捕获

使用圆括号可以创建分组，并使用group()方法访问它们：

text = "Date: 2023-01-11"
pattern = re.compile(r"Date: (\d{4}-\d{2}-\d{2})")
match = pattern.search(text)
if match:
    print("匹配成功：", match.group(1))

5. 贪婪与非贪婪匹配

默认情况下，正则表达式是贪婪的，它们尝试匹配尽可能多的字符。如果要进行非贪婪匹配，可以在量词后面加上?：

text = "This is a <em>sample</em> text."
pattern = re.compile(r"<.*>")
match = pattern.search(text)
if match:
    print("贪婪匹配：", match.group())

pattern = re.compile(r"<.*?>")
match = pattern.search(text)
if match:
    print("非贪婪匹配：", match.group())

使用场景

正则表达式在文本处理中有广泛的应用，下面将详细描述一些常见用途，并提供相应的示例代码。

1. 数据提取

提取电话号码

正则表达式可以用于从文本中提取电话号码。例如，从以下文本中提取电话号码：

import re

text = "联系我们：电话号码为 123-456-7890 和 987-654-3210。"
pattern = re.compile(r"\d{3}-\d{3}-\d{4}")
matches = pattern.findall(text)
print("电话号码：", matches)

提取电子邮件地址

正则表达式也可以用于提取电子邮件地址：

import re

text = "请发送邮件至email@example.com或contact@company.com。"
pattern = re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b")
matches = pattern.findall(text)
print("电子邮件地址：", matches)

2. 文本替换

替换文本

正则表达式可以用于批量替换文本中的特定模式。例如，将文本中的所有"color"替换为"colour"：

import re

text = "color, colorful, coloring"
pattern = re.compile(r"color")
new_text = pattern.sub("colour", text)
print("替换后的文本：", new_text)

3. 输入验证

验证日期格式

正则表达式可以用于验证日期格式是否正确。例如，验证日期是否符合YYYY-MM-DD格式：

import re

date = "2023-01-11"
pattern = re.compile(r"^\d{4}-\d{2}-\d{2}$")
if pattern.match(date):
    print("日期格式正确")
else:
    print("日期格式错误")

验证密码强度

您还可以使用正则表达式来验证密码强度，例如，要求密码包含至少一个大写字母、一个小写字母和一个数字：

import re

password = "P@ssw0rd123"
pattern = re.compile(r"^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).+$")
if pattern.match(password):
    print("密码强度符合要求")
else:
    print("密码强度不符合要求")

4. 日志分析

正则表达式在解析和提取日志文件中的有用信息时非常有用。例如，从日志文件中提取访问IP地址和访问时间：

import re

log = "192.168.1.1 - - [11/Jan/2023:12:34:56] \"GET /page.html\" 200 1234"
pattern = re.compile(r"(\d+\.\d+\.\d+\.\d+) - - \[([^]]+)\]")
match = pattern.search(log)
if match:
    ip_address = match.group(1)
    access_time = match.group(2)
    print("IP地址:", ip_address)
    print("访问时间:", access_time)