Python正则表达式(re模块)入门：文本匹配与提取全面指南

最新推荐文章于 2025-05-20 18:44:14 发布

梦幻南瓜

最新推荐文章于 2025-05-20 18:44:14 发布

阅读量888

点赞数 9

分类专栏： python 文章标签： python 正则表达式

本文链接：https://blog.csdn.net/ybg8912/article/details/148068887

版权

一、正则表达式基础概念

1. 什么是正则表达式

正则表达式(Regular Expression)是一种用于描述字符串模式的特殊语法，可以用来：

检查字符串是否符合特定模式
从文本中提取特定部分
替换文本中的内容
分割字符串

2. 正则表达式应用场景

场景	示例	正则表达式用途
数据验证	邮箱、URL验证	检查格式合法性
日志分析	提取错误代码	模式匹配与提取
文本处理	批量替换文本	查找与替换
网络爬虫	提取网页内容	结构化数据抽取
数据清洗	标准化日期格式	格式统一化

二、Python re模块核心函数

1. 常用函数概览

函数	描述	返回值
re.match()	从字符串起始位置匹配	Match对象或None
re.search()	扫描整个字符串匹配	Match对象或None
re.findall()	查找所有匹配项	列表
re.finditer()	查找所有匹配项	迭代器
re.sub()	替换匹配项	替换后的字符串
re.split()	按模式分割字符串	列表

2. 函数使用对比

特性	match	search	findall	finditer
匹配范围	仅开头	整个字符串	整个字符串	整个字符串
返回结果	首个匹配	首个匹配	所有匹配文本	所有匹配对象
内存效率	高	高	低(大文本)	高
适用场景	验证格式	查找首个	提取所有	逐个处理

三、正则表达式语法详解

1. 基础元字符

元字符	描述	示例	匹配
.	任意字符(除换行)	a.c	abc, aac
^	字符串开始	^abc	abc开头的行
$	字符串结束	abc$	abc结尾的行
*	0次或多次	ab*c	ac, abc, abbc
+	1次或多次	ab+c	abc, abbc
?	0次或1次	ab?c	ac, abc
{m,n}	m到n次	a{2,4}b	aab, aaab, aaaab

2. 字符类与特殊序列

模式	描述	示例	匹配
[abc]	匹配a/b/c	[abc]123	a123, b123
[^abc]	非a/b/c	[^abc]123	d123, 1123
\d	数字字符	\d+	123, 0
\D	非数字	\D+	abc, @#$
\s	空白字符	\s+	空格,制表符
\S	非空白	\S+	abc, 123
\w	单词字符	\w+	abc, 你好
\W	非单词字符	\W+	@#$

3. 分组与捕获

import re

# 基础分组
pattern = r'(\d{3})-(\d{3,8})'
m = re.match(pattern, '010-12345')
print(m.groups())  # ('010', '12345')

# 命名分组
pattern = r'(?P<area>\d{3})-(?P<num>\d{3,8})'
m = re.match(pattern, '010-12345')
print(m.groupdict())  # {'area': '010', 'num': '12345'}

四、正则表达式实战应用

1. 数据验证

邮箱验证：

def validate_email(email):
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.fullmatch(pattern, email))

print(validate_email('test@example.com'))  # True
print(validate_email('invalid.email'))     # False

URL验证：

def validate_url(url):
    pattern = r'^(https?://)?([\da-z.-]+)\.([a-z.]{2,6})([/\w .-]*)*/?$'
    return bool(re.fullmatch(pattern, url))

print(validate_url('https://www.example.com/path'))  # True
print(validate_url('example.com'))                  # True

2. 数据提取

提取日志中的IP地址：

log = '127.0.0.1 - - [10/Oct/2023:13:55:36 +0800] "GET / HTTP/1.1" 200 2326'
pattern = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
ips = re.findall(pattern, log)
print(ips)  # ['127.0.0.1']

提取HTML中的链接：

html = '<a href="https://example.com">Link</a><img src="image.png">'
pattern = r'(?:href|src)="([^"]+)"'
urls = re.findall(pattern, html)
print(urls)  # ['https://example.com', 'image.png']

3. 文本清洗与替换

标准化日期格式：

text = "日期：2023-10-10, 10/10/2023, 2023年10月10日"
pattern = r'(\d{4})[/-年](\d{1,2})[/-月](\d{1,2})[日]?'
repl = r'\1年\2月\3日'
result = re.sub(pattern, repl, text)
print(result)  # 日期：2023年10月10日, 2023年10月10日, 2023年10月10日

移除多余空白：

text = "这  是一段  有很多   空白的 文本"
result = re.sub(r'\s+', ' ', text).strip()
print(result)

最低0.47元/天解锁文章