Python进阶_05正则表达式实战指南

最新推荐文章于 2025-12-04 17:53:34 发布

原创最新推荐文章于 2025-12-04 17:53:34 发布 · 368 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#正则表达式

1. 正则表达式基础

1.1 正则表达式概念

正则表达式(Regular Expression)是用于匹配和处理文本的强大工具，它通过定义特定模式来实现字符串的搜索、匹配和替换操作。

import re  # Python正则表达式模块

text = "Python是一种解释型、面向对象的编程语言"
pattern = r"面向对象"  # 模式字符串前的r表示原始字符串

# 搜索匹配
match = re.search(pattern, text)
if match:
    print(f"在位置 {match.start()} 到 {match.end()} 找到匹配")

1.2 常用正则函数对比

函数	描述	返回值	适用场景
`re.match()`	从字符串开头匹配	Match对象或None	验证字符串格式
`re.search()`	搜索第一个匹配项	Match对象或None	查找单个模式
`re.findall()`	查找所有匹配项	匹配字符串列表	提取所有符合条件的内容
`re.finditer()`	查找所有匹配项	Match对象迭代器	需要匹配位置信息时
`re.sub()`	替换匹配项	新字符串	批量替换内容

2. 正则表达式语法详解

2.1 基础元字符

import re

# 点号(.)匹配任意字符
print(re.findall(r"p.thon", "python pYthon p@thon"))  # ['python', 'pYthon', 'p@thon']

# 星号(*)匹配前导字符0次或多次
print(re.findall(r"ab*c", "ac abc abbc abbbc"))  # ['ac', 'abc', 'abbc', 'abbbc']

# 加号(+)匹配前导字符1次或多次
print(re.findall(r"ab+c", "ac abc abbc"))  # ['abc', 'abbc']

# 问号(?)匹配前导字符0次或1次
print(re.findall(r"colou?r", "color colour"))  # ['color', 'colour']

2.2 量词与边界

# 精确匹配次数
print(re.findall(r"\d{3}", "123 4567 89 012"))  # ['123', '456', '012']

# 范围匹配
print(re.findall(r"\d{2,4}", "1 12 123 1234 12345"))  # ['12', '123', '1234', '1234', '5']

# 边界匹配
print(re.findall(r"^\d+", "123abc"))  # ['123'] - 开头数字
print(re.findall(r"\w+$", "abc123 "))  # [] - 结尾有空格不匹配

2.3 字符类与分组

# 字符集合
print(re.findall(r"[aeiou]", "Hello World"))  # ['e', 'o', 'o']
print(re.findall(r"[A-Za-z]", "Python3"))  # ['P', 'y', 't', 'h', 'o', 'n']

# 分组提取
match = re.search(r"(\d{4})-(\d{2})-(\d{2})", "日期:2023-05-15")
if match:
    print(f"年: {match.group(1)}, 月: {match.group(2)}, 日: {match.group(3)}")

# 非捕获组(?:...)
print(re.findall(r"(?:www\.)?(\w+)\.com", "www.baidu.com google.com"))
# ['baidu', 'google']

3. 特殊序列与转义

3.1 常用特殊序列

# \d 匹配数字
print(re.findall(r"\d+", "电话:12345, 邮编:100000"))  # ['12345', '100000']

# \w 匹配单词字符
print(re.findall(r"\w+", "user_name@example.com"))  
# ['user_name', 'example', 'com']

# \s 匹配空白字符
print(re.split(r"\s+", "Python  是一种\t编程语言\n很棒"))  
# ['Python', '是一种', '编程语言', '很棒']

3.2 转义特殊字符

# 匹配真正的点号
print(re.findall(r"\d+\.\d+", "圆周率约3.14159"))  # ['3.14159']

# 匹配特殊字符
print(re.findall(r"\$\d+", "价格:$100, 折扣:50%"))  # ['$100']

4. 高级正则技巧

4.1 贪婪与非贪婪匹配

# 贪婪匹配(默认)
print(re.findall(r"<.*>", "<div><p>Hello</p></div>"))  
# ['<div><p>Hello</p></div>']

# 非贪婪匹配(加?)
print(re.findall(r"<.*?>", "<div><p>Hello</p></div>"))  
# ['<div>', '<p>', '</p>', '</div>']

4.2 前后查找断言

# 正向肯定预查(?=...)
print(re.findall(r"\w+(?=:)", "姓名:张三, 年龄:30"))  # ['姓名', '年龄']

# 正向否定预查(?!...)
print(re.findall(r"\d{3}(?!\d)", "123 1234 12345"))  # ['123', '123', '123']

# 反向肯定预查(?<=...)
print(re.findall(r"(?<=\$)\d+", "价格:$100, 折扣:$50"))  # ['100', '50']

4.3 标志参数

# 忽略大小写
print(re.findall(r"python", "Python is great", re.IGNORECASE))  # ['Python']

# 多行模式
text = "第一行\n第二行\n第三行"
print(re.findall(r"^第\w+", text, re.MULTILINE))  # ['第一', '第二', '第三']

# 详细模式(可添加注释)
pattern = re.compile(r"""
    \d{3}    # 区号
    -?       # 可选的分隔符
    \d{8}    # 电话号码
""", re.VERBOSE)
print(pattern.findall("电话:010-12345678, 02187654321"))  # ['010-12345678', '02187654321']

5. 实战应用示例

5.1 数据提取

# 提取邮箱地址
text = "联系: support@example.com, sales@test.org"
emails = re.findall(r"\b[\w.-]+@[\w.-]+\.\w+\b", text)
print(emails)  # ['support@example.com', 'sales@test.org']

# 提取HTML标签内容
html = "<h1>标题</h1><p>段落内容</p>"
print(re.findall(r"<.*?>(.*?)</.*?>", html))  # ['标题', '段落内容']

5.2 数据验证

def validate_phone(phone):
    pattern = r"^(?:\+86)?1[3-9]\d{9}$"
    return bool(re.fullmatch(pattern, phone))

print(validate_phone("13800138000"))  # True
print(validate_phone("+8613800138000"))  # True
print(validate_phone("02812345678"))  # False

5.3 文本清洗

def clean_text(text):
    # 移除多余空格
    text = re.sub(r"\s+", " ", text)
    # 移除非字母数字字符(保留中文)
    text = re.sub(r"[^\w\u4e00-\u9fff]", " ", text)
    return text.strip()

dirty_text = "这是一些  标点符号！@#，还有123数字。"
print(clean_text(dirty_text))  # "这是一些 标点符号 还有123数字"

练习

编写正则表达式验证以下内容：
- 强密码（至少8位，包含大小写字母和数字）
- 中国大陆身份证号（18位，最后一位可能是X）
- 日期格式（YYYY-MM-DD）
从以下文本中提取所有URL：
访问我们的网站 https://www.example.com 或者 http://test.org/page?q=123，
也可以查看ftp://files.example.com
编写一个函数，使用正则表达式将驼峰命名转换为下划线命名：camel_to_snake("getUserName") # 返回 "get_user_name"
camel_to_snake("HTTPRequest") # 返回 "http_request"
使用正则表达式解析简单的SQL查询：
parse_sql("SELECT name, age FROM users WHERE age > 20 LIMIT 10")
# 应返回:
# {
# 'select': ['name', 'age'],
# 'from': 'users',
# 'where': 'age > 20',
# 'limit': '10'
# }