【编程小白必看】Python 正则表达式操作秘籍一文全掌握

最新推荐文章于 2024-10-03 09:02:12 发布

6个q

最新推荐文章于 2024-10-03 09:02:12 发布

阅读量611

点赞数 24

分类专栏： python基础知识文章标签： python 正则表达式数据库

本文链接：https://blog.csdn.net/zuiliwangmeng/article/details/141970285

版权

python基础知识专栏收录该内容

28 篇文章 0 订阅

订阅专栏

【编程小白必看】Python 正则表达式操作秘籍🔥一文全掌握

前言

嘿，小伙伴们！今天我们要一起走进 Python 正则表达式的世界，了解如何使用正则表达式来处理文本数据。正则表达式是一种强大的文本匹配工具，广泛应用于数据清洗、文本解析等场景。跟着我一起，轻松掌握这些基础知识吧！

一、什么是正则表达式？

正则表达式（Regular Expression，简称 Regex）是一种用于匹配字符串的模式。它可以用来搜索、替换、验证文本中的特定模式。Python 中的 re 模块提供了对正则表达式的支持。

二、基本概念

1.元字符

元字符是正则表达式中的特殊字符，用于定义模式：
.：匹配任意单个字符（除了换行符）
^：匹配字符串的开头
$：匹配字符串的结尾
*：匹配前面的字符零次或多次
+：匹配前面的字符一次或多次
?：匹配前面的字符零次或一次
{n}：匹配前面的字符恰好 n 次
{n,}：匹配前面的字符至少 n 次
{n,m}：匹配前面的字符至少 n 次，但不超过 m 次
[]：字符集，匹配括号内的任意一个字符
()：捕获组，用于提取匹配的部分
|：或运算符，匹配左边或右边的模式
\d：匹配数字
\D：匹配非数字
\s：匹配空白字符（空格、制表符、换行符等）
\S：匹配非空白字符
\w：匹配字母、数字或下划线
\W：匹配非字母、非数字或非下划线

2.模式对象

使用 re.compile(pattern) 创建一个模式对象，以便后续使用。

3.匹配对象

使用 re.match(pattern, string) 等方法返回一个匹配对象，包含匹配的结果。

三、操作案例

1.匹配字符串

使用 re.match(pattern, string) 来匹配字符串的开头部分。

import re

# 匹配字符串
pattern = r"^hello"
string = "hello world"

# 使用 match 方法
result = re.match(pattern, string)

if result:
    print("匹配成功")
else:
    print("匹配失败")  # 输出 匹配成功

2.搜索字符串

使用 re.search(pattern, string) 来搜索整个字符串中的匹配项。

import re

# 搜索字符串
pattern = r"world"
string = "hello world"

# 使用 search 方法
result = re.search(pattern, string)

if result:
    print("匹配成功")
else:
    print("匹配失败")  # 输出 匹配成功

3.替换字符串

使用 re.sub(pattern, repl, string) 来替换字符串中的匹配项。

import re

# 替换字符串
pattern = r"\d+"
string = "123 hello 456 world"

# 使用 sub 方法
result = re.sub(pattern, "X", string)

print(result)  # 输出 X hello X world

4.分割字符串

使用 re.split(pattern, string) 来分割字符串。

import re

# 分割字符串
pattern = r"\s+"
string = "hello world"

# 使用 split 方法
result = re.split(pattern, string)

print(result)  # 输出 ['hello', 'world']

5.查找所有匹配项

使用 re.findall(pattern, string) 来查找字符串中的所有匹配项。

import re

# 查找所有匹配项
pattern = r"\d+"
string = "123 hello 456 world"

# 使用 findall 方法
result = re.findall(pattern, string)

print(result)  # 输出 ['123', '456']

6.查找迭代器

使用 re.finditer(pattern, string) 来获取一个迭代器，用于遍历所有匹配项。

import re

# 查找迭代器
pattern = r"\d+"
string = "123 hello 456 world"

# 使用 finditer 方法
for match in re.finditer(pattern, string):
    print(match.group())  # 输出 123 456

7.捕获组

使用 () 来定义捕获组，提取匹配的部分。

import re

# 捕获组
pattern = r"(\d+)"
string = "123 hello 456 world"

# 使用 match 方法
result = re.match(pattern, string)

if result:
    print(result.group(1))  # 输出 123

8.非捕获组

使用 (?😃 来定义非捕获组，不提取匹配的部分。

import re

# 非捕获组
pattern = r"(?:\d+)"
string = "123 hello 456 world"

# 使用 match 方法
result = re.match(pattern, string)

if result:
    print(result.group(0))  # 输出 123

9.字符集

使用 [] 来定义字符集，匹配括号内的任意一个字符。

import re

# 字符集
pattern = r"[abc]"
string = "abc def"

# 使用 match 方法
result = re.match(pattern, string)

if result:
    print(result.group(0))  # 输出 a

10.贪婪匹配

默认情况下，正则表达式会尽可能多地匹配字符，称为贪婪匹配。

import re

# 贪婪匹配
pattern = r"a.*b"
string = "axxb"

# 使用 match 方法
result = re.match(pattern, string)

if result:
    print(result.group(0))  # 输出 axxb

11.非贪婪匹配

使用 ? 来指定非贪婪匹配，即尽可能少地匹配字符。

import re

# 非贪婪匹配
pattern = r"a.*?b"
string = "axxb"

# 使用 match 方法
result = re.match(pattern, string)

if result:
    print(result.group(0))  # 输出 a

12.标志位

使用标志位来指定正则表达式的匹配行为，例如 re.IGNORECASE 表示忽略大小写。

import re

# 标志位
pattern = r"hello"
string = "Hello world"

# 使用 match 方法
result = re.match(pattern, string, re.IGNORECASE)

if result:
    print("匹配成功")
else:
    print("匹配失败")  # 输出 匹配成功

四、进阶用法

1.多行匹配

使用 re.DOTALL 标志位来匹配多行文本中的换行符。

import re

# 多行匹配
pattern = r".*"
string = """hello
world"""

# 使用 match 方法
result = re.match(pattern, string, re.DOTALL)

if result:
    print("匹配成功")
else:
    print("匹配失败")  # 输出 匹配成功

2.编译模式

使用 re.compile(pattern, flags) 来编译模式，提高匹配速度。

import re

# 编译模式
pattern = re.compile(r"hello", re.IGNORECASE)
string = "Hello world"

# 使用 match 方法
result = pattern.match(string)

if result:
    print("匹配成功")
else:
    print("匹配失败")  # 输出 匹配成功

3.替换回调函数

使用 re.sub(pattern, repl, string) 时，repl 参数可以是一个函数，用于动态生成替换文本。

import re

# 替换回调函数
def replace_func(match):
    return match.group().upper()

pattern = r"\d+"
string = "123 hello 456 world"

# 使用 sub 方法
result = re.sub(pattern, replace_func, string)

print(result)  # 输出 123 HELLO 45