Python正则表达式

最新推荐文章于 2024-09-08 20:15:55 发布

实相无相

最新推荐文章于 2024-09-08 20:15:55 发布

阅读量646

点赞数

文章标签： python 正则表达式

本文链接：https://blog.csdn.net/weixin_46121540/article/details/129646295

版权

一、正则表达式的概念

正则表达式是一种用于描述字符串模式的语言。它通过特定的字符和符号来定义搜索和替换规则，可以实现对文本进行快速、准确的匹配和处理。

正则表达式通常由普通字符（例如字母、数字、标点符号等）和元字符（也称为转义字符或控制字符）组成。其中，元字符具有特殊含义，可以用来表示某些特定的模式或操作。

在Python中，我们可以使用re模块来实现正则表达式的相关操作。该模块提供了许多函数和方法，包括编译正则表达式、匹配字符串、查找子串、分割字符串等功能。

二、基本语法

普通字符

在正则表达式中，大部分普通字符都表示它们本身。例如：

import re

匹配单词hello

pattern = “hello”
string = “hello world”
result = re.match(pattern, string)
print(result.group()) # 输出 hello

元字符

元字符是指那些具有特殊含义的符号，在正则表达式中使用时需要进行转义或者使用特殊的语法格式。

2.1 字符集合 []

方括号用来表示一个字符集合，其中包含了要匹配的任意一个字符。例如：

python
import re

匹配单词hello或hi

pattern = “[hH]ello”
string1 = “Hello world”
string2 = “Hi there”
result1 = re.match(pattern, string1)
result2 = re.match(pattern, string2)
print(result1.group()) # 输出 Hello
print(result2.group()) # 输出 Hi

方括号中可以使用连续符号-表示范围内所有字符，例如[a-z]表示小写字母a到z之间任意一个字母；[A-Z]表示大写字母A到Z之间任意一个字母；[0-9]表示数字0到9之间任意一个数字。

还可以在方括号中使用逗号,分隔多个区间或者单个字符。例如[a-z,A-Z]表示小写字母a到z以及大写字母A到Z之间任意一个字母；[abc,123]表示小写字母a,b,c以及数字1,2,3之间任意一个字符。

如果要匹配方括号本身，则需要使用反斜杠\进行转义。

python
import re

匹配[]内部内容以及[]本身

pattern = “[\w+]”
string1 = “[test]”
string2 = “[test]]”
result1 = re.match(pattern, string1)
result2 = re.match(pattern, string2)
print(result1.group()) # 输出 [test]
print(result2.group()) # 输出 [test]

注意：在方括号中不需要对元字符进行转义，因为它们已经失去了原有的含义。

2.2 通配符 .

点号.表示匹配任意一个字符（除了换行符\n）。例如：

import re

匹配单词h.llo

pattern = “h.llo”
string1 = “hello world”
string2 = “hallo world”
result1 = re.match(pattern, string1)
result2 = re.match(pattern, string2)
print(result1.group()) # 输出 hello
print(result2.group()) # 输出 hallo

注意：点号只能匹配一个字符，如果要匹配多个字符，则需要使用重复操作符。

2.3 转义字符 \

反斜杠\用来转义元字符，使其失去特殊含义。例如：

python
import re

匹配单词\d+

pattern = “\d+”
string = “12345”
result = re.match(pattern, string)
print(result.group()) # 输出 12345

在正则表达式中，常见的转义字符包括：

\d：匹配任意一个数字（等价于[0-9]）。
\D：匹配任意一个非数字字符（等价于[^0-9]）。
\w：匹配任意一个字母、数字或下划线（等价于[a-zA-Z0-9_]）。
\W：匹配任意一个非字母、数字或下划线字符（等价于[^a-zA-Z0-9_]）。
\s：匹配任意一个空白字符（包括空格、制表符\t、换行符\n等）。
\S：匹配任意一个非空白字符。

注意：在Python中，字符串本身也是有转义字符的，因此在编写正则表达式时需要对它们进行双重转义。

python
import re

匹配单词\w+

pattern = “\w+”
string = “hello_world”
result = re.match(pattern, string)
print(result.group()) # 输出 hello_world

2.4 边界限定符

边界限定符用来指定模式的开始和结束位置。常见的边界限定符包括：

^：表示模式必须出现在字符串开头。
$：表示模式必须出现在字符串结尾。
\b：表示单词边界，即单词与非单词之间的位置。
\B：表示非单词边界，即两个单词之间或者两个非单词之间的位置。

例如：

python
import re

匹配以hello开头的句子

pattern1 = “^hello.*$”
string1 = “hello world”
string2 = “world hello”
result1 = re.match(pattern1, string1)
result2 = re.match(pattern1, string2)
print(result1.group()) # 输出 hello world
print(result2) # 输出 None

匹配以is为单词开头的句子

pattern2 = r"\bis.*\b"
string3 =“this is a test case for regex expression in python.”
result3 =re.findall(pattern,string3 )
print(result3 ) #输出 [‘is’, ‘in’]

注意事项:

\b 要求前后都是非字母/数字/_ 的才可以被认为是边界

这里r"\bis.*\b" r作用是告知Python不要把\b当做普通字符串中的反斜杠处理，在正则表达式中使用原生字符串(r’’)会忽略所有反斜杠(不会发生被视为特殊命令)，因此可以方便地编写正则表达式。

三、重复操作符

重复操作符用来指定某个模式出现次数的范围。常见的重复操作符包括：

*：表示前面的模式可以出现0次或者多次。
+：表示前面的模式至少出现一次或者多次。
?：表示前面的模式最多只能出现一次或者零次
{m}：表示前面的模式必须出现m次。
{m,n}：表示前面的模式可以出现m到n次。

例如：

import re

匹配单词he后面紧跟着0个或多个l，然后再跟一个o

pattern1 = “hel*o”
string1 = “hello world”
string2 = “heo world”
result1 = re.match(pattern1, string1)
result2 = re.match(pattern1, string2)
print(result1.group()) # 输出 hello
print(result2.group()) # 输出 heo

匹配数字串中至少有一位数的情况

pattern2 = “\d+”
string3 =“123abc456def789ghi0jkl”
result3 =re.findall(pattern2,string3 )
print(result3 ) #输出 [‘123’, ‘456’, ‘789’, ‘0’]

注意事项:

在使用重复操作符时，需要注意以下几点：

*和+是贪婪匹配，即尽可能多地匹配字符。如果要使用非贪婪匹配，可以在操作符后加上?号。
在使用{m,n}时，如果省略了n，则表示最多出现m次；如果同时省略了m和n，则表示任意次数。
如果在正则表达式中使用了多个重复操作符，则它们的作用是叠加的。

四、分组和捕获

分组和捕获用来将整个正则表达式划分为若干个子表达式，并对每个子表达式进行处理。常见的分组和捕获方式包括：

4.1 普通分组 ()

普通分组用圆括号()来指定一个子表达式。例如：

python
import re

匹配单词hello或hi，并且后面紧跟着一个空格和world单词

pattern = “(hello|hi) world”
string1 = “hello world”
string2 = “hi there”
result1 = re.match(pattern, string1)
result2 = re.match(pattern, string2)
print(result1.group()) # 输出 hello world
print(result2) # 输出 None

普通分组只能用于改变优先级或者限定重复范围，并不能对其内部元素进行捕获。

4.2 命名分组 (?P)

命名分组用来给某个子表达式起一个名称，并将其作为字典键值对存储下来。其中，?P语法中name表示名称。例如：

python
import re

匹配格式为YYYY-MM-DD的日期字符串，并且将年月日三个部分进行命名捕获

pattern = r"(?P\d{4})-(?P\d{2})-(?P\d{2})"
string = “2020-11-25”
result = re.match(pattern, string)
print(result.group()) # 输出 2020-11-25
print(result.group(“year”)) # 输出 2020
print(result.group(“month”)) # 输出 11
print(result.group(“day”)) # 输出 25

命名捕获可以通过group方法按照名称获取相应的值。

4.3 非捕获分组 (?😃

非捕获分组与普通分组类似，但不会对其内部元素进行捕获。它主要用于提高效率或者简化语法。例如：

python
import re

使用非捕获分组进行重复操作符限制

pattern = r"\b(?:\w+\s){4}\w+\b"
string =“this is a test case for regex expression in python.”
result= re.findall(pattern,string)
print(result) #输出 ['this is a test case for regex ', 'expression in ']

非捕获分组只能用于改变优先级或者限定重复范围。

五、替换和搜索

替换和搜索是正则表达式最常用的两种操作。在Python中，可以使用re模块中的sub和search方法进行替换和搜索。

5.1 替换

sub方法用来将匹配到的字符串替换为指定的字符串。其语法为：

re.sub(pattern, repl, string, count=0, flags=0)

其中，pattern表示正则表达式；repl表示要替换成的字符串；string表示要进行替换的原始字符串；count表示要替换的次数，默认为所有匹配项都进行替换；flags表示匹配模式，与compile方法相同。

例如：

python
import re

将单词hello或hi替换为greeting

pattern = r"(hello|hi)"
string = “hello world, hi there”
result = re.sub(pattern, “greeting”, string)
print(result) # 输出 greeting world, greeting there

注意：如果要在repl中使用分组捕获结果，则可以使用\1、\2等形式引用分组编号（从1开始），也可以使用\g形式引用命名分组名称。

5.2 搜索

search方法用来在一个字符串中查找符合正则表达式条件的第一个子串，并返回一个Match对象。其语法为：

python
re.search(pattern, string, flags=0)

其中，pattern表示正则表达式；string表示要进行搜索的原始字符串；flags表示匹配模式，与compile方法相同。

例如：

python
import re

在句子中查找第一个数字串并输出

pattern = “\d+”
string =“this is a test case for regex expression in python.”
result = re.search(pattern,string)
print(result.group()) #输出一串数字串 ‘20210825’

六、总结

本文介绍了Python中正则表达式的基本语法和常见操作。需要注意以下几点：