重识python正则表达式（re模块）

最新推荐文章于 2024-05-07 09:54:58 发布

knighthood2001

最新推荐文章于 2024-05-07 09:54:58 发布

阅读量2.7k

点赞数 30

分类专栏： python python模块讲解文章标签：正则表达式开发语言 python

本文链接：https://blog.csdn.net/knighthood2001/article/details/124539352

版权

python 同时被 2 个专栏收录

70 篇文章 22 订阅

订阅专栏

python模块讲解

5 篇文章 17 订阅

订阅专栏

🔝🔝🔝🔝🔝🔝🔝🔝🔝🔝

🥰 博客首页：knighthood2001

😗 欢迎点赞👍评论🗨️

❤️ 热爱python，期待与大家一同进步成长！！❤️

一、前言

二、字符匹配

三、函数讲解

3.1match()和fullmatch()

3.1.1match(pattern, string, flags=0)

3.1.2fullmatch(pattern, string, flags=0)

3.2search()、findall()和finditer()

3.2.1search(pattern, string, flags=0)

3.2.2findall(pattern, string, flags=0)

3.2.3finditer(pattern, string, flags=0)

3.3sub()和subn()

3.3.1sub(pattern, repl, string, count=0, flags=0)

3.3.2subn(pattern, repl, string, count=0, flags=0)

3.4split()

3.4.1split(pattern, string, maxsplit=0, flags=0)

3.5compile()、purge()和template()

compile(pattern, flags=0)

purge()

template(pattern, flags=0)

四、常用的正则表达式总结

一、前言

正则表达式（Regular expressions）在许多语言中都可以使用，具有通用性。它简单、优美、功能强大、妙用无穷。对于很多实际工作来讲，正则表达式简直是灵丹妙药，能够成百倍的提高开发效率与程序质量。CSND创始人蒋涛先生在早年开发专业软件产品时，就体验过这一工具的巨大威力。

所谓正则表达式，就是一种描述字符串结构模式的形式化表达方法，在发展初期，这套方法仅限于描述正则文本，故此得名“正则表达式”，不过随着其研究的深入与发展，正则表达式的能力以及大大突破了传统的数学上的限制，成为威力巨大的实用工具。其有两方面原因：

①正则表达式处理的对象时字符串，或者抽象地说是一个对象序列，而这恰恰是当今计算机体系地本质数据结构，我们围绕计算机所做的大多数工作，都归结为在这个序列上的操作，因此，正则表达式用途广泛。

②与大多数其他技术不同，正则表达式具有超强地结构描述能力，而在计算机中，正是不同的结构把无差别地字节组织成千差万别地软件对象，在组合成功能强大地软件系统。因此，描述了结构就等于描述了系统。

正是因为以上两点，正则表达式非常强大。因此大家都需要或多或少地了解。本文主要讲述python中的正则表达式。

二、字符匹配

字符	功能
.	匹配除换行符(\n)以外的任何1个字符
^	匹配字符串的开头
$	匹配字符串的结尾或字符串结尾的换行符之前
*	匹配前面 RE 的 0 个或多个（贪婪）重复。贪婪意味着它将匹配尽可能多的重复。
+	匹配前面 RE 的 1 个或多个（贪婪）重复
？	匹配前面 RE 的 0 或 1（贪婪）
*？,+?，??	前三个特殊字符(*,+,?)的非贪婪版本
{m,n}	匹配前面 RE 的 m 到 n 次重复
{m,n}？	上述的非贪婪版
\\	转义特殊字符或表示特殊序列
[]	[] 表示一组字符。匹配[]中列举的字符
\|	A\|B，创建一个匹配 A 或 B 的 RE
(...)	(...) 匹配括号内的 RE。稍后可以在字符串中检索或匹配内容。
(?aiLmsux)	为 RE 设置 A、I、L、M、S、U 或 X 标志
(?:...)	(?:...) 正则括号的非分组版本

注：贪婪与非贪婪，python里的数量词默认是贪婪的，总是尝试尽可能的匹配更多的字符。python中使用？关闭贪婪模式。

The special sequences consist of "\\" and a character from the list
below.  If the ordinary character is not on the list, then the
resulting RE will match the second character.
翻译：特殊序列由“\\”和下表中的一个字符组成。如果普通字符不在列表中，则生成的 RE 将匹配第二个字符。

字符	功能
\number	\number 匹配相同编号的组的内容
\A	\A 仅匹配字符串的开头
\Z	\Z 仅匹配字符串的末尾
\b	\b 匹配空字符串，但只匹配单词的开头或结尾。
\B	\B 匹配空字符串，但不匹配单词的开头或结尾
\d	\d 匹配任何十进制数字；等效于带有 ASCII 标志的字节模式或字符串模式中的集合 [0-9]。在没有 ASCII 标志的字符串模式中，它将匹配整个 Unicode 数字范围通俗的讲：\d匹配数字，即0-9
\D	\D 匹配任何非数字字符；相当于 [^\d]
\s	\s 匹配任何空白字符；等效于带有 ASCII 标志的字节模式或字符串模式中的 [ \t\n\r\f\v]。在没有 ASCII 标志的字符串模式中，它将匹配整个 Unicode 空白字符范围通俗的讲：\s匹配空白，即空格、tab键
\S	\S 匹配任何非空白字符；相当于 [^\s]
\w	\w 匹配任何字母数字字符；等效于带有 ASCII 标志的字节模式或字符串模式中的 [a-z、A-Z、0-9、_]。在没有 ASCII 标志的字符串模式中，它将匹配 Unicode 字母数字字符（字母加数字加下划线）的范围。使用 LOCALE，它将匹配集合 [0-9、_] 加上定义为当前语言环境的字母的字符通俗的讲：\w匹配单词字符，即a-z、A-Z、0-9、_
\W	\W 匹配 \w 的补码,即匹配非单词字符
\\	\\ 匹配文字反斜杠

Some of the functions in this module takes flags as optional parameters:
翻译：该模块中的一些函数将标志作为可选参数（如下表）

符号	含义	解释
A	ASCII	对于字符串模式，使 \w、\W、\b、\B、\d、\D 匹配相应的 ASCII 字符类别（而不是整个 Unicode 类别，这是默认设置）。对于字节模式，此标志是唯一可用的行为，无需指定
I	IGNORECASE	执行不区分大小写的匹配
L	LOCALE	使 \w、\W、\b、\B 依赖于当前的语言环境
M	MULTILINE	"^" 匹配行的开头（在换行符之后）以及字符串。 "" 匹配行尾（在换行符之前）以及字符串的结尾
S	DOTALL	完全匹配任何字符，包括换行符
X	VERBOSE	忽略空格和注释以获得更好看的 RE
U	UNICODE	仅用于兼容性。忽略字符串模式（默认），禁止字节模式。

重点讲解：一般在findall()中会使用到re.S

在字符串（包含换行符\n）中，

①如果不使用re.S，则只在每一行内进行匹配，如果一行没有，则换下一行重新开始；

②使用re.S后，正则表达式会讲这个字符串当作一个整体，在整体中进行匹配。

import re
a = """hello
hello
helloworld
12321hello"""
b = re.findall(r"hello.*1",a)
print(b)
# []
c = re.findall(r"hello.*1", a, re.S)
print(c)
# ['hello\nhello\nhelloworld\n12321']

三、函数讲解

re模块中主要涉及到以下几个函数

注意：

①re模块中相关函数中都有一个flags参数，它代表了正则表达式的匹配标记，可以通过该标记来指定匹配时是否忽略大小写、是否进行多行匹配等等。

②在书写正则表达式时，我们通常会使用“原始字符串”的写法（在字符串前面加上r）。所谓“原始字符串”就是字符串中每个字符都是它原始的意义，通俗的讲，就是字符串中没有转义字符。因为正则表达式中有很多元字符和需要进行转义的地方，如果不使用原始字符串就需要反斜杠\写成\\，这样不仅写起来不便，读代码时也不便。

3.1match()和fullmatch()

其中flags参数一般不写

3.1.1match(pattern, string, flags=0)

参数解析：pattern参数表示匹配的正则表达式，string参数表示要匹配的字符串，flags参数在上文中已解释。

    """Try to apply the pattern at the start of the string, returning
    a Match object, or None if no match was found."""
函数的中文解释：尝试在字符串的开头应用模式，返回一个匹配对象，如果没有找到匹配则返回None。

注意：如果在开头就不匹配，即使后面内容包含你要匹配的内容，结果还是不匹配，返回None；若是前面匹配到了，后来多出来的字符就不管了。

因此我们就很好理解了

import re
a = re.match(r"hello", "helloworld!")
b = re.match(r"world", "helloworld!")
print(a)
# <re.Match object; span=(0, 5), match='hello'>
print(a.group())
# hello
print(b)
# None

re.match()返回一个匹配对象（如下），而不是匹配的内容。

<re.Match object; span=(0, 5), match='hello'>

如果需要返回内容则需要调用group()。

3.1.2fullmatch(pattern, string, flags=0)

"""Try to apply the pattern to all of the string, returning
    a Match object, or None if no match was found."""
函数的中文解释：尝试将模式应用到所有字符串，返回一个 Match 对象，如果没有找到匹配项，则返回 None

import re
a = re.fullmatch(r"hello", "helloworld!")
b = re.fullmatch(r"helloworld!", "helloworld!")
c = re.fullmatch(r"world", "helloworld!")
print(a)
# None
print(b)
# <re.Match object; span=(0, 11), match='helloworld!'>
print(b.group())
# helloworld!
print(c)
# None

同理，经过以上几个print()，我们可以发现fullmatch()是从字符串开头匹配到结尾，匹配成功才会返回对象，也就是说，后面多出来的字符串它也要管，发现没匹配到，返回None。

3.2search()、findall()和finditer()

search()、findall()与match()、fullmatch()的最重要区别在于，search()、findall()不管头不管尾，只要有地方匹配到了，就匹配成功了。

3.2.1search(pattern, string, flags=0)

"""Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found."""
函数的中文解释：扫描字符串，寻找匹配的模式，返回一个匹配对象，如果没有找到匹配则返回None。

import re
a = re.search("el", "helloworld!")
b = re.findall(r"l", "helloworld!")
print(a)
# <re.Match object; span=(1, 3), match='el'>
print(a.group())
# el
print(b)
# ['l', 'l', 'l']

3.2.2findall(pattern, string, flags=0)

    """Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result."""
函数的中文解释：返回字符串中所有非重叠匹配的列表。如果模式中存在一个或多个捕获组，则返回组列表；如果模式有多个组，这将是元组列表。结果中包含空匹配项.

用3.2.1中的例子，可以发现findall()是寻找所有能匹配到的字符，并以列表的方式返回。

3.2.3finditer(pattern, string, flags=0)

    """Return an iterator over all non-overlapping matches in the
    string.  For each match, the iterator returns a Match object.

    Empty matches are included in the result."""
函数的中文解释：返回一个遍历字符串中所有非重叠匹配的迭代器。对于每个匹配，迭代器返回一个 Match 对象。结果中包含空匹配。

import re
a = "hello world hello world python"
b = re.finditer(r"he", a)
print(b)
# <callable_iterator object at 0x00000280EDC46518>
print(type(b))
# <class 'callable_iterator'>
print("-"*50)
for i in b:
    print(i)
    print(i.group())
    print("*"*30)

其中callable_iterator表示可调用迭代器

3.3sub()和subn()

3.3.1sub(pattern, repl, string, count=0, flags=0)

参数解析：pattern参数表示匹配的正则表达式；repl参数表示替换后的字符串；string参数表示要匹配的字符串，count参数表示从左开始要变换的个数，默认为0，即全部替换，若为1，则表示左边第一个进行替换；flags参数在上文中已解释。

    """Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the Match object and must return
    a replacement string to be used."""
函数的中文解释：返回通过替换repl替换字符串中最左边的不重叠出现的模式获得的字符串。repl可以是字符串或可调用的；如果是字符串，则处理其中的反斜杠转义。如果是可调用的，它传递了 Match 对象，并且必须返回要使用的替换字符串

import re
a = re.sub("php", "python", "php是世界上最好的语言---php")
print(a)
b = re.sub("php", "python", "php是世界上最好的语言---php", count=0)
print(b)
c = re.sub("php", "python", "php是世界上最好的语言---php", count=1)
print(c)

3.3.2subn(pattern, repl, string, count=0, flags=0)

 """Return a 2-tuple containing (new_string, number).
    new_string is the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in the source
    string by the replacement repl.  number is the number of
    substitutions that were made. repl can be either a string or a
    callable; if a string, backslash escapes in it are processed.
    If it is a callable, it's passed the Match object and must
    return a replacement string to be used."""
函数的中文解释：返回一个包含 (new_string, number) 的 2 元组。new_string 是通过用替换 repl 替换源字符串中模式的最左侧不重叠出现而获得的字符串。number 是进行的替换次数。 repl 可以是字符串或可调用对象；如果是字符串，则处理其中的反斜杠转义。如果是可调用对象，则传递给匹配对象，并且必须返回要使用的替换字符串。

import re
a = re.subn("php", "python", "php是世界上最好的语言---php,php,php,php,php")
print(a)
# ('python是世界上最好的语言---python,python,python,python,python', 6)
b = re.subn("php", "python", "php是世界上最好的语言---php,php,php,php,php", count=0)
print(b)
# ('python是世界上最好的语言---python,python,python,python,python', 6)
c = re.subn("php", "python", "php是世界上最好的语言---php,php,php,php,php", count=1)
print(c)
# ('python是世界上最好的语言---php,php,php,php,php', 1)
d = re.subn("php", "python", "php是世界上最好的语言---php,php,php,php,php", count=3)
print(d)
# ('python是世界上最好的语言---python,python,php,php,php', 3)

例子中，我们可以知道，返回的内容是进行替换后的字符串与替换次数所构成的元组。

3.4split()

3.4.1split(pattern, string, maxsplit=0, flags=0)

"""Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list."""
函数的中文解释：按模式的出现分割源字符串，返回一个包含结果子字符串的列表。如果在模式中使用捕获括号，那么模式中所有组的文本也作为结果列表的一部分返回。如果maxsplit 不为零，最多发生 maxsplit 拆分，字符串的其余部分作为列表的最后一个元素返回。

通俗的讲：就是对字符串进行分割，并返回一个列表。 maxsplit表示从左开始拆分的次数。

import re
a = "hello world hello world hello world"
b = re.split(r" ", a)
print(b)
# ['hello', 'world', 'hello', 'world', 'hello', 'world']
c = re.split(r" ", a, maxsplit=2)
print(c)
# ['hello', 'world', 'hello world hello world']

3.5compile()、purge()和template()

compile(pattern, flags=0)

"Compile a regular expression pattern, returning a Pattern object."
函数的中文解释：编译一个正则表达式模式，返回一个 Pattern 对象。

purge()

"Clear the regular expression caches"
函数的中文解释：清除正则表达式缓存。

template(pattern, flags=0)

"Compile a template pattern, returning a Pattern object"
函数的中文解释： 编译一个模板模式，返回一个 Pattern 对象。

四、常用的正则表达式总结

非负整数：^\d+$
正整数：^[0-9]*[1-9][0-9]*$
非正整数：^((-\d+)|(0+))$
负整数：^-[0-9]*[1-9][0-9]*$
整数：^-?\d+$
非负浮点数：^\d+(\.\d+)?$
正浮点数 : ^((0-9)+\.[0-9]*[1-9][0-9]*)|([0-9]*[1-9][0-9]*\.[0-9]+)|([0-9]*[1-9][0-9]*)$
非正浮点数：^((-\d+\.\d+)?)|(0+(\.0+)?))$
负浮点数：^(-((正浮点数正则式)))$
英文字符串：^[A-Za-z]+$
英文大写串：^[A-Z]+$
英文小写串：^[a-z]+$
英文字符数字串：^[A-Za-z0-9]+$
英数字加下划线串：^\w+$
E-mail地址：^[\w-]+(\.[\w-]+)*@[\w-]+(\.[\w-]+)+$
URL：^[a-zA-Z]+://(\w+(-\w+)*)(\.(\w+(-\w+)*))*(\?\s*)?$
或：^http:\/\/[A-Za-z0-9]+\.[A-Za-z0-9]+[\/=\?%\-&_~`@[\]\':+!]*([^<>\"\"])*$
邮政编码：^[1-9]\d{5}$
中文：^[\u0391-\uFFE5]+$
电话号码：^(($\d{2,3}$)|(\d{3}\-))?($0\d{2,3}$|0\d{2,3}-)?[1-9]\d{6,7}(\-\d{1,4})?$
手机号码：^(($\d{2,3}$)|(\d{3}\-))?13\d{9}$
双字节字符(包括汉字在内)：^\x00-\xff
匹配首尾空格：(^\s*)|(\s*$)（像vbscript那样的trim函数）
匹配HTML标记：<(.*)>.*<\/\1>|<(.*) \/>
匹配空行：\n[\s| ]*\r
提取信息中的网络链接：(h|H)(r|R)(e|E)(f|F) *= *('|")?(\w|\\|\/|\.)+('|"| *|>)?
提取信息中的邮件地址：\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*
提取信息中的图片链接：(s|S)(r|R)(c|C) *= *('|")?(\w|\\|\/|\.)+('|"| *|>)?
提取信息中的IP地址：(\d+)\.(\d+)\.(\d+)\.(\d+)
提取信息中的中国手机号码：(86)*0*13\d{9}
提取信息中的中国固定电话号码：($\d{3,4}$|\d{3,4}-|\s)?\d{8}
提取信息中的中国电话号码（包括移动和固定电话）：($\d{3,4}$|\d{3,4}-|\s)?\d{7,14}
提取信息中的中国邮政编码：[1-9]{1}(\d+){5}
提取信息中的浮点数（即小数）：(-?\d*)\.?\d+
提取信息中的任何数字：(-?\d*)(\.\d+)?
IP：(\d+)\.(\d+)\.(\d+)\.(\d+)
电话区号：/^0\d{2,3}$/
腾讯QQ号：^[1-9]*[1-9][0-9]*$
帐号(字母开头，允许5-16字节，允许字母数字下划线)：^[a-zA-Z][a-zA-Z0-9_]{4,15}$
中文、英文、数字及下划线：^[\u4e00-\u9fa5_a-zA-Z0-9]+$
匹配中文字符的正则表达式： [\u4e00-\u9fa5]
匹配双字节字符(包括汉字在内)：[^\x00-\xff]
匹配空行的正则表达式：\n[\s| ]*\r
匹配HTML标记的正则表达式：/<(.*)>.*<\/\1>|<(.*) \/>/
sql语句：^(select|drop|delete|create|update|insert).*$
匹配首尾空格的正则表达式：(^\s*)|(\s*$)
匹配Email地址的正则表达式：\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*