python正则表达式【标准库—re】

最新推荐文章于 2023-04-07 22:57:25 发布

半斗烟草

最新推荐文章于 2023-04-07 22:57:25 发布

阅读量419

点赞数

分类专栏： python基础文章标签：正则表达式 python

本文链接：https://blog.csdn.net/qq_40494873/article/details/120797490

版权

python基础专栏收录该内容

13 篇文章 0 订阅

订阅专栏

一、re模块介绍

python标准库—re模块，建议详细查看 re.py源码，模块所有的方法见如下__all__：

__all__ = [
    "match", "fullmatch", "search", "sub", "subn", "split",
    "findall", "finditer", "compile", "purge", "template", "escape",
    "error", "A", "I", "L", "M", "S", "X", "U",
    "ASCII", "IGNORECASE", "LOCALE", "MULTILINE", "DOTALL", "VERBOSE",
    "UNICODE",
]

__version__ = "2.2.1"

re.py文件中，如下文档详细介绍了正则表达式API、正则表达式等，见如下：

r"""Support for regular expressions (RE).

This module provides regular expression matching operations similar to
those found in Perl.  It supports both 8-bit and Unicode strings; both
the pattern and the strings being processed can contain null bytes and
characters outside the US ASCII range.

Regular expressions can contain both special and ordinary characters.
Most ordinary characters, like "A", "a", or "0", are the simplest
regular expressions; they simply match themselves.  You can
concatenate ordinary characters, so last matches the string 'last'.

The special characters are:
    "."      Matches any character except a newline.
    "^"      Matches the start of the string.
    "$"      Matches the end of the string or just before the newline at
             the end of the string.
    "*"      Matches 0 or more (greedy) repetitions of the preceding RE.
             Greedy means that it will match as many repetitions as possible.
    "+"      Matches 1 or more (greedy) repetitions of the preceding RE.
    "?"      Matches 0 or 1 (greedy) of the preceding RE.
    *?,+?,?? Non-greedy versions of the previous three special characters.
    {m,n}    Matches from m to n repetitions of the preceding RE.
    {m,n}?   Non-greedy version of the above.
    "\\"     Either escapes special characters or signals a special sequence.
    []       Indicates a set of characters.
             A "^" as the first character indicates a complementing set.
    "|"      A|B, creates an RE that will match either A or B.
    (...)    Matches the RE inside the parentheses.
             The contents can be retrieved or matched later in the string.
    (?aiLmsux) Set the A, I, L, M, S, U, or X flag for the RE (see below).
    (?:...)  Non-grouping version of regular parentheses.
    (?P<name>...) The substring matched by the group is accessible by name.
    (?P=name)     Matches the text matched earlier by the group named name.
    (?#...)  A comment; ignored.
    (?=...)  Matches if ... matches next, but doesn't consume the string.
    (?!...)  Matches if ... doesn't match next.
    (?<=...) Matches if preceded by ... (must be fixed length).
    (?<!...) Matches if not preceded by ... (must be fixed length).
    (?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
                       the (optional) no pattern otherwise.

The special sequences consist of "\\" and a character from the list
below.  If the ordinary character is not on the list, then the
resulting RE will match the second character.
    \number  Matches the contents of the group of the same number.
    \A       Matches only at the start of the string.
    \Z       Matches only at the end of the string.
    \b       Matches the empty string, but only at the start or end of a word.
    \B       Matches the empty string, but not at the start or end of a word.
    \d       Matches any decimal digit; equivalent to the set [0-9] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode digits.
    \D       Matches any non-digit character; equivalent to [^\d].
    \s       Matches any whitespace character; equivalent to [ \t\n\r\f\v] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode whitespace characters.
    \S       Matches any non-whitespace character; equivalent to [^\s].
    \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
             in bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the
             range of Unicode alphanumeric characters (letters plus digits
             plus underscore).
             With LOCALE, it will match the set [0-9_] plus characters defined
             as letters for the current locale.
    \W       Matches the complement of \w.
    \\       Matches a literal backslash.

This module exports the following functions:
    match     Match a regular expression pattern to the beginning of a string.
    fullmatch Match a regular expression pattern to all of a string.
    search    Search a string for the presence of a pattern.
    sub       Substitute occurrences of a pattern found in a string.
    subn      Same as sub, but also return the number of substitutions made.
    split     Split a string by the occurrences of a pattern.
    findall   Find all occurrences of a pattern in a string.
    finditer  Return an iterator yielding a match object for each match.
    compile   Compile a pattern into a RegexObject.
    purge     Clear the regular expression cache.
    escape    Backslash all non-alphanumerics in a string.

Some of the functions in this module takes flags as optional parameters:
    A  ASCII       For string patterns, make \w, \W, \b, \B, \d, \D
                   match the corresponding ASCII character categories
                   (rather than the whole Unicode categories, which is the
                   default).
                   For bytes patterns, this flag is the only available
                   behaviour and needn't be specified.
    I  IGNORECASE  Perform case-insensitive matching.
    L  LOCALE      Make \w, \W, \b, \B, dependent on the current locale.
    M  MULTILINE   "^" matches the beginning of lines (after a newline)
                   as well as the string.
                   "$" matches the end of lines (before a newline) as well
                   as the end of the string.
    S  DOTALL      "." matches any character at all, including the newline.
    X  VERBOSE     Ignore whitespace and comments for nicer looking RE's.
    U  UNICODE     For compatibility only. Ignored for string patterns (it
                   is the default), and forbidden for bytes patterns.
                   
This module also defines an exception 'error'.

"""

二、re模块详情

1. 正则表达式介绍

我习惯将正则表达式分为四类：字符符号，位置符号，数量符号，分组符号：

字符类型符号：

正则表达式符号	说明
.	匹配任意字符（不包括换行符）
\d	匹配一个数字，相当于 [0-9]
\D	匹配非数字,相当于 [^0-9]
\s	匹配任意空白字符，相当于 [ \t\n\r\f\v]
\S	匹配非空白字符，相当于 [^ \t\n\r\f\v]
\w	匹配数字、字母、下划线中任意一个字符，相当于 [a-zA-Z0-9_]
\W	匹配非数字、字母、下划线中的任意字符，相当于 [^a-zA-Z0-9_]
\\	转义字符，跟在其后的字符将失去作为特殊元字符的含义，例如\\.只能匹配.，不能再匹配任意字符
[]	字符集，一个字符的集合，可匹配其中任意一个字符
\|	逻辑表达式或，比如 a\|b 代表可匹配 a 或者 b

位置符号:

正则表达式符号	说明
^	匹配开始位置，多行模式下匹配每一行的开始
$	匹配结束位置，多行模式下匹配每一行的结束
\A	匹配字符串开始位置，忽略多行模式
\Z	匹配字符串结束位置，忽略多行模式
\b	匹配位于单词开始或结束位置的空字符串
\B	匹配不位于单词开始或结束位置的空字符串

数量符号：

正则表达式符号	说明
*	匹配前一个元字符1到多次
+	匹配前一个元字符1到多次
?	匹配前一个元字符0到1次
*?,+?,??	非贪婪模式，匹配最少
{m,n}	匹配前一个元字符m到n次
{m,n}?	匹配前一个元字符m到n次，非贪婪模式，匹配最少

分组符号：

正则表达式符号	说明
(...)	分组，默认为捕获，即被分组的内容可以被单独取出，默认每个分组有个索引，从 1 开始，按照"("的顺序决定索引值
(?aiLmsux)	分组中可以设置模式，iLmsux之中的每个字符代表一个模式,用法参见模式 I
(?:...)	分组的不捕获模式，计算索引时会跳过这个分组
(?P<name>...)	分组的命名模式，取此分组中的内容时可以使用索引也可以使用name
(?P=name)	分组的引用模式，可在同一个正则表达式用引用前面命名过的正则
(?#...)	注释，不影响正则表达式其它部分,用法参见模式
(?=...)	顺序肯定环视，表示所在位置右侧能够匹配括号内正则
(?!...)	顺序否定环视，表示所在位置右侧不能匹配括号内正则
(?<=...)	逆序肯定环视，表示所在位置左侧能够匹配括号内正则
(?<!...)	逆序否定环视，表示所在位置左侧不能匹配括号内正则
(?(id/name)yes\|no)	若前面指定id或name的分区匹配成功则执行yes处的正则，否则执行no处的正则
\number	匹配和前面索引为number的分组捕获到的内容一样的字符串

2. re模块api

api如下（示例）：

api	功能
compile	初始化正则表达式
template	模板化正则表达式，没搞懂干啥的

match	只从字串的开始位置进行匹配，如果失败，它就此放弃
fullmatch	检测整个字符串与正则匹配，从头到尾
search	则会锲而不舍地完全遍历整个字串中所有可能的位置，直到成功地找到一个匹配，或者搜索完字串，以失败告终。
findall	找出所有可能的匹配，以列表的形式返回
finditer	找出所有可能的匹配，以迭代器形式返回

sub	搜索整个字符串，将所有匹配的用指定的字符串替换
subn	搜索整个字符串，将所有匹配的用指定的数目进行替换,返回tuple
split	搜索整个字符串，将字符串按照匹配上的字符进行分割，可以指定按照几个匹配的字符串进行分割

purge	清除正则表达式的缓存，尽量使用compile区创建，可以复用
escape	转义模式中除ASCII字母、数字和'_'以外的所有字符
error	错误异常

如上方法中可以指定匹配模式，flags参数，六种模式如下：

模式简写	全称	功能
I	IGNORECASE	不区分大小写
L	LOCALE	字符集本地化，根据当地的大小写等规则匹配
M	MULTILINE	多行模式, 改变 ^ 和 $ 的行为
S	DOTALL	此模式下 '.' 的匹配不受限制，可匹配任何字符，包括换行符，也就是默认是不能匹配换行符
X	VERBOSE	冗余模式，此模式忽略正则表达式中的空白和#号的注释
U	UNICODE	UNICODE规则匹配，目前python3都是unicode
A	ASCII	python3使用unicode表示字符串，而在python2用ASCII表示，

三、re代码示例

# -*- coding:utf-8 -*-

import re

#解读 re正则表达式
s = 'The launch of shenzhou 13 manned spacecraft was a complete success! 12 by 2021.10.15' 

s1 = '''
<div class="contson" id="contson16e9c75bf6d1">
    <span style="color:#B00815;">男儿何不带吴钩，收取关山五十州。</span><br>请君暂上凌烟阁，若个书生万户侯？ 
</div>
'''
s2 = '''2021.10.16, Apple Apach
phone: 0041-123-158-7710(intel), 158-7777-2455(china)
email: caontcat-123@cnte.com,caontcat-123@163.com
'''

regex = re.compile('\w') #compile 初始化正则表达式
re_match = re.match(regex, s) #从头开始匹配一次，一次匹配失败，则终止，返回None
print(type(re_match)) #<class '_sre.SRE_Match'>  C库类型
print(re_match) #<_sre.SRE_Match object; span=(0, 1), match='T'>，匹配位置，匹配文本
print(re_match.group(0)) #使用group函数取出匹配结果
print(re_match.span()) #返回匹配位置

regex = re.compile('.*')
re_fullmatch = re.fullmatch(regex, s) #检测整个字符串与正则匹配，从头到尾。
print(re_fullmatch.span())
print(re_match.group(0)) #第一个匹配的字符

regex = re.compile('\d')
re_search = re.search(regex, s) #搜索整个字符串，直到找到第一个匹配的返回
print(re_search)#<_sre.SRE_Match object; span=(23, 24), match='1'>
print(re_search.group(0))

regex = re.compile('\d+')
re_findall = re.findall(regex, s) #搜索整个字符串，以列表的方式返回所有匹配
print(re_findall) #<class 'list'> #['13', '12', '2021', '10', '15'], 贪婪匹配

regex = re.compile('\d+')
re_finditer = re.finditer(regex, s)#搜索整个字符串，返回迭代器
print(type(re_finditer))#<class 'callable_iterator'>
for  x in re_finditer:
    print(x) #<_sre.SRE_Match object; span=(23, 24), match='1'>
    print(x.group()) #all value: 13/12/2021/10/15

regex = re.compile('\d+')
re_sub = re.sub(regex, 'xxxx', s) #搜索整个字符串，将所有匹配的用指定的字符串替换
print(re_sub)#The launch of shenzhou xxx manned spacecraft was a complete success! xxx by xxx.xxx.xxx

regex = re.compile('\d+')
re_subn = re.subn(regex, 'XXX', s) #搜索整个字符串，将所有匹配的用指定的数目进行替换,返回tuple
print(re_subn) #('The launch of shenzhou XXX manned spacecraft was a complete success! XXX by XXX.XXX.XXX', 5)
re_subn = re.subn(regex, 'XXX', s, 2) 
print(re_subn)#('The launch of shenzhou XXX manned spacecraft was a complete success! XXX by 2021.10.15', 2)

regex = re.compile('\d+')
re_split = re.split(regex, s) #搜索整个字符串，将字符串按照匹配上的字符进行分割，
print(re_split)
re_split = re.split(regex, s, 2)# 可以指定按照几个匹配的字符串进行分割
print(re_split)

re.purge() #清除正则表达式的缓存，尽量使用compile区创建，可以复用

# # regex_template = re.template('%d+')  这应该是一个正则的一个模板，不知道干啥

# re.escape(regex) #文档中写：转义模式中除ASCII字母、数字和'_'以外的所有字符。

e = re.error #错误异常
print(e)

总结

提示：这里对文章进行总结：
例如：以上就是今天要讲的内容，本文仅仅简单介绍了pandas的使用，而pandas提供了大量能使我们快速便捷地处理数据的函数和方法。

半斗烟草

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python正则表达式【标准库—re】

一、re模块介绍python标准库—re模块，建议详细查看 re.py源码，热模块所有的方法见如下__all__：__all__ = [ "match", "fullmatch", "search", "sub", "subn", "split", "findall", "finditer", "compile", "purge", "template", "escape", "error", "A", "I", "L", "M", "S", "X", "U", "ASC
复制链接

扫一扫