01正则表达式

最新推荐文章于 2022-09-09 00:29:38 发布

a_small_python

最新推荐文章于 2022-09-09 00:29:38 发布

阅读量696

点赞数

分类专栏： Python基础

本文链接：https://blog.csdn.net/a_small_python/article/details/79307192

版权

Python基础专栏收录该内容

13 篇文章 0 订阅

订阅专栏

正则表达式是字符串处理的有力工具和技术。

1.正则表达式常用元字符：

代码　　　　　　　　　　说明
. 　　　　　　　　　　　　匹配除换行符以外的任意单个字符
\w 　　　　　　　　　　　匹配字母或数字或下划线或汉字
\s 　　　　　　　　　　　匹配任意的空白符
\d 　　　　　　　　　　　匹配数字，相当于[0-9]
\b 　　　　　　　　　　　匹配单词的开始或结束
^ 　　　　　　　　　　　匹配行的开始，匹配以^后面的字符开头的字符串
$ 　　　　　　　　　　　　匹配行的结束，匹配以$之前的字符结束的字符串

常用反义元字符
代码　　说明
\W 　　　　匹配任意不是字母，数字，下划线，汉字的字符
\S 　　　　匹配任意不是空白符的字符
\D 　　　　　匹配任意非数字的字符
\B 　　　　匹配不是单词开头或结束的位置

常用重复限定符（加在后面）
代码　　　　　　         说明
* 　　　　　　　重复零次或更多次
+ 　　　　　　　重复一次或更多次
? 　　　　　　　重复零次或一次
{n} 　　　   　　　重复n次
{n,} 　   　　　　重复n次或更多次
{n,m} 　　　　       重复n到m次

补充：
| 匹配位于|之前或之后的字符
\ 表示位于\之后的为转义字符
[] 匹配位于[]中的任意一个字符
- 用在[]之内用来表示范围
() 将位于()内的内容作为一个整体对待

2.re模块主要方法
具体使用时，既可以直接使用re模块的方法进行字符串处理，也可以将模块编译为正则表达式对象，然后使用正则表达式对象的方法来操作字符串。

compile(pattern[,flags])  创建模式对象
search(pattern,string[,flags])  在整个字符串中寻找模式，返回match对象或None
match(pattern,string[,flags])  从字符串的开始处匹配模式，返回match对象或None
findall(pattern,string[,flags])  列出字符串中模式的所有匹配项
split(pattern,string[,maxsplit=0]) 根据模式匹配项分割字符串
sub(pat,repl,string[,count=0])  将字符串中所有pat的匹配项用repl替换
escape(string)    将字符串中所有特殊正则表达式字符转义

其中函数参数flags的值可以是
re.I(忽略大小写)
re.L 使预定字符类 \w \W \b \B \s \S 取决于当前区域设定
re.M(多行匹配模式)
re.S(使元字符'.'匹配任意字符，包括换行符)
re.U: 使预定字符类 \w \W \b \B \s \S \d \D 取决于unicode定义的字符属性
re.x(忽略模式中的空格，并可以使用#注释)的不同组合(使用‘|’进行组合);详细模式。这个模式下正则表达式可以是多行，忽略空白字符，并可以加入注释。

3.直接使用re模块方法
>>> import re

>>> text = 'alpha. beta....gamma delta'

>>> re.split('[\. ]+',text)
['alpha', 'beta', 'gamma', 'delta']

>>> re.split('[\. ]+',text,maxsplit=2)#分割2次
['alpha', 'beta', 'gamma delta']
>>> re.split('[\. ]+',text,maxsplit=1)#分割1次
['alpha', 'beta....gamma delta']

>>> pat = '[a-zA-Z]+'

>>> re.findall(pat,text) #查找所有单词

['alpha', 'beta', 'gamma', 'delta']

>>> pat = '{name}'

>>> text = 'Dear {name}...'

>>> re.sub(pat,'Mr.Dong',text) #字符串替换

'Dear Mr.Dong...'

>>> s = 'a s d'

>>> re.sub('a|s|d','good',s) #字符串替换

'good good good'

>>> re.escape('http://www.python.org') #字符串转义

'http\\:\\/\\/www\\.python\\.org'

4.使用正则表达式对象
首先使用re模块的compile()方法将正则表达式编译生成正则表达式对象，然后再使用正则表达式对象提供的方法进行字符串处理。

使用编译后的正则表达式对象可以提高字符串处理速度。

part1

①正则表达式对象的match(string[, pos[, endpos]])方法用于在字符串开头或指定位置进行搜索，模式必须出现在字符串开头或指定位置；

②正则表达式对象的search(string[, pos[, endpos]])方法用于在整个字符串或指定位置中进行搜索；

③正则表达式对象的findall(string[, pos[, endpos]])方法用于在字符串中查找所有符合正则表达式的字符串并以列表形式返回。

import re
>>> example = 'ShanDong Institute of Business and Technology'
>>> pattern = re.compile(r'\bB\w+\b')#以B开头的单词
>>> pattern.findall(example)
['Business']
>>> pattern = re.compile(r'\w+g\b')#以g结尾的单词
>>> pattern.findall(example)
['ShanDong']
>>> pattern = re.compile(r'\b[a-zA-Z]{3}\b')#查找3个字母长的单词
>>> pattern.findall(example)
['and']
>>> pattern.match(example)#从字符串开头开始匹配，所以不成功，没有返回值
>>> pattern.search(example)#在整个字符串中搜索，所以成功
<_sre.SRE_Match object at 0x01228EC8>
>>> pattern = re.compile(r‘\b\w*a\w*\b’)#查找所有含有字母a的单词
>>> pattern.findall(example)
['ShanDong', 'and']

part2
替换字符串内容的方法：
⑤sub(repl, string[, count = 0]) --> newstring
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.
⑥subn(repl, string[, count = 0]) --> (newstring, number of subs)
Return the tuple (new_string, number_of_subs_made) found by replacing the leftmost non-overlapping occurrences of pattern with the replacement repl.

>>> example = '''Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
pattern = re.compile(r'\bb\w*\b',re.I)
>>> print pattern.sub('*',example) #将以字母‘b’和‘B’开头的单词替换为‘*’
* is * than ugly.
Explicit is * than implicit.
Simple is * than complex.
Complex is * than complicated.
Flat is * than nested.
Sparse is * than dense.
Readability counts.

>>> print pattern.sub('*',example,1) #只替换一次
* is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
>>> pattern = re.compile(r'\bb\w*\b')
>>> print pattern.sub('*',example,1) #将第一个以字母‘b’开头的单词替换为‘*’
Beautiful is * than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.

part3
分割字符串：
⑦
split(string[, maxsplit = 0]) --> list

>>> import re
>>> example=r'one two three four,five.six,,seven7eight'
>>> pattern=re.compile(r'[\s,.\d]+') #允许分隔符重复
>>> pattern.split(example)
['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight']
>>>

5.子模式与match对象
使用()表示一个子模式，即()内的内容作为一个整体出现，例如’(red)+’可以匹配’redred’、’redredred‘等多个重复’red’的情况。
正则表达式对象的match方法和search方法匹配成功后返回match对象。

match对象的主要方法有group()、groups()、groupdict()、start()、end()、span()等等。

①group([group1, …]):
获得一个或多个分组截获的字符串；指定多个参数时将以元组形式返回。group1可以使用编号也可以使用别名；编号0代表整个匹配的子串；不填写参数时，返回group(0)；没有截获字符串的组返回None；截获了多次的组返回最后一次截获的子串。
②groups([default]):
以元组形式返回全部分组截获的字符串。相当于调用group(1,2,…last)。default表示没有截获字符串的组以这个值替代，默认为None。
③groupdict([default]):
返回以有别名的组的别名为键、以该组截获的子串为值的字典，没有别名的组不包含在内。default含义同上。
④start([group]):
返回指定的组截获的子串在string中的起始索引（子串第一个字符的索引）。group默认值为0。
⑤end([group]):
返回指定的组截获的子串在string中的结束索引（子串最后一个字符的索引+1）。group默认值为0。
⑥span([group]):
返回(start(group), end(group))。
⑦expand(template):
将匹配到的分组代入template中然后返回。template中可以使用\id或\g<id>、\g<name>引用分组，但不能使用编号0。\id与\g<id>是等价的；但\10将被认为是第10个分组，如果你想表达\1之后是字符'0'，只能使用\g<1>0。

子模式扩展语法：

(?P<groupname>)：为子模式命名

(?iLmsux)：设置匹配标志，可以是几个字母的组合，每个字母含义与编译标志相同
(?:...)：匹配但不捕获该匹配的子表达式

(?P=groupname)：表示在此之前的命名为groupname的子模式

(?#...)：表示注释

(?=…)：用于正则表达式之后，表示如果=后的内容在字符串中出现则匹配，但不返回=之后的内容

(?!...)：用于正则表达式之后，表示如果!后的内容在字符串中不出现则匹配，但不返回!之后的内容

(?<=…)：用于正则表达式之前，与(?=…)含义相同

(?<!...)：用于正则表达式之前，与(?!...)含义相同