Python基础_正则表达式

最新推荐文章于 2021-07-22 16:06:37 发布

十一姐

最新推荐文章于 2021-07-22 16:06:37 发布

阅读量742

点赞数 1

分类专栏： # PythonKnowledge

本文链接：https://blog.csdn.net/weixin_43411585/article/details/91891846

版权

PythonKnowledge 专栏收录该内容

58 篇文章 20 订阅

订阅专栏

第7章_模式匹配与正则表达式

python官方文档正则表达式

1、主要内容

序号	字符	含义
1	？	匹配零次或一次前面的分组
2	*	匹配零次或多次前面的分组
3	+	匹配有一次或多次前面的分组
4	{n}	匹配n次前面的分组
5	{n,}	匹配n次或更多前面的分组
6	{，m	匹配零次到m次前面的分组
7	{n,m}	匹配至少n次、至多m次前面的分组
8	{n,m}?或*？或+？	对前面的分组进行非贪心匹配
9	^spam	意味着字符串必须以spam开始
10	spam$	意味着字符串必须以spam结束
11	.	匹配所有字符，换行符除外
12	\d、\w和\s	分别匹配数字、单词和空格
13	\D、\W和\S	分别匹配除数字、单词和空格外的所有字符
14	[abc]	匹配方括号内的任意字符（诸如a,b或c）
15	[^abc]	匹配不在方括号内的任意字符
16	re.I	不区分大小写
17	sub()	替换字符串
18	re.DOTALL	让.*匹配所有字符，包括换行符
19	re.VERBOSE	忽略正则表达式字符串中空白符和注释

2、匹配Regex对象

向Regex对象的search()方法传入想查找的字符串。找到了就返回一个Match对象，未找到就返回None。调用Match对象的group()方法，返回实际匹配文本的字符串
通过在字符串中的第一个引号之前加上r，可以将该字符串标记为原始字符串，它不包括转义字符

import re
phoneNumRegex = re.compile(r'\d{3}-\d{3}-\d{3}')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found: ' + mo.group())
>>>>> Phone number found: 415-555-424

3、利用括号分组

正则表达式字符串的第一对括号就是第1组，第二对括号就是第2组。
向group()匹配对象方法传入整数1或2，就可以取得匹配文本的不同部分。
向group()方法传入0或不传入参数，将返回整个匹配的文本
如果想一次性获取所有分组，请使用groups()方法
匹配真正的括号用转义字符 $ 和 $

phoneNumRegex = re.compile(r'(\d{3})-(\d{3}-\d{3})')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('整个文本：{}，分组1：{}，分组2：{}'.format(mo.group(), mo.group(1), mo.group(2)))
>>>>> 整个文本：415-555-424，分组1：415，分组2：555-424
print('所有分组：', mo.groups())
>>>>> 所有分组： ('415', '555-424')

4、用 | 匹配多个分组

字符’|'称为管道，希望匹配许多表达式中的一个时，就可以使用它
匹配真正的管道字符，用转义字符 \|

heroRegex = re.compile(r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey.')
print(mo1.group())
>>>>> Batman
mo2 = heroRegex.search('Tina Fey and Batman.')
print(mo2.group())
>>>>> Tina Fey
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a weel')
print(mo.group())
>>>>> Batmobile
print(mo.group(1))
>>>>> mobile

5、用 ? 实现可选匹配

匹配这个问号之前的分组零次或一次
匹配真正的问号字符，用转义字符 \?

batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
print(mo1.group())
>>>>> Batman
mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())
>>>>> Batwoman
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
mo1 = phoneRegex.search('My number is 415-555-4242')
print(mo1.group())
>>>>> 415-555-4242
mo2 = phoneRegex.search('My number is 555-4242')
print(mo2.group())
>>>>> 555-4242

6、用 * 匹配零次或多次

匹配真正的星号字符，用转义字符 \*

batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
print(mo1.group())
>>>>> Batman
mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())
>>>>> Batwoman
mo3 = batRegex.search('The Adventure of Batwowowowoman')
print(mo3.group())
>>>>> Batwowowowoman

7、用 + 匹配一次或多次

匹配真正的加号字符，用转义字符 \+

batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batman')
print(mo1)
>>>>> None
mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())
>>>>> Batwoman
mo3 = batRegex.search('The Adventure of Batwowowowoman')
print(mo3.group())
>>>>> Batwowowowoman

8、用 {} 匹配特定次数

{3}匹配3次， {3,5}匹配3次到5次，{3，}匹配3次及更多次，{，5}匹配0次到5次

haRegex = re.compile(r'(Ha){3}')
mo1 = haRegex.search('HaHaHa')
print(mo1.group())
>>>>> HaHaHa
mo2 = haRegex.search('Ha')
print(mo2 == None)
>>>>> True

9、贪心和非贪心匹配

python正则表达式默认是’贪心’的，这表示在二义的情况下，它们会尽可能匹配最长的字符串
?在正则表达式里面有两种含义，声明非贪心匹配或表示可选的分组。这两种含义是完全无关的

greedyHaRegex = re.compile(r'(Ha){3,5}')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
print(mo1.group())
>>>>> HaHaHaHaHa
nongreedyHaRegex = re.compile(r'(Ha){3,5}?')
mo2 = nongreedyHaRegex.search('HaHaHaHaHa')
print(mo2.group())
>>>>> HaHaHa

10、findall()方法

findall()方法，返回一个字符串列表，如果在正则表达式中有分组，那么findall将返回元组的列表

phoneNumRegex = re.compile(r'\d{3}-\d{4}-\d{4}')
mo = phoneNumRegex.findall('Cell:415-5555-9999 Work:212-5554-0000')
print(mo)
>>>>> ['415-5555-9999', '212-5554-0000']
phoneNumRegex = re.compile(r'(\d{3})-(\d{3})-(\d\d\d\d)')
mo = phoneNumRegex.findall('Cell:415-555-9999 Work:212-555-0000')
print(mo)
>>>>> [('415', '555', '9999'), ('212', '555', '0000')]

11、字符分类

\d : 0~9的任何数字
\D : 除0~9数字以外的任何数字
\w : 任何字母、数字或下划线字符（可以认为是匹配“单词”字符）
\W : 除字母、数字和下划线以外的任何字符
\s : 空格、制表符或换行符（可以认为是匹配“空白”字符）
\S : 除空格、制表符和换行符以外的任何字符

xmasRegex = re.compile(r'\d+\s\w+')
print(xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge'))
>>>>> ['12 drummers', '11 pipers', '10 lords', '9 ladies', '8 maids', '7 swans', '6 geese', '5 rings', '4 birds', '3 hens', '2 doves', '1 partridge']

12、建立自己的字符分类

用方括号定义自己的字符分类
- 表示字母或数字范围，如字符分类[a-zA-Z0-9]将匹配所有小写字母、大写字母和数字
在方括号内，普通的正则表达式符号不会被解释，这意味着不需要加\转义
^ 表示非的意思,如下面匹配所有非元音字符

vowelRegex = re.compile(r'[aeiouAEIOU]')
print(vowelRegex.findall('RoboCop eats baby food. BABY FOOD'))
>>>>> ['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']
vowelRegex = re.compile(r'[^aeiouAEIOU]')
print(vowelRegex.findall('RoboCop eats baby food. BABY FOOD'))
>>>>> ['R', 'b', 'C', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.', ' ', 'B', 'B', 'Y', ' ', 'F', 'D']

13、以什么开始 ^ 、以什么结尾 $

^ 表明匹配必须发生在被查找文本开始处
$ 表示整个字符串必须以这个正则表达式的模式结束

# 如 r'^Hello'匹配以'Hello'开始的字符串
beginsWithHello = re.compile(r'^Hello')
print(beginsWithHello.search('Hello world!'))
>>>>> <re.Match object; span=(0, 5), match='Hello'>
print(beginsWithHello.search('He said hello Hello'))
>>>>> None
# r'\d$'匹配以数字0到9结束的字符串
endWithNumber = re.compile(r'\d$')
print(endWithNumber.search('You number is 42'))
>>>>> <re.Match object; span=(15, 16), match='2'>
print(endWithNumber.search('Your number is forty two'))
>>>>> None
# 正则表达式r'^\d+$'匹配从开始到结束都是数字的字符串
wholeStringIsNum = re.compile(r'\d+$')
print(wholeStringIsNum.search('1234567890'))
>>>>> <re.Match object; span=(0, 10), match='1234567890'>
print(wholeStringIsNum.search('12345xy67890'))
>>>>> <re.Match object; span=(7, 12), match='67890'>

14、. 通配字符，匹配除换行符之外的所有字符

. 匹配除换行符之外的所有字符

atRegex = re.compile(r'.at')
print(atRegex.findall('The cat in the hat sat on the flat mat.'))
>>>>> ['cat', 'hat', 'sat', 'lat', 'mat']

15、用 .* 匹配所有字符，包括换行符

.* 匹配除换行所有字符为贪心模式
.*? 用非贪心模式

nongreedyHaRegex = re.compile(r'<.*>')
print(nongreedyHaRegex.search('<To server man> for dinner.>').group())
>>>>> <To server man> for dinner.>
nongreedyHaRegex = re.compile(r'<.*?>')
print(nongreedyHaRegex.search('<To server man> for dinner.>').group())
>>>>> <To server man>

16、用句点字符匹配换行

传入re.DOTALL作为re.compile()的第二个参数，可以让 .* 字符匹配所有字符，包括换行符

newlineRegex = re.compile('.*', re.DOTALL)
print(newlineRegex.search('Serve the public truse.\nProtect the innocent.').group())
>>>>> Serve the public truse.\nProtect the innocent.

17、re.I 不区分大小写的匹配

re.compile()传入参数re.I，可以进行不区分大小写的匹配

rebocop = re.compile(r'robocop', re.I)
print(rebocop.search('Robocop is part man,part machine,all cop.').group())
>>>>> Robocop

18、用sub()方法替换字符串

sub()方法替换字符串，第一个参数是一个字符串，用于取代发现的匹配，第二个参数是一个字符串，是用正则表达式匹配的内容
输入\1、\2、\3表示替换中输入分组1、2、3…的文本

namesRegex = re.compile(r'Agent \w+')
print(namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.'))
>>>>> CENSORED gave the secret documents to CENSORED.
# 输入\1、\2、\3表示替换中输入分组1、2、3.....的文本
agentNamesRegex = re.compile(r'Agent (\w)\w*')
print(agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol tha Agent Eve'))
>>>>> A**** told C**** tha E****

19、re.VERBOSE 忽略空白符和注释

传入参数re.VERBOSE,可以忽略正则表达式字符串中的空白符和注释

re.compile(r'foo ', re.VERBOSE)
# 忽略大小写和符的空白符和注释
re.compile(r'foo ', re.I | re.VERBOSE)

20、小项目—电话号码和E-mail地址提取程序

21、小项目—强口令检测

"""
写一个函数，它使用正则表达式，确保传入的口令字符串是强口令。强口令的定义是：长度不少于 8 个字符，同时包含大写和小写字符，至少有一位数字。你可能需要用多个正则表达式来测试该字符串，以保证它的强度。
"""
import re
text = input('输入一串口令:')
ch_len = len(text)
ch_pw1 = re.compile(r'[a-zA-Z]').search(text)
ch_pw2 = re.compile(r'\d+').search(text)
def ch_pw():
    if ch_len > 8 and ch_pw1 and ch_pw2:
        print("口令正确")
    else:
        print("口令错误,口令长度至少8位！同时包含大小写字母！至少一位数字！")    
ch_pw()

22、小项目—strip()的正则表达式版本

import re
def fn(str_temp, char=r'\s'):
    str_regex = re.compile(r'^({})*|({})*$'.format(char, char))
    s = str_regex.sub('', str_temp)
    return s
print(fn(' spam bacon  '))
print(fn('SpamSpamBaconSpamEggsSpamSpam', 'Spam'))