第7章模式匹配与正则表达式

最新推荐文章于 2022-11-05 11:50:11 发布

Yang5527

最新推荐文章于 2022-11-05 11:50:11 发布

阅读量278

点赞数

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/Yang5527/article/details/107143356

版权

python 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

7.1用正则表达式查找文本模式

Python创建&查找正则表达式对象的4个步骤
（1）用import re 导入正则表达式模块

（2）用re.compile() 函数创建一个Regex对象
①向re.compile() 函数传入一个字符串值（使用原始字符串），表示正则表达式
– 在字符串的第一个引号前加上r，即可将该字符串标记为原始字符串
②该函数返回一个Regex模块对象

（3）向Regex对象的search() 方法传入想查找的字符串，寻找正则表达式的所有匹配
①找到，它返回一个Match对象，包含被查找字符串中的“第一次”匹配的文本
②未找到，它返回None

（4）调用Match对象的group() 方法，返回实际匹配文本的字符串

>>> import re
#将期待的模式传给re.compile(),并将得到的regex对象保存至phoneNumRegex
>>> phoneNumRegex = re.compile(r‘\d\d\d-\d\d\d-\d\d\d\d’)
>>> mo = phoneNumRegex.rearch(‘My number is 415-555-4242’)
>>> print(‘Phone number found:’ + mo.group())
Phone number found:415-555-4242

7.2用正则表达式匹配更多模块

利用括号分组
（1）添加括号将在正则表达式中创建分组
– (\d\d\d)-(\d\d\d-\d\d\d\d)
（2）利用group() 匹配对象方法，从分组获取匹配文本
（3）利用groups()方法，一次获得所有的分组

#创建分组
>>> phoneNumRegex = re.compile(r‘(\d\d\d)-(\d\d\d-\d\d\d\d)’)
>>> mo = phoneNumRegex.rearch(‘My number is 415-555-4242’)

#group() 匹配对象方法
#从分组中分别获得匹配文本
>>> mo.group(1)
‘415’

>>> mo.group(2)
‘555-4242’

>>> mo.group(0)
‘415-555-4242’
>>> mo.group()
‘415-555-4242’

#groups()方法
#一次获得所有分组
>>> mo.groups()
(‘415’，‘555-4242’)

用管道匹配多个分组
（1）字符| 称为管道，用于匹配多个表达式中的一个
（2）通过管道字符和分组括号，可指定几种可选模式，让正则表达式匹配

>>> heroRegex = re.compile(r‘Batman|Tina Fey’)
>>> mo1 = heroRegex.search(‘Batman and Tina Fey’)
>>> mo1.group()
‘Batman’
>>> mo2 = heroRegex.search(‘Tina Fey and Batman ’)
>>> mo2.group()
‘Tina Fey’

用问号实现可选匹配
（1）字符？表示它前面的分组在这个模式中是可选的
（2）正则表达式（wo）？部分表明，模式wo是可选分组，正则表达式匹配的文本中，wo将出现0次或1次

>>> batRegex = re.compile(r‘Bat(wo)?man’)
>>> mo1 = batRegex.search(‘The Adventrues of Batman’)
>>> mo1.group()
‘Batman’

>>> mo2 = batRegex.search(‘The Adventrues of Batwoman’)
>>> mo2.group()
‘Batwoman’

用星号匹配零次或者多次
（1）*意味着匹配零次或者多次，即星号之前的分组可以在文本中出现任意次

>>> batRegex = re.compile(r‘Bat(wo)*man’)
>>> mo1 = batRegex.search(‘The Adventrues of Batman’)
>>> mo1.group()
‘Batman’

>>> mo2 = batRegex.search(‘The Adventrues of Batwoman’)
>>> mo2.group()
‘Batwoman’

>>> mo3 = batRegex.search(‘The Adventrues of Batwowowowoman’)
>>> mo3.group()
‘Batwowowowoman’

用加号匹配一次或者多次
（1）+意味着匹配一次或者多次，加号前的分组必须至少出现一次

>>> batRegex = re.compile(r‘Bat(wo)+man’)
>>> mo1 = batRegex.search(‘The Adventrues of Batwoman’)
>>> mo1.group()
‘Batwoman’

>>> mo2 = batRegex.search(‘The Adventrues of Batwowowowoman’)
>>> mo2.group()
‘Batwowowowoman’

>>> mo3 = batRegex.search(‘The Adventrues of Batman’)
>>> mo3 == None
True

6.用花括号匹配特定次数
（1）花括号内可以是数字或者指定一个范围

>>> haRegex = re.compile(r‘(Ha){3}’)
>>> mo1 = haRegex.search(‘HaHaHa’)
>>> mo1.group()
‘HaHaHa’

>>> mo2 = haRegex.search(‘Ha’)
>>> mo2 == None
True

7.3贪心和非贪心匹配

Python 的正则表达式默认是“贪心”的，即表示在有二义的情况下，python会尽量匹配最长的字符串
花括号的“非贪心”版本匹配尽可能最短的字符串，即在结束的花括号后跟着一个问号

#花括号的贪心形式
>>> greedyHaRegex = re.compile(r‘(Ha){3,5}’)
>>> mo1 = greedyhaRegex.search(‘HaHaHaHaHa’)
>>> mo1.group()
HaHaHaHaHa’

#花括号的非贪心形式
>>> nogreedyHaRegex = re.compile(r‘(Ha){3,5}?’)
>>> mo1 = nogreedyhaRegex.search(‘HaHaHaHaHa’)
>>> mo1.group()
HaHaHa’

7.4 findall() 方法

1.search()方法将返回一个Match对象，包含被查找字符串中的“第一次”匹配的文本
2.调用在一个没有分组的正则表达式上，findall()将返回匹配字符串的列表
3.调用在一个有分组的正则表达式上，findall()将返回一个字符串的元组的列表（每一个分组对应一个字符串）

#search()方法
>>> phoneNumRegex = re.compile(r‘\d\d\d-\d\d\d-\d\d\d\d’)
>>> mo = phoneNumRegex.search(‘Cell: 415-555-9999 Work:212-555-0000’)
>>> mo.group()
415-555-9999

#findall()方法
#未分组
>>> phoneNumRegex = re.compile(r‘\d\d\d-\d\d\d-\d\d\d\d’)
>>> phoneNumRegex.findall(‘Cell: 415-555-9999 Work:212-555-0000’)
[‘415-555-9999’，‘212-555-0000’]
#分组
>>> phoneNumRegex = re.compile(r‘(\d\d\d）-(\d\d\d)-(\d\d\d\d)’)
>>> phoneNumRegex.findall(‘Cell: 415-555-9999 Work:212-555-0000’)
[(‘415’,‘555’,‘9999’)，(‘212’,‘555’,‘0000’)]

7.5 字符分类

在这里插入图片描述

7.6建立自己的字符分类

可以用方括号[]定义自己的字符分类
– 例如：[aeiouAEIOU]将匹配所有元音字符，不论大小写
– 例如：短横线表示字母或数字的范围 [a-zA-Z0-9]将匹配所有大小写字母和数字
通过在字符分类的左方括号加上一个插入字符^, 可得到非字符类，即非字符类将匹配不在这个字符中的所有字符

7.7插入字符和美元字

插入字符^，表明匹配必须发生在被查找文本开始处
正则表达式末尾加上美元符号$，表示该字符串必须以这个正则表达式的模块结束
同时使用插入字符^和美元字符$，表示该整个字符串必须匹配该模式

#插入字符^
>>> beginsWithHello = re.compile(r‘^Hello’)
>>> beginsWithHello.search(‘Hello world!’)
>>> beginsWithHello.search(‘He said Hello.’) == None
True

#美元字符$
>>> endsWithHello = re.compile(r‘\d$’)
>>> endsWithHello.search(‘Your number is 42’)
>>> endsWithHello.search(‘Your number is fourty two.’) == None
True

#同时使用^和$
>>> wholeStringIsNum = re.compile(r‘^\d+$’)
>>> wholeStringIsNum.search(‘1234567890’)
>>> wholeStringIsNum.search(‘12345xyz7890.’) == None
True
>>> wholeStringIsNum.search(‘12 34567890.’) == None
True

7.8通配字符

– 在正则表达式中，.(句点)字符称为通配符
– 它匹配除换行符以外的所有字符，且句点字符只匹配一个字符

>>> atRegex = re.compile(r‘.at’)
>>> atRegex.findall(‘The cat in the hat sat on the flat mat.’)
[‘cat’,‘hat’,‘sat’,‘lat’,‘mat’]

用点-星匹配所有字符
①.*表示任意文本
②点-星使用贪心模式：匹配尽可能多的文本
③使用非贪心模式匹配所有文本：点星和问好

#点-星使用贪心模式
>>> nameRegex = re.compile(r‘First Name:(.*)Last Name:(.*)’)
>>> mo = nameRegex.search(‘First Name:A1Last Name:Sweigart)
>>> mo.group(1)
‘A1’
>>> mo.group(2)
‘Sweigart’

#非贪心模式
>>> nongreedyRegex = re.compile(r‘<.*?>’)
>>> mo = nongreedyRegex.search(‘<To serve man> for dinner.>’)
>>> mo.group()
‘<To serve man>’

#贪心模式
>>> nongreedyRegex = re.compile(r‘<.*>’)
>>> mo = nongreedyRegex.search(‘<To serve man> for dinner.>’)
>>> mo.group()
‘<To serve man>for dinner.>’

用句点字符匹配换行
①通过传入re.DOTALL作为re.compile()的第二个参数，可以让句点字符匹配所有字符，包含换行字符

>>> noNewlineRegex = re.compile(‘.*’)
>>> noNewlineRegex.search(‘Serve the public trust.\nProtect the innocent.\nUphold the law’).group
‘Serve the public trust.’

>>> noNewlineRegex = re.compile(‘.*’,re.DOTALL)
>>> noNewlineRegex.search(‘Serve the public trust.\nProtect the innocent.\nUphold the law’).group
‘Serve the public trust\nProtect the innocent.\nUphold the law.’

不区分大小写的匹配：向re.compile()传入re.IGNORECASE或re.I
管理复杂的正则表达式：向re.compile()传入re.VERBOSE
组合使用re.DOTALL、re.IGNORECASE、re.VERBOSE

>>> someRegex = re.compile(‘foo’,re.DOTALL|re.IGNORECASE|re.VERBOSE)

7.9用sub()方法替换字符串

1.正则表达式可以用心的文本替换文本模式
2.sub方法的第一个参数是一个字符串，用于取代发现的匹配；第二个字符串是一个字符串，即正则表达式

>>> nameRegex = re.compile(r‘Agent \w+’)
>>> nameRegex.sub(‘CENSORED’,‘Agent Alice gave the srcret documents to Agent Bob.’)
‘CENSORED gave the srcret documents to CENSORED.’