Python基础学习笔记(4)

最新推荐文章于 2021-05-26 00:34:51 发布

Cacra

最新推荐文章于 2021-05-26 00:34:51 发布

阅读量455

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/u014465934/article/details/79449452

版权

Python 专栏收录该内容

39 篇文章 4 订阅

订阅专栏

《Python编程快速上手》

Python模式匹配与正则表达式

正则表达式，简称为regex，是文本模式的描述方法。

正则表达式匹配基本步骤：

用import re导入正则表达式模块。
用re.compile()函数创建一个Regex对象（记得使用原始字符串）。
向Regex对象的search()方法传入想查找的字符串。它返回一个Match对象。
调用Match对象的group()方法，返回实际匹配文本的字符串。

import re
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d\-\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found: '+mo.group())
>> Phone number found: 415-555-4242

1.利用括号分组

添加括号将在正则表达式中创建“分组”；(\d\d\d)-(\d\d\d-\d\d\d\d)。然后可以使用group()匹配对象方法group(1)、group(2)。。。等等获取匹配的文本。

import re
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d\-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242.')
mo.group(1)
>> '415'
mo.group(2)
>> '555-4242'

#一下就获取所有的分组
mo.groups()
>>('415','555-4242')

#多重赋值
areaCode,mainNumber = mo.groups()
print(areaCode)
>> 415
print(mainCode)
>> 555-4242

2.用管道匹配多个分组

字符 | 称为“管道”，希望匹配许多表达式中的一个时，就可以使用它。

#1.如果 | 前后都符合，取第一次出现的匹配文本
heroRegex = re.compile(r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey')
mo1.group()
>> 'Batman'

#2.如果你希望匹配'Batman'、'Batmobile'、'Batcopter'和'Batbar'中任意一个。只要以Bat开头，其他用()表示，这样就很方便。
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
mo.group()
>> Batmobile
mo.group(1)
>>mobile
mo.group(2)
>> IndexError: no such group

方法调用mo.group()返回了完全匹配的文本’Batmobile’，而mo.group(1)只是返回第一个括号分组内匹配的文本’mobile’。

如果需要匹配真正的管道字符，就用倒斜杠转义，即 | 。

3.用？* + { }进行匹配

？：0次或1次
* ：>= 0 次
+ ：>=1 次
{ } ：指定次数

#1. ？表示匹配的模式是可选的，就是说，不论这段文本在不在，正则表达式都会认为匹配。
batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
mo1.group()
>> 'Batman'

#2. *意味着"匹配零次或多次"，即星号之前的分组，可以在文本中出现任意次。它可以完全不存在，或一次又一次地重复。
batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batwowowoman')
mo1.group()
>> 'Batwowowoman'

#3. +则意味着"匹配一次或多次"。
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwowowoman')
mo1.group()
>> 'Batwowowoman'

mo2 = batRegex.search('The Adventures of Batman')
mo2 == None
>>True

#4. { }匹配特定次数，还可以指定一个范围，且左闭右闭
(Ha){3} 匹配 'HaHaHa'
(Ha){3,5} 可以匹配 'HaHaHa'、'HaHaHaHa'和'HaHaHaHaHa'

HaRegex = re.compile(r'(Ha){3}')
mo1 = haRegex.search('HaHaHa')
mo1.group()
>> 'HaHaHa'

HaRegex = re.compile(r'Hello(Ha){3}')
mo1 = HaRegex.search('HelloHa')
print(mo1==None)
>>True

4.贪心和非贪心匹配

Python的正则表达式默认是“贪心的”，这表示在有二义的情况下，他们会尽可能匹配最长的字符串。

Python的“非贪心”匹配，就是在结束的花括号后跟着一个问号。

#贪心匹配
greedyHaRegex = re.compile(r'(Ha){3}')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
m01.group()
>> 'HaHaHaHaHa'

#非贪心匹配
greedyHaRegex = re.compile(r'(Ha){3}?')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
m01.group()
>> 'HaHaHa'

# ?之前说了，表示可选的分组
HaRegex = re.compile(r'Hello(Ha){3}?')
mo1 = HaRegex.search('Hello')
print(mo1.group())
>> print(mo1.group())
AttributeError: 'NoneType' object has no attribute 'group'

5.findall()方法

search()将返回一个match()对象，包含被查找字符串的“第一次”匹配的文本。

findall()将返回字符串列表，包含被查找字符串中的所有匹配，只要在正则表达式中没有分组。

findall()在正则表达式含义分组的情况下，将返回元组的列表。

#1.search()将返回第一个匹配的
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('Cell:415-555-9999 Work:212-555-0000')
mo.group()
>> '415-555-9999'

#2.返回一个字符串列表
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.findall('Cell:415-555-9999 Work:212-555-0000')
>>['415-555-9999','212-555-0000']

#3.返回元组的列表
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
mo = phoneNumRegex.findall('Cell:415-555-9999 Work:212-555-0000')
>>['415-555-9999','212-555-0000']

6.字符分类和建立自己的字符分类

这里写图片描述

#例如下面，\d+\s\w+就是指匹配一个或多个数字（\d+），接下来是一个空白字符(\s)，接下来是一个或多个字母/数字/下划线字符（\w+）
xmasRegex = re.compile(r'\d+\s\w+')

建立自己的字符分类：

1.有时候你想匹配一组字符串，但缩写的字符字符分类（\d、\w、\s等）太宽泛。你可以用方括号定义自己的字符分类。例如，字符分类[aeiouAEIOU]将匹配所有元音字符，不论大小写。

2.也可以使用短横表示字母或数字的范围。例如，字符分类[a-zA-Z0-9]将匹配所有小写字母、大写字母和数字。

3.通过在字符分类的左方括号后加上一个插入字符(^)，就可以得到“非字符类”。

#1.
vowelRegex = re.compile(r'[aeiouAEIOU]')
vowelRegex.findall('RoboCop eats baby food. BABY FOOD.')
>>['o','o','o','e','a','a','o','o','A','O','O']

#2.
consonantRegex = re.compile(r'^[aeiouAEIOU]') 
consonantRegex.findall('Hello World')
>> ['H','l','l','W','r','l','d']

7.插入字符和美元字符，通配字符

插入字符和美元字符

re.compile(r'^Hello')   #以Hello开始
re.compile(r'Hello$')  #以Hello结尾

通配字符： . 句点字符只匹配一个字符，可以匹配除了换行之外的所有字符，返回字符串列表。

atREgex = re.compile(r'.at')
atRegex.findall('The cat in the hat sat on the flat mat.')
>>['cat','hat','sat','lat','mat']

传入re.DOTALL可以让 . 句点字符匹配所有字符。

newlineRegex = re.compile('.*',re.DOTALL)
newlineRegex.search('Save the public trust. \nProtect the innocent. \nUphold the law.').group()
>> 'Save the public trust. \nProtect the innocent. \nUphold the law.'

8.用点-星匹配所有字符

.* 就是指任一个字符出现零次或多次，使用的是贪心算法，它总是匹配尽可能的文本。（. 就是表示此处可以是任意字符。）

nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = nameRegex.search('First Name: AL Last Name: Sweigart')
mp.group(1)
>> 'AL'
mp.group(2)
>> 'Sweigart'

非贪心的 .* 和贪心的 .*

#1.非贪心的 .*
nongreedyRegex = re.compile(r'<.*?>')
mo = nongreedyRegex.search('<To serve man> for dinner.')
mo.group()
>> '<To serve man>'
#1.贪心的 .*
nongreedyRegex = re.compile(r'<.*?>')
mo = nongreedyRegex.search('<To serve man> for dinner.>')
mo.group()
>> '<To serve man> for dinner.>'

9.不区分大小写的匹配

可以向re.comile()传入re.IGNORECASE或re.I，作为第二个参数，让正则表达式不区分大小写。

robocop =  re.compile(r'robocop',re.I)
robocop.search('RoboCop is part man, part machine, all cop.').group
>> 'RoboCop'

10.用sub()方法替换字符串

正则表达式不仅能找到文本模式，而且能够用新的文本替换掉这些模式。

namesRegex = re.compile(r'Agent \w+')
namesRegex.sub('CENSORED','Agent Alice gave the secret documents to Agent Bob.')
>>'CENSORED gave the secret documents to CENSORED.'

附注：
1. \w代表字母、数字和下划线，所以只能匹配到Alice，因为后面有空格了。
2. Regex对象的sub()方法需要传入两个参数。第一个参数是一个字符串，用于取代发现的匹配。第二个参数是一个字符串，即正则表达式。sub()方法返回替换后的字符串。

11.管理复杂的正则表达式

匹配复杂的文本模式，可能需要长的、费解的正则表达式。你可以告诉re.compile()，传入参数re.VERBOSE，忽略正则表达式字符串中的空白符和注释，从而缓解这一点，

phoneRegex = re.compile(r'((\d{3}|\d{3}\))?(\s|-|.)?\d{3}(\s|-|\.)\d{4}(\s*(ext|x|ext)\s*\d{2,5})?)')

#使用''' '''将正则表达式放在多行中，并加上注释，像这样：

phoneRegex = re.compile(r'''
    ((\d{3}|\d{3}\))?      #area code
    (\s|-|.)?              #separator
    \d{3}                  #first 3 digits
    (\s|-|\.)              #separator
    \d{4}                  #last 4 digits
    (\s*(ext|x|ext)\s*\d{2,5})?   #extension
    )''',re.VERBOSE)

12.组合使用re.IGNORECASE、re.DOTALL和re.VERBOSE

re.IGNORECASE或者re.I是不区分大小写。
re.DOTALL是 . 句点字符匹配换行。
re.VERBOSE是忽略正则表达式字符串中的空白符和注释。

通过使用 | 就可以组合使用了。

someRegexValue = re.compile(r'foo',re.IGNORECASE|re.DOTALL|re.VERBOSE)

Cacra

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python基础学习笔记(4)

《Python编程快速上手》Python模式匹配与正则表达式正则表达式，简称为regex，是文本模式的描述方法。正则表达式匹配基本步骤：用import re导入正则表达式模块。用re.compile()函数创建一个Regex对象（记得使用原始字符串）。向Regex对象的search()方法传入想查找的字符串。它返回一个Match对象。调用Match对象的group(...
复制链接

扫一扫

专栏目录