33. Python中的正则表达式

bai666ai

已于 2022-02-21 12:55:06 修改

阅读量7.3k

点赞数

分类专栏： Python语言入门文章标签：正则表达式 python 人工智能

于 2022-02-20 13:40:39 首次发布

本文链接：https://blog.csdn.net/bai666ai/article/details/123030009

版权

Python语言入门专栏收录该内容

35 篇文章 12 订阅

订阅专栏

《Python编程的术与道：Python语言入门》视频课程
《Python编程的术与道：Python语言入门》视频课程链接：https://edu.csdn.net/course/detail/27845

正则表达式 (Regular Expression)

RegEx或正则表达式是形成搜索模式的一个字符序列。

(A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.)

RegEx可用于检查字符串是否包含指定的搜索模式。

下面是Python中的正则表达式匹配语法：

在这里插入图片描述

断言(assertions)

在正则表达式的系统里，断言就是匹配或者不匹配。一个正则表达式都能产生匹配或者不匹配的结果，所以所有的正则表达式都可以叫断言(assertions)。

我们也会看到用于边界匹配的概念，零宽断言(zero-width assertions)。普通的断言，比如\d+(匹配一个或者多个数字)，它所匹配的内容是有长度的；而有些断言比如^和$(分别匹配行开头和结尾)匹配的仅仅是一个位置，这可以理解为它所匹配的内容长度为0。所以，这类断言称为零宽断言(zero-width assertions)。

RegEx模块

Python有一个称为re的内置软件包，可用于正则表达式(Regular Expressions)。

导入re模块：

import re

RegEx Functions正则表达式函数

re模块提供了一组函数，使我们可以在字符串中搜索匹配项：

函数	描述
findall	返回一个包含所有匹配项的列表
search	如果字符串中任何地方存在一个匹配项，则返回一个匹配对象
split	返回一个列表，字符串在每次匹配时已拆分
sub	用字符串替换一个或多个匹配项

findall() 函数

findall()函数返回一个包含所有匹配项的列表。

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

['ai', 'ai']

exampleString = '''
Jessica is 15 years old, and Daniel is 27 years old.
Edward is 97 years old, and his grandfather, Oscar, is 102. 
'''

raw类型字符串可以在普通字符串的双引号前面加一个字符r来创建。当一个字符串是原始类型时，Python编译器不会对其尝试做任何的替换。本质上来讲，你在告诉编译器完全不要去干涉你的字符串。

ages = re.findall(r'\d{1,3}',exampleString)
names = re.findall(r'[A-Z][a-z]*',exampleString)

print(ages)
print(names)

['15', '27', '97', '102']
['Jessica', 'Daniel', 'Edward', 'Oscar']

txt = "The rain in Spain"
x = re.findall("Portugal", txt)
print(x)

[]

search() 函数

search()函数在字符串中搜索一个匹配项，如果存在匹配项，则返回一个匹配对象。

如果有多个匹配项，则仅返回多个匹配项的第一个匹配项：

txt = "The rain in Spain"
x = re.search("ai", txt)
print(x)

<re.Match object; span=(5, 7), match='ai'>

txt = "The rain in Spain"
x = re.search("\s", txt)
print(x)

print("The first white-space character is located in position:", x.start())

<re.Match object; span=(3, 4), match=' '>
The first white-space character is located in position: 3

txt = "The rain in Spain"
x = re.search("Portugal", txt)
print(x)

None

str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
    print('found', match.group()) ## 'found word:cat'
else:
    print('did not find')

found word:cat

\b 单词边界; 这是一个零宽度的断言，仅在单词的开头或结尾匹配。一个单词被定义为一个字母数字字符序列，因此一个单词的结尾由空格或非字母数字字符表示。
\B 另一个零宽度的断言，与\b相反，仅当当前位置不在单词边界时才匹配。

以下示例仅在一个完整的单词上匹配字符串class；当它包含在另一个单词中时，将不匹配。

txt = "no class at all"
x = re.search(r"\bclass\b", txt)
print(x)

<re.Match object; span=(3, 8), match='class'>

p = re.compile(r'\bclass\b')
print(p.search('no class at all'))

<re.Match object; span=(3, 8), match='class'>

print(p.search('the declassified algorithm'))

None

print(p.search('one subclass is'))

None

下面的例子中，“er\b”可以匹配“never”中的“er”，但不能匹配“verb”中的“er”。

“er\B”能匹配“verb”中的“er”，但不能匹配“never”中的“er”。

str1 = 'never do it!'
x = re.search("er\\b", str1)
print(x)
str2 = 'verb words!'
x = re.search("er\\b", str2)
print(x)

<re.Match object; span=(3, 5), match='er'>
None

str1 = 'never do it!'
x = re.search("er\B", str1)
print(x)
str2 = 'verb words!'
x = re.search("er\B", str2)
print(x)

None
<re.Match object; span=(1, 3), match='er'>

split() 函数

split()函数返回一个列表，该列表在每次匹配时都将字符串拆分：

txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Spain']

可以通过指定maxsplit参数来控制出现次数：

例: 仅在第一次出现时才拆分字符串：

txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x)

['The', 'rain in Spain']

sub()函数

sub()函数将匹配项替换为你选择的文本：

txt = "The rain in Spain"
x = re.sub("\s", "9", txt) # Replace every white-space character with the number 9
print(x)

The9rain9in9Spain

可以通过指定count参数来控制替换次数：

txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2) # Replace the first 2 occurrences
print(x)

The9rain9in Spain

匹配对象 (Match Object)

一个匹配对象是一个包含有关搜索和结果信息的对象。

注意：如果没有匹配项，则将返回值None，而不是Match Object。

txt = "The rain in Spain"
x = re.search("ai", txt)
print(x) #this will print an object

<re.Match object; span=(5, 7), match='ai'>

匹配对象具有用于检索有关搜索信息和结果的属性和方法：

span() 返回一个元组，其中包含匹配的开始和结束位置。
string() 返回传递给函数的字符串
group() 返回字符串中匹配的部分

例子：打印第一个匹配项的位置（开始和结束位置）。正则表达式查找以大写字母“ S”开头的所有单词：

txt = "The rain in Spain"
x = re.search(r"S\w+", txt)
print(x.span())

(12, 17)

例子：打印传递给函数的字符串：

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.string)

The rain in Spain

例子：打印匹配的字符串部分。正则表达式查找以大写字母“ S”开头的所有单词：

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.group())

Spain

Email 地址匹配例子

假设要在字符串purple alice-b@python.com monkey dishwasher中找到电子邮件地址。我们使用它作为示例来演示更多正则表达式功能。

str = 'purple alice-b@python.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
    print(match.group())  ## 'b@python'

b@python

在这个例子中，搜索不会获得完整的电子邮件地址，因为\w与-或.在地址中不匹配。我们将使用下面的正则表达式功能解决此问题。

方括号

方括号可用于表示一组字符，因此[abc]匹配a或b或c。代码\w，\s等也可以在方括号内使用，唯一的例外是点.,仅表示一个点。

match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
    print(match.group())  ## 'alice-b@python.com'

alice-b@python.com

更多方括号功能: 你也可以使用-指示范围，因此[a-z]匹配所有小写字母。要使用-符号而不表示范围，要将-放在最后，例如[abc-]。方括号开始处的^会将意义反转，[^ ab]表示除a和b以外的任何字符。

组提取（Group Extraction）

正则表达式的“分组”功能使你可以挑选出匹配文本的一部分。假设要从电子邮件名中分别提取用户名（username）和主机(host)。为此，请在模式中的用户名和主机周围添加括号( ) ，如下所示： r'([\w.-]+)@([\w.-]+)'。在这种情况下，括号不会更改模式匹配的内容，而是在匹配文本内部建立逻辑“组”。在成功搜索时，match.group(1)是与左括号第1个相对应的匹配文本，而match.group(2)是与左括号第2个相对应的文本。普通的match.group()仍然像往常一样是整个匹配文本。

str = 'purple alice-b@python.com monkey dishwasher'
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
    print(match.group())   ## 'alice-b@python.com' (the whole match)
    print(match.group(1))  ## 'alice-b' (the username, group 1)
    print(match.group(2))  ## 'python.com' (the host, group 2)

alice-b@python.com
alice-b
python.com

选项

re函数采用选项来修改模式匹配的行为。选项标志作为额外的参数添加到search()或findall()等中，例如 re.search(pat, str, re.IGNORECASE)。

IGNORECASE 忽略大小写差异以进行匹配，因此’a’匹配’a’和’A’。
DOTALL 此模式下. 的匹配不受限制，可匹配任何字符，包括换行符
MULTILINE 在由多行组成的字符串中，允许^和$匹配每行的开头和结尾。通常，^/$只会匹配整个字符串的开始和结尾。

贪婪模式和非贪婪模式（Greedy vs. Non-Greedy)

当正则表达式中包含能接受重复的定界符时，正则表达式会在能够匹配的情况下尽可能多地匹配字符。这种匹配模式称之为贪婪模式。但有时候我们需要程序不这么贪婪，也就是尽可能少地匹配符合条件的字符串。贪婪模式与非贪婪模式的切换只需要在原表达式后方加上一个问号。

例子1：a.*b，这个表达式将会匹配以a开头以b结尾的所有字符串，例如’abbbabbbacedb’，匹配的结果将是整个字符串，如果使用非贪婪模式将正则表达式修改为a.*?b，将只会匹配’ab’和’acedb’。

str= 'abbbabbbacedb'
# greedy 
print(re.findall('a.*b',str))
# non-greedy
print(re.findall('a.*?b',str))

['abbbabbbacedb']
['ab', 'ab', 'acedb']

例子2：匹配字符串中的所有url

url_str = 'www.taobao.com//https://www.baidu.com www.qq.com '
# greedy
url_list = re.findall('www.*com',url_str)
print(url_list)
# non-greedy
url_list = re.findall('www.*?com',url_str)
print(url_list)

['www.taobao.com//https://www.baidu.com www.qq.com']
['www.taobao.com', 'www.baidu.com', 'www.qq.com']

替换（Substitution）

re.sub(pat, replacement, str)函数在给定字符串中搜索pattern的所有实例，并将其替换。

替换字符串可以包括’\1’, ‘\2’ ，它们分别指向来自原始匹配文本的 group(1), group(2)等文本。

下面是一个示例，该示例搜索所有电子邮件地址，并对其进行更改以保留group(1)即用户(\1)，但以python.com作为替换后的host，即替换了group(2)。

  str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
  ## re.sub(pat, replacement, str) -- returns new string with all replacements,
  ## \1 is group(1), \2 group(2) in the replacement
  print(re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@python.com', str))

purple alice@python.com, blah monkey bob@python.com blah dishwasher

bai666ai

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
33. Python中的正则表达式

《Python编程的术与道：Python语言入门》视频课程《Python编程的术与道：Python语言入门》视频课程链接：https://edu.csdn.net/course/detail/27845正则表达式 (Regular Expression)RegEx或正则表达式是形成搜索模式的一个字符序列。(A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.)RegEx可用于
复制链接

扫一扫