【python 让繁琐工作自动化】第7章使用正则表达式进行模式匹配

最新推荐文章于 2022-11-05 11:50:11 发布

今岁成蹊

最新推荐文章于 2022-11-05 11:50:11 发布

阅读量1.7k

点赞数 1

分类专栏： Python学习笔记文章标签： python

本文链接：https://blog.csdn.net/HPP_CSDN/article/details/103586034

版权

Python学习笔记专栏收录该内容

19 篇文章 20 订阅

订阅专栏

Automate the Boring Stuff with Python: Practical Programming for Total Beginners (2nd Edition)
Written by Al Sweigart.
The second edition is available on 2019.10.29

7.1 不使用正则表达式查找文本模式

假设想在字符串中找到一个电话号码。知道模式：3 个数字，一个连字符，3 个数字，一个连字符，4 个数字。例：415-555-4242。
使用一个名为 isPhoneNumber() 的函数来检查一个字符串是否匹配这个模式，返回 True 或 False。

def isPhoneNumber(text):
	if len(text) != 12:
		return False
	for i in range(0, 3):
		if not text[i].isdecimal():
			return False
	if text[3] != '-':
		return False
	for i in range(4, 7):
		if not text[i].isdecimal():
			return False
	if text[7] != '-':
		return False
	for i in range(8, 12):
		if not text[i].isdecimal():
			return False
	return True

print('415-555-4242 is a phone number:')
print(isPhoneNumber('415-555-4242')) # 打印 True
print('Moshi moshi is a phone number:')
print(isPhoneNumber('Moshi moshi')) # 打印 False

添加代码在更长的字符串中找到电话号码这种文本模式。

message = 'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.'
for i in range(len(message)):
	chunk = message[i:i+12]
	if isPhoneNumber(chunk):
		print('Phone number found: ' + chunk)
print('Done')

运行上面的程序，结果如下：

Phone number found: 415-555-1011
Phone number found: 415-555-9999
Done

7.2 使用正则表达式查找文本模式

正则表达式（regular expressions），简称为 regexes，是文本模式的描述方式。例如，regex 中的 \d 表示数字字符，即任何一位 0 到 9 的数字。Python 使用正则表达式 \d\d\d-\d\d\d-\d\d\d\d，来匹配前面电话号码的文本模式。
正则表达式要复杂得多。例如，在一个模式之后，在花括号中添加 3（{3}），表示 “将这个模式匹配 3 次”。因此，正则表达式 \d{3}-\d{3}-\d{4} 也匹配正确的电话号码格式。

创建正则表达式对象

Python中的所有正则表达式的函数都在 re 模块中。

import re

将表示正则表达式的字符串值传递给 re.compile()，返回一个 Regex 模式对象（简称为 Regex 对象）。
创建与电话号码模式匹配的 Regex 对象：

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

将原始字符串传递给 re.compile()
Python 中的转义字符使用反斜杠（\）。通过在字符串值的第一个引号前加上 r，可以将该字符串标记为原始字符串（raw string），它不会转义字符。
由于正则表达式经常使用反斜杠，所以将原始字符串传递给 re.compile() 函数比输入额外的反斜杠更方便。输入 r'\d\d\d-\d\d\d-\d\d\d\d' 比输入'\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d' 要容易得多。

匹配正则表达式对象

Regex 对象的 search() 方法，在传入的字符串中查找该正则表达式的任何匹配。如果在字符串中没有找到正则表达式模式，search() 方法返回 None。如果找到该模式，search() 方法返回一个 Match 对象。Match 对象有一个group() 方法，该方法将从搜索字符串返回实际匹配的文本。

mo = phoneNumRegex.search('My number is 415-555-4242.') # mo is a Match object
print('Phone number found: ' + mo.group())  # 打印 Phone number found: 415-555-4242

正则表达式匹配步骤

使用 import re 导入正则表达式模块。
使用 re.compile() 函数创建一个 Regex 对象。（记住要使用原始字符串。）
将要查找的字符串传递给 Regex 对象的 search() 方法。它返回一个 Match 对象。
调用 Match 对象的 group() 方法，返回实际匹配文本的字符串。

利用基于网页的正则表达式测试程序，展示正则表达式如何匹配输入的文本：http://regexpal.com/。

7.3 使用用正则表达式匹配更多模式

使用括号分组

如果想把区号从电话号码中分离。添加括号将在正则表达式中创建分组（groups）：(\d\d\d)-(\d\d - d\d\d)。然后可以使用 group() 匹配对象方法从一个分组中获取匹配的文本。
正则表达式字符串中的第一对括号是第 1 组，第二对括号是第 2 组。通过将整数 1 或 2 传递给 group() 匹配对象方法，可以获取匹配文本的不同部分。向 group() 方法传入 0 或不传参数，返回整个匹配的文本。

>>> phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
>>> mo = phoneNumRegex.search('My number is 415-555-4242.')
>>> mo.group(1)
'415'
>>> mo.group(2)
'555-4242'
>>> mo.group(0)
'415-555-4242'
>>> mo.group()
'415-555-4242'

如果希望一次获取所有分组，使用 groups() 方法，注意名称的复数形式。

>>> mo.groups()
('415', '555-4242')
>>> areaCode, mainNumber = mo.groups()
>>> print(areaCode)
415
>>> print(mainNumber)
555-4242

括号在正则表达式中有特殊的含义，但是如果需要匹配文本中的括号，应该怎么做呢？例如，要匹配的电话号码的区号放置在括号中。在这种情况下，需要用反斜杠转义 ( 和 ) 字符。

>>> phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
>>> mo = phoneNumRegex.search('My phone number is (415) 555-4242.')
>>> mo.group(1)
'(415)'
>>> mo.group(2)
'555-4242'

传递给 re.compile() 的原始字符串中的 $ 和 $ 转义字符将与实际的括号字符匹配。

使用管道匹配多个分组

| 字符称为管道（pipe）。如果想要匹配许多表达式中的一个，就可以在使用它。
当要匹配表达式的多个都出现在被查找的字符串中时，第一次出现的匹配文本将作为 Match 对象返回。

>>> heroRegex = re.compile (r'Batman|Tina Fey')
>>> mo1 = heroRegex.search('Batman and Tina Fey.')
>>> mo1.group()
'Batman'

>>> mo2 = heroRegex.search('Tina Fey and Batman.')
>>> mo2.group()
'Tina Fey'

注意：可以使用 findall() 方法找到所有匹配项。

可以使用管道来匹配多个模式中的一个，作为正则表达式一部分。例如，假设想匹配字符串 “Batman”、“Batmobile”、“Batcopter” 和 “Batbat” 中的任意一个。因为所有这些字符串都以 Bat 开头，所以如果可以只指定一次前缀就好了。这可以用括号来完成。

>>> batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
>>> mo = batRegex.search('Batmobile lost a wheel')
>>> mo.group()
'Batmobile'
>>> mo.group(1)
'mobile'

通过使用管道字符和分组括号，可以指定需要正则表达式匹配的几个可选模式。
如果需要匹配实际的管道字符，使用反斜杠对其进行转义，即 \|。

使用问号（?）实现可选匹配

有时，希望匹配的模式是可选的。也就是说，无论这段文本是否存在，正则表达式都会认为匹配。字符 ? 表示：匹配这个问号之前的分组零次或一次。

>>> batRegex = re.compile(r'Bat(wo)?man')
>>> mo1 = batRegex.search('The Adventures of Batman')
>>> mo1.group()
'Batman'

>>> mo2 = batRegex.search('The Adventures of Batwoman')
>>> mo2.group()
'Batwoman'

以前面的电话号码为例，可以让正则表达式查找具有或不具有区号的电话号码。

>>> phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
>>> mo1 = phoneRegex.search('My number is 415-555-4242')
>>> mo1.group()
'415-555-4242'

>>> mo2 = phoneRegex.search('My number is 555-4242')
>>> mo2.group()
'555-4242'

如果需要匹配实际的问号字符，使用 \? 将其转义。

使用星号（*）匹配零次或多次

字符 * 表示 “匹配零次或多次”，称为星号（star / asterisk）。即，在星号前面的分组可以在文本中出现任意次。它可以完全不存在，也可以一遍又一遍地重复。

>>> batRegex = re.compile(r'Bat(wo)*man')
>>> mo1 = batRegex.search('The Adventures of Batman')
>>> mo1.group()
'Batman'

>>> mo2 = batRegex.search('The Adventures of Batwoman')
>>> mo2.group()
'Batwoman'

>>> mo3 = batRegex.search('The Adventures of Batwowowowoman')
>>> mo3.group()
'Batwowowowoman'

如果需要匹配实际的星号，在正则表达式中的星号前面加上反斜杠 \*。

使用加号（+）匹配一次或多次

符号 + 表示 “匹配一次或多次”。即，在加号前面的分组在文本中至少出现一次。

>>> batRegex = re.compile(r'Bat(wo)+man')
>>> mo1 = batRegex.search('The Adventures of Batwoman')
>>> mo1.group()
'Batwoman'

>>> mo2 = batRegex.search('The Adventures of Batwowowowoman')
>>> mo2.group()
'Batwowowowoman'

>>> mo3 = batRegex.search('The Adventures of Batman')
>>> mo3 == None
True

如果需要匹配实际的加号字符，在正则表达式的加号前面加上反斜杠 \+。

使用花括号匹配特定次数

如果想要一个分组重复特定次数，在正则表达式中的分组后面，加上花括号包围的数字。例如，正则表达式 (Ha){3} 匹配字符串 ‘HaHaHa’，但它不匹配 ‘HaHa’，因为后者只重复 (Ha) 分组两次。
除了在花括号中写入一个数字，还可以通过在花括号中写入最小值、逗号和最大值来指定范围。例如，正则表达式 (Ha){3,5} 匹配 ‘HaHaHa’、‘HaHaHaHa’ 和 ‘HaHaHaHaHa’。
还可以省略花括号中的第一个或第二个数字，不限定最小值或最大值。例如，(Ha){3,} 匹配 3 次或更多次 (Ha) 分组，而 (Ha){,5} 匹配 0 到 5 次实例。花括号让正则表达式更简短。

>>> haRegex = re.compile(r'(Ha){3}')
>>> mo1 = haRegex.search('HaHaHa')
>>> mo1.group()
'HaHaHa'

>>> mo2 = haRegex.search('Ha')
>>> mo2 == None
True

7.4 贪心和非贪心匹配

在字符串 ‘HaHaHaHaHa’ 中，既然 (Ha){3,5} 可以匹配的 3 个、4 个或 5 个 Ha 实例，那么为什么在前面的花括号示例中，Match 对象的 group() 的调用会返回 ‘HaHaHaHaHa’，而不是更短的可能结果。
Python 的正则表达式在默认情况下是贪心（greedy）的，这意味着在有二义的情况下，它们将匹配尽可能长的字符串。非贪心（non-greedy）版本的花括号匹配尽可能短的字符串，它的右花括号后面有一个问号。

>>> greedyHaRegex = re.compile(r'(Ha){3,5}')
>>> mo1 = greedyHaRegex.search('HaHaHaHaHa')
>>> mo1.group()
'HaHaHaHaHa'

>>> nongreedyHaRegex = re.compile(r'(Ha){3,5}?')
>>> mo2 = nongreedyHaRegex.search('HaHaHaHaHa')
>>> mo2.group()
'HaHaHa'

注意，问号在正则表达式中有两种含义：声明非贪心匹配或标记可选分组。这两种含义完全无关。

7.5 findall() 方法

除了 search() 方法之外，Regex 对象还有一个 findall() 方法。findall() 方法返回被查找字符串中每个匹配的字符串。
search() 返回一个 Match 对象，只包含被查找的字符串中的第一个匹配文本。

>>> phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
>>> mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000')
>>> mo.group()
'415-555-9999'

findall() 不是返回 Match 对象，而是字符串列表——只要正则表达式中没有分组。列表中的每个字符串都是被查找的文本的一个片段，并且与正则表达式匹配。

>>> phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups
>>> phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')
['415-555-9999', '212-555-0000']

如果正则表达式中有分组，那么 findall() 返回一个元组列表。每个元组表示一个找到的匹配，其项是正则表达式中每个分组匹配的字符串。

>>> phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') # has groups
>>> phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')
[('415', '555', '9999'), ('212', '555', '0000')]

7.6 字符分类

表7-1 常用字符分类的缩写代码

缩写字符分类	表示
\d	0 到 9 的任何数字
\D	除 0 到 9 的数字以外的任何字符
\w	任何字母、数字或下划线字符（可以认为是匹配 “单词”（word）字符）
\W	除字母、数字和下划线以外的任何字符
\s	空格、制表符或换行符（可以认为是匹配 “空白”（space）字符）
\S	除空格、制表符和换行符以外的任何字符

>>> xmasRegex = re.compile(r'\d+\s\w+')
>>> xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')
['12 drummers', '11 pipers', '10 lords', '9 ladies', '8 maids', '7 swans', '6 geese', '5 rings', '4 birds', '3 hens', '2 doves', '1 partridge']

7.7 创建自己的字符分类

可以使用方括号定义自己的字符分类。例如，字符分类 [aeiouAEIOU] 匹配任何元音，包括小写和大写。

>>> vowelRegex = re.compile(r'[aeiouAEIOU]')
>>> vowelRegex.findall('Robocop eats baby food. BABY FOOD.')
['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

还可以使用连字符表示字母或数字的范围。例如，字符分类 [a-zA-Z0-9] 匹配所有小写字母、大写字母和数字。

注意，在方括号内，常规正则表达式符号不会被解释。这意味着不需要在前面加反斜杠转义 .、*、? 或 () 字符。例如，字符分类 [0-5.] 匹配数字 0 到 5 和一个句点，不需要将它写成[0-5.]。

通过在字符分类的左边方括号后面放置一个插入字符（^），可以创建一个非字符类（negative character class）。非字符类匹配这个字符类中不存在的所有字符。

>>> consonantRegex = re.compile(r'[^aeiouAEIOU]')
>>> consonantRegex.findall('Robocop eats baby food. BABY FOOD.')
['R', 'b', 'c', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.', '
', 'B', 'B', 'Y', ' ', 'F', 'D', '.']

7.8 插入符号和美元符号字符

可以在正则表达式的开头使用插入符号（^），表示被查找的文本的开头必须出现匹配。同样，可以在正则表达式的末尾放置一个美元符号（$），表示字符串必须以这个正则表达式模式结束。可以将 ^ 和 $ 一起使用，表示整个字符串必须与正则表达式匹配。
例如，r'^Hello' 正则表达式匹配以 ‘Hello’ 开头的字符串。

>>> beginsWithHello = re.compile(r'^Hello')
>>> beginsWithHello.search('Hello world!')
<_sre.SRE_Match object; span=(0, 5), match='Hello'>
>>> beginsWithHello.search('He said hello.') == None
True

r'\d$' 正则表达式匹配以数字字符结尾的字符串。

>>> endsWithNumber = re.compile(r'\d$')
>>> endsWithNumber.search('Your number is 42')
<_sre.SRE_Match object; span=(16, 17), match='2'>
>>> endsWithNumber.search('Your number is forty two.') == None
True

正则表达式 r'^\d+$' 匹配以一个或多个数字字符开头和结尾的字符串。

>>> wholeStringIsNum = re.compile(r'^\d+$')
>>> wholeStringIsNum.search('1234567890')
<_sre.SRE_Match object; span=(0, 10), match='1234567890'>
>>> wholeStringIsNum.search('12345xyz67890') == None
True
>>> wholeStringIsNum.search('12 34567890') == None
True

上面示例中的最后两个 search() 调用表明，如果使用了 ^ 和 $，那么整个字符串必须匹配该正则表达式。
使用助记法 “Carrots cost dollars”，提醒插入符号在前面，美元符号在后面。

7.9 通配字符（.）

字符 .（点：dot）在正则表达式中称为通配符（wildcard），它匹配一个除换行符以外的任何字符。

>>> atRegex = re.compile(r'.at')
>>> atRegex.findall('The cat in the hat sat on the flat mat.')
['cat', 'hat', 'sat', 'lat', 'mat']

要匹配实际的点，用反斜杠转义点：\.。

用点-星匹配所有字符

可以使用点-星（.*）来代表 “任意文本”。

>>> nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
>>> mo = nameRegex.search('First Name: Al Last Name: Sweigart')
>>> mo.group(1)
'Al'
>>> mo.group(2)
'Sweigart'

点-星使用贪心模式：它总是试图匹配尽可能多的文本。要以非贪心方式匹配任意文本，使用点-星和问号（.*?）。与花括号一样，问号告诉 Python 以非贪心方式匹配。

>>> nongreedyRegex = re.compile(r'<.*?>')
>>> mo = nongreedyRegex.search('<To serve man> for dinner.>')
>>> mo.group()
'<To serve man>'

>>> greedyRegex = re.compile(r'<.*>')
>>> mo = greedyRegex.search('<To serve man> for dinner.>')
>>> mo.group()
'<To serve man> for dinner.>'

用点字符匹配换行

点-星匹配除换行之外的所有字符。通过传入 re.DOTALL 作为 re.compile() 的第二个参数，可以使点字符匹配所有字符，包括换行字符。

>>> noNewlineRegex = re.compile('.*')
>>> noNewlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()
'Serve the public trust.'

>>> newlineRegex = re.compile('.*', re.DOTALL)
>>> newlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()
'Serve the public trust.\nProtect the innocent.\nUphold the law.'

7.10 正则表达式符号汇总

? 匹配零次或一次前面的分组。
* 匹配零次或多次前面的分组。
+ 匹配一次或多次前面的分组。
{n} 匹配 n 次前面的分组。
{n,} 匹配 n 次或更多前面的分组。
{,m} 匹配零次到 m 次前面的分组。
{n,m} 匹配至少 n 次、至多 m 次前面的分组。
{n,m}? 或 *? 或 +? 对前面的分组进行非贪心匹配。
^spam 意味着字符串必须以 spam 开始。
spam$ 意味着字符串必须以 spam 结束。
. 匹配所有字符，换行符除外。
\d、\w 和 \s 分别匹配数字、单词和空白字符。
\D、\W 和 \S 分别匹配出数字、单词和空白字符外的所有字符。
[abc] 匹配方括号内的任意字符（诸如 a、b 或 c）。
[^abc] 匹配不在方括号内的任意字符。

7.11 不区分大小写的匹配

通常，正则表达式用指定的大小写匹配文本。但有时只关心匹配字母，而不关心它们是大写还是小写。要使正则表达式不区分大小写，可以传递 re.IGNORECASE 或 re.I 作为 re.compile() 的第二个参数。

>>> robocop = re.compile(r'robocop', re.I)
>>> robocop.search('Robocop is part man, part machine, all cop.').group()
'Robocop'

>>> robocop.search('ROBOCOP protects the innocent.').group()
'ROBOCOP'

>>> robocop.search('Al, why does your programming book talk about robocop so much?').group()
'robocop'

7.12 用 sub() 方法替换字符串

正则表达式不仅可以找到文本模式，还可以用新文本替代这些模式。Regex 对象的 sub() 方法传递两个参数。第一个参数是用来替换任何匹配项的字符串。第二个是正则表达式的字符串。sub() 方法返回替换完成后的字符串。

>>> namesRegex = re.compile(r'Agent \w+')
>>> namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')
'CENSORED gave the secret documents to CENSORED.'

有时可能需要使用匹配的文本本身作为替换的一部分。在 sub() 的第一个参数中，可以输入 \1、\2、\3 等，表示 “在替换中输入分组 1、2、3 等的文本”。
例如，假设想通过只显示密探姓名的首字母来审查他们的姓名。

>>> agentNamesRegex = re.compile(r'Agent (\w)\w*')
>>> agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')
A**** told C**** that E**** knew B**** was a double agent.'

7.13 管理复杂的正则表达式

匹配复杂的文本模式可能需要长而复杂的正则表达式。可以通过告诉 re.compile() 函数忽略正则表达式字符串中的空白和注释，来缓解这种情况。通过传递变量 re.VERBOSE 作为 re.compile() 的第二个参数，可以启用这种 “详细模式（verbose mode）”。

难以理解的正则表达式：

phoneRegex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}(\s*(ext|x|ext.)\s*\d{2,5})?)')

可以将正则表达式放置到多个行，并添加注释：

phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?            # area code
    (\s|-|\.)?                    # separator
    \d{3}                         # first 3 digits
    (\s|-|\.)                     # separator
    \d{4}                         # last 4 digits
    (\s*(ext|x|ext.)\s*\d{2,5})?  # extension
    )''', re.VERBOSE)

前面的示例使用了三重引号语法（’’’）创建多行字符串，以便将正则表达式定义放置到多行，使其更加易读。
正则表达式字符串中的注释规则与常规 Python 代码相同：# 符号及其后面的所有内容都将被忽略。而且，正则表达式的多行字符串中的额外空格，不被认为是要匹配的文本模式的一部分。

7.14 组合 re.IGNORECASE、re.DOTALL 和 re.VERBOSE

可以通过使用管道字符（|），组合 re.IGNORECASE、re.DOTALL 和 re.VERBOSE 变量。管道字符在这里称为按位或（bitwise or）操作符。

若想要一个不区分大小写的正则表达式，并且包含与点字符匹配的换行符，则可以这样调用 re.compile()：

>>> someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL)

使用第二个参数的全部 3 个选项：

>>> someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL | re.VERBOSE)

这种语法有点过时，起源于 Python 的早期版本。关于按位运算符的详细信息超出了本书的范围，参阅参考资料 http://nostarch.com/automatestuff/，获得更多信息。
还可以向第二个参数传递其他选项。它们并不常用，但是也可以在参考资料中了解更多关于它们的信息。

7.15 项目：电话号码和电子邮件地址提取程序

任务：在一个很长的网页或文档中，找出所有电话号码和电子邮件地址。
程序功能：在剪贴板中查找所有的电话号码和电子邮件地址，用来替换剪贴板上的文本。
操作：按 CTRL-A 选择所有文本，按 CTRL-C 将它复制到剪贴板，然后运行程序。

建议首先制定一个高层次的计划，弄清楚程序需要做什么，稍后再考虑实际的代码。
现在，要关注大致框架。例如，电话号码和电子邮件地址提取程序需要完成以下任务：
① 从剪贴板中取得文本；
② 找出文本中所有的电话号码和电子邮件地址；
③ 将它们粘贴到剪贴板。

现在可以开始考虑如何在代码中实现它了。代码将需要做以下工作：
① 使用 pyperclip 模块复制和粘贴字符串；
② 创建两个正则表达式，一个用来匹配电话号码，另一个用来匹配电子邮件地址；
③ 查找两个正则表达式的所有匹配项；
④ 将匹配的字符串整齐地格式化为一个字符串以便粘贴；
⑤ 如果在文本中没有找到匹配项，显示某种消息。

步骤 1：为电话号码创建一个正则表达式

#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.

import pyperclip, re

phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?                # area code
    (\s|-|\.)?                        # separator
    (\d{3})                           # first 3 digits
    (\s|-|\.)                         # separator
    (\d{4})                           # last 4 digits
    (\s*(ext|x|ext.)\s*(\d{2,5}))?    # extension
    )''', re.VERBOSE)

# TODO: Create email regex.

# TODO: Find matches in clipboard text.

# TODO: Copy results to the clipboard.

步骤 2：为电子邮件地址创建一个正则表达式

# Create email regex.
emailRegex = re.compile(r'''(
	[a-zA-Z0-9._%+-]+      # username
	@                      # @ symbol
	[a-zA-Z0-9.-]+         # domain name
	(\.[a-zA-Z]{2,4})      # dot-something
	)''', re.VERBOSE)

电子邮件地址有很多奇怪的规则。这个正则表达式不会匹配所有可能有效的电子邮件地址，但它几乎能匹配任何遇到的典型的电子邮件地址。

步骤 3：在剪贴板文本中找出所有匹配

# Find matches in clipboard text.
text = str(pyperclip.paste())
matches = []
for groups in phoneRegex.findall(text):
	phoneNum = '-'.join([groups[1], groups[3], groups[5]])
	if groups[8] != '':
		phoneNum += ' x' + groups[8]
	matches.append(phoneNum)
for groups in emailRegex.findall(text):
	matches.append(groups[0])

每个匹配对应一个元组，每个元组包含了正则表达式中的所有分组。从步骤 1 和 2 中的变量 phoneRegex 和 emailRegex 的定义中可以看出，分组 0 匹配了整个正则表达式，所以在元组下标 0 处的分组就是需要的内容。

步骤 4：将所有匹配连接成一个字符串，复制到剪贴板

# Copy results to the clipboard.
if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to clipboard:')
    print('\n'.join(matches)) # print any matches you find to the terminal.
else:
    print('No phone numbers or email addresses found.')

运行程序

程序的整个脚本如下：

#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.

import pyperclip, re

# Create email regex.
phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?                # area code
    (\s|-|\.)?                        # separator
    (\d{3})                           # first 3 digits
    (\s|-|\.)                         # separator
    (\d{4})                           # last 4 digits
    (\s*(ext|x|ext.)\s*(\d{2,5}))?    # extension
    )''', re.VERBOSE)

# Create email regex.
emailRegex = re.compile(r'''(
	[a-zA-Z0-9._%+-]+      # username
	@                      # @ symbol
	[a-zA-Z0-9.-]+         # domain name
	(\.[a-zA-Z]{2,4})      # dot-something
	)''', re.VERBOSE)

# Find matches in clipboard text.
text = str(pyperclip.paste())
matches = []
for groups in phoneRegex.findall(text):
	phoneNum = '-'.join([groups[1], groups[3], groups[5]])
	if groups[8] != '':
		phoneNum += ' x' + groups[8]
	matches.append(phoneNum)
for groups in emailRegex.findall(text):
	matches.append(groups[0])

# Copy results to the clipboard.
if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to clipboard:')
    print('\n'.join(matches)) # print any matches you find to the terminal.
else:
    print('No phone numbers or email addresses found.')

打开 Web 浏览器，访问 No Starch Press 联系页面 http://www.nostarch.com/contactus.htm，按 CTRL-A 选择所有文本，按 CTRL-C 将它复制到剪贴板，然后运行程序，输出如下：

Copied to clipboard:
800-420-7240
415-863-9900
415-863-9950
info@nostarch.com
media@nostarch.com
academic@nostarch.com
help@nostarch.com

类似程序的构想

识别文本的模式（并且可能用 sub() 方法替换它们）有许多不同潜在的应用。
① 查找以 http:// 或 https:// 开头的网站 URL。
② 整理不同格式的日期（如 3/14/2015、03-14-2015 和 2015/3/14），用一个标准格式的日期替换它们。
③ 删除敏感信息，如社会保险号或信用卡号码。
④ 查找常见的打印错误，如单词之间的多个空格、意外地重复的单词或句子末尾的多个感叹号。

7.16 小结

除本章介绍的之外，还有一些正则表达式的语法。
可以在官方 Python 文档中找到更多内容：http://docs.python.org/3/library/re.html。
教程网站也是很有用的资源：http://www.regular-expressions.info/。

7.17 实践项目

强密码检测

写一个函数，使用正则表达式确保传入的密码是强密码。强密码的定义：至少有 8 个字符，同时包含大小和小写字母，至少包含一个数字。可能需要用多个正则表达式来测试字符串，以保证它的强度。

#! python3
# strongPasswordDetection.py - Makes sure the password string it is passed is strong.

import re
def isStrongPassword(password):
	if len(password) < 8:
		return False
	
	upperRegex = re.compile(r'.*[A-Z].*') # contains uppercase characters
	mo1 = upperRegex.search(password)
	if mo1 == None:
		return False
	
	lowerRegex = re.compile(r'.*[a-z].*') # contains lowercase characters
	mo2 = lowerRegex.search(password)
	if mo2 == None:
		return False
		
	digitRegex = re.compile(r'.*\d.*') # contains digit characters
	mo3 = digitRegex.search(password)
	if mo3 == None:
		return False
	
	return True
	
password = input()
strong = isStrongPassword(password)
if strong:
	print('The password is strong!')
else:
	print('The password is not strong!')

strip() 的正则表达式版本

写一个函数，可以传入一个字符串，与 strip() 字符串方法的功能相同。如果只传入需要处理的字符串，没有传入其他参数，那么删除字符串的开始和结尾的空白字符。否则，删除字符串中函数第二个参数指定的字符。

#! python3
# stripRegexVersion.py - Does the same thing as the strip() string method.

import re

def stripRegex(string, removed='\s'):
	spam = '[' + removed + ']'
	allRegex = re.compile(spam)
	listResult = allRegex.findall(string)
	if len(string) == len(listResult):
		return ''
		
	omit = '([' + removed + ']*)'
	keep = '([^' + removed + ']*.*[^' + removed + '])'
	s = omit + keep + omit
	regex = re.compile(s)
	return regex.sub(r'\2', string)

string1 = ' \n white '
print(stripRegex(string1))
string2 = '123424321'
print(stripRegex(string2, '12'))
string3 = 'white space\n'
print(stripRegex(string3))
string4 = 'yesnoyes'
print(stripRegex(string4, 'no'))