Python编程-让繁琐的工作自动化（七）模式匹配与正则表达式

最新推荐文章于 2022-09-21 23:40:01 发布

半夏云流

最新推荐文章于 2022-09-21 23:40:01 发布

阅读量1.3k

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/qq_33195791/article/details/89603076

版权

Python 专栏收录该内容

25 篇文章 2 订阅

订阅专栏

2.3.6 用花括号"{}"匹配特定的次数

11.组合使用 re.IGNORECASE、re.DOTALL 和 re.VERBOSE，管理复杂正则表达式

电话号码和 E-mail 地址提取程序

先为电话号码创建一个正则表达式

为 E-mail 地址创建一个正则表达式

前言

知道正则表达式可能意味着用3步解决一个问题，而不是用3000步。如果你是一个技术怪侠，别忘了，你几次击键就能解决的问题，其他人需要数天的繁琐工作才能解决，而且他们容易犯错。——Cory Doctorow.

1.用正则表达式查找文本模式

1.1 创建正则表达式

python中所有的正则表达式都在re模块中，使用正则表达式的时候记住导入re

向re.compile传入一个字符串值，表示正则表达式，它将返回一个Regex模式对象，或者就简称Regex对象。

在python正则表达中，/d表示一个数字，即0-9的任意数字。例如一个电话号码为415-555-4292,那么用正则表达式匹配可以写为：

phoneNumRegex=re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') #r 表示字符串为原始字符串，所有转义自否都被忽略。

1.2 匹配Regex对象

Regex对象的search()方法查找传入的字符串，寻找该正则表达式匹配的所有匹配。如果字符创中没有找到该正则表达式，search()方法将返回None。如果找到了该模式，search()方法将返回一个Match对象。Match对象有一个group()方法，它返回被查找字符串中实际被匹配的文本。例如

#!/usr/bin/python3

import re
phoneNumRegex=re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') #r 表示字符串为原始字符串，所有转义自否都被忽略。

mo = phoneNumRegex.search('My Number is 415-555-4242-X99')
print('Phone number found:' + mo.group())

结果：Phone number found:415-555-4242

可以看到，正则表达式返回了匹配的对象。

向re.compile()传递原始字符：正则表达式使用倒斜杠\,向正则表达式传入原始字符串就比较方便。

1.3 正则表达式匹配步骤

python中使用正则表达式有几个步骤，比较简单：

1.用import re导入正则表达式模块。

2.用re.compile()函数创建一个Regex对象（记得使用原始字符串）

3.向Regex对象的search()方法传入向查找的字符串。它返回一个Match对象。

4.调用Match对象的group()方法，返回实际匹配文本的字符串。

2 用正则表达式匹配更多模式

正则表达式有着很多强大的功能，我们可以遵循基本步骤，尝试使用正则表达式匹配更多复杂的模式。

2.3.1 利用括号分组

假定要将区号从电话号码中分离。添加括号将在正则表达式中创建“分组”：

phoneNumRegex=re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')

依然可以使用group()方法，从一个分组中获取匹配的文本。

正则表达式中第一对括号是第一组，第二对括号是第二组。向group()方法中传入0或不传入参数，将返回整个文本，传入1或2将返回匹配的不同部分。

import re
phoneNumRegex=re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)') #r 表示字符串为原始字符串，所有转义自否都被忽略。

mo = phoneNumRegex.search('My Number is 415-555-4242-X99')
print('Phone number Area is:' + mo.group(1))
print('Phone number Number is:' + mo.group(2))
print('Phone number None is:' + mo.group(3))

结果：

Phone number Area is:415
Phone number Number is:555-4242
Traceback (most recent call last):
File "regex.py", line 9, in <module>
print('Phone number None is:' + mo.group(3))
IndexError: no such group

注意，传入不存在的分组序号将出错。

注意，group()方法返回的是字符串，这与即将介绍的groups()方法不同。

如果想一次获取所有的分组，可以使用方法groups(),注意函数名的复数形式。

phoneNumRegex=re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My Number is (415) 555-4242')
if None == mo:
    print('No such regex strings')
else:
    print('found phone numbers is:',end=' ')
    print(mo.groups())

结果：

found phone numbers is: ('(415)', '555-4242')

传递给re.compile()的原始字符串中，“$” 和 “$”转义后才能匹配实际的括号字符。

2.3.2 用 "|"匹配多个分组

字符"|"在shell中称为“管道”,在一般的编程语言中称为“按位或”。当希望匹配许多表达式中的一个其任意一个时

，可以使用它。例如正则表达式：r'Batman|Tina Fey' 将匹配Batman 或 Tina Fey。如果两个都存在与被查找的字符串中，第一次出现的将作为Match对象返回。

HeroRegex = re.compile(r'Batman|Tina fey')
mo1=HeroRegex.search('Batman and Tina fey')
if None == mo:
    print('No such regex strings')
else:
    print('we serched :'+mo1.group())

返回结果：we serched :Batman

利用findall()方法可以找到“所有”匹配的字符串，将在后面介绍。

另一种用法，所有匹配的字符串都以相同的前缀开始，只是后缀不同，那么可以这样使用：

假如希望匹配Batman，Batmobile,Batcopter,Batbat中的任意一个：

BatRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo=BatRegex.search('Batmobile lost a wheel')
if None == mo:
    print('No such regex strings')
else:
    print('complete finded is :'+ mo.group())
    print('group(1) = '+ mo.group(1))

结果：

complete finded is :Batmobile
group(1) = mobile

注意：group()方法不传参数时返回完全匹配的文本Batmobile,而group(1)只是返回第一个括号的内模式匹配的分组。

2.3.3 用问号实现可选匹配

有时候，想匹配的模式是可选的。就是说，不论这段文本在不在，正则表达式都会认为匹配。字符"?"表明它前面的分组在这个模式中是可选的。例如：

phoneNumRegex=re.compile(r'(\(\d\d\d\))? (\d\d\d-\d\d\d\d)')
mo1 = phoneNumRegex.search('My Number is (415) 555-4242')
if None == mo1:
    print('No such regex strings')
else:
    print('found phone numbers is:',end=' ')
    print(mo1.groups())
mo2 = phoneNumRegex.search('My Number is 555-4242')
if None == mo2:
    print('No such regex strings')
else:
    print('found phone numbers is:',end=' ')
    print(mo2.groups())

结果：

found phone numbers is: ('(415)', '555-4242')
found phone numbers is: (None, '555-4242')

可见，即便第二个电话没有区号，也返回了查找到的元组，只是元组的第一个值是None。

**2.3.4 用星号（*）匹配零次或多次**

*称为星号，以为“匹配0次或多次”，即*前面的分组，可以在文本中出现任意次。

BatRegex=re.compile(r'Bat(wo)*man')
mo1=BatRegex.search('The Adventures of Batman')
mo2=BatRegex.search('The Adventures of Batwoman')
mo3=BatRegex.search('The Adventures of Batwowowowowowowowowowoman')
if None != mo1:
    print('mo1.group:' + mo1.group())
if None != mo2:
    print('mo2.group:' + mo2.group())
if None != mo3 :
    print('mo3.group:' + mo3.group())

结果：

mo1.group:Batman
mo2.group:Batwoman
mo3.group:Batwowowowowowowowowowoman

如果要匹配文本中的*，用转义字符“\”

2.3.5 用加号“+”匹配一次或多次

+号与*号类似，但是+号要求前面的分组必须至少出现一次。

BatRegex=re.compile(r'Bat(wo)+man')
mo1=BatRegex.search('The Adventures of Batman')
mo2=BatRegex.search('The Adventures of Batwoman')
mo3=BatRegex.search('The Adventures of Batwowowowowowowowowowoman')
if None != mo1:
    print('mo1.group:' + mo1.group())
if None != mo2:
    print('mo2.group:' + mo2.group())
if None != mo3 :
    print('mo3.group:' + mo3.group())

结果：

mo2.group:Batwoman
mo3.group:Batwowowowowowowowowowoman

2.3.6 用花括号"{}"匹配特定的次数

如果想让一个分组重复匹配特定次数，就在正则表达式中该分组的后面，跟上花括号"{}"包围的数字，例如{3}，还可以指定一个范围{3,5}，匹配出现3次到5次的分组。

BatRegex=re.compile(r'Bat(wo){3,10}man')
mo1=BatRegex.search('The Adventures of Batman')
mo2=BatRegex.search('The Adventures of Batwoman')
mo3=BatRegex.search('The Adventures of Batwowowowowowowowowowoman')
mo4=BatRegex.search('The Adventures of Batwowowowowoman') #5次
if None != mo1:
    print('mo1.group:' + mo1.group())
if None != mo2:
    print('mo2.group:' + mo2.group())
if None != mo3 :
    print('mo3.group:' + mo3.group())
if None !=mo4 :
    print('mo4.group:' + mo4.group())

结果：

mo3.group:Batwowowowowowowowowowoman
mo4.group:Batwowowowowoman

可见，匹配返回5次和10次的结果。

3 贪心和非贪心匹配

python正则表达式默认是“贪心”的，这表示在有二义的情况下，会尽可能的返回匹配最长的字符串。

花括号后面跟一个问号，表示前面小括号内分组的模式为非贪心匹配。

GreedHaRegex=re.compile(r'(Ha){3,5}')
mo1=GreedHaRegex.search('HaHaHaHaHa')
if None != mo1:
    print('mo1.group:' + mo1.group())

NoGreedHaRegex=re.compile(r'(Ha){3,5}?')
mo2=NoGreedHaRegex.search('HaHaHaHaHa')
if None != mo2:
    print('mo2.group:' + mo2.group())

结果：

mo1.group:HaHaHaHaHa
mo2.group:HaHaHa

注意，问号在正则表达式中有两种含义：

<1> 声明可选的分组，问号在分组(re)的后面: 'We (HA)?'

<2>声明匹配的分组是非贪心模式，问号跟在花括号{}后面：'We(HA){3,5}?'

4. findall() 方法

除了search（）方法外，Regex对象也有一个findall()方法。search()将返回一个Match对象，包含被查找字符串中第一次匹配的文本，而findall（）方法将返回一组字符串，包含被查找的字符串中所有的匹配。实例如下：

PhoneNumRegex=re.compile(r'\d{3}-\d{3}-\d{4}')
se=PhoneNumRegex.search('Cell: 415-555-9999 work: 212-555-0000')
if se != None:
    print('search group is '+se.group())

seach()返回第一次匹配的字符串

search group is 415-555-9999

PhoneNumRegex=re.compile(r'\d{3}-\d{3}-\d{4}')
FD=PhoneNumRegex.findall('Cell: 415-555-9999 work: 212-555-0000')
if FD != None:
    print('findall group is ',end='')
    print(FD)
    #依次打印
    print('findall group(1) is',FD[0])
    print('findall group(2) is',FD[1])
    FD.append('123-456-7890')
    print('findall group(2) is',FD[2])

结果：

findall group is ['415-555-9999', '212-555-0000']
findall group(1) is 415-555-9999
findall group(2) is 212-555-0000

findall group(2) is 123-456-7890

因为findall（）返回的是字符串列表，因此可以用列表的方式处理。

5.字符分类

在电话号码匹配中，我们用"\d"代表数字，在python中，还有很多这样的字符缩写分类：

缩写字符分类	表示
\d	0到9的任何数字
\D	除0到9的任何数字
\w	任何字母，数字或下划线(可以认为是匹配“单词”字符)
\W	除字母，数字和下划线以外的字符
\s	空格、制表符或换行符（可以认为是匹配“空白”字符）
\S	除空白，制表符或换行符以外的任何字符

6.建立自己的字符分类

有时候通用的代表字符如"\d","\w"等太宽泛，你可以定义自己的字符分类，使用方括号[]

例如：

redef=re.compile(r'[aeiouAEIOU]')
FE=redef.findall('ROBcop,Eats,food,baby,lily')
if None != FE:
    print('FE is',FE)

结果：

FE is ['O', 'o', 'E', 'a', 'o', 'o', 'a', 'i']

也可以使用短横线指定字母或数字的范围，例如字符分类[a-zA-Z0-9]将匹配所有的字母和数字。

注意：方括号内，普通的正则表达式符号不会被解释，这意味着你不需要前面加上转义字符转义".","*","?"或"()"字符。例如匹配0到5一个句点，你不需要将其指定为[0-5\.]，直接指定[0-5.]。

通过在左方括号的后面跟上一个"^"，就可以得到不在该字符分类内的字符。例如：

[^aeiouAEIOU]将匹配非元音字符外的所有字符。

7. 插入字符 ^ 和美元字符 $

可以在正则表达式的开始处使用插入符号（“^”），表明匹配必须发生在被查找文本的开始处。类似的，在正则表达式末尾加上美元符号（" $ "），表示该字符串必须以这个正则表达式的模式结束。可以同时使用^ 和 $ ，表明整个字符串必须匹配该模式。也就是说。只匹配该字符串的某个子集是不够的。

如下代码，正则表达式 r'^Hello' 匹配以‘Hello’开始的字符串：

>>> import re
>>> beginHello = re.compile(r'Hello')
>>> beginHello.search('Hello world')
<re.Match object; span=(0, 5), match='Hello'>
>>> beginHello.search('He said hello') == None
True

正则表达式 r'\d$'匹配数字[0-9]结束的字符串。下面的结果中，span（17,18）指的是匹配位置从17开始，到18结束。

>>> endWithNum= re.compile(r'\d$')
>>> endWithNum.search('Your number is 123')
<re.Match object; span=(17, 18), match='3'>

正则表达式 r'^\d+$'匹配从开始到结束都是[0-9]的数字的字符串。

>>> wholeIsNum = re.compile(r'^\d+$')
>>> wholeIsNum.search('1234567890')
<re.Match object; span=(0, 10), match='1234567890'>
>>> wholeIsNum.search('12345XY890')
>>> wholeIsNum.search('12345XY890') == None
True
>>> wholeIsNum.search('12345 890') == None
True
>>>

8.通配字符 "."

在正则表达式中，.（句点）字符成为“通配字符”。它匹配除了换行之外的所有字符。

例如：

>>> atRegex = re.compile(r'.at')
>>> atRegex.findall('The cat in the hat sat on the flat mat')
['cat', 'hat', 'sat', 'lat', 'mat']

要记住，句点字符只匹配一个字符，这就是为什么在前面的例子中，对于文本flat，只匹配'lat'，要匹配真正的句点，要用反斜杠转义： \.

8.1 用点-星匹配所有字符

有时候想要匹配所有字符串，例如，假设想要匹配字符串‘First Name：’，接下来是任意文本，接下来是'Last Name:'，然后又是任意文本。回忆一下，句点字符表示“除换行符外所有单个字符”，星号字符表示"前面字符出现零次或多次"。

例如：

>>> nameRegex = re.compile(r'First Name:(.*) Last Name:(.*)')
>>> mo = nameRegex.search('First Name:lily Last Name:Lokwud')
>>> mo.group(1)
'lily'
>>> mo.group(2)
'Lokwud'

点-星使用贪心模式，它总是匹配尽可能多的文本。要用非贪心模式匹配所有文本，就要使用点-星和问号。像和大括号一起使用时一样。问号告诉Python使用非贪心模式匹配。

感受一下问号在不同位置的区别

>>> nogreedyRegex = re.compile(r'<.*?>')
>>> mo =nogreedyRegex.search('<To serverman> for dinner.>')
>>> mo.group()
'<To serverman>'
>>> nogreedyRegex = re.compile(r'<.*>')
>>> mo =nogreedyRegex.search('<To serverman> for dinner.>')
>>> mo.group()
'<To serverman> for dinner.>'
>>> nogreedyRegex = re.compile(r'<.*>?')
>>> mo =nogreedyRegex.search('<To serverman> for dinner.>')
>>> mo.group()
'<To serverman> for dinner.>'
>>>
>>> mo =nogreedyRegex.search('<To serverman> for dinner.> Lily playing piano>')
>>> mo.group()
'<To serverman> for dinner.> Lily playing piano>'
>>> nogreedyRegex = re.compile(r'<.*>?')
>>> mo =nogreedyRegex.search('<To serverman> for dinner.> Lily playing piano>')
>>> mo.group()
'<To serverman> for dinner.> Lily playing piano>'
>>> nogreedyRegex = re.compile(r'<.*?>')
>>> mo =nogreedyRegex.search('<To serverman> for dinner.> Lily playing piano>')
>>> mo.group()
'<To serverman>'

8.2 用句点字符匹配换行

点-星将匹配除换行之外的多个所有字符。通过传入re.DOTALL作为re.compile()的第二个参数，可以让句点字符匹配所有的字符，包括换行符。

例如：

>>> noNewlineRegex.search('Serve the public trust.\nPortect the innocent.\nUphold the law')
<re.Match object; span=(0, 23), match='Serve the public trust.'>
>>> noNewlineRegex.search('Serve the public trust.\nPortect the innocent.\nUphold the law').group()
'Serve the public trust.'
>>> noNewlineRegex = re.compile(r'.*',re.DOTALL)
>>> noNewlineRegex.search('Serve the public trust.\nPortect the innocent.\nUphold the law').group()
'Serve the public trust.\nPortect the innocent.\nUphold the law'

9.不区分大小写的匹配

通常正则表达式用你指定的大小写匹配文本，但是有时候你只关心匹配字母，不关心他们那是大写还是小写。要让正则表达式不区分大小写，可以向re.compile()传入re.IGNORECASE或re.I，作为第二个参数。

例如：

>>> robocop = re.compile(r'robocop',re.I)
>>> robocop.search('RoboCop is part man, part machine, all cop.').group()
'RoboCop'
>>> robocop.search('ROBOCOP protects the innocent..').group()
'ROBOCOP'

10. 用sub()方法替换字符

正则表达式不仅能找到文本，而且能够用新的文本替换掉这些模式。Regex 对象的sub()方法需要传入两个参数，第一个参数是一个字符串，用于取代发现的匹配。爹个参数也是字符串，使用正则表达式匹配的内容。sub()方法返回替换完成后的字符串。

例如：

>>> namesRegex = re.compile(r'Agent \w+')
>>> namesRegex = re.compile(r'Agent \w+')
>>> namesRegex.sub('CENSORED','Agent Alice gave the secret documents to Agent Bob.')
'CENSORED gave the secret documents to CENSORED.'

有时候你可能需要使用匹配的文本本身作为替换的一部分，在sub()的第一个参数中，可以输入\1、\2、\3......。表示在替换中输入分组1、2、3......的文本。

例如想要隐藏密探的姓名，只显示他们姓名的第一个字母，要做到这一点，可以使用正则表达式r'Agent (\w)\w*',传入r'\1****'作为sub的第一个（实际是第二个，因为默认参数是re）参数。字符串中的“ \1 "将被分组1匹配的文本所替代，也就是正则表达式(\w)的分组。

r'Agent (\w)\w*' 正则表达式匹配Agent + （一个字母）+0个或多个字母

>>> re.sub(r'Agent (\w)\w*',r'\1****','Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a doube agent.')
'A**** told C**** that E**** knew B**** was a doube agent.'
>>> re.sub(r'Agent (\w)\w+',r'\1****','Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a doube agent.')
'A**** told C**** that E**** knew B**** was a doube agent.'
>>> re.sub(r'Agent (\w+)',r'\1****','Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a doube agent.')
'Alice**** told Carol**** that Eve**** knew Bob**** was a doube agent.'

re.sub共有五个参数。

re.sub(pattern, repl, string, count=0, flags=0)

其中三个必选参数：pattern, repl, string分别表示：正则表达式匹配模式，替换的目标字符串，待匹配的字符串。

两个可选参数：count, flags

re.sub(r'Agent (\w)\w*',r'\1****','Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a doube agent.')

pattern = r'Agent (\w)\w*'； r'Agent (\w)\w*' 正则表达式匹配Agent + （一个字母）+0个或多个字母

repl = r'\1****' 表示将正则表达式查找的分组1，即 (\w)作为替代的一部分，其余补****

这个地方有点难理解，在使用中多实践。实际上如果不讲查找的结果作为替换，直接替换为其他字符，就更好理解，例如将查找的结果全部替换为AAAA

re.sub(r'Agent (\w)\w*',r'AAAA,'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a doube agent.')

结果演示：

<<<re.sub(r'Agent (\w)\w*',r'AAAA','Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a doube agent.')
'AAAA told AAAA that AAAA knew AAAA was a doube agent.'

11.组合使用 re.IGNORECASE、re.DOTALL 和 re.VERBOSE，管理复杂正则表达式

可以将正则表达式分多行写出，每一行后面写上注释，这样使复杂正则表达式具有可读性。下面是一个匹配电话号码的例子：

phoneRegex = re.compile(r'''( (\d{3}|\(\d{3}\))? # area code

(\s|-|\.)? # separator

\d{3} # first 3 digits

(\s|-|\.) # separator

\d{4} # last 4 digits

(\s*(ext|x|ext.)\s*\d{2,5})? # extension

)''',re.VERBOSE)

注意，后面要加上参数re.VERBOSE。

第一个参数，正则表达式，写在r'''()'''中的括号中。

可以按意义，分部分写。一部分写一行，后面加上注释。执行时，注释会被忽略。同时，多余的空白也会被忽略。如果用以前的方式写，则不小心写的空白，可能会改变正则表达式的意义。

如果你想用 re.VERBOSE 来添加注释，又想用 re.IGNORECASE 来忽略大小写，抱歉，re.compile 函数只接受一个只作为它的第二个参数，不过我们可以使用管道来跳过这个限制，即用 | 将两个参数分隔开。

电话号码和 E-mail 地址提取程序

先为电话号码创建一个正则表达式

首先需要创建一个正则表达式才能够用来查找电话号码

phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?  #area code 区号
    (\s|-|\.)?          #separator 分隔符
    (\d{3})             #first 3 digitals 前3个数字
    (\s|-|\.)           #separator
    (\d{4})             #last 4 digitals
    (\s*(ext|x|ext\.)\s*(\d{2,5}))? #extension 扩展
)''',re.VERBOSE)

为 E-mail 地址创建一个正则表达式

用户名可以是一个或多个字符串，大小写字母、数字、句点、下划线、百分号、加号或短横
域名跟姓名用 @ 分隔开
姓名可以是大小写字母、数字、句号跟下划线
.com 部分其实可以匹配 .三个字母

emailRegex = re.compile(r'''(
    [a-zA-Z0-9._%+-]+   #username 用户名
    @                   #symbol
    [a-zA-Z0-9.-]+      #domain name
    (\.[a-zA-Z]{2,4})
)''',re.VERBOSE | re.I | re.DOTALL) #组合使用 re.VERBOSE | re.I | re.DOTALL

完整代码例子：把前面注释部分复制到剪切板

# /usr/bin/python3

import re ,pyperclip, pprint

'''
创建一个电话号码的正则表达式
电话号码格式为：
415-863-9990
571-664-8888
电话号码就这些
邮箱例举：
info@nostarch.com
lily@apple.com
bobo@huawei.com
here is end
'''

phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?  #area code 区号
    (\s|-|\.)?          #separator 分隔符
    (\d{3})             #first 3 digitals 前3个数字
    (\s|-|\.)           #separator
    (\d{4})             #last 4 digitals
    (\s*(ext|x|ext\.)\s*(\d{2,5}))? #extension 扩展
)''',re.VERBOSE)

emailRegex = re.compile(r'''(
    [a-zA-Z0-9._%+-]+   #username 用户名
    @                   #symbol
    [a-zA-Z0-9.-]+      #domain name
    (\.[a-zA-Z]{2,4})
)''',re.VERBOSE | re.I | re.DOTALL) #组合使用 re.VERBOSE | re.I | re.DOTALL

def findPhoneNumAndEmail():
    #从剪贴板获取文本
    text = str(pyperclip.paste())
    matches = []
    for groups in phoneRegex.findall(text):
        phoneNum = '-'.join([groups[1], groups[3], groups[5]])
        if groups[8] != '':
            phoneNum += ' x' +groups[8]
        matches.append(phoneNum)
    #email
    for groups in emailRegex.findall(text):
        matches.append(groups[0])

    pprint.pprint(matches)

if __name__ == '__main__':
    findPhoneNumAndEmail()

结果：

['415-863-9990',
'571-664-8888',
'info@nostarch.com',
'lily@apple.com',
'bobo@huawei.com']