Python爬虫教程（三）：正则表达式

python慕遥

已于 2022-05-02 00:25:33 修改

阅读量948

点赞数

分类专栏：爬虫系列教程文章标签：正则表达式爬虫 python

于 2022-05-01 15:00:00 首次发布

本文链接：https://blog.csdn.net/csdn1561168266/article/details/124521988

版权

爬虫系列教程专栏收录该内容

14 篇文章 23 订阅

订阅专栏

01 字符含义一览表

符号	含义
.	代表除了换行以外的任意字符，1个"."匹配 1 次
\w+	匹配数字+字母+下划线
\s+	匹配所有的空白符
\d+	匹配所有数字
\n	匹配一个换行符
\t	匹配一个制表符
^	设定开始，^\d表示开始的第一个必须是数字
$	设定结束，\d$表示最后以数字结尾
\W	除了数字字母和下划线
\D	除了数字以外的
\S	除了空白符以外的
()	括号内的表达式称作一个组,(?P<hahaha>)提取固定词
[]	建立自己的字符组： [a-zA-Z0-9]：表示出现a-z、A-Z、0-9就会匹配 [aeiouAEIOU]：表示出现里面的内容就会被匹配
*	重复0次或更多次
+	重复1次或更多次
?	重复0次或1次
{n,}	重复n次或更多次： "(Ha){3,}"：3 次或更多次，"(Ha){,5}"：0 到 5 次实例
.*	贪婪匹配：匹配除换行外的所有字符 "1.*3"：整个字符中第一个1和最后一个3之间所有字符
.*?	惰性匹配： "1.*?3"表示匹配1-3之间的字符，有多则取最短的

02 正则表达式的使用

2.1 re模块四种方式

import re1° findalla = re.findall(r"\d+","my telephone is 15950005,1343222")print(a)
2° search:返回的是match内容，检索到就结束，结果就一个c = re.search(r"\d+","my telephone is 15950005,1343222")print(c.group())
3° match:从头开始匹配，开头必须满足自己的要求d = re.match(r"\d+","12my telephone is 15950005,1343222")print(d.group())
4° finditer:返回迭代器，从迭代器中拿到内容，效率高b = re.finditer(r"\d+","my telephone is 15950005,1343222")for i in b:    print(i.group())

·结果·

2.2 re模块的改良

*- 预加载正则表达式:以防止代码太长，可以反复用obj = re.compile(r"\d+")e = obj.finditer("my telephone is 15950005,1343222")for it in e:    print(it.group())    *- 一次性获取多个group    import reobj = re.compile(r"(\d),(\d+)")e = obj.search("my telephone is 15950005,1343222")print(e.groups())

·结果·

2.3 量词的使用

1° "*"：重复0次或更多次batRegex = re.compile(r'Bat(wo)*man')mo1 = batRegex.search('The Adventures of Batman')mo2 = batRegex.search('The Adventures of Batwoman')mo3 = batRegex.search('The Adventures of Batwowowowoman')print(mo1.group())print(mo2.group())print(mo3.group())

·结果·

2° "+"：重复1次或更多次batRegex = re.compile(r'Bat(wo)+man')mo1 = batRegex.search('The Adventures of Batman')mo2 = batRegex.search('The Adventures of Batwoman')mo3 = batRegex.search('The Adventures of Batwowowowoman')print(None == mo1)print(mo2.group())print(mo3.group())

·结果·

3° "?"：重复0次或1次batRegex = re.compile(r'Bat(wo)?man')mo1 = batRegex.search('The Adventures of Batman')mo2 = batRegex.search('The Adventures of Batwoman')print(mo1.group())print(mo2.group())

·结果·

4° "{n,}"：重复n次或更多次,"(Ha){3,}"将匹配 3 次或更多次实例，"(Ha){,5}"将匹配 0 到 5 次实例haRegex = re.compile(r'(Ha){3}')mo1 = haRegex.search('HaHaHa')mo2 = haRegex.search('Ha')print(mo1.group())print(None == mo2)

·结果·

2.4 自定义组名

1° 通过(?P<组别名字>.*?)设定组名import res = '''<li><a href="#"title="^-[1-9]\d*$">匹配负整数</a></li><li><a href="#"title="^-?[1-9]\d*$">匹配整数</a></li><li><a href="#"title="^[1-9]\d*|0$">匹配非负整数（正整数 + 0）</a></li>'''obj = re.compile(r'''<li><a href=".*?"title="(?P<hahaha>.*?)">.*?</a></li>''',re.S)f = obj.finditer(s)for it in f:    print(it.group("hahaha"))

·结果·

2°  贪心匹配（.*）非贪心匹配（.*？）的区别greedyRegex = re.compile(r'<.*>')nongreedyRegex = re.compile(r'<.*?>')mo1 = greedyRegex.search('<To serve man> for dinner.>')mo2 = nongreedyRegex.search('<To serve man> for dinner.>')print(mo1.group())print(mo2.group())

·结果·

2.5 sub()替换字符

*- 第一个参数是一个字符串，用于取代发现的匹配。*- 第二个参数是一个字符串，即正则表达式。namesRegex = re.compile(r'Agent \w+')a = namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')print(a)
*- 你可能需要使用匹配的文本本身，作为替换的一部分*- 在 sub()的第一个参数中，可以输入\1、\2、\3……,表示“在替换中输入分组 1、2、3……的文本”agentNamesRegex = re.compile(r'Agent (\w)\w*')b = agentNamesRegex.search(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')print(b)

·结果·

03 正则表达式的第二个选项

3.1 re.DOTALL

*- ".*"可以匹配除了换行符以外的所有字符，加上re.DOTALL可以同时匹配换行符noNewlineRegex = re.compile('.*')newlineRegex = re.compile('.*', re.DOTALL)print(noNewlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group())newlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

·结果·

3.2 re.I

*- 只关心匹配字母，不关心它们是大写或小写robocop = re.compile(r'robocop', re.I)print(robocop.search('RoboCop is part man, part machine, all cop.').group())print(robocop.search('ROBOCOP protects the innocent.').group())

·结果·

3.3 re.VERBOSE

*- 但匹配复杂的文本模式，可能需要长的、费解的正则表达式。*- 忽略正则表达式字符串中的空白符和注释，从而缓解这一点phoneRegex = re.compile(r'''(    (\d{3}|\(\d{3}\))?             # area code    (\s|-|\.)?                     # separator    \d{3}                          # first 3 digits    (\s|-|\.)                      # separator    \d{4}                          # last 4 digits    (\s*(ext|x|ext.)\s*\d{2,5})?   # extension        )''', re.VERBOSE)

python慕遥

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python爬虫教程（三）：正则表达式

01 字符含义一览表符号含义 . 代表除了换行以外的任意字符，1个"."匹配 1 次 \w+ 匹配数字+字母+下划线 \s+ 匹配所有的空白符 \d+ 匹配所有数字 \n 匹配一个换行符 \t 匹配一个制表符
复制链接

扫一扫