第四节正则表达式

无根树浮生时

已于 2023-12-04 19:00:45 修改

阅读量53

点赞数

分类专栏：爬虫学习文章标签：正则表达式爬虫

于 2023-11-22 21:29:42 首次发布

本文链接：https://blog.csdn.net/m0_50802620/article/details/134541249

版权

爬虫学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

正则表达式

1. 常用的匹配规则
- 1.1 测试工具
2. 常用匹配方法

1. 常用的匹配规则

\w:匹配字母、数字及下划线
\W:匹配不是字母、数字及下划线
\s:匹配任意空白字符串，等价于[\t\n\r\f]
\S:匹配任意非空白字符
\d:匹配任意数字、等价于[0-9]
\D:匹配任意非数字的字符
\A:匹配字符串开头
\Z:匹配字符串。若存在换行，只匹配换行前的结束字符串
\z:匹配字符串结尾。若存在换行，同时还会匹配换行符
\G：匹配最后匹配完成的位置
\n:匹配一个换行符
\t:匹配一个制表符
^:匹配一行字符串的开头
$:匹配一行字符串的结尾
.:匹配任意字符，除了换行符，当re.DOTALL被标指定时，可以匹配包括换行符的任意字符
[...]:用来表示一组字符，单独列出，例如[amk]用来匹配a、m或k
[^...]:匹配不在[]中的字符，例如匹配除了a、b、c之外的字符
*：匹配0个或多个表达式
+：匹配1个或多个表达式
？：匹配0个或1个前面的正则表达式定义的片段，非贪婪方式
{n}:精确匹配n个前面的表达式
{n,m}:匹配n到m次由前面正则表达式定义的片段，贪婪方式
a|b：匹配a或b
():匹配括号内的表达式

1.1 测试工具

打开开源中国正则表达式测试工具 http://tool.oschina.net/regex

2. 常用匹配方法

2.1 match

	用法：向它传入要匹配的字符串以及正则表达式，
就可以检测到这个正则表达式是否和字符串相匹配。

##match:尝试从起始位置开始匹配正则表达式，如果匹配，就会成功返回匹配成功的结果，若不成功，就返回None
import re
content = 'hello 123 4567 world is_beautiful welcome to here'
print(len(content))#打印字符长度49
result=re.match('^hello\s\d\d\d\s\d{4}\s\w{5}\s\w{12}\s\w{7}\s\w{2}\s\w{4}',content)#第一个参数正则表达式，第二个参数要匹配的字符串
print(result)#<re.Match object; span=(0, 49), match='hello 123 4567 world is_beautiful welcome to here>
print(result.group())#hello 123 4567 world is_beautiful welcome to here
print(result.span())#(0, 49)

2.1.1 匹配目标

###匹配目标
###用括号括起来想要提取的字符串，调用group方法传入分组的索引

import re

content='hello 1234567 world_this is a good people'
resulrt = re.match('^hello\s(\d+)\sworld',content)
print(resulrt)#<re.Match object; span=(0, 19), match='hello 1234567 world'>
print(resulrt.group())#hello 1234567 world
print(resulrt.group(1))# 1234567
print(resulrt.span())#(0, 19)

2.1.2 通用匹配

通用匹配(贪婪匹配) ：会尽可能多的去匹配字符

##通用匹配(贪婪匹配)：.*匹配任意字符

import re

content='hello 1234567 world_this is a good people'
resulrt = re.match('^hello.*people$',content)
print(resulrt)#<re.Match object; span=(0, 41), match='hello 1234567 world_this is a good people'>
print(resulrt.group())#hello 1234567 world_this is a good people
print(resulrt.span())#(0, 41)

2.1.3 贪婪与非贪婪

非贪婪匹配：尽可能少的匹配字符，在字符串末尾非贪婪匹配有可能匹配不到任何内容，最好放在字符串中间使用

##非贪婪匹配：.*?j
import re

content='hello 1234567 world_this is a good people'
resulrt = re.match('^he.*?(\d+).*people$',content)
print(resulrt)#<re.Match object; span=(0, 41), match='hello 1234567 world_this is a good people'>
print(resulrt.group())#hello 1234567 world_this is a good people
print(resulrt.group(1))#1234567

2.1.4 修饰符

re.I : 使匹配大小写不敏感
re.L : 实现本地化识别匹配
re.M : 多行匹配，影响^ 和$
re.S : 使用匹配内容包括换行符在内的所有字符
re.U : 根据Uncoide字符集解析字符。这个标志会影响\w,\W,\b\B
re.X : 该标志能够给予更灵活的格式，以便将正则表达式书写得更易于理解

###修饰符
###修饰符
import re

content=('''hello 1234567 
world_this is a good people''')
resulrt = re.match('^he.*?(\d+).*people$',content,re.S)
print(resulrt)#<re.Match object; span=(0, 42), match='hello 1234567 \nworld_this is a good people'>
print(resulrt.group())
#hello 1234567 
#world_this is a good people
print(resulrt.group(1))#1234567

2.1.5 转义字符

若是字符中出现 . 需要进行 \. 转义

###转义匹配
import re

content=('(百度)www.baidu.com')
resulrt = re.match('\(百度\)www\.baidu\.com',content)
print(resulrt)#<re.Match object; span=(0, 17), match='(百度)www.baidu.com'>

2.2 search

在匹配的同时会扫描整个字符串，然后返回第一个匹配成功的结果

##目的使用search方法获取经典古诗
import requests
import re
url = 'https://so.gushiwen.cn/mingjus/'
resp = requests.get(url).text
#print(resp)#获取页面源代码
pattern='<a style=" float:left;".*?>(.*?)</a>.*?<span style=" color:#65645F;.*?>(.*?)</span><a style=" float:left;".*?>(.*?)</a>'
result = re.search(pattern,resp,re.S)#（.*?）按顺序1，2，3
print(result.group(1),result.group(2),result.group(3))#东南形胜，三吴都会，钱塘自古繁华。烟柳画桥，风帘翠幕，参差十万人家。 —— 柳永《望海潮·东南形胜》

2.3 findall

用于提取多个内容，findall返回列表形式，需要通过遍历来依次获取每组内容

##使用findall 获取经典古诗
import requests
import re

url = 'https://so.gushiwen.cn/mingjus/'
resp = requests.get(url).text
#print(resp)#获取页面源代码
pattern='<a style=" float:left;".*?>(.*?)</a>.*?<span style=" color:#65645F;.*?>(.*?)</span><a style=" float:left;".*?>(.*?)</a>'
results = re.findall(pattern,resp,re.S)
print(type(results))#<class 'list'>
#print(results)
#遍历获取每组内容
for result in results:
    #print(result)
    #获取完整的古诗
    all_result=result[0]+result[1]+result[2]
    print(all_result)#东南形胜，三吴都会，钱塘自古繁华。烟柳画桥，风帘翠幕，参差十万人家。——柳永《望海潮·东南形胜》......

2.4 sub

去除文本中的数字

###sub:用来提取文本去除数字字母
import re
content  ='1a2b3c4d5e6f7g'
result=re.sub('\d+','',content)#第一个参数：匹配所有数字，第二个参数传入的数字替换成字符串（可以赋值为空）
print(result)#abcdefg

2.5 compile

将正则字符串编译成正则表达式对象，以便在后面匹配重复使用

###compile:还可以传入修饰符，在其它方法就不需要额外传了
import re
content1='2023-11-08 11:12'
content2='2023-11-08 11:23'
content3='2023-11-08 11:24'
pattern=re.compile('\d{2}:\d{2}')
result1=re.sub(pattern,'',content1)
result2=re.sub(pattern,'',content2)
result3=re.sub(pattern,'',content3)
print(result1,result2,result3)#2023-11-08  2023-11-08  2023-11-08

无根树浮生时

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
第四节正则表达式

非贪婪匹配：尽可能少的匹配字符，在字符串末尾非贪婪匹配有可能匹配不到任何内容，最好放在字符串中间使用。打开开源中国正则表达式测试工具 http://tool.oschina.net/regex。通用匹配(贪婪匹配) ：会尽可能多的去匹配字符。若是字符中出现 . 需要进行 \. 转义。
复制链接

扫一扫