【爬虫基础】第12讲正则表达式在爬虫中的应用

娜年花开666

已于 2024-04-09 13:11:30 修改

阅读量598

点赞数 3

分类专栏： # 爬虫基础 # Python语音基础与测试框架文章标签：爬虫正则表达式 python

于 2024-03-29 10:06:49 首次发布

本文链接：https://blog.csdn.net/a272329874a/article/details/137136038

版权

Python语音基础与测试框架同时被 2 个专栏收录

87 篇文章

订阅专栏

爬虫基础

22 篇文章

订阅专栏

正则表达式在爬虫中有很多应用。以下是一些常见的使用场景：

数据提取：爬虫通常需要从网页中提取特定的数据。正则表达式可以通过匹配模式，提取出所需的数据，比如提取标题、链接、价格等等。
URL匹配：爬虫需要从网页中提取URL，正则表达式可以根据URL的特定模式进行匹配，以确定是否需要爬取该URL。
数据清洗：在爬取到的数据中，有时会包含一些不需要的字符或标签。正则表达式可以用来清洗数据，去除不需要的内容。
验证：有时候需要验证某个字符串是否符合特定的格式，比如邮箱、电话号码等。正则表达式可以用来进行验证。
HTML解析：在爬取网页时，经常需要解析HTML标签。正则表达式可以用来查找和提取HTML标签或者属性。

下面是几种常用的正则表达式方法：

match方法

import re
str = 'I say hello to the world4.0'
print('---------------------match(规则，从哪个字符串匹配)------------------------')
# 从头开始匹配,如果有地方匹配不上，就不会返回数据
m1 = re.match(r'I',str)
m2 = re.match(r'\w',str)
m3 = re.match(r'\S',str)
m4 = re.match(r'\D',str)
m5 = re.match(r'say',str)
m6= re.match(r'I say',str)
# print(m6)
m7 = re.match(r'I (say)',str)

# 我们可以在使用match对象之前，先检查它是否为空，不然会提示'NoneType' object has no attribute 'group'

# if m7 :
#     print(m7.group())
# else:
#     print('m7-1 no match found')

# if m7:
#     print(m7.group(1))
# else:
#     print('m7-2 no match found')

# 贪婪模式
m8 = re.match(r'I (s\w*)',str)
# if m8:
#     print(m8.group())
# else:
#     print('m8-1 no match found')

# if m8:
#     print(m8.group(1))
# else:
#     print('m8-2 no match found')

#非贪婪模式
m9 = re.match(r'I (s\w*?)',str)
# if m9:
#     print(m9.group())
# else:
#     print('m9 no match found')

if m9:
    print(m9.group(1))
else:
    print('m9-2 no match found')

search方法

import re
str = 'I say hello to the world4.0'
print('---------------------search(规则，从哪个字符串匹配)------------------------')
# 从任意位置开始匹配，匹配第一个数据
# 匹配一个 用search
s1 = re.search(r'\D',str)   #I
s2 = re.search(r's\w+',str) #say
s3 = re.search(r'y',str)    #y
s4 = re.search(r'w\w+',str) #world4
s5 = re.search(r'w\w+.\d',str) #world4.0
# print(s5.group())

findall方法

import re
str = 'I say hello to the world4.0'

print('---------------------findall(规则，从哪个字符串匹配)------------------------')
# 从任意位置开始匹配，匹配所有数据
# 匹配多个用findall
f1 = re.findall(r'l',str) #['l', 'l', 'l']
f2 = re.findall(r'world',str)  #['world']
f3 = re.findall(r'lll',str)  #[]
# print(f3)

sub方法

import re
str = 'I say hello to the world4.0'

print('---------------------sub(匹配表达式，替换成什么样，原字符串)------------------------')
su1 = re.sub('world','World',str) #I say hello to the World4.0
su2 = re.sub('w\w+.\d','World_day',str) #I say hello to the World_day
# print(su2)

正则在爬虫中的实际运用

import re
str = 'I say hello to the world4.0'

print('---------------------test()------------------------')
html ='<div><a class="title" href="http://www.baidu.com">百度</a></div>'
t1 = re.findall(r'<div><a class="title" href="http://www.baidu.com">(百度)</a></div>',html) #['百度']
t2 = re.findall(r'<div><a class="title" href="http://www.baidu.com">([\u4e00-\u9fa5]+)</a></div>',html) #['百度']
t3 = re.findall(r'<div><a class="title" href="(.+)">([\u4e00-\u9fa5]+)</a></div>',html) #[('http://www.baidu.com', '百度')]
t4 = re.findall(r'<div><a class="title" href=".+">.+</a></div>',html) #['<div><a class="title" href="http://www.baidu.com">百度</a></div>']
t5 = re.findall(r'<div><a class="title" href="(.+)">(.+)</a></div>',html)# [('http://www.baidu.com', '百度')]
print(t5)

需要注意的是，在爬虫中使用正则表达式时，应该遵循一些最佳实践，如尽可能使用非贪婪模式的匹配、使用边界限制符等，以提高正则表达式的性能和准确性。

实战案例

爬取新闻网地址可以使用Python的爬虫库来实现

代码实现：

import requests
from  fake_useragent import UserAgent
import re

url = 'https://sports.qq.com/'
headers = {'User-Agent':UserAgent().chrome}
resp = requests.get(url,headers=headers)
print(resp.text)
# regx='<li><a target="_blank" href="https://new.qq.com/rain/a/20240328V08NAR00" class=""  dt-imp-once="true" dt-eid="em_item_article" dt-params="article_id=20240328V08NAR00&article_type=56&article_url=https://new.qq.com/rain/a/20240328V08NAR00&dt_element_path=[\'em_item_article\',\'em_content_card\']">竞者 | 他是将世界纪录尘封29年的三级跳之神！四朝元老终获命运垂青</a></li>'
# . 表示任意字符， +表示可以匹配多次  ?非贪婪模式  *匹配0个或多个前面的字符
# 匹配网址
# regx='<li><a target="_blank" href=".+?" class=""  dt-imp-once="true" dt-eid="em_item_article" dt-params=".+?">竞者 | 他是将世界纪录尘封29年的三级跳之神！四朝元老终获命运垂青</a></li>'
# 匹配标题
# regx='<li><a target="_blank" href=".+?" class=".*?"  dt-imp-once="true" dt-eid="em_item_article" dt-params=".+?">.+?</a></li>'
# 最终结果
regx='<li><a target="_blank" href=".+?" class=".*?"  dt-imp-once="true" dt-eid="em_item_article" dt-params=".+?">(.+?)</a></li>'
datas = re.findall(regx,resp.text)
print (datas)
for d in datas:
    print(d)

执行结果：执行结果

通过上面的代码，可以获取腾讯新闻页面上的新闻标题，并将其打印出来。你可以根据自己的需求进一步处理这些新闻数据，例如保存到数据库或文件中。