requests_html模块

最新推荐文章于 2024-05-27 09:45:56 发布

写进メ诗的结尾。

最新推荐文章于 2024-05-27 09:45:56 发布

阅读量964

点赞数

分类专栏： Python网络爬虫文章标签： html python 前端爬虫

本文链接：https://blog.csdn.net/weixin_48158964/article/details/125952381

版权

Python网络爬虫专栏收录该内容

2 篇文章

订阅专栏

requests_html是requests库的增强版，支持JavaScript渲染、CSS和XPath选择器、自动重定向等功能。它可以方便地解析HTML，通过find和xpath方法提取数据，例如CSS选择器和XPath选择器用于提取页面链接、价格等信息。此外，它还能实现异步请求，提高爬虫效率。示例中展示了如何获取和解析网页内容，以及如何利用XPath选择器提取房价等关键信息。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

requests_html（requests的增强版）

安装：pip install requests_html（支持Python 3.6及以上版本）

官方对该库的基本描述：

Full JavaScript support!（完全支持 JS，这里手册还重点标记了一下，初学阶段可以先忽略）
CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).（集成了 pyquery 库，支持 CSS 选择器）
XPath Selectors, for the faint at heart.（支持 XPath 选择器）
Mocked user-agent (like a real web browser).（mock UA 数据，这点不错）
Automatic following of redirects.（自动跟踪重定向）
Connection–pooling and cookie persistence.（持久性 COOKIE）
The Requests experience you know and love, with magical parsing abilities.

由于该库是对 html 对象进行解析，所以可以通过 print(dir(r.html)) 查看对应的 html 对象包含哪些属性和方法。

# html对象的方法和属性
print(r.html)  # 解析过的html对象
---------
<HTML url='https://www.python.org/'>

方法：
print(r.html.find(参数))  # 提供一个CSS选择器，返回一个元素列表
print(r.html.xpath(参数))  # 提供一个XPath选择器，返回一个元素列表
print(r.html.search(参数))  # 根据传入的模板参数，查找Element对象
print(r.html.search_all(参数))  # 同上，返回全部数据

属性：
print(r.html.links)  # 以集合形式返回页面中所有链接
print(r.html.absolute_links)  # 以集合形式返回页面中所有链接的绝对地址
print(r.html.url)  # 以字符串形式返回被请求网页的URL
print(r.html.base_url)  # 以字符串形式返回页面的基准URL
print(r.html.html)  # 以字符串形式返回HTML格式页面（响应）内容，等价于r.text
print(r.html.raw_html)  # 以bytes形式返回未解析过的网页内容
print(r.html.text)  # 以字符串形式返回页面中所有文本

# 发送网络请求并获取响应
from requests_html import HTMLSession
session = HTMLSession()  # 实例化HTMLSession类对象
r = session.get('https://python.org/')  # 调用get方法
print(r)
---------
<Response [200]>

print(r.text)  # 获取响应内容（F12 --> 网络 --> 响应）

# 尝试异步并同时获取一些站点
from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()
async def get_python():
    r = await asession.get('https://python.org/')
    return r
async def get_reddit():
    r = await asession.get('https://reddit.com/')
    return r
async def get_google():
    r = await asession.get('https://google.com/')
    return r
results = asession.run(get_python, get_reddit, get_google)
for result in results:
    print(result.html.url)
---------
https://www.python.org/
https://www.google.com/
https://www.reddit.com/

# 智能分页
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://python.org/')
for html in r.html:
	print(html)
---------
<HTML url='https://www.python.org/'>
<HTML url='https://docs.python.org/3/tutorial/controlflow.html#defining-functions'>
<HTML url='https://docs.python.org/3/tutorial/'>
<HTML url='https://docs.python.org/3/tutorial/appetite.html'>
<HTML url='https://docs.python.org/3/tutorial/interpreter.html'>
......

print(r.html.next())  # 可以很方便的请求下一个URL
---------
https://docs.python.org/3/tutorial/controlflow.html#defining-functions

# 用XPath选择器提取有用数据
法一：
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://python.org/')
hrefs = r.html.xpath(参数)

法二：
from requests_html import HTMLSession
from requests_html import HTML
session = HTMLSession()
r = session.get('https://python.org/')
html = HTML(html=r.text)  # 得到解析过的html对象
hrefs = html.xpath(参数)

XPath是一门路径提取语言，常用于从html/xml文件中提取信息，基本规则如下：

表达式	描述
nodename	选取此节点的所有子节点
/	从根节点选取
//	从匹配选择的当前节点选择文档中的节点，而不考虑他们的位置
.	选取当前节点
…	选取当前节点的父节点
@	选取属性

XPath选择器需要通过xpath函数来实现，它有四个参数，描述如下：

- selector：要用的XPath选择器；
- clean：布尔值，如果为真会忽略HTML中style和script标签造成的影响;
- first：布尔值，如果为真会返回第一个元素，否则会返回满足条件的元素列表；
- _encoding：编码格式。

需要注意的一点是，如果XPath中包含 text() 或 @href 这样的子属性，那么结果相应的会变成简单的字符串类型，而不是HTML元素。

CSS选择器需要通过find函数来实现，它有五个参数，描述如下：

- selector：要用的CSS选择器；
- clean：布尔值，如果为真会忽略HTML中style和script标签造成的影响;
- containing：如果设置该属性，会返回包含该属性文本的标签；
- first：布尔值，如果为真会返回第一个元素，否则会返回满足条件的元素列表；
- _encoding：编码格式。

# XPath提取信息实例
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://nc.lianjia.com/ershoufang/103118960587.html')
total_price = r.html.xpath('//span[@class="total"]/text()')  # 返回一个元素为字符串的列表
price = r.html.xpath('//span[@class="total"]/text()')[0]  # 返回元素列表中第一个元素
p = r.html.xpath('//span[@class="total"])
print(total_price)
print(price)
print(p)
---------
['111']
111
[<Element 'span' class=('total',)>]        
                 
                 
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://nc.lianjia.com/ershoufang/103118960587.html')
unit_price = r.html.xpath('//span[@class="unitPriceValue"]/text()')[0]  # 12860
unit = r.html.xpath('//span[@class="unitPriceValue"]/i/text()')[0]  # 元/平米
print(unit_price+unit)
---------
12860元/平米