关于httpx的使用方法

用脑白金维持脑活力

于 2024-09-02 22:08:09 发布

阅读量303

点赞数 6

分类专栏：爬虫文章标签： httpx 爬虫 python

本文链接：https://blog.csdn.net/qq_52046196/article/details/141830409

版权

爬虫专栏收录该内容

6 篇文章 0 订阅

订阅专栏

当我们遇到一些强制使用HTTP/2.0的协议访问的网站，采用requests是无法爬取数据的，因为其只支持HTTP/1.1协议，针对于这种情况，我们引入httpx请求库。

首先我们要了解如何查看网址使用的http协议是什么：

我们同样以 “https://spa16.scrape.center/” 网站为例，该网站为强制使用http/2.0协议的网站

按住F12进入调试工具 -> 点击network -> 鼠标划到Name和Status中间那根线的位置，鼠标右键 -> 选择Header Options -> 选择Protocol

如图，h2就代表强制使用http/2.0协议

我们使用常规访问方式会显示报错

import requests
try:
    response = requests.get('https://spa16.scrape.center/')
    print(response.text)
except requests.RequestException as r:
    print(r.args)
------------------------输出结果---------------------------
(ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')),)

所以我们使用httpx来解决这个问题。

安装

 pip install "httpx[http2]"

基本使用

import httpx
h = httpx.get("https://www.httpbin.org/get")
print(h.text)
print(h.status_code)
print(h.headers)
---------------输出结果----------------------
{
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "www.httpbin.org",
    "User-Agent": "python-httpx/0.27.2",
    "X-Amzn-Trace-Id": "Root=1-66d5c0ca-41fe030b610c4b59126ca286"
  },
  "origin": "113.205.146.233",
  "url": "https://www.httpbin.org/get"
}

200
Headers({'date': 'Mon, 02 Sep 2024 13:42:35 GMT', 'content-type': 'application/json', 'content-length': '314', 'connection': 'keep-alive', 'server': 'gunicorn/19.9.0', 'access-control-allow-origin': '*', 'access-control-allow-credentials': 'true'})

其实httpx的使用方法几乎跟requests的使用方法一样，如果还不了解requests的使用方法可以看我的文章关于requests的使用方法https://blog.csdn.net/qq_52046196/article/details/141690872?spm=1001.2014.3001.5501接着我们使用httpx库来访问我们最开始的网站

import httpx
client = httpx.Client(http2=True)
response = client.get('https://spa16.scrape.center/')
print(response.text)
----------------------输出结果-----------------------
<!DOCTYPE html><html lang=en><head><meta charset=utf-8><meta http-equiv=X-UA-Compatible content="IE=edge"><meta name=viewport 
content="width=device-width,initial-scale=1"><meta name=referrer content=no-referrer><link rel=icon href=/favicon.ico><title>Scrape | Book</title><link href=/css/chunk-50522e84.e4e1dae6.css rel=prefetch><link href=/css/chunk-f52d396c.4f574d24.css rel=prefetch><link href=/js/chunk-50522e84.6b3e24aa.js rel=prefetch><link href=/js/chunk-f52d396c.f8f41620.js rel=prefetch><link href=/css/app.ea9d802a.css rel=preload as=style><link href=/js/app.b93891e2.js rel=preload as=script><link href=/js/chunk-vendors.a02ff921.js rel=preload as=script><link href=/css/app.ea9d802a.css rel=stylesheet></head><body><noscript><strong>We're sorry but portal doesn't work properly without JavaScript enabled. Please enable it to continue.</strong></noscript><div id=app></div><script src=/js/chunk-vendors.a02ff921.js></script><script src=/js/app.b93891e2.js></script></body></html>

需要注意一点的是，httpx默认使用的是HTTP/1.1，我们需要手动开启对HTTP/2.0的支持

Client对象

其实可以跟requests中的Session对象进行对比学习，用法和思想都差不多。官方更推荐使用with as 语句，所以我们刚才的案例可以改写为：

import httpx
with httpx.Client(http2=True) as client:
    response= client.get('https://spa16.scrape.center/')
    print(response.text)

这样我们就不需要手动调用close方法来关闭Client对象

当然我们也可以向对象中添加一些参数：

import httpx
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0'
}
with httpx.Client(http2=True,headers=headers) as client:
    response= client.get('https://spa16.scrape.center/')
    print(response.text)

总体来说，有了前面的requests的一些相关知识，这一章还是比较简单的。我这里就不过多赘述了，如果大家感兴趣的话，可以参考官方文档进行学习：https://www.python-httpx.org/