【httpx】—— Python网络爬虫（五）

卡利-安

于 2024-04-20 18:29:51 发布

阅读量1.9k

点赞数 28

文章标签： httpx 爬虫

本文链接：https://blog.csdn.net/2302_78240669/article/details/137798732

版权

某些情况下，一些网站强制使用HTTP/2.0协议访问，这时urllib 和 requests 是无法爬取数据的，因为它们只支持HTTP/1.1，不支持HTTP/2.0。

这种情况下，只需要使用一些支持 HTTP/2.0的请求库就好了，目前来说，比较有代表性的是hyper 和 httpx，后者使用起来更加方便，功能也更强大，requests已有的功能它几乎都支持。

示例

Scrape | Book 就是强制使用 HTTP/2.0 访问的一个网站，

这个网站用 requests 是无法爬取的，

import requests

url = "https://spa16.scrape.center/"
response = requests.get(url)
print(response)

输出结果：

requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

可以看到，抛出了 RemoteDisconnected 错误，请求失败。

原因是requests 这个库是使用HTTP/1.1访问的目标网站，而目标网站会检测请求使用的协议是不是HTTP/2.0,如果不是就拒绝返回任何结果。

安装httpx库

pip3 install "httpx[http2]"

这样既安装了htpx，又安装了httpx 对HTTP/2.0的支持模块。

注意，httpx所需的Python版本是3.6 及以上

基本使用

httpx 和 requests 的很多API存在相似之处

基本的GET请求的用法：

import httpx

response = httpx.get("https://www.httpbin.org/get")
print(response.status_code)
print(response.headers)
print(response.text)

输出结果：

200
Headers({'date': 'Sat, 20 Apr 2024 09:29:08 GMT', 'content-type': 'application/json', 'content-length': '313', 'connection': 'keep-alive', 'server': 'gunicorn/19.9.0', 'access-control-allow-origin': '*', 'access-control-allow-credentials': 'true'})
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "www.httpbin.org", 
    "User-Agent": "python-httpx/0.27.0", 
    "X-Amzn-Trace-Id": "Root=1-66238ae4-01749776337de9332f35681e"
  }, 
  "origin": "222.85.167.205", 
  "url": "https://www.httpbin.org/get"
}

换一个User-Agent 重新请求：

import httpx

headers = {
    'User-Agent': 'Mozilla/5.0(Macintosh; Intel Mac OS X 10_15_7)ApplewebKit/537.36(KHTML, like Gecko)Chrome/90.0.4430.93 Safari/537.36'
}
response = httpx.get("https://www.httpbin.org/get",headers=headers)
print(response.text)

输出结果：

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "www.httpbin.org", 
    "User-Agent": "Mozilla/5.0(Macintosh; Intel Mac OS X 10_15_7)ApplewebKit/537.36(KHTML, like Gecko)Chrome/90.0.4430.93 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-66238c90-66f980a06ab73e781885fb9f"
  }, 
  "origin": "222.85.167.205", 
  "url": "https://www.httpbin.org/get"
}

使用httpx请求开头的那个网站：

import httpx

response = httpx.get("https://spa16.scrape.center/")
print(response.text)

输出结果：

httpx.RemoteProtocolError: Server disconnected without sending a response.

可以看到，抛出了和使用requests 请求时类似的错误，不是说好支持HTTP/2.0吗?

其实，httpx默认是不会开启对HTTP/2.0的支持的，默认使用的是HTTP/1.1，需要手动声明一下才能使用HTTP/2.0

import httpx

client = httpx.Client(http2=True)
response = client.get("https://spa16.scrape.center/")
print(response.text)

输出结果：

<!DOCTYPE html><html lang=en><head><meta charset=utf-8><meta http-equiv=X-UA-Compatible content="IE=edge"><meta name=viewport content="width=device-width,initial-scale=1"><meta name=referrer content=no-referrer><link rel=icon href=/favicon.ico><title>Scrape | Book</title><link href=/css/chunk-50522e84.e4e1dae6.css rel=prefetch><link href=/css/chunk-f52d396c.4f574d24.css rel=prefetch><link href=/js/chunk-50522e84.6b3e24aa.js rel=prefetch><link href=/js/chunk-f52d396c.f8f41620.js rel=prefetch><link href=/css/app.ea9d802a.css rel=preload as=style><link href=/js/app.b93891e2.js rel=preload as=script><link href=/js/chunk-vendors.a02ff921.js rel=preload as=script><link href=/css/app.ea9d802a.css rel=stylesheet></head><body><noscript><strong>We're sorry but portal doesn't work properly without JavaScript enabled. Please enable it to continue.</strong></noscript><div id=app></div><script src=/js/chunk-vendors.a02ff921.js></script><script src=/js/app.b93891e2.js></script></body></html>

这里我们声明了一个Client 对象，赋值为client 变量，同时显式地将http2参数设置为True, 这样便开启了对HTTP/2.0的支持，之后就会发现可以成功获取 HTML代码了

对于POST请求、PUT 请求和 DELETE 请求来说，实现方式是类似的：

import httpx

r = httpx.get("https://www.httpbin.org/get", params={'name': 'Alan'})
r = httpx.post("https://www.httpbin.org/get", data={'name': 'Alan'})
r = httpx.put("https://www.httpbin.org/put")
r = httpx.delete("https://www.httpbin.org/delete")
r = httpx.patch("https://www.httpbin.org/patch")

基于得到的Response对象，可以使用如下属性和方法获取想要的内容：

status_code：状态码。
text：响应体的文本内容。

content：响应体的二进制内容，当请求的目标是二进制数据(如图片)时，可以使用此属性获取。
headers：响应头，是Headers 对象，可以用像获取字典中的内容一样获取其中某个Header 的值。
json：方法，可以调用此方法将文本结果转化为JSON 对象。

Client对象

Client 对象的使用：

官方比较推荐的使用方式是with as 语句

import httpx

with httpx.Client() as client:
    response = client.get('https://www.httpbin.org/get')
    print(response)

输出结果：

<Response [200 OK]>

等价于：

import httpx

client = httpx.Client()
try:
    response = client.get('https://www.httpbin.org/get')
    print(response)
finally:
    client.close()

两种方式的运行结果是一样的，只不过这里需要我们在最后显式地调用close方法来关闭Client对象。

另外，在声明Client对象时可以指定一些参数，例如 headers,这样使用该对象发起的所有请求都会默认带上这些参数配置

import httpx

url = "https://www.httpbin.org/headers"
headers = {'User-Agent': 'my-app/0.0.1'}
with httpx.Client(headers=headers) as client:
    r = client.get(url)
    print(r.json()['headers']['User-Agent'])

输出结果：

my-app/0.0.1

这里我们声明了一个headers变量，内容为User-Agent 属性，然后将此变量传递给 headers 参数初始化了一个Client 对象，并赋值为client变量，最后用client变量请求了测试网站，并打印返回结果中的 User-Agent 的内容。

异步请求

htpx还支持异步客户端请求(即AsyncClient),支持Python的 async 请求模式，

import httpx
import asyncio


async def fetch(url):
    async with httpx.AsyncClient(http2=True) as client:
        response = await client.get(url)
        print(response.text)

if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(fetch('https://www.httpbin.org/get'))

这个坑得后面再填了~

卡利-安

关注

28
点赞
踩
31

收藏

觉得还不错? 一键收藏
0
评论
【httpx】—— Python网络爬虫（五）

某些情况下，一些网站强制使用HTTP/2.0协议访问，这时urllib 和 requests 是无法爬取数据的，因为它们只支持HTTP/1.1，不支持HTTP/2.0。这种情况下，只需要使用一些支持 HTTP/2.0的请求库就好了，目前来说，比较有代表性的是hyper 和 httpx，后者使用起来更加方便，功能也更强大，requests已有的功能它几乎都支持。
复制链接

扫一扫