爬虫时,urllib 与 requests 库只支持HTTP/1.1,有些网站强制使用 HTTP/2.0 访问协议,则 此时 urllib 与 requests 将无能为力。目前来说,支持 HTTP/2.0 的请求库使用较多的是 hyper 和 httpx,其中 httpx 使用起来更为方便,功能也更强大,几乎支持了 requests 已有的所有功能。
1、安装
python 版本需为 3.6 及以上,通过如下命令安装:
pip3 install httpx[http2]
2、使用
httpx 用法与 requests 相似,基本使用方法如下:
# *********************************************
# Basic use of httpx
# *********************************************
import httpx
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ApppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/90.0.4430.93 Safari/537.36'}
url = 'https://www.httpbin.org/get'
response = httpx.get(url=url, headers=headers)
print(response.text)
运行输出结果如下:
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "www.httpbin.org",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ApppleWebKit/537.36 (KHTML, like Gecko)Chrome/90.0.4430.93 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-64d344dc-766eeeb637fbfa915a2e6b0f"
},
"origin": "xxx.xxx.12.60",
"url": "https://www.httpbin.org/get"
}
httpx 请求 HTTP/2.0 网站,使用方法如下:
import httpx
url = 'https://spa16.scrape.center/'
client = httpx.Client(http2=True)
response = client.get(url)
print(response.text)
运行输出结果如下:
<!DOCTYPE html><html lang=en><head><meta charset=utf-8><meta http-equiv=X-UA-Compatible content="IE=edge"><meta name=viewport content="width=device-width,initial-scale=1"><meta name=referrer content=no-referrer><link rel=icon href=/favicon.ico><title>Scrape | Book</title><link href=/css/chunk-50522e84.e4e1dae6.css rel=prefetch><link href=/css/chunk-f52d396c.4f574d24.css rel=prefetch><link href=/js/chunk-50522e84.6b3e24aa.js rel=prefetch><link href=/js/chunk-f52d396c.f8f41620.js rel=prefetch><link href=/css/app.ea9d802a.css rel=preload as=style><link href=/js/app.b93891e2.js rel=preload as=script><link href=/js/chunk-vendors.a02ff921.js rel=preload as=script><link href=/css/app.ea9d802a.css rel=stylesheet></head><body><noscript><strong>We're sorry but portal doesn't work properly without JavaScript enabled. Please enable it to continue.</strong></noscript><div id=app></div><script src=/js/chunk-vendors.a02ff921.js></script><script src=/js/app.b93891e2.js></script></body></html>
httpx 默认使用的是HTTP/1.1,是不开启对 HTTP/2.0 支持的。若要开启 HTTP/2.0 支持,可以使用 httpx.Client(http2=True)。
使用下列属性或方法,可以获取想要的信息:
-
status_code:状态码
-
text:文本内容
-
content:相应的二进制内容,可以获取请求目标的二进制数据
-
headers:响应头 Headers 对象
-
json:可以将文本结果转化为 JSON 对象
3、Client 对象
httpx.Client 对象,使用方法与 requests 的 Session 类似:
# *********************************************
# use of httpx.Client
# *********************************************
import httpx
with httpx.Client() as client:
response = client.get('https://www.httpbin.org/get')
print(response.text)
运行输出结果如下:
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "www.httpbin.org",
"User-Agent": "python-httpx/0.24.1",
"X-Amzn-Trace-Id": "Root=1-64d43731-7ee43b435ecd09e174f2d47a"
},
"origin": "xxx.xxx.12.60",
"url": "https://www.httpbin.org/get"
}
4、异步请求AsyncClient
httpx 支持异步客户端请求,支持 python 的 async 请求模式,使用方法如下:
# *********************************************
# httpx.AsyncClient
# *********************************************
import httpx
import asyncio
async def fetch(url):
async with httpx.AsyncClient(http2=True) as asyncClient:
response = await asyncClient.get(url)
print(response.text)
if __name__ == '__main__':
url = "https://www.httpbin.org/get"
asyncio.get_event_loop().run_until_complete(fetch(url))
运行输出结果如下:
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "www.httpbin.org",
"User-Agent": "python-httpx/0.24.1",
"X-Amzn-Trace-Id": "Root=1-64d43b02-64f889d97b6deba3369a2f6f"
},
"origin": "xxx.xxx.12.60",
"url": "https://www.httpbin.org/get"
}
后续公众号会发布系列教程,更多内容请关注公众号:程序猿学习日记