爬虫基本库的使用(httpx库的详细解析)

       前面,已经介绍过了urllib库和requests库( 爬虫基本库的使用(urllib库的详细解析)-CSDN博客爬虫基本库的使用(requests库的详细解析)-CSDN博客),已经可以爬取大多数网站的数据。但对于某些网站依然无能为力 ,因为这些网站强制使用HTTP/2.0协议访问,而urllib库和requests库只支持HTTP/1.1协议。那碰上这种情况应该怎么办呢?只需要使用支持HTTP/2.0协议的请求库不就好了。目前,应用比较广泛的是hyper和httpx。但httpx用起来更方便而且也更强大,requests库的功能它几乎都支持。那么,这里就详细来介绍httpx库吧!!!

目录

httpx库

1、示例

2、安装

3、基本使用 

4、 Client对象


httpx库

1、示例

        下面我们来看一个案例,https://spa16.scrape.center/就是一个强制使用HTTP/2.0协议。采用requests库是无法请求的,不信?我们来试试:

import  requests
url = 'https://spa16.scrape.center/'
response = requests.get(url)
print(response.text)

结果如下:

Traceback (most recent call last):
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connectionpool.py", line 793, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connection.py", line 466, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 1374, in getresponse
    response.begin()
  File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 318, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\requests\adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connectionpool.py", line 847, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\util\retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\util\util.py", line 38, in reraise
    raise value.with_traceback(tb)
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connectionpool.py", line 793, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connection.py", line 466, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 1374, in getresponse
    response.begin()
  File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 318, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Lenovo\Desktop\爬虫学习\Python知识\01.py", line 3, in <module>
    response = requests.get(url)
               ^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\requests\api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\requests\sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\requests\sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\requests\adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

           可以看到,一堆报错。头痛吧!抛出RemoteDisconnected错误,这就是requests库无法请求HTTP/2.0协议的网站。

2、安装

        httpx是需要提前安装的第三方库,所需Python版本是3.6及以上,安装命令如下:

pip install httpx

3、基本使用 

        https:和requests的很多API存在相似之处, 我们先看下最基本的 GET 请求的用法:

import httpx
response = httpx.get('https://www.httpbin.org/get')
print(response. status_code)
print(response. headers)
print(response. text)

        这里我们还是请求之前的测试网站,直接使用httpx的get方法即可, 用法和requests里的一模一样, 将返回结果赋值为response 变量, 然后打印出它的 status_code、headers、text等属性, 运行结果如下:

200
Headers({'date': 'Thu, 22 Feb 2024 03:37:08 GMT', 'content-type': 'application/json', 'content-length': '311', 'connection': 'keep-alive', 'server': 'gunicorn/19.9.0', 'access-control-allow-origin': '*', 'access-control-allow-credentials': 'true'})
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "www.httpbin.org", 
    "User-Agent": "python-httpx/0.27.0", 
    "X-Amzn-Trace-Id": "Root=1-65d6c164-3cdcebd7381e5873457a6866"
  }, 
  "origin": "111.72.54.67", 
  "url": "https://www.httpbin.org/get"
}

        输出结果包含三项内容, status_code 属性对应状态码, 为 200; headers 属性对应响应头, 是一个 Headers 对象, 类似于一个字典; text 属性对应响应体, 可以看到其中的 User-Agent 是python-httpx/0.18.1, 代表我们是用 https请求的。下面换一个 User-Agent 再请求一次, 代码改写如下:

import httpx
headers = {
    'User-Agent': 'Mozilla/5.0(Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36(KHTML, like Gecko)Chrome/90.0.4430.93 Safari/537.36'
}
response = httpx.get('https://www.httpbin.org/get',headers=headers)
print(response. text)

        这里我们换了一个User-Agent 重新请求, 并将其赋值为headers变量,然后传递给 headers 参数,运行结果如下: 

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "www.httpbin.org", 
    "User-Agent": "Mozilla/5.0(Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36(KHTML, like Gecko)Chrome/90.0.4430.93 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-65d6c2c6-0fef57aa6cc8a6fa4f0ea62e"
  }, 
  "origin": "111.72.54.67", 
  "url": "https://www.httpbin.org/get"
}

        可以发现更换User-Agent生效了!接下来,我们尝试用httpx请求一下这个网站,看看效果如何,代码如下:

import httpx
response = httpx.get('https://spa16.scrape.center/')
print(response.text)

结果如下:

Traceback (most recent call last):
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_transports\default.py", line 69, in map_httpcore_exceptions
    yield
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_transports\default.py", line 233, in handle_request
    resp = self._pool.handle_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\connection_pool.py", line 216, in handle_request
    raise exc from None
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\connection_pool.py", line 196, in handle_request
    response = connection.handle_request(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\connection.py", line 101, in handle_request
    return self._connection.handle_request(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\http11.py", line 143, in handle_request
    raise exc
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\http11.py", line 113, in handle_request
    ) = self._receive_response_headers(**kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\http11.py", line 186, in _receive_response_headers
    event = self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\http11.py", line 238, in _receive_event
    raise RemoteProtocolError(msg)
httpcore.RemoteProtocolError: Server disconnected without sending a response.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\Lenovo\Desktop\爬虫学习\Python知识\01.py", line 2, in <module>
    response = httpx.get('https://spa16.scrape.center/')
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_api.py", line 198, in get
    return request(
           ^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_api.py", line 106, in request
    return client.request(
           ^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_client.py", line 827, in request
    return self.send(request, auth=auth, follow_redirects=follow_redirects)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_client.py", line 914, in send
    response = self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_client.py", line 942, in _send_handling_auth
    response = self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_client.py", line 979, in _send_handling_redirects
    response = self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_client.py", line 1015, in _send_single_request
    response = transport.handle_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_transports\default.py", line 232, in handle_request
    with map_httpcore_exceptions():
  File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_transports\default.py", line 86, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.RemoteProtocolError: Server disconnected without sending a response.

      啊,为什么会报错呀。不是说好了支持HTTP/2.0协议吗?其实,httpx默认是不会开启对HTTP/2.0协议的支持的,默认的还是HTTP/1.1协议,需要手动声明一下才能使用HTTP/2.0协议。代码如下:

import httpx
client = httpx.Client(http2=True)
response = client.get('https://spa16.scrape.center/')
print(response.text)

        这里我们声明了一个 Client 对象,赋值为client变量, 同时显式地将 http2参数设置为 True,这样便开启了对 HTTP/2.0的支持,之后就会发现可以成功获取HTML 代码了。这也就印证了这个示例网站只能使用HTTP/2.0访问。刚才我们也提到了, https和requests有很多相似的 API,上面实现的是GET请求, 对于POST请求、PUT请求和DELETE 请求来说, 实现方式是类似的:

import httpx
r =https.get('  https://www.httpbin.org/get',params=  {'name':'germey'})
I =   https://www.httpbin.org/post',data=  {'name':'germey'})
r = httpx.put('  https://www.httpsbin.org/put')
I = https.delete('  https://www.httpbin.org/delete')
x =   https://www.https://www.httpbin.org/patch')

        基于得到的 Response 对象, 可以使用如下属性和方法获取想要的内容。

  • status_code: 状态码。
  • text: 响应体的文本内容。 
  •  content: 响应体的二进制内容, 当请求的目标是二进制数据(如图片)时, 可以使用此属性获取。
  • headers: 响应头, 是 Headers对象, 可以用像获取字典中的内容一样获取其中某个Header  的值。
  • json: 方法, 可以调用此方法将文本结果转化为 JSON 对象。

        除了这些, httpx还有一些基本用法也和 requests 极其类似, 这里就不再赘述了,可以参考官方文档: https://www.python-httpx.org/quickstart/

4、 Client对象

        httpx中有一些基本的 API和requests 中的非常相似, 但也有一些 API是不相似的, 例如 httpx中有一个 Client 对象, 就可以和requests 中的 Session 对象类比学习。下面我们介绍Client 对象的使用。官方比较推荐的使用方式是with as 语句,示例如下:

import httpx
with httpx.Client() as client:
    response = client.get('https://www.httpbin.org/get')
    print(response)

运行结果如下:

<Response [200 OK]>

        这个用法等价于:

import httpx
client = httpx.Client()
try:
    response = client.get('https://www.httpbin.org/get')
finally:
    client. close()

        两种方式的运行结果是一样的,只不过这里需要我们在最后显式地调用close方法来关闭Client 对象。另外,在声明Client 对象时可以指定一些参数,例如headers,这样使用该对象发起的所有请求都会默认带上这些参数配置,示例如下:

import httpx
url = 'http://www.httpbin.org/headers'
headers = {'User-Agent': 'my-app/0.0.1'}
with httpx.Client(headers=headers) as client:
    r = client. get(url)
    print(r.json()['headers']['User-Agent'])

        这里我们声明了一个 headers 变量, 内容为User-Agent 属性, 然后将此变量传递给 headers 参数初始化了一个Client对象,并赋值为client变量,最后用client变量请求了测试网站,并打印返回结果中的 User-Agent 的内容:

my-app/0.0.1

        可以看到, headers 成功赋值了!

        这里介绍httpx的基本用法就到此为止,总结一下:httpx是跟requests非常相似的库,并且同时支持 HTTP/2.0协议。

  • 12
    点赞
  • 26
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值