爬虫系列 | 第三讲 Requests 库

最新推荐文章于 2024-07-16 13:17:37 发布

Eric Sang

最新推荐文章于 2024-07-16 13:17:37 发布

阅读量435

点赞数

分类专栏：爬虫文章标签：爬虫

本文链接：https://blog.csdn.net/ry1026/article/details/100175588

版权

爬虫专栏收录该内容

6 篇文章 0 订阅

订阅专栏

内容大纲

1. 什么是 Requests ？
2. 发起 GET 请求
3. 发起 POST 请求
4. 发起其他请求
5. 异常处理？

1. 什么是 Requests ？

Requests 是一个常用的用于HTTP请求的第三方模块，其实在Python内置的urllib 基础之上进一步封装编写的。
Requests的使用它会比 urllib 更加方便，可以大大提高我们的开发效率，建议爬虫开发使用 Requests 库。
Requests 库安装：pip install requests
官方主页：http://python-requests.org/
中文文档：https://2.python-requests.org//zh_CN/latest/index.html

2. 发起 GET 请求

2.1 基本使用

使用Requests创建GET请求，有两种方法：一种是直接使用get方法，一种是使用request方法并设置请求方法为get。

import requests
# 方法一
response1 = requests.get("http://www.baidu.com/")
print(response1.text)
# 方法二
response2 = requests.request("get", "http://www.baidu.com/")
print(response2.text)

注意：
（1）使用response.text 时，Requests 会基于 HTTP 响应的文本编码自动解码响应内容，大多数 Unicode 字符集都能被无缝地解码。
（2）使用response.content 时，返回的是服务器响应数据的原始二进制字节流，可以用来保存图片等二进制文件。

如果我们想要在URL中传递数据给服务器，即携带查询字符串参数（Query String Parameters），有两种方式可以实现：① 直接构造携带参数的URL，如httpbin.org/get?key1=val1&key2=val2 ; ② Requests模块允许的get方法允许我们使用params关键字参数传递，以一个字典来传递这些参数。

import requests

query_string_parameters = {
    "username":"admin",
    "password":"123456"
}
response = requests.get("http://httpbin.org/get",
						 params=query_string_parameters)
print(response.url)
print(response.text)

2.2 设置请求头

2.3 获取Cookies

如果一个响应中设置了 cookies，那么我们可以利用使用响应对象的 cookies 属性拿到

import requests

response = requests.get("http://www.baidu.com")

# CookieJar对象:
cookiejar = response.cookies
print(cookiejar)

# 遍历
for key,value in cookiejar.items():
    print(key + "：" + value)
    
# 将CookieJar转为字典：
cookiedict = requests.utils.dict_from_cookiejar(cookiejar)
print(cookiedict)

2.4 会话维持

在 requests 里，session对象是一个非常常用的对象，这个对象代表一次用户会话：从客户端浏览器连接服务器开始，到客户端浏览器与服务器断开。
会话能让我们在跨请求时候保持某些参数，如同一个 Session 对象发出的所有请求之间保持 cookie 。

import requests

sess = requests.Session()
sess.get("http://httpbin.org/cookies/set/number/123456")
response = sess.get("http://httpbin.org/cookies")
print(response.text)

2.5 证书验证

Requests 默认情况下，启用SSL验证，如果无法验证SSL证书，将会引发SSLError
如果想要避免报SSLError错误，可以不进行证书验证，通过设置verify=False

import requests
response = requests.get("https://www.12306.cn", verify=False)
print(response.status_code)
# print(response.text)

运行结果如下，会报警告

D:\Program Files\Python37\lib\site-packages\urllib3\connectionpool.py:851: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
D:\Program Files\Python37\lib\site-packages\urllib3\connectionpool.py:851: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)

如果想要避免报警告，通过加入如下代码

import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning

 # 禁用安全请求警告
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
response = requests.get("https://www.12306.cn", verify=False)
print(response.status_code)
# print(response.text)

3. 发起 POST 请求

使用post方法，并设置data参数

import requests
if __name__ == '__main__':
	# POST 请求
	resp = requests.post("http://httpbin.org/post", data={'key': 'value'})
	print(resp)
	print(resp.url)          
	print(resp.status_code)  # 获取HTTP响应状态码
	print(resp.text)         # 获取解码后的HTTP响应内容（字符串方式），会自动动为你解码 gzip 和 deflate 压缩
	print(resp.content)      # 获取HTTP响应内容（字节方式）
	print(resp.raw)          # 获取HTTP原始响应内容，也就是 urllib 的 response 对象，使用 r.raw.read() 读取
	print(resp.headers)      # 获取HTTP响应头，以字典对象存储，但是这个字典比较特殊，字典键不区分大小写，若键不存在则返回None

4. 发起其他请求

import requests
if __name__ == '__main__':
	# DELETE 请求
	resp3 = requests.delete('http://httpbin.org/delete')
	print(resp3)
	# HEAD 请求
	resp4 =  requests.head('http://httpbin.org/get')
	print(resp4)
	# OPTIONS 请求
	resp5 = requests.options('http://httpbin.org/get')
	print(resp5)

5. 异常处理？

Requests 提供的异常类都在requests.exceptions 中, 源码见 http://cn.python-requests.org/zh_CN/latest/_modules/requests/exceptions.html#RequestException
从如下源码中可以看出异常类的继承关系：

（1）RequestException 继承自IOError
（2）HTTPError、ConnectionError、Timeout 继承自 RequestionException
（3）ProxyError、SSLError 继承自 ConnectionError
（4） ReadTimeout继承Timeout异常
这里列举了一些常用的异常继承关系，详细的可以看：

# -*- coding: utf-8 -*-

"""
requests.exceptions
~~~~~~~~~~~~~~~~~~~

This module contains the set of Requests' exceptions.
"""
from urllib3.exceptions import HTTPError as BaseHTTPError


class RequestException(IOError):
    """There was an ambiguous exception that occurred while handling your
    request.
    """

    def __init__(self, *args, **kwargs):
        """Initialize RequestException with `request` and `response` objects."""
        response = kwargs.pop('response', None)
        self.response = response
        self.request = kwargs.pop('request', None)
        if (response is not None and not self.request and
                hasattr(response, 'request')):
            self.request = self.response.request
        super(RequestException, self).__init__(*args, **kwargs)



class HTTPError(RequestException):
    """An HTTP error occurred."""



class ConnectionError(RequestException):
    """A Connection error occurred."""



class ProxyError(ConnectionError):
    """A proxy error occurred."""


class SSLError(ConnectionError):
    """An SSL error occurred."""


class Timeout(RequestException):
    """The request timed out.

    Catching this error will catch both
    :exc:`~requests.exceptions.ConnectTimeout` and
    :exc:`~requests.exceptions.ReadTimeout` errors.
    """



class ConnectTimeout(ConnectionError, Timeout):
    """The request timed out while trying to connect to the remote server.

    Requests that produced this error are safe to retry.
    """



class ReadTimeout(Timeout):
    """The server did not send any data in the allotted amount of time."""



class URLRequired(RequestException):
    """A valid URL is required to make a request."""



class TooManyRedirects(RequestException):
    """Too many redirects."""



class MissingSchema(RequestException, ValueError):
    """The URL schema (e.g. http or https) is missing."""


class InvalidSchema(RequestException, ValueError):
    """See defaults.py for valid schemas."""


class InvalidURL(RequestException, ValueError):
    """The URL provided was somehow invalid."""


class InvalidHeader(RequestException, ValueError):
    """The header value provided was somehow invalid."""


class ChunkedEncodingError(RequestException):
    """The server declared chunked encoding but sent an invalid chunk."""


class ContentDecodingError(RequestException, BaseHTTPError):
    """Failed to decode response content"""


class StreamConsumedError(RequestException, TypeError):
    """The content for this response was already consumed"""


class RetryError(RequestException):
    """Custom retries logic failed"""


class UnrewindableBodyError(RequestException):
    """Requests encountered an error when trying to rewind a body"""

# Warnings


class RequestsWarning(Warning):
    """Base warning for Requests."""
    pass


class FileModeWarning(RequestsWarning, DeprecationWarning):
    """A file was opened in text mode, but Requests determined its binary length."""
    pass


class RequestsDependencyWarning(RequestsWarning):
    """An imported dependency doesn't match the expected version range."""
    pass

使用示例

import requests

from requests.exceptions import ReadTimeout
from requests.exceptions import ConnectionError
from requests.exceptions import RequestException

try:
    response = requests.get("http://httpbin.org/get", timout=0.1)
    print(response.status_code)
except ReadTimeout:
    print("timeout")
except ConnectionError:
    print("connection Error")
except RequestException:
    print("error")

raise_for_status() ：如果HTTP响应状态码不是200，就主动抛出异常

import requests

if __name__ == '__main__':
	try:
		resp = requests.get('http://httpbin.org/status/404')
		resp.raise_for_status()  # 如果HTTP响应状态码不是 200，就主动抛出异常
	except requests.RequestException as e:
		print(e)
	else:
		print(resp)

运行结果如下：

Eric Sang

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
爬虫系列 | 第三讲 Requests 库

内容大纲1. 什么是 Requests ？2. 发起 GET 请求3. 发起 POST 请求1. 什么是 Requests ？Requests 是一个常用的用于HTTP请求的第三方模块，其实在Python内置的urllib 基础之上进一步封装编写的。Requests的使用它会比 urllib 更加方便，可以大大提高我们的开发效率，建议爬虫开发使用 Requests 库。Requests ...
复制链接

扫一扫