爬虫系列 | 第三讲 Requests 库

1. 什么是 Requests ?

  • Requests 是一个常用的用于HTTP请求的第三方模块,其实在Python内置的urllib 基础之上进一步封装编写的。
  • Requests的使用它会比 urllib 更加方便,可以大大提高我们的开发效率,建议爬虫开发使用 Requests 库。
  • Requests 库安装:pip install requests
  • 官方主页:http://python-requests.org/
  • 中文文档:https://2.python-requests.org//zh_CN/latest/index.html
    在这里插入图片描述

2. 发起 GET 请求

2.1 基本使用

  • 使用Requests创建GET请求,有两种方法:一种是直接使用get方法,一种是使用request方法并设置请求方法为get。
import requests
# 方法一
response1 = requests.get("http://www.baidu.com/")
print(response1.text)
# 方法二
response2 = requests.request("get", "http://www.baidu.com/")
print(response2.text)

注意:
(1)使用response.text 时,Requests 会基于 HTTP 响应的文本编码自动解码响应内容,大多数 Unicode 字符集都能被无缝地解码。
(2)使用response.content 时,返回的是服务器响应数据的原始二进制字节流,可以用来保存图片等二进制文件。

  • 如果我们想要在URL中传递数据给服务器,即携带查询字符串参数(Query String Parameters),有两种方式可以实现:① 直接构造携带参数的URL,如httpbin.org/get?key1=val1&key2=val2 ; ② Requests模 块允许的get方法允许我们使用params关键字参数传递,以一个字典来传递这些参数。
import requests

query_string_parameters = {
    "username":"admin",
    "password":"123456"
}
response = requests.get("http://httpbin.org/get",
						 params=query_string_parameters)
print(response.url)
print(response.text)

2.2 设置请求头

2.3 获取Cookies

  • 如果一个响应中设置了 cookies,那么我们可以利用使用响应对象的 cookies 属性拿到
import requests

response = requests.get("http://www.baidu.com")

# CookieJar对象:
cookiejar = response.cookies
print(cookiejar)

# 遍历
for key,value in cookiejar.items():
    print(key + ":" + value)
    
# 将CookieJar转为字典:
cookiedict = requests.utils.dict_from_cookiejar(cookiejar)
print(cookiedict)

2.4 会话维持

  • 在 requests 里,session对象是一个非常常用的对象,这个对象代表一次用户会话:从客户端浏览器连接服务器开始,到客户端浏览器与服务器断开。
  • 会话能让我们在跨请求时候保持某些参数,如同一个 Session 对象发出的所有请求之间保持 cookie 。
import requests

sess = requests.Session()
sess.get("http://httpbin.org/cookies/set/number/123456")
response = sess.get("http://httpbin.org/cookies")
print(response.text)

2.5 证书验证

  • Requests 默认情况下,启用SSL验证,如果无法验证SSL证书,将会引发SSLError
  • 如果想要避免报SSLError错误,可以不进行证书验证,通过设置verify=False
import requests
response = requests.get("https://www.12306.cn", verify=False)
print(response.status_code)
# print(response.text)
  • 运行结果如下,会报警告
D:\Program Files\Python37\lib\site-packages\urllib3\connectionpool.py:851: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
D:\Program Files\Python37\lib\site-packages\urllib3\connectionpool.py:851: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
  • 如果想要避免报警告,通过加入如下代码
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning

 # 禁用安全请求警告
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
response = requests.get("https://www.12306.cn", verify=False)
print(response.status_code)
# print(response.text)

3. 发起 POST 请求

  • 使用post方法,并设置data参数
import requests
if __name__ == '__main__':
	# POST 请求
	resp = requests.post("http://httpbin.org/post", data={'key': 'value'})
	print(resp)
	print(resp.url)          
	print(resp.status_code)  # 获取HTTP响应状态码
	print(resp.text)         # 获取解码后的HTTP响应内容(字符串方式),会自动动为你解码 gzip 和 deflate 压缩
	print(resp.content)      # 获取HTTP响应内容(字节方式)
	print(resp.raw)          # 获取HTTP原始响应内容,也就是 urllib 的 response 对象,使用 r.raw.read() 读取
	print(resp.headers)      # 获取HTTP响应头,以字典对象存储,但是这个字典比较特殊,字典键不区分大小写,若键不存在则返回None

4. 发起其他请求

import requests
if __name__ == '__main__':
	# DELETE 请求
	resp3 = requests.delete('http://httpbin.org/delete')
	print(resp3)
	# HEAD 请求
	resp4 =  requests.head('http://httpbin.org/get')
	print(resp4)
	# OPTIONS 请求
	resp5 = requests.options('http://httpbin.org/get')
	print(resp5)

5. 异常处理 ?

  • Requests 提供的异常类都在requests.exceptions 中, 源码见 http://cn.python-requests.org/zh_CN/latest/_modules/requests/exceptions.html#RequestException
  • 从如下源码中可以看出异常类的继承关系:

(1)RequestException 继承自IOError
(2)HTTPError、ConnectionError、Timeout 继承自 RequestionException
(3)ProxyError、SSLError 继承自 ConnectionError
(4) ReadTimeout继承Timeout异常
这里列举了一些常用的异常继承关系,详细的可以看:

# -*- coding: utf-8 -*-

"""
requests.exceptions
~~~~~~~~~~~~~~~~~~~

This module contains the set of Requests' exceptions.
"""
from urllib3.exceptions import HTTPError as BaseHTTPError


class RequestException(IOError):
    """There was an ambiguous exception that occurred while handling your
    request.
    """

    def __init__(self, *args, **kwargs):
        """Initialize RequestException with `request` and `response` objects."""
        response = kwargs.pop('response', None)
        self.response = response
        self.request = kwargs.pop('request', None)
        if (response is not None and not self.request and
                hasattr(response, 'request')):
            self.request = self.response.request
        super(RequestException, self).__init__(*args, **kwargs)



class HTTPError(RequestException):
    """An HTTP error occurred."""



class ConnectionError(RequestException):
    """A Connection error occurred."""



class ProxyError(ConnectionError):
    """A proxy error occurred."""


class SSLError(ConnectionError):
    """An SSL error occurred."""


class Timeout(RequestException):
    """The request timed out.

    Catching this error will catch both
    :exc:`~requests.exceptions.ConnectTimeout` and
    :exc:`~requests.exceptions.ReadTimeout` errors.
    """



class ConnectTimeout(ConnectionError, Timeout):
    """The request timed out while trying to connect to the remote server.

    Requests that produced this error are safe to retry.
    """



class ReadTimeout(Timeout):
    """The server did not send any data in the allotted amount of time."""



class URLRequired(RequestException):
    """A valid URL is required to make a request."""



class TooManyRedirects(RequestException):
    """Too many redirects."""



class MissingSchema(RequestException, ValueError):
    """The URL schema (e.g. http or https) is missing."""


class InvalidSchema(RequestException, ValueError):
    """See defaults.py for valid schemas."""


class InvalidURL(RequestException, ValueError):
    """The URL provided was somehow invalid."""


class InvalidHeader(RequestException, ValueError):
    """The header value provided was somehow invalid."""


class ChunkedEncodingError(RequestException):
    """The server declared chunked encoding but sent an invalid chunk."""


class ContentDecodingError(RequestException, BaseHTTPError):
    """Failed to decode response content"""


class StreamConsumedError(RequestException, TypeError):
    """The content for this response was already consumed"""


class RetryError(RequestException):
    """Custom retries logic failed"""


class UnrewindableBodyError(RequestException):
    """Requests encountered an error when trying to rewind a body"""

# Warnings


class RequestsWarning(Warning):
    """Base warning for Requests."""
    pass


class FileModeWarning(RequestsWarning, DeprecationWarning):
    """A file was opened in text mode, but Requests determined its binary length."""
    pass


class RequestsDependencyWarning(RequestsWarning):
    """An imported dependency doesn't match the expected version range."""
    pass
  • 使用示例
import requests

from requests.exceptions import ReadTimeout
from requests.exceptions import ConnectionError
from requests.exceptions import RequestException

try:
    response = requests.get("http://httpbin.org/get", timout=0.1)
    print(response.status_code)
except ReadTimeout:
    print("timeout")
except ConnectionError:
    print("connection Error")
except RequestException:
    print("error")
  • raise_for_status() : 如果HTTP响应状态码不是200,就主动抛出异常
import requests

if __name__ == '__main__':
	try:
		resp = requests.get('http://httpbin.org/status/404')
		resp.raise_for_status()  # 如果HTTP响应状态码不是 200,就主动抛出异常
	except requests.RequestException as e:
		print(e)
	else:
		print(resp)
  • 运行结果如下:
    在这里插入图片描述
©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页