requests 库

最新推荐文章于 2024-06-21 10:23:19 发布

weixin_30498807

最新推荐文章于 2024-06-21 10:23:19 发布

阅读量85

点赞数

文章标签： json python 爬虫

原文链接：http://www.cnblogs.com/midworld/p/10847277.html

版权

1. 安装

pip install requests

1.1 HTTP 协议

HTTP：Hypertext （超文本） Transfer（转移、传） Protocol（协议）（超文本传输协议），它是一个基于“请求与响应”模式的、无状态的应用层协议，采用URL作为定位网络资源的标识，URL是通过HTTP 协议存取资源的 Internet 路径，一个URL 对应一个数据资源：

http://host[:port][path]
host: 合法的 Internet 主机域名或 ip 地址
port：端口号，缺省端口为 80
path：请求资源的路径
# http:www.bit.edu.cn
# http://220.181.111.188/duty

1.2 HTTP 协议对资源的操作

方法	说明
GET	请求获取 url 位置的资源
HEAD	请求获取 url 位置资源的响应消息报告，即获得该资源的头部信息
POST	请求向 url 位置的资源后附加新的数据
PUT	请求向 url 位置存储一个资源，覆盖原 url 位置的资源
PATCH	请求局部更新 URL 位置的资源，即改变该处资源的部分内容
DELETE	请求删除 URL 位置存储的资源

2. Requests 库的 7 个主要方法

requests 库有 7 个主要方法，比较常用的有 get 和 post 方法，再加上其余的四个方法其实质上都是调用 request 方法（具体可见源码）：

方法	说明
requests.request（）	构造一个请求，支撑以下各方法的基础方法
requests.get（）	获取 HTML 网页主要方法，对应于 HTTP 的GET
requests.head（）	获取 HTML 网页头信息的方法，对应 HTTP的 HEAD
requests.post（）	向 HTML 网页提交 post 请求的方法，对应 HTTP 的POST
requests.put（）	向 HTML 网页提交 PUT 请求的方法，对应 HTTP 的 PUT
requests.patch（）	向 HTML 网页提交局部修改请求，对应 HTTP 的 PATCH
requests.delete（）	向 HTML 页面提交删除请求，对应 HTTP 的 DELETE

2.1 get 方法

以 get 形式向服务器发起请求，参数会添加在 url 后面一起发送过去，返回一个包含服务器资源的 response 对象。

def get(url, params=None, **kwargs):
    """
    url：请求 url 地址
    params: 额外参数，会添加到 url 最后随着请求发送过去
    可以是 dict、list、tuple 以及 bytes
    """
    pass
    return request('get', url, params=params, **kwargs)

示例

向 http://www.autohome.com.cn/news/ 发起一个 get 请求，获得网页源代码：

# 无参数
url = 'http://www.autohome.com.cn/news/'
response = requests.get(url)
print(response.text)

# 有参数
url = 'http://www.autohome.com.cn/news/'
params = {'k1': 'v1', 'k2': 'v2'}
response = requests.get(url, params=params)
print(response.text)

# 最终请求 url
# http://www.autohome.com.cn/news/?k1=v1&k2=v2

2.2 post 方法

以 post 形式向服务器发送请求，一般用于登录/表单提交等。

def post(url, data=None, json=None, **kwargs):
    """
    url: 请求 url 地址
    data: （可选）请求体，可以是 dict、list、tuple 以及 bytes或者文件对象
    json: （可选）请求体（以json形式），当发送的数据是字典套字典时，使用 data 只会将 key 发送过去，而 json 不会
    """
    return request('post', url, data=data, json=json, **kwargs)

示例

data = {'username': 'rose@126.com', 'password': 'aadaf522'}
url = 'http://www.autohome.com.cn/login/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}

response = requests.post(
    url=url,
    data=data,
    headers=headers
)

2.3 requests 库的 2 个重要对象

包含服务器的资源 response 对象（也包含请求的 request 对象）
向服务器请求资源的的 request 对象

response 对象的属性

属性	说明
response.status_code	http 请求的返回状态，200为连接成功，404为失败
response.text	http 响应内容的字符串形式，即 url 对应的页面内容
response.encoding	从 http header 中猜测的响应内容编码方式
response.apparent_encoding	从内容中分析出的响应内容编码方式（备选编码f）
response.content	http 响应内容的二进制形式

关于 response 对象编码问题

当我们请求一个网站，获得其源代码时，有时得到的却是乱码。这是有编码出现问题导致的，虽然我们也可以自己找到源码所用的编码方式，但是太费时费力。

幸好 requests 库早就给我们想好了解决方法，apparent_encoding 属性可以获取源代码本身的编码，然后我们将其赋值给 encoding，这样就不会有乱码问题出现了：

# 只需在其中加上这么一句即可
response.encoding=response.apparent_encoding

2.4 request 方法

get、post、delete、put、head、patch、options 这七个方法都是在 request() 方法上构建的。

request() 方法有众多的参数，比较常用的有 method：请求方法、url：请求地址、data、json：请求体等等。

def request(method, url, **kwargs):
    """Constructs and sends a :class:`Request <Request>`.

    :param method: method for the new :class:`Request` object.
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary, list of tuples or bytes to send
        in the body of the :class:`Request`.
    :param data: (optional) Dictionary, list of tuples, bytes, or file-like
        object to send in the body of the :class:`Request`.
    :param json: (optional) A JSON serializable Python object to send in the body of the :class:`Request`.
    :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
    :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
    :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
        ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
        or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
        defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
        to add for the file.
    :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
    :param timeout: (optional) How many seconds to wait for the server to send data
        before giving up, as a float, or a :ref:`(connect timeout, read
        timeout) <timeouts>` tuple.
    :type timeout: float or tuple
    :param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``.
    :type allow_redirects: bool
    :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
    :param verify: (optional) Either a boolean, in which case it controls whether we verify
            the server's TLS certificate, or a string, in which case it must be a path
            to a CA bundle to use. Defaults to ``True``.
    :param stream: (optional) if ``False``, the response content will be immediately downloaded.
    :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response

    Usage::

      >>> import requests
      >>> req = requests.request('GET', 'https://httpbin.org/get')
      <Response [200]>
    """

    # By using the 'with' statement we are sure the session is closed, thus we
    # avoid leaving sockets open which can trigger a ResourceWarning in some
    # cases, and look like a memory leak in others.
    with sessions.Session() as session:
        return session.request(method=method, url=url, **kwargs)

常用参数示例

method_url()：请求方法、请求 url：

def method_url():
    requests.request(method='get', url='https://www.badiu.com')
    requests.request(method='post', url='https://www.badiu.com')
    pass

params()：get() 方法携带的额外参数，会放在 url 后面

def params():
    # 可以是 dict、str、bytes（ascii编码以内）
    url='https://www.badiu.com'
    requests.request(method='get', url=url, params={'name':'rose', 'age':18})

    requests.request(method='get', url=url, params='name=rose&age=18')

    requests.request(method='get', url=url, params=bytes('name=rose&age=18', encoding='utf-8'))

# 最终 url 
https://www.badiu.com?name=rose&age=18

data()：请求体（请求数据）

    def data():
        # 可以是 str、dict、bytes 或文件对象
        url = 'http://httpbin.org/post'
        data1 = {'username': 'rose@126.com', 'password': 'aadaf522'}
        data2 = 'username=rose@126.com, password=aadaf522'

        # 文件对象，文件内容为  username=rose@126.com, password=123456789
        data3 = open('test.txt', 'rb')

        # 字典套字典形式，只会将 key 发送过去
        data4 = {'first_name': {'last_name': 'rose'}}

        """
        {
            "args": {}, 
            "data": "", 
            "files": {}, 
            "form": {
                "first_name": "last_name"
            }, 
        """

        response = requests.request(method='POST', url=url, data=data3, headers={"Content-Type": "application/x-www-form-urlencoded"})

json()：将请求数据序列化，转化为 json 字符串

def json():
    # 与 data() 使用方法一致，区别在于 json() 会将请求数据序列化为 json 字符串
    # 发送到服务器端的body中，并且Content-Type是 {'Content-Type': 'application/json'}

    url = 'http://httpbin.org/post'

    data = {'first_name': {'last_name': 'rose'}}
    response = requests.request(method='POST', url=url, json=data)

    print(response.text)

运行结果如下：

{
    "args": {}, 
    "data": "{\"first_name\": {\"last_name\": \"rose\"}}", 
    "files": {}, 
    "form": {}, 
    "headers": {
        "Accept": "*/*", 
        "Accept-Encoding": "gzip, deflate", 
        "Content-Length": "37", 
        "Content-Type": "application/json", 
        "Host": "httpbin.org", 
        "User-Agent": "python-requests/2.21.0"
    }, 
    "json": {
        "first_name": {
        "last_name": "rose"
        }
    }, 
    "origin": "183.39.162.167, 183.39.162.167", 
    "url": "https://httpbin.org/post"
}

headers()：请求头，发起请求时最好带上请求头

def headers():
    url = 'http://httpbin.org/post'
    data = {'username': 'rose@126.com', 'password': 'aadaf522'}

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
    }

    response = requests.request(method='POST', url=url, json=data, headers=headers)
    pass

cookies()：获取或者设置 cookie

def cookies():
    url = 'http://dig.chouti.com/'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
    }

    r = requests.get(url=url, headers=headers)
    print(r.cookies)
    print(r.cookies.get_dict())

    pass

运行结果如下：

<RequestsCookieJar[<Cookie gpsd=c2b86684b98a6e0628989195b342ebe8 for .chouti.com/>, <Cookie JSESSIONID=aaa3NHhDIoimCcf5kLJPw for dig.chouti.com/>]>
{'gpsd': 'c2b86684b98a6e0628989195b342ebe8', 'JSESSIONID': 'aaa3NHhDIoimCcf5kLJPw'}

调用cookies 属性可以获得 cookies，它是一个 RequestsCookieJar 类型，使用 get_dict() 方法可以获得具体值。

session() :维持会话，保存客户端历史访问信息

当我们在模拟登陆一个网站时，一般首先获取 cookies，然后再设置 cookies。现在有了 session 就不用每次都这么麻烦，它会帮我们自动处理好 cookies。通常用于模拟登陆后的下一次操作：

def session():
    pass

示例

模拟登陆 抽屉新热榜，并给某篇文章点赞，以下将按照两种方法来实现：

cookies 实现：

import requests

# 获取 cookies
r1 = requests.get('http://dig.chouti.com/')
r1_cookies = r1.cookies.get_dict()


# 发起 post 请求，模拟登陆，并携带第一次 get 请求返回的 cookies
post_dict = {
    "phone": '8615131255089',
    'password': 'woshiniba',
    'oneMonth': 1
}
r2 = requests.post(
    url="http://dig.chouti.com/login",
    data=post_dict,
    cookies=r1_cookies
)

# 给文章点赞
r3 = requests.post(
    url='http://dig.chouti.com/link/vote?linksId=11832246',
    cookies={'gpsd': r1_cookies.get('gpsd')}
)
print(r3.text)

session 实现：

import requests

session = requests.Session()

# 登录页面，获取 cookies
r1 = session.get(url='http://dig.chouti.com/help/service')

# 用户登录，携带上一次的 cookies
r2 = session.post(
    url= 'http://dig.chouti.com/login',
    data={
        'phone': 'xxxx',
        'password': 'xxxx',
        'oneMonth': 1
    }
)

# 给某篇文章点赞
r3 = session.post(
    url="http://dig.chouti.com/link/vote?linksId=11837086",
)
print(i3.text)

timeout()：超时设置

在本地网络不好活服务器网络响应太慢甚至无响应时，可能需要等待很长的时间，也有可能会收不到响应而报错。这时我们应该设置一个超时时间，在这个时间内如果没有响应，则报错。

def timeout():
    r = requests.request(method='get', url='https://www.google.com/', timeout=5)

    # 请求分为两个阶段，连接（connect）和读取（read），timeout 是两者总和，若要分别指定，可以传入一个元组
     r = requests.request(method='get', url='https://www.google.com/', timeout=(5, 10, 15))
    pass

proxies()：代理设置

现在很多网站都会采取禁用 IP 的形式，来禁止一些大规模爬取网页的爬虫，一旦请求过去频繁就会封禁 IP。这时就需要通过设置代理方法（自动更换IP）来解决：

def proxies():
    r = requests.request(method='get', url='https://www.google.com/', 
        proxies = {
            # http、https 分别走不同的代理
            'http': 'http://10.10.1.10:3128',
            'https': 'https://10.11.1.11:3824'
        }
    )

另外 requests 库还支持 socks 协议代理，首先需要安装 socks 库：pip3 install 'requests[socks]'

proxies = {
            # http、https 分别走不同的代理
            'http': 'sock5://user:password@host:port',
            'https': 'sock5://user:password@host:port'
        }

files() ：文件上传

有些网站需要上传文件，我们就可以用它来实现：

def file():
    # 字典的 key 为上传到后台要拿取的名字
    files = {'f1': open('xxx.txt', 'rb')}
    r = requests.request(method='get', file=files)

    # 定制文件名，将 xxx.txt 命名为 test.txt
    files = {'f1': ('test.txt', open('xxx.txt', 'rb'))}
    pass

auth()：基本认证，有些网站在访问时要求认证页面，这是因为在 headers 中加入了加密的用户名和密码，可以用 requests 的 auth() 认证。若用户名密码正确，状态码返回 200，否则返回 401 。

from requests.auth import HTTPBasicAuth

r = requests.request(method='get', auth=HTTPBasicAuth('username', 'password'))
print(r.status_code)

# 若参数都传一个 HTTPBasicAuth 类即显得有点繁琐了，requests 库提供了一个更简单的写法，可以直接传入一个元组，默认会使用 HTTPBasicAuth 类来认证
r = requests.request(method='get', auth=('username', 'password'))

allow_redirects()：是否允许重定向

 r = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)

verify()、cert()：是否忽略 SSL 证书验证

在请求带 https 的网站时，一般默认会检查 ssl 证书，如果证书没有被官方 CA 机构信任，会出现证书认证错误。

verify() 可以用来判断是否要检查证书，若为 False 则表示不检查证书，不设置或为 True 表示要检查证书。

def verify():
    r = requests.request(method='get', url='https://www.12306.cn')
    print(r.status_code)
    print(r.text)

以前 12306 的证书没有被认证，请求时会出现如下错误（现在已经被认证了）：

requests.exceptions.SSLError:('bad handhake: Error(['SSL routines', 'tls_process_sever_certificate', 'certificate verify failed'],)',)

cert() 指定证书文件：

def cert():
    r = requests.request(method='get', url='https://www.12306.cn', cert=('/path/server.crt', '/path/key'))

    # cert='xxx.pem'

stream()：以数据流的形式请求/下载，True 为一点点下载，False 直接全部下载：

def stream():
    r = requests.request(method='get', url='https://www.12306.cn', stream=True)

3、爬取网页的通用代码框架

import requests

def get_html_text(url):
    try:
        r = requests.get(url, timeout = 30)  # 设置超时时间
        r.raise_for_status()  # 如果状态不是200，引发HTTPError异常
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return '产生异常'


if __name__ == '__main__':
    url = 'http://www.baidu.com'
    print(get_html_text(url))

转载于:https://www.cnblogs.com/midworld/p/10847277.html

weixin_30498807

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
requests 库

1. 安装pip install requests1.1 HTTP 协议HTTP：Hypertext （超文本） Transfer（转移、传） Protocol（协议）（超文本传输协议），它是一个基于“请求与响应”模式的、无状态的应用层协议，采用URL作为定位网络资源的标识，URL是通过HTTP 协议存取资源的 Internet 路径，一个URL 对应一个数据资源：http://host[...
复制链接

扫一扫