Python 库：requests ( 使用 session、cookie )、httpx

擒贼先擒王

已于 2024-06-26 17:33:41 修改

阅读量3.1k

点赞数 4

分类专栏： Python 爬虫文章标签： python 开发语言 pycharm

于 2017-03-08 14:35:23 首次发布

本文链接：https://blog.csdn.net/freeking101/article/details/60868350

版权

Python 爬虫专栏收录该内容

42 篇文章 57 订阅

订阅专栏

1、requests 的使用

Requests 官方文档：http://cn.python-requests.org/zh_CN/latest

用户指南

从 Requests 的背景讲起，然后对 Requests 的重点功能做了逐一的介绍。

API 文档/指南

如果你要了解具体的函数、类、方法，这部分文档就是为你准备的。

开发接口

示例：简单使用

安装：pip install requests

import json
import requests

# HTTP 请求类型
r = requests.get('https://github.com/timeline.json')  # get 类型
r = requests.post("http://m.ctrip.com/post")  # post 类型
r = requests.put("http://m.ctrip.com/put")  # put 类型
r = requests.delete("http://m.ctrip.com/delete")  # delete 类型
r = requests.head("http://m.ctrip.com/head")  # head 类型
r = requests.options("http://m.ctrip.com/get")  # options类型

# 获取响应内容
print(r.content)  # 以字节的方式去显示，中文显示为字符
print(r.text)  # 以文本的方式去显示

payload = {'keyword': '日本', 'salecityid': '2'}
# 向 URL 传递参数
r = requests.get("https://m.ctrip.com/webapp/tourvisa/visa_list", params=payload)
print(r.url)

# 获取/修改网页编码
r = requests.get('https://github.com/timeline.json')
print(r.encoding)
# 修改网页编码
r.encoding = 'utf-8'

# json处理
r = requests.get('https://github.com/timeline.json')
print(r.json())  # 需要先 import json

# 定制请求头 (get 和 post 都一样方式）
url = 'https://m.ctrip.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
                  '(KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.62'
}
r = requests.post(url, headers=headers)  # 或者 r = requests.get(url, headers=headers)
print(r.request.headers)


# 复杂post请求
url = 'https://m.ctrip.com'
payload = {'some': 'data'}
# 如果传递的payload是string而不是dict，需要先调用dumps方法格式化一下
r = requests.post(url, data=json.dumps(payload))

# post 多部分编码文件
url = 'https://m.ctrip.com'
files = {'file': open('report.xls', 'rb')}
r = requests.post(url, files=files)


r = requests.get('http://m.ctrip.com')
# 响应状态码
print(r.status_code)

r = requests.get('http://m.ctrip.com')
# 响应头
print(r.headers)

print(r.headers['Content-Type'])
# 访问响应头部分内容的两种方式
print(r.headers.get('content-type'))

# Cookies
url = 'https://example.com/some/cookie/setting/url'
r = requests.get(url)
r.cookies['example_cookie_name']  # 读取 cookies

url = 'https://m.ctrip.com/cookies'
cookies = dict(cookies_are='working')
r = requests.get(url, cookies=cookies)  # 发送cookies

# 设置超时时间
r = requests.get('https://m.ctrip.com', timeout=0.001)

# 设置访问代理（get 和 post 都一样方式）
proxies = {
    "http": "http://10.10.10.10:8888",
    "https": "http://10.10.10.100:4444",
}
r = requests.get('https://m.ctrip.com', proxies=proxies)

发送 get 请求、传递参数

import requests

r = requests.get("http://httpbin.org/get")
print(type(r))
print(r.status_code)  # 获取响应状态码
print(r.encoding)  # 获取网页编码
print(r.text)  # r.text来获取响应的内容。以字符的形式获取
print(r.content)  # 以字节的形式获取响应的内容。requests会自动将内容转码。
print(r.cookies)  # 获取cookies

requests.get('https://github.com/timeline.json')   # GET请求
requests.post('https://httpbin.org/post')           # POST请求
requests.put('https://httpbin.org/put')             # PUT请求
requests.delete('https://httpbin.org/delete')       # DELETE请求
requests.head('https://httpbin.org/get')            # HEAD请求
requests.options('https://httpbin.org/get')         # OPTIONS请求

requests 模块

发送 get 请求时，请求参数可以直接放在 ur1 的?后面，也可以放在字典里，传递给params
发送 post 请求时，请求参数要放在字典里，传递给 data

import requests

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get("http://httpbin.org/get", params=payload)
print r.url

运行结果

http://httpbin.org/get?key2=value2&key1=value1
url = "https://www.baidu.com/s"
r = requests.get(url, params={'wd': 'python'})
print r.url
r = requests.get(url, params={'wd': 'php'})
print r.url
print r.text

示例：带参数的请求

import requests

# GET参数实例
requests.get('https://www.baidu.com/s', params={'wd': 'python'})
# POST参数实例
requests.post(
    'https://www.itwhy.org/wp-comments-post.php',
    data={'comment': '测试POST'}
)

# get 方法 使用 params 参数传递数据
# post 方法 使用 data 参数传递数据

r = requests.get("https://httpbin.org/get", params={'wd': 'python'})  # get 参数示例
print(r.url)
print(r.text)

# post 方法 如果使用 params 参数传递数据时，传递的数据可以在url中以明文看到
r = requests.post("https://httpbin.org/post", params={'wd': 'python'})
print(r.url)
print(r.text)

# post 如果使用 data 参数传递数据时，传递的数据在url中无法看到
r = requests.post(
    "https://httpbin.org/post", 
    data={'comment': 'TEST_POST'}
)  # post 参数示例
print(r.url)
print(r.text)

如果想请求JSON文件，可以利用 json() 方法解析。例如自己写一个JSON文件命名为a.json，内容如下

["foo", "bar", {
  "foo": "bar"
}]

利用如下程序请求并解析

import requests

# a.json 代表的一个服务器json文件，这里为了演示
# 实际是：http://xxx.com/a.json  形式的URL地址
r = requests.get("a.json") 
print(r.text)
print(r.json())

运行结果如下，其中一个是直接输出内容，另外一个方法是利用 json() 方法解析。

["foo", "bar", {
 "foo": "bar"
 }]
 [u'foo', u'bar', {u'foo': u'bar'}]

如果想获取来自服务器的原始套接字响应，可以取得 r.raw 。不过需要在初始请求中设置 stream=True 。

r = requests.get('https://github.com/timeline.json', stream=True)
r.raw
<requests.packages.urllib3.response.HTTPResponse object at 0x101194810>
r.raw.read(10)
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

这样就获取了网页原始套接字内容。如果想添加 headers，可以传 headers 参数

import requests

payload = {'key1': 'value1', 'key2': 'value2'}
headers = {'content-type': 'application/json'}
r = requests.get("http://httpbin.org/get", params=payload, headers=headers)
print(r.url)

通过headers参数可以增加请求头中的headers信息

自定义 header

import requests
import json
 
data = {'some': 'data'}
headers = {
    'content-type': 'application/json',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0)'
}
 
r = requests.post('https://api.github.com/some/endpoint', data=data, headers=headers)
print(r.text)

请求头内容可以用r.request.headers来获取。

>>> r.request.headers

自定义请求头部：定制 headers，使用 headers 参数来传递

伪装请求头部是采集时经常用的，我们可以用这个方法来隐藏：

r = requests.get('http://www.zhidaow.com')
print(r.request.headers['User-Agent'])

headers = {'User-Agent': 'alexkh'}
r = requests.get('http://www.zhidaow.com', headers = headers)
print(r.request.headers['User-Agent'])

POST 请求

请求参数，观察抓包的参数状况：

QueryStringParameters ---> url
Form Data ---> requests.post(data)
requestpayload ---> requests.post(data=json.dumps(dict),headers={"contentType":"application/json"})

"""
requests  -> 

    get:
        Query String Parameters  ->  url
        url上拼接? xxx=xxx&xxxx=xxxx
        params -> 也可以把上述参数进行设置
    
    post:
        Form Data
            把字典传递个data即可
            requests.post(url, data=dict)
        Request Payload
            把字典处理成json传递给data
            字典处理成json之后. json是不是字符串????
            同时需要给出请求头中的Content-Type : application/json
"""

对于 POST 请求一般需要为它增加一些参数。最基本的传参方法可以利用 data 这个参数。

import requests

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post("http://httpbin.org/post", data=payload)
print(r.text)

执行成功后，服务器返回了我们传的数据。如果需要传JSON格式的数据，可以用 json.dumps() 方法把表单数据序列化。

POST发送JSON数据：

import json
import requests

url = 'http://httpbin.org/post'
payload = {'some': 'data'}
r = requests.post(url, data=json.dumps(payload))
print(r.text)

data 参数

当传递给requests.post方法的data参数是字典类型时，requests会自动将其编码为表单形式(application/x-www-form-urlencoded)，这是大多数HTTP表单提交的默认编码方式。
如果data参数是字符串，则可以自定义数据格式，如JSON字符串。此时，你可能需要手动设置headers参数中的'Content-Type'为'application/json'，或其他适合你数据格式的MIME类型。
data参数可以是字典、字节序列，或文件对象的元组列表。

json 参数

json参数提供了一种更直接的方式来发送JSON编码的数据。当使用json参数时，requests会自动将字典编码为JSON格式，并自动将Content-Type设置为application/json。
使用json参数时，无需调用json.dumps()来序列化数据，requests会处理这一过程。

结论

使用data参数更为通用，适用于提交表单数据或发送自定义格式的数据。但要正确发送JSON数据，需要手动设置Content-Type头部。
使用json参数时，适用于发送JSON数据，更为便捷，不需要手动指定Content-Type头部，因为requests会自动处理。

post 请求载体是表单时，则 data参数是字典

post 请求载体是载荷时，data参数是字符串，不是字典

import requests
import json

headers = {
    "Content-Type": "application/json",
    "u-sign": "f4f3ddab7f10a7d927fdfef3a7a3ca2d",
}
url = "https://uwf7de983aad7a717eb.youzy.cn/youzy.dms.datalib.api.enrolldata.enter.college.encrypted.v2.get"
data_dict = {
    "collegeCode": "10001",
    "provinceCode": 37
}
data_string = json.dumps(data_dict, separators=(',', ':'))
response = requests.post(url, headers=headers, data=data_string)

print(response.text)
print(response)

一般而言，如果是提交JSON数据给服务器，使用json参数会更加方便。如果提交其他格式的数据或表单数据，使用data参数会更适合。

如果想要上传文件，那么直接用 file 参数即可。新建一个 a.txt 的文件，内容写上 Hello World!

import requests

url = 'http://httpbin.org/post'
files = {'file': open('test.txt', 'rb')}
r = requests.post(url, files=files)
print r.text

运行结果如下

{
  "args": {}, 
  "data": "", 
  "files": {
    "file": "Hello World!"
  }, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "156", 
    "Content-Type": "multipart/form-data; boundary=7d8eb5ff99a04c11bb3e862ce78d7000", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.9.1"
  }, 
  "json": null, 
  "url": "http://httpbin.org/post"
}

这样便成功完成了一个文件的上传。
requests 是支持流式上传的，这允许你发送大的数据流或文件而无需先把它们读入内存。要使用流式上传，仅需为你的请求体提供一个类文件对象即可。这是一个非常实用方便的功能。

with open('massive-body') as f:
    requests.post('http://some.url/streamed', data=f)

import requests
 
url = 'http://127.0.0.1:5000/upload'
files = {'file': open('/home/lyb/sjzl.mpg', 'rb')}
#files = {'file': ('report.jpg', open('/home/lyb/sjzl.mpg', 'rb'))}     #显式的设置文件名
 
r = requests.post(url, files=files)
print(r.text)

你可以把字符串当着文件进行上传：

import requests
 
url = 'http://127.0.0.1:5000/upload'
files = {'file': ('test.txt', b'Hello Requests.')}     #必需显式的设置文件名
 
r = requests.post(url, files=files)
print(r.text)

发送文件的post类型，这个相当于向网站上传一张图片，文档等操作，这时要使用files参数

>>> url = 'http://httpbin.org/post'
>>> files = {'file': open('touxiang.png', 'rb')}
>>> r = requests.post(url, files=files)

POST 请求模拟登陆及一些返回对象的方法

import requests

url1 = 'https://www.exanple.com/login'  # 登陆地址
url2 = "https://www.example.com/main"  # 需要登陆才能访问的地址
data = {"user": "user", "password": "pass"}
headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;",
    "Accept-Encoding": "gzip",
    "Accept-Language": "zh-CN,zh;q=0.8",
    "Referer": "https://www.example.com/",
    "User-Agent": ""
}
res1 = requests.post(url1, data=data, headers=headers)
res2 = requests.get(url2, cookies=res1.cookies, headers=headers)
print(res2.content)  # 获得二进制响应内容
print(res2.raw)  # 获得原始响应内容,需要stream=True
print(res2.raw.read(50))
print(type(res2.text))  # 返回解码成unicode的内容
print(res2.url)
print(res2.history)  # 追踪重定向
print(res2.cookies)
print(res2.cookies['example_cookie_name'])
print(res2.headers)
print(res2.headers['Content-Type'])
print(res2.headers.get('content-type'))
print(res2.json)  # 讲返回内容编码为json
print(res2.encoding)  # 返回内容编码
print(res2.status_code)  # 返回http状态码
print(res2.raise_for_status())  # 返回错误状态码

Response 对象

使用requests方法后，会返回一个response对象，其存储了服务器响应的内容，如上实例中已经提到的 r.text、r.status_code……
获取文本方式的响应体实例：当你访问 r.text 之时，会使用其响应的文本编码进行解码，并且你可以修改其编码让 r.text 使用自定义的编码进行解码。

r.status_code    # 响应状态码
r.raw            # 返回原始响应体，也就是 urllib 的 response 对象，使用 r.raw.read() 读取
r.content        # 字节方式的响应体，会自动为你解码 gzip 和 deflate 压缩
r.text           # 字符串方式的响应体，会自动根据响应头部的字符编码进行解码
r.headers        # 以字典对象存储服务器响应头，但是这个字典比较特殊，字典键不区分大小写，若键不存在则返回None

#*特殊方法*#
r.json()                # Requests中内置的JSON解码器
r.raise_for_status()    # 失败请求(非200响应)抛出异常

Cookies

如果一个响应中包含了cookie，那么我们可以利用 cookies 变量来拿到。会话对象让你能够跨请求保持某些参数，最方便的是在同一个Session实例发出的所有请求之间保持cookies，且这些都是自动处理的

import requests

url = 'http://example.com'
r = requests.get(url)
print r.cookies
print r.cookies['example_cookie_name']

以上程序仅是样例，可以用 cookies 变量来得到站点的 cookies。另外可以利用 cookies 变量来向服务器发送 cookies 信息

import requests

url = 'http://httpbin.org/cookies'
cookies = dict(cookies_are='working')
r = requests.get(url, cookies=cookies)
print r.text

运行结果
'{"cookies": {"cookies_are": "working"}}'

如果某个响应中包含一些Cookie，你可以快速访问它们：

import requests

r = requests.get('http://www.google.com.hk/')
print(r.cookies['NID'])
print(tuple(r.cookies))

要想发送你的cookies到服务器，可以使用 cookies 参数：

import requests
 
url = 'http://httpbin.org/cookies'
cookies = {'testCookies_1': 'Hello_Python3', 'testCookies_2': 'Hello_Requests'}
# 在Cookie Version 0中规定空格、方括号、圆括号、等于号、逗号、双引号、斜杠、问号、@，冒号，分号等特殊符号都不能作为Cookie的内容。
r = requests.get(url, cookies=cookies)
print(r.json())

如下是快盘签到脚本

import requests
 
headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
           'Accept-Encoding': 'gzip, deflate, compress',
           'Accept-Language': 'en-us;q=0.5,en;q=0.3',
           'Cache-Control': 'max-age=0',
           'Connection': 'keep-alive',
           'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'}
 
s = requests.Session()
s.headers.update(headers)
# s.auth = ('superuser', '123')
s.get('https://www.kuaipan.cn/account_login.htm')
 
_URL = 'http://www.kuaipan.cn/index.php'
s.post(_URL, params={'ac':'account', 'op':'login'},
       data={'username':'****@foxmail.com', 'userpwd':'********', 'isajax':'yes'})
r = s.get(_URL, params={'ac':'zone', 'op':'taskdetail'})
print(r.json())
s.get(_URL, params={'ac':'common', 'op':'usersign'})

获取响应中的cookies

获取响应中的cookies
>>> r = requests.get('http://www.baidu.com')
>>> r.cookies['BAIDUID']
'D5810267346AEFB0F25CB0D6D0E043E6:FG=1'

也可以自已定义请求的COOKIES
>>> url = 'http://httpbin.org/cookies'
>>> cookies = {'cookies_are':'working'}
>>> r = requests.get(url,cookies = cookies)
>>> 
>>> print r.text
{
  "cookies": {
    "cookies_are": "working"
  }
}
>>>

超时配置

可以利用 timeout 变量来配置最大请求时间。timeout 仅对连接过程有效，与响应体的下载无关。

requests.get('http://github.com', timeout=0.001)

注：timeout 仅对连接过程有效，与响应体的下载无关。

也就是说，这个时间只限制请求的时间。即使返回的 response 包含很大内容，下载需要一定时间，然而这并没有什么卵用。

Session

requests 爬虫：http://www.cnblogs.com/lucky-pin/p/5806394.html

在了解 Session 和 Cookie 之前，需要先了解 HTTP 的一个特点，叫作无状态

无状态 HTTP

HTTP 的无状态是指 HTTP 协议对事务处理是没有记忆能力的，或者说服务器并不知道客户端处于什么状态。客户端向服务器发送请求后，服务器解析此请求，然后返回对应的响应，服务器负责完成这个过程，而且这个过程是完全独立的，服务器不会记录前后状态的变化，也就是缺少状态记录。这意味着之后如果需要处理前面的信息，客户端就必须重传，导致需要额外传递一些重复请求，才能获取后续响应，这种效果显然不是我们想要的。为了保持前后状态，肯定不能让客户端将前面的请求全部重传一次，这太浪费资源了，对于需要用户登录的页面来说，更是棘手
这时两种用于保持HTTP连接状态的技术出现了，分别是Session和Cookie。
Session 在服务端也就是网站的服务器，用来保存用户的 Session 信息;
Cookie 在客户端，也可以理解为在浏览器端。有了 Cookie，浏览器在下次访问相同网页时就会自动附带上它，并发送给服务器，服务器通过识别Cookie 鉴定出是哪个用户在访问，然后判断此用户是否处于登录状态，并返回对应的响应。
可以这样理解，Cookie里保存着登录的凭证，客户端在下次请求时只需要将其携带上，就不必重新输人用户名、密码等信息重新登录了
因此在爬虫中，处理需要先登录才能访间的页面时，我们一般会直接将登录成功后获取的 Cookie放在请求头里面直接请求，而不重新模拟登录。

什么是 Session

Session 中文称之为会话，其本义是指有始有终的一系列动作、消息。例如打电话时，从拿起电话拨号到挂断电话之间的一系列过程就可以称为一个Session。
而在Web中Session对象用来存储特定用户Session 所需的属性及配置信息。这样，当用户在应用程序的页面之间跳转时，存储在Session 对象中的变量将不会丢失，会在整个用户Session中一直存在下去。当用户请求来自应用程序的页面时，如果该用户还没有 Session，那么 Web 服务器将自动创建一个Session对象。当Session过期或被放弃后，服务器将终止该Session。

什么是 Cookie

Cookie 指某些网站为了鉴别用户身份、进行Session跟踪而存储在用户本地终端上的数据

Session 维持原理

那么，怎样利用 Cookie 保持状态呢?在客户端第一次请求服务器时，服务器会返回一个响应头中带有 Set-Cookie字段的响应给客户端，这个字段用来标记用户。客户端浏览器会把 Cookie 保存起来，当下一次请求相同的网站时，把保存的 Cookie 放到请求头中一起提交给服务器。Cookie 中携带着SessionID相关信息，服务器通过检查Cookie 即可找到对应的Session，继而通过判断Session辨认用户状态。如果 Session 当前是有效的，就证明用户处于登录状态，此时服务器返回登录之后才可以查看的网页内容，浏览器再进行解析便可以看到了。
反之，如果传给服务器的Cookie是无效的，或者 Session已经过期了，客户端将不能继续访问页面，此时可能会收到错误的响应或者跳转到登录页面重新登录
Cookie和Session需要配合，一个在客户端，一个在服务端，二者共同协作，就实现了登录控制

常见误区

在谈论 Session 机制的时候，常会听到一种误解-只要关闭浏览器，Session 就消失了。可以想象一下生活中的会员卡，除非顾客主动对店家提出销卡，否则店家是绝对不会轻易删除顾客资料的。对Session来说，也一样，除非程序通知服务器删除一个Session，否则服务器会一直保留。例如：程序一般都是在我们做注销操作时才删除Session。
但是当我们关闭浏览器时，浏览器不会主动在关闭之前通知服务器自己将要被关闭，所以服务器压根不会有机会知道浏览器已经关闭。之所以会产生上面的误解，是因为大部分网站使用会话 Cookie来保存SessionID信息，而浏览器关闭后Cookie就消失了，等浏览器再次连接服务器时，也就无法找到原来的 Session了。如果把服务器设置的 Cookie 保存到硬盘上，或者使用某种手段改写浏览器发出的 HTTP 请求头，把原来的 Cookie 发送给服务器，那么再次打开浏览器时，仍然能够找到原来的SessionID，依旧保持登录状态
而且恰恰是由于关闭浏览器不会导致Session 被删除，因此需要服务器为Session设置一个失效时间，当距离客户端上一次使用Session 的时间超过这个失效时间时，服务器才可以认为客户端已经停止了活动，并删除掉 Session 以节省存储空间。

requests 中使用 session

使用 session 步骤
1. 先初始化一个session对象，s = requests.Session()
2. 然后使用这个session对象来进行访问，r = s.post(url,data = user)

在以上的请求中，每次请求其实都相当于发起了一个新的请求。也就是相当于我们每个请求都用了不同的浏览器单独打开的效果。也就是它并不是指的一个会话，即使请求的是同一个网址。

比如

import requests

requests.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = requests.get("http://httpbin.org/cookies")
print(r.text)

结果是：
{
  "cookies": {}
}

很明显，这不在一个会话中，无法获取 cookies，那么在一些站点中，我们需要保持一个持久的会话怎么办呢？就像用一个浏览器逛淘宝一样，在不同的选项卡之间跳转，这样其实就是建立了一个长久会话。

解决方案如下

import requests

s = requests.Session()
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get("http://httpbin.org/cookies")
print(r.text)

在这里我们请求了两次，一次是设置 cookies，一次是获得 cookies。运行结果

{
  "cookies": {
    "sessioncookie": "123456789"
  }
}

发现可以成功获取到 cookies 了，这就是建立一个会话到作用。那么既然会话是一个全局的变量，那么我们肯定可以用来全局的配置了。

import requests

s = requests.Session()
s.headers.update({'x-test': 'true'})
r = s.get('http://httpbin.org/headers', headers={'x-test2': 'true'})
print r.text

通过 s.headers.update 方法设置了 headers 的变量。然后我们又在请求中设置了一个 headers，那么会出现什么结果？

很简单，两个变量都传送过去了。运行结果：

{
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.9.1", 
    "X-Test": "true", 
    "X-Test2": "true"
  }
}

如果get方法传的headers 同样也是 x-test 呢？

r = s.get('http://httpbin.org/headers', headers={'x-test': 'true'})

嗯，它会覆盖掉全局的配置

{
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.9.1", 
    "X-Test": "true"
  }
}

那如果不想要全局配置中的一个变量了呢？很简单，设置为 None 即可

r = s.get('http://httpbin.org/headers', headers={'x-test': None})

运行结果

{
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.9.1"
  }
}

以上就是 session 会话的基本用法

使用Session()对象的写法（Prepared Requests）:

#-*- coding:utf-8 -*-
import requests
s = requests.Session()
url1 = 'http://www.exanple.com/login'#登陆地址
url2 = "http://www.example.com/main"#需要登陆才能访问的地址
data={"user":"user","password":"pass"}
headers = { "Accept":"text/html,application/xhtml+xml,application/xml;",
            "Accept-Encoding":"gzip",
            "Accept-Language":"zh-CN,zh;q=0.8",
            "Referer":"http://www.example.com/",
            "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
            }

prepped1 = requests.Request('POST', url1,
    data=data,
    headers=headers
).prepare()
s.send(prepped1)


'''
也可以这样写
res = requests.Request('POST', url1,
data=data,
headers=headers
)
prepared = s.prepare_request(res)
# do something with prepped.body
# do something with prepped.headers
s.send(prepared)
'''

prepare2 = requests.Request('POST', url2,
    headers=headers
).prepare()
res2 = s.send(prepare2)

print res2.content

另一种写法 :

#-*- coding:utf-8 -*-
import requests
s = requests.Session()
url1 = 'http://www.exanple.com/login'#登陆地址
url2 = "http://www.example.com/main"#需要登陆才能访问的页面地址
data={"user":"user","password":"pass"}
headers = { "Accept":"text/html,application/xhtml+xml,application/xml;",
            "Accept-Encoding":"gzip",
            "Accept-Language":"zh-CN,zh;q=0.8",
            "Referer":"http://www.example.com/",
            "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
            }
res1 = s.post(url1, data=data)
res2 = s.post(url2)
print(resp2.content)

SSL证书验证

现在随处可见 https 开头的网站，Requests可以为HTTPS请求验证SSL证书，就像web浏览器一样。要想检查某个主机的SSL证书，你可以使用 verify 参数。
现在 12306 证书不是无效的嘛，来测试一下

import requests

r = requests.get('https://kyfw.12306.cn/otn/', verify=True)
print r.text

结果
requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)
果真如此

来试下 github 的

import requests

r = requests.get('https://github.com', verify=True)
print r.text

正常请求，内容我就不输出了。如果我们想跳过刚才 12306 的证书验证，把 verify 设置为 False 即可

import requests

r = requests.get('https://kyfw.12306.cn/otn/', verify=False)
print r.text

发现就可以正常请求了。在默认情况下 verify 是 True，所以如果需要的话，需要手动设置下这个变量。

身份验证

基本身份认证(HTTP Basic Auth):

import requests
from requests.auth import HTTPBasicAuth
 
r = requests.get('https://httpbin.org/hidden-basic-auth/user/passwd', auth=HTTPBasicAuth('user', 'passwd'))
# r = requests.get('https://httpbin.org/hidden-basic-auth/user/passwd', auth=('user', 'passwd'))    # 简写
print(r.json())

另一种非常流行的HTTP身份认证形式是摘要式身份认证，Requests对它的支持也是开箱即可用的:

requests.get(URL, auth=HTTPDigestAuth('user', 'pass'))

requests 设置 http、socks 代理

使用代理可以通过为任意请求方法提供 proxies 参数来配置单个请求

import requests

proxies = {
  "https": "http://41.118.132.69:4433"
}
r = requests.post("http://httpbin.org/post", proxies=proxies)
print r.text

也可以通过环境变量 HTTP_PROXY 和 HTTPS_PROXY 来配置代理

export HTTP_PROXY="http://10.10.1.10:3128"
export HTTPS_PROXY="http://10.10.1.10:1080"

采集时为避免被封IP，经常会使用代理。requests也有相应的proxies属性。

import requests

proxies = {
"http": "http://10.10.1.10:3128",
"https": "http://10.10.1.10:1080",
}

requests.get("http://www.zhidaow.com", proxies=proxies)

如果代理需要账户和密码，则需这样：

proxies = {
"http": "http://user:pass@10.10.1.10:3128/",
}

示例：

import requests
 
proxy = '127.0.0.1:10809'
proxies = {
    'http': 'http://' + proxy,
    'https': 'https://' + proxy,
}
try:
    response = requests.get('http://httpbin.org/get', proxies=proxies)
    print(response.text)
except requests.exceptions.ConnectionError as e:
    print('Error', e.args)

如果代理需要认证，同样在代理的前面加上用户名密码即可，代理的写法就变成：

proxy = 'username:password@127.0.0.1:10809'

如果需要使用 SOCKS5 代理，则首先需要安装一个 Socks 模块：

pip3 install "requests[socks]"

同样使用本机运行代理软件的方式，则爬虫设置代理的代码如下：

import requests
 
proxy = '127.0.0.1:10808'
proxies = {
    'http': 'socks5://' + proxy,
    'https': 'socks5://' + proxy
}
try:
    response = requests.get('http://httpbin.org/get', proxies=proxies)
    print(response.text)
except requests.exceptions.ConnectionError as e:
    print('Error', e.args)

还有一种使用 socks 模块进行全局设置的方法，如下：

import requests
import socks
import socket
 
socks.set_default_proxy(socks.SOCKS5, '127.0.0.1', 10808)
socket.socket = socks.socksocket
try:
    response = requests.get('http://httpbin.org/get')
    print(response.text)
except requests.exceptions.ConnectionError as e:
    print('Error', e.args)

如果代理提供了协议，就使用对应协议的代理；如果代理没有协议的话，就在代理上加上http协议。

对于 Chrome 来说，用 Selenium 设置代理的方法也非常简单，设置方法如下：

from selenium import webdriver
 
proxy = '127.0.0.1:10809'
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=http://' + proxy)
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.get('http://httpbin.org/get')

socks5 代理

下图是 ccxt 的代理 (纯娱乐)

或者 (可尝试把这代码加到最前面,改成对应的端口)

import socks
import socket

# # set proxy
socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", 10808)
socket.socket = socks.socksocket

利用 Requests 来抓取火车票数据。

根据观察，数据接口如下:

https://kyfw.12306.cn/otn/lcxxcx/query?purpose_codes=ADULT&queryDate=2015-05-23&from_station=NCG&to_station=CZQ

返回的是2015-5-23南昌到郴州的火车票信息,格式为json。
返回的数据的如下(只截取了一部分):

{"validateMessagesShowId":"_validatorMessage","status":true,"httpstatus":200,"data":{"datas":[{"train_no":"5u000G140101","station_train_code":"G1401","start_station_telecode":"NXG","start_station_name":"南昌西","end_station_telecode":"IZQ","end_station_name":"广州南","from_station_telecode":"NXG","from_station_name":"南昌西","to_station_telecode":"ICQ","to_station_name":"郴州西","start_time":"07:29","arrive_time":"10:42","day_difference":"0","train_class_name":"","lishi":"03:13","canWebBuy":"Y","lishiValue":"193","yp_info":"O030850182M0507000009097450000","control_train_day":"20991231","start_train_date":"20150523","seat_feature":"O3M393","yp_ex":"O0M090","train_seat_feature":"3","seat_types":"OM9","location_code":"G2","from_station_no":"01","to_station_no":"11","control_day":59,"sale_time":"0930","is_support_card":"1","note":"","gg_num":"--","gr_num":"--","qt_num":"--","rw_num":"--","rz_num":"--","tz_num":"--","wz_num":"--","yb_num":"--","yw_num":"--","yz_num":"--","ze_num":"182","zy_num":"无","swz_num":"无"}}

看着很乱，我们稍加整理:

{
    "validateMessagesShowId":"_validatorMessage",
    "status":true,"httpstatus":200,
    "data":{
            "datas":[
                        {
                             "train_no":"5u000G140101",
                             "station_train_code":"G1401",
                             "start_station_telecode":"NXG",
                             "start_station_name":"南昌西",
                             "end_station_telecode":"IZQ",
                             "end_station_name":"广州南",
                             "from_station_telecode":"NXG",
                             "from_station_name":"南昌西",
                             "to_station_telecode":"ICQ",
                             "to_station_name":"郴州西",
                             "start_time":"07:29",
                             "arrive_time":"10:42",
                             "day_difference":"0",
                             ...
                             "swz_num":"无"
                        },
                        {
                              ...
                        }
                    ]
}

这样就比较清晰了,代码如下，提取自己需要的信息。

#-*- coding:utf-8 -*-
import requests
import json

class trainTicketsSprider:

    def getTicketsInfo(self,purpose_codes,queryDate,from_station,to_station):
        self.url = 'https://kyfw.12306.cn/otn/lcxxcx/query?purpose_codes=%s&queryDate=%s&from_station=%s&to_station=%s' %(purpose_codes,queryDate,from_station,to_station)
        self.headers = { 
                    "Accept":"text/html,application/xhtml+xml,application/xml;",
                    "Accept-Encoding":"gzip",
                    "Accept-Language":"zh-CN,zh;q=0.8",
                    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
                  }
        self.TicketSession = requests.Session()
        self.TicketSession.verify = False #关闭https验证   
        self.TicketSession.headers = self.headers
        try:
            self.resp_json = self.TicketSession.get(self.url)
            self.ticketsDatas = json.loads(self.resp_json.text)["data"]["datas"]
            return self.ticketsDatas
        except Exception,e:
            print e

def isZero(num):
    if num == '--' or '无':
        return '0'
    else:
        return num

def main():
    purpose_codes = 'ADULT'
    queryDate = '2015-05-23'
    from_station = 'NCG'
    to_station = 'CZQ'
    TicketSprider = trainTicketsSprider()
    res= TicketSprider.getTicketsInfo(purpose_codes,queryDate,from_station,to_station)
    for i,ticketInfo in enumerate(res):        
                print u"车次:%s" %ticketInfo["station_train_code"]
                print u"起始站:%s" %ticketInfo["start_station_name"]
                print u"目的地:%s" %ticketInfo["to_station_name"]
                print u"开车时间:%s" %ticketInfo["start_time"]
                print u"到达时间:%s" %ticketInfo["arrive_time"]
                print u"二等座还剩:%s张票" %isZero(ticketInfo["ze_num"])
                print u"硬座还剩:%s张票" %isZero(ticketInfo["yz_num"])
                print u"硬卧还剩:%s张票" %isZero(ticketInfo["yw_num"])
                print u"无座还剩:%s张票" %isZero(ticketInfo["wz_num"])
                print u"是否有票:%s" %ticketInfo["canWebBuy"]
                print "**********************************"


if __name__ == '__main__':
    main()

Requests POST 多部分编码 (Multipart-Encoded) 的文件方法

http://lovesoo.org/requests-post-multiple-part-encoding-multipart-encoded-file-format.html

更多请参考:

1. 快速上手 — Requests 2.18.1 文档

2. Uploading Data — requests_toolbelt 0.10.1 documentation

Requests本身虽然提供了简单的方法POST多部分编码(Multipart-Encoded)的文件，但是Requests是先读取文件到内存中，然后再构造请求发送出去。
如果需要发送一个非常大的文件作为 multipart/form-data 请求时，为了避免把大文件读取到内存中，我们就希望将请求做成数据流。
默认requests是不支持的（或很困难）, 这时需要用到第三方包requests-toolbelt。
两个库POST多部分编码(Multipart-Encoded)的文件示例代码分别如下：

Requests库（先读取文件至内存中）

import requests
 
url = 'http://httpbin.org/post'
files = {'file': open('report.xls', 'rb')}
 
r = requests.post(url, files=files)
print r.text

Requests+requests-toolbelt库（直接发送数据流）

# -*- coding:utf-8 -*-

import requests
from requests_toolbelt.multipart.encoder import MultipartEncoder

proxies = {
    "http": "http://172.17.18.80:8080",
    "https": "http://172.17.18.80:8080",
}

if __name__ == "__main__":
    print "main"

    m = MultipartEncoder(
        fields={'field0': 'value', 'field1': 'value',
                'field2': ('names.txt', open(r'd:\names.txt', 'r'), 'application/zip')}
    )

    r = requests.post('http://httpbin.org/post',
                      data=m,
                      headers={'Content-Type': m.content_type},
                      proxies=proxies)
    print r.text

模拟登录豆瓣、发表动态

github 上一个关于模拟登录的项目：https://github.com/xchaoinfo/fuck-login

模拟登陆的重点，在于找到表单真实的提交地址，然后携带 cookie，然后 post 数据即可，只要登陆成功，就可以访问其他任意网页，从而获取网页内容。

一个请求，只要正确模拟了method，url，header，body 这四要素，任何内容都能抓下来，而所有的四个要素，只要打开浏览器-审查元素-Network就能看到！

验证码这一块，现在主要是先把验证码的图片保存下来，手动输入验证码，后期研究下python自动识别验证码。如果验证码保存成本地图片，看的不不太清楚（有时间在改下），可以把验证码的 url 地址在浏览器中打开，就可以看清楚验证码了。

主要实现登录豆瓣，并发表一句话

# -*- coding:utf-8 -*-

import re
import requests
from bs4 import BeautifulSoup


class DouBan(object):
    def __init__(self):
        self.__username = "豆瓣帐号" # 豆瓣帐号
        self.__password = "豆瓣密码" # 豆瓣密码
        self.__main_url = "https://www.douban.com"
        self.__login_url = "https://www.douban.com/accounts/login"
        self.__proxies = {
            "http": "http://172.17.18.80:8080",
            "https": "https://172.17.18.80:8080"
        }
        self.__headers = {
            "Host": "www.douban.com",
            "Origin": self.__main_url,
            "Referer": self.__main_url,
            "Upgrade-Insecure-Requests": "1",
            "User-Agent": 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
        }
        self.__data = {
            "source": "index_nav",
            "redir": "https://www.douban.com",
            "form_email": self.__username,
            "form_password": self.__password,
            "login": u"登录"
        }

        self.__session = requests.session()
        self.__session.headers = self.__headers
        self.__session.proxies = self.__proxies
        pass

    def login(self):
        r = self.__session.post(self.__login_url, self.__data)
        if r.status_code == 200:
            html = r.content
            soup = BeautifulSoup(html, "lxml")
            captcha_address = soup.find('img', id='captcha_image')['src']
            print captcha_address
            # 验证码存在
            if captcha_address:
                # 利用正则表达式获取captcha的ID
                re_captcha_id = r'<input type="hidden" name="captcha-id" value="(.*?)"/'
                captcha_id = re.findall(re_captcha_id, html)
                print captcha_id
                # 保存到本地
                with open('captcha.jpg', 'w') as f:
                    f.write(requests.get(captcha_address, proxies=self.__proxies).content)
                captcha = raw_input('please input the captcha:')

                self.__data['captcha-solution'] = captcha
                self.__data['captcha-id'] = captcha_id
                r = self.__session.post(self.__login_url, data=self.__data)
                if r.status_code == 200:
                    print "login success"
                    data = {
                        "ck": "NBJ2",
                        "comment": "模拟登录"
                    }
                    r = self.__session.post(self.__main_url, data=data)
                    print r.status_code

            else:
                print "登录不需要验证码"
                # 不需要验证码的逻辑 和 上面输入验证码之后 的 逻辑 一样
                # 此处代码省略
        else:
            print "login fail", r.status_code
        pass

if __name__ == "__main__":
    t = DouBan()
    t.login()
    pass

2、httpx 的使用

pypi ：https://pypi.org/project/httpx/

安装：pip install httpx

httpx 是 Python3 的一个功能齐全的 HTTP 客户端库。它包括一个集成的命令行客户端，支持HTTP/1.1 和 HTTP/2，并提供 同步和异步 api。HTTPX 目标是与 requests 库的 API 广泛兼容

HTTPX 介绍

Introduction

示例：

import httpx

r = httpx.get('https://www.example.org/')
print(r.status_code)
print(r.headers['content-type'])
print(r.text)

命令行：pip install 'httpx[cli]'

示例：

import httpx
from PIL import Image
from io import BytesIO


r = httpx.get('https://httpbin.org/get')
r = httpx.post('https://httpbin.org/post', data={'key': 'value'})
r = httpx.put('https://httpbin.org/put', data={'key': 'value'})
r = httpx.delete('https://httpbin.org/delete')
r = httpx.head('https://httpbin.org/get')
r = httpx.options('https://httpbin.org/get')

headers = {'user-agent': 'my-app/0.0.1'}
params = {'key1': 'value1', 'key2': 'value2'}
r = httpx.get('https://httpbin.org/get', params=params, headers=headers)
print(r.text)
print(r.content)
print(r.json())
print(r.encoding)
r.encoding = 'utf-8'

# 从二进制流中创建图片
i = Image.open(BytesIO(r.content))

# 上传文件
files = {'upload-file': open('report.xls', 'rb')}
r = httpx.post("https://httpbin.org/post", files=files)
print(r.text)
# 也可以以原则的形式，显式地设置文件名和内容类型
files = {'upload-file': ('report.xls', open('report.xls', 'rb'), 'application/vnd.ms-excel')}
r = httpx.post("https://httpbin.org/post", files=files)

快速开始

高级用法

client 实例

如果用过 requests 库，那么 httpx.Client() 就相当于 requests.Session()

为什么要使用客户端？

如果在快速入门中使用顶级 API 发出请求时，HTTPX 必须为每个请求建立新连接（不会重复使用连接）。随着对主机的请求数量的增加，这很快就会变得低效。
"Client实例" 使用 HTTP 连接池。这意味着，当您向同一主机发出多个请求时，将重用基础 TCP 连接，而不是为每个请求重新创建一个。

与使用顶级 API 相比，这可以带来显著的性能改进，包括：

减少了跨请求的延迟（无握手）。
减少了 CPU 使用率和往返。
减少网络拥塞。

额外功能

Client实例还支持顶级 API 中不可用的功能，例如：

跨请求的 Cookie 持久性。
对所有传出请求应用配置。
通过 HTTP 代理发送请求。
使用 HTTP/2。

用法：( 推荐使用 with 方式 )

with httpx.Client() as client:
...

或者，手动显式关闭连接池：
client = httpx.Client()
try:
...
finally:
client.close()

创建完 client 后，就可以使用 client.get、client .post 等方法来发送请求，参数和httpx.get(),httpx.post() 等都相同

通过给 client 构造函数传递参数，可以在所有传出的请求间共享配置。

class Client(BaseClient):
"""
An HTTP client, with connection pooling, HTTP/2, redirects, cookie persistence, etc.

It can be shared between threads.

Usage:

```python
>>> client = httpx.Client()
>>> response = client.get('https://example.org')
```

**Parameters:**

* **auth** - *(optional)* An authentication class to use when sending
requests.
* **params** - *(optional)* Query parameters to include in request URLs, as
a string, dictionary, or sequence of two-tuples.
* **headers** - *(optional)* Dictionary of HTTP headers to include when
sending requests.
* **cookies** - *(optional)* Dictionary of Cookie items to include when
sending requests.
* **verify** - *(optional)* SSL certificates (a.k.a CA bundle) used to
verify the identity of requested hosts. Either `True` (default CA bundle),
a path to an SSL certificate file, an `ssl.SSLContext`, or `False`
(which will disable verification).
* **cert** - *(optional)* An SSL certificate used by the requested host
to authenticate the client. Either a path to an SSL certificate file, or
two-tuple of (certificate file, key file), or a three-tuple of (certificate
file, key file, password).
* **proxies** - *(optional)* A dictionary mapping proxy keys to proxy
URLs.
* **timeout** - *(optional)* The timeout configuration to use when sending
requests.
* **limits** - *(optional)* The limits configuration to use.
* **max_redirects** - *(optional)* The maximum number of redirect responses
that should be followed.
* **base_url** - *(optional)* A URL to use as the base when building
request URLs.
* **transport** - *(optional)* A transport class to use for sending requests
over the network.
* **app** - *(optional)* An WSGI application to send requests to,
rather than sending actual network requests.
* **trust_env** - *(optional)* Enables or disables usage of environment
variables for configuration.
* **default_encoding** - *(optional)* The default encoding to use for decoding
response text, if no charset information is included in a response Content-Type
header. Set to a callable for automatic character set detection. Default: "utf-8".
"""

def __init__(
self,
*,
auth: typing.Optional[AuthTypes] = None,
params: typing.Optional[QueryParamTypes] = None,
headers: typing.Optional[HeaderTypes] = None,
cookies: typing.Optional[CookieTypes] = None,
verify: VerifyTypes = True,
cert: typing.Optional[CertTypes] = None,
http1: bool = True,
http2: bool = False,
proxies: typing.Optional[ProxiesTypes] = None,
mounts: typing.Optional[typing.Mapping[str, BaseTransport]] = None,
timeout: TimeoutTypes = DEFAULT_TIMEOUT_CONFIG,
follow_redirects: bool = False,
limits: Limits = DEFAULT_LIMITS,
max_redirects: int = DEFAULT_MAX_REDIRECTS,
event_hooks: typing.Optional[
typing.Mapping[str, typing.List[EventHook]]
] = None,
base_url: URLTypes = "",
transport: typing.Optional[BaseTransport] = None,
app: typing.Optional[typing.Callable[..., typing.Any]] = None,
trust_env: bool = True,
default_encoding: typing.Union[str, typing.Callable[[bytes], str]] = "utf-8",
):

夸请求共享配置示例：共享请求头

>>> url = 'http://httpbin.org/headers'
>>> headers = {'user-agent': 'my-app/0.0.1'}
>>> with httpx.Client(headers=headers) as client:
... r = client.get(url)
...
>>> r.json()['headers']['User-Agent']

配置合并

在 "client级别" 和 "请求级别" 同时提供配置选项时，可能会发生以下两种情况之一：

对于 headers、query parameters、cookies，这些值将组合在一起。例如：

>>> headers = {'X-Auth': 'from-client'}
>>> params = {'client_id': 'client1'}
>>> with httpx.Client(headers=headers, params=params) as client:
... headers = {'X-Custom': 'from-request'}
... params = {'request_id': 'request1'}
... r = client.get('https://example.com', headers=headers, params=params)
...
>>> r.request.url
URL('https://example.com?client_id=client1&request_id=request1')
>>> r.request.headers['X-Auth']
'from-client'
>>> r.request.headers['X-Custom']
'from-request'

对于所有其他参数，请求级别值优先。例如：

>>> with httpx.Client(auth=('tom', 'mot123')) as client:
... r = client.get('https://example.com', auth=('alice', 'ecila123'))
...
>>> _, _, auth = r.request.headers['Authorization'].partition(' ')
>>> import base64
>>> base64.b64decode(auth)
b'alice:ecila123'

client 才有的配置

例如，允许您在所有传出请求前面附加一个 URL：base_url

>>> with httpx.Client(base_url='http://httpbin.org') as client:
... r = client.get('/headers')
...
>>> r.request.url
URL('http://httpbin.org/headers')

有关所有可用客户端参数的列表，请参阅 client API 参考。

字符集编码、自动检测

import httpx
import chardet

def autodetect(content):
    return chardet.detect(content).get("encoding")

client = httpx.Client(default_encoding=autodetect)
response = client.get(...)
print(response.encoding)  
print(response.text)

调用 Python Web 应用程序

可以将 httpx client 配置为使用 WSGI 协议直接调用 Python Web 应用程序。

这对于两个主要用例特别有用：

在测试用例中用作客户端。
在测试期间或在开发/过渡环境中模拟外部服务。

示例：可以打上断点，查看执行流程

from flask import Flask
import httpx

app = Flask(__name__)


@app.route("/")
def hello():
    return "Hello World!"


with httpx.Client(app=app, base_url="http://testserver") as client:
    r = client.get("/")
    assert r.status_code == 200
    assert r.text == "Hello World!"

request 实例

为了最大限度地控制通过网络发送的内容，HTTPX 支持构建显式 request 实例：

request = httpx.Request("GET", "https://example.com")

将实例发送到网络：Request.send()

with httpx.Client() as client:
response = client.send(request)
...

默认的 "参数合并" 不支持在一个方法中同时使用 client-level 和 request-level 的参数，如果想要使用，则可以使用 .build_request() 构造一个实例，使用构造的实例去用

headers = {"X-Api-Key": "...", "X-Client-ID": "ABC123"}

with httpx.Client(headers=headers) as client:
request = client.build_request("GET", "https://api.example.com")

print(request.headers["X-Client-ID"]) # "ABC123"

# Don't send the API key for this particular request.
del request.headers["X-Api-Key"]

response = client.send(request)
...

事件钩子

HTTPX 允许向 client 注册“事件钩子”，这些钩子在每次发生特定类型的事件时被激活并调用

目前有两个事件挂钩：

request- 在请求完全准备好之后，但在发送到网络之前调用。已传递实例。request
response- 在从网络获取响应之后，但在将其返回给调用方之前调用。已传递实例。response

这些允许您安装客户端范围的功能，例如日志记录、监视或跟踪。

import httpx


def log_request(request):
    print(f"Request event hook: {request.method} {request.url} - Waiting for response")


def log_response(response):
    request = response.request
    print(f"Response event hook: {request.method} {request.url} - Status {response.status_code}")


client = httpx.Client(event_hooks={'request': [log_request], 'response': [log_response]})

检查和修改已安装的钩子

client = httpx.Client() client.event_hooks['request'] = [log_request] client.event_hooks['response'] = [log_response, raise_on_4xx_5xx]

监控下载进度

如果需要监视大型响应的下载进度，可以使用响应流式处理并检查属性。response.num_bytes_downloaded

示例：

import tempfile

import httpx
from tqdm import tqdm

with tempfile.NamedTemporaryFile() as download_file:
    url = "https://speed.hetzner.de/100MB.bin"
    with httpx.stream("GET", url) as response:
        total = int(response.headers["Content-Length"])

        with tqdm(total=total, unit_scale=True, unit_divisor=1024, unit="B") as progress:
            num_bytes_downloaded = response.num_bytes_downloaded
            for chunk in response.iter_bytes():
                download_file.write(chunk)
                progress.update(response.num_bytes_downloaded - num_bytes_downloaded)
                num_bytes_downloaded = response.num_bytes_downloaded
pass

示例：

import tempfile
import httpx
import rich.progress

with tempfile.NamedTemporaryFile() as download_file:
    url = "https://speed.hetzner.de/100MB.bin"
    with httpx.stream("GET", url) as response:
        total = int(response.headers["Content-Length"])

        with rich.progress.Progress(
            "[progress.percentage]{task.percentage:>3.0f}%",
            rich.progress.BarColumn(bar_width=None),
            rich.progress.DownloadColumn(),
            rich.progress.TransferSpeedColumn(),
        ) as progress:
            download_task = progress.add_task("Download", total=total)
            for chunk in response.iter_bytes():
                download_file.write(chunk)
                progress.update(download_task, completed=response.num_bytes_downloaded)

监控上传进度

import io
import random

import httpx
from tqdm import tqdm


def gen():
    """
    this is a complete example with generated random bytes.
    you can replace `io.BytesIO` with real file object.
    """
    total = 32 * 1024 * 1024  # 32m
    with tqdm(ascii=True, unit_scale=True, unit='B', unit_divisor=1024, total=total) as bar:
        with io.BytesIO(random.randbytes(total)) as f:
            while data := f.read(1024):
                yield data
                bar.update(len(data))


httpx.post("https://httpbin.org/post", content=gen())

"HTTP、SOCKS" 代理

proxies = {
"http://": "http://localhost:8030",
"https://": "http://localhost:8031",
}

HTTPX 支持通过在 client 初始化时传递 proxies 参数。
with httpx.Client(proxies=proxies) as client:
...
也支持顶级 API 函数传递 proxies 参数。
httpx.get(..., proxies=...)

HTTPX 提供了细粒度控件，用于决定哪些请求必须通过proxy，哪些请求不能通过proxy。

proxies = {
"http://": "http://username:password@localhost:8030",
# ...
}
proxies 字典通过 "键值对"的形式映射 "url 和 proxy "。
HTTPX 将请求的 URL 与 "keys" 进行匹配，以决定应使用哪个 proxy（如果有）。
匹配是从最具体的 "keys" (例如：https://<domain>:<port>) 到最不具体 keys (例如：https://)

HTTPX 支持基于 "scheme、domain、port" 或这些方案的自由组合。

一个复杂 proxy 的配置示例，可以组合上述路由功能来构建复杂的路由配置。例如

SOCKS

超时配置

为单个请求设置超时：

# Using the top-level API:
httpx.get('http://example.com/api/v1/example', timeout=10.0)

# Using a client instance:
with httpx.Client() as client:
client.get("http://example.com/api/v1/example", timeout=10.0)

或者禁用单个请求的超时：

# Using the top-level API:
httpx.get('http://example.com/api/v1/example', timeout=None)

# Using a client instance:
with httpx.Client() as client:
client.get("http://example.com/api/v1/example", timeout=None)

client = httpx.Client() # Use a default 5s timeout everywhere.
client = httpx.Client(timeout=10.0) # Use a default 10s timeout everywhere.
client = httpx.Client(timeout=None) # Disable all timeouts by default.

池限制配置

可以在 client 上使用 limits 关键字参数控制连接池大小。

max_keepalive_connections、允许的保持活动连接数或始终允许。（默认值 20）None
max_connections、允许的最大连接数或无限制。（默认 100）None
keepalive_expiry、空闲保持活动连接的时间限制（以秒为单位）或无限制。（默认 5）None

limits = httpx.Limits(max_keepalive_connections=5, max_connections=10)
client = httpx.Client(limits=limits)

多部分文件编码

>>> files = {'upload-file': ('report.xls', open('report.xls', 'rb'), 'application/vnd.ms-excel')}
>>> r = httpx.post("https://httpbin.org/post", files=files)
>>> print(r.text)
{
...
"files": {
"upload-file": "<... binary content ...>"
},
...
}

更具体地说，如果将元组用作值，它必须具有 2 到 3 个元素：

第一个元素是可选文件名，可以设置为。None
第二个元素可以是类似文件的对象或字符串，它们将自动以 UTF-8 编码。
可选的第三个元素可用于指定要上传的文件的 MIME 类型。如果未指定，HTTPX 将尝试猜测基于 MIME 类型在文件名上，未知文件扩展名默认为“应用程序/八位字节流”。如果文件名显式设置为则 HTTPX 将不包含内容类型 MIME 标头字段。None

>>> files = {'upload-file': (None, 'text content', 'text/plain')}
>>> r = httpx.post("https://httpbin.org/post", files=files)
>>> print(r.text)
{
...
"files": {},
"form": {
"upload-file": "text-content"
},
...
}

自定义身份验证

SSL 证书

使用自定义 CA ，使用 verify 参数。

import httpx

r = httpx.get("https://example.org", verify="path/to/client.pem")

或者，可以传递一个标准库 .ssl.SSLContext

>>> import ssl
>>> import httpx
>>> context = ssl.create_default_context()
>>> context.load_verify_locations(cafile="/tmp/client.pem")
>>> httpx.get('https://example.org', verify=context)
<Response [200 OK]>

禁用SSL

import httpx
r = httpx.get("https://example.org", verify=False)

自定义传输

HTTPX 的 Client 也接收一个 transport 参数，This argument allows you to provide a custom Transport object that will be used to perform the actual sending of the requests.

用法：对于某些高级配置，您可能需要实例化传输类，并将其传递给客户端实例。一个例子是只能通过这个低级 API 提供 local_address 的配置。

>>> import httpx
>>> transport = httpx.HTTPTransport(local_address="0.0.0.0")
>>> client = httpx.Client(transport=transport)

也可以通过此接口进行连接重试

>>> import httpx
>>> transport = httpx.HTTPTransport(retries=1)
>>> client = httpx.Client(transport=transport)

实例化传输时，只能通过低级 API 提供的 Unix 域套接字进行连接这一种方式。

>>> import httpx
>>> # Connect to the Docker API via a Unix Socket.
>>> transport = httpx.HTTPTransport(uds="/var/run/docker.sock")
>>> client = httpx.Client(transport=transport)
>>> response = client.get("http://docker/info")
>>> response.json()
{"ID": "...", "Containers": 4, "Images": 74, ...}

使用 urllib3 进行传输

>>> import httpx
>>> from urllib3_transport import URLLib3Transport
>>> client = httpx.Client(transport=URLLib3Transport())
>>> client.get("https://example.org")
<Response [200 OK]>

编写自定义传输

传输实例必须实现低级传输 API，该 API 处理发送单个请求并返回响应。你要么是 httpx.BaseTransport子类，要么是 httpx.AsyncBaseTransport子类。

在传输 API 层，我们使用熟悉的 Request 和 Response 模型

有关更多详细信息，请参阅 handle_request 和 handle_async_request 文档字符串关于传输 API 的细节。

自定义传输实现的完整示例如下：

import json
import httpx

class HelloWorldTransport(httpx.BaseTransport):
"""
A mock transport that always returns a JSON "Hello, world!" response.
"""

def handle_request(self, request):
message = {"text": "Hello, world!"}
content = json.dumps(message).encode("utf-8")
stream = httpx.ByteStream(content)
headers = [(b"content-type", b"application/json")]
return httpx.Response(200, headers=headers, stream=stream)

以相同的方式使用：

>>> import httpx
>>> client = httpx.Client(transport=HelloWorldTransport())
>>> response = client.get("https://example.org/")
>>> response.json()
{"text": "Hello, world!"}

模拟传输

在测试过程中，能够模拟传输通常很有用，并返回预先确定的响应，而不是发出实际的网络请求。httpx.MockTransport 类接受可以使用的处理程序函数，要将请求映射到预先确定的响应。

def handler(request):
return httpx.Response(200, json={"text": "Hello, world!"})

# Switch to a mock transport, if the TESTING environment variable is set.
if os.environ.get('TESTING', '').upper() == "TRUE":
transport = httpx.MockTransport(handler)
else:
transport = httpx.HTTPTransport()

client = httpx.Client(transport=transport)

对于更高级的用例，您可能需要查看第三方 mocking library、RESPX 或 pytest-httpx library。

安装运输

还可以针对给定的方案或域挂载传输，以控制传出请求应通过具有相同样式的传输进行路由用于指定代理路由。

import httpx

class HTTPSRedirectTransport(httpx.BaseTransport):
"""
A transport that always redirects to HTTPS.
"""

def handle_request(self, method, url, headers, stream, extensions):
scheme, host, port, path = url
if port is None:
location = b"https://%s%s" % (host, path)
else:
location = b"https://%s:%d%s" % (host, port, path)
stream = httpx.ByteStream(b"")
headers = [(b"location", location)]
extensions = {}
return 303, headers, stream, extensions

# A client where any `http` requests are always redirected to `https`
mounts = {'http://': HTTPSRedirectTransport()}
client = httpx.Client(mounts=mounts)

关于如何利用安装运输工具的其他一些草图......

在单个给定域上禁用 HTTP/2...

mounts = {
    "all://": httpx.HTTPTransport(http2=True),
    "all://*example.org": httpx.HTTPTransport()
}
client = httpx.Client(mounts=mounts)

模拟对给定域的请求：

# All requests to "example.org" should be mocked out.
# Other requests occur as usual.
def handler(request):
    return httpx.Response(200, json={"text": "Hello, World!"})

mounts = {"all://example.org": httpx.MockTransport(handler)}
client = httpx.Client(mounts=mounts)

添加对自定义方案的支持：

# Support URLs like "file:///Users/sylvia_green/websites/new_client/index.html"
mounts = {"file://": FileSystemTransport()}
client = httpx.Client(mounts=mounts)

使用指南

异步

协程，英文叫作 coroutine，又称微线程、纤程，是一种运行在用户态的轻量级线程。
协程拥有自己的寄存器上下文和栈。协程在调度切换时，将寄存器上下文和栈保存到其他地方，等切回来的时候。再恢复先前保存的寄存器上下文和栈。因此，协程能保留上一次调用时的状态，即所有局部状态的一个特定组合，每次过程重人，就相当于进人上一次调用的状态
协程本质上是个单进程，相对于多进程来说，它没有线程上下文切换的开销，没有原子操作锁定及同步的开销，编程模型也非常简单。
可以使用协程来实现异步操作，例如在网络爬虫场景下，我们发出一个请求之后，需要等待一定时间才能得到响应，但其实在这个等待过程中，程序可以干许多其他事情，等得到响应之后再切换回来继续处理，这样可以充分利用CPU和其他资源，这就是协程的优势

发出异步请求，需要使用 .AsyncClient

>>> async with httpx.AsyncClient() as client:
... r = await client.get('https://www.example.com/')
...
>>> r
<Response [200 OK]>

使用格式：response = await client.get(...)

AsyncClient.get(url, ...)
AsyncClient.options(url, ...)
AsyncClient.head(url, ...)
AsyncClient.post(url, ...)
AsyncClient.put(url, ...)
AsyncClient.patch(url, ...)
AsyncClient.delete(url, ...)
AsyncClient.request(method, url, ...)
AsyncClient.send(request, ...)

异步上下文

注意：为了从连接池中获得最大好处，请不要实例化多个 client 实例 。例如，通过使用 async with 可以创建只有一个单个作用域的 client 实例，在该 client 实例中传递任何需要的参数。

async with httpx.AsyncClient() as client:
...

或者，显式关闭客户端：await client.aclose()

client = httpx.AsyncClient()
...
await client.aclose()

示例：

import asyncio
import httpx
import datetime

pool_size_limit = httpx.Limits(max_keepalive_connections=300, max_connections=500)


async def fetch(url=None):
    async with httpx.AsyncClient(limits=pool_size_limit) as client:
        resp = await client.get('https://www.example.com/')
        print(resp.status_code)


async def main():
    url = 'https://www.httpbin.org/delay/5'
    task_list = []
    for index in range(100):
        task_list.append(asyncio.create_task(fetch(url)))
    await asyncio.wait(task_list)


if __name__ == '__main__':
    time_1 = datetime.datetime.now()
    asyncio.run(main())
    time_2 = datetime.datetime.now()
    print((time_2 - time_1).seconds)

流式处理响应

该方法是一个异步上下文块。AsyncClient.stream(method, url, ...)

>>> client = httpx.AsyncClient()
>>> async with client.stream('GET', 'https://www.example.com/') as response:
... async for chunk in response.aiter_bytes():
...

异步响应流式处理方法包括：

Response.aread()- 用于有条件地读取流块内的响应。
Response.aiter_bytes()- 用于将响应内容流式传输为字节。
Response.aiter_text()- 用于将响应内容流式传输为文本。
Response.aiter_lines()- 用于将响应内容流式传输为文本行。
Response.aiter_raw()- 用于流式传输原始响应字节，而无需应用内容解码。
Response.aclose()- 用于关闭响应。通常不需要它，因为 stream block 会在退出时自动关闭响应。

对于不能使用上下文块进行处理的情况，可以通过使用发送请求实例来进入“手动模式”。client.send(..., stream=True)

使用 Starlette 将响应转发到流式处理 Web 终结点的上下文中的示例：

import httpx
from starlette.background import BackgroundTask
from starlette.responses import StreamingResponse

client = httpx.AsyncClient()

async def home(request):
req = client.build_request("GET", "https://www.example.com/")
r = await client.send(req, stream=True)
return StreamingResponse(r.aiter_text(), background=BackgroundTask(r.aclose))

流式处理请求

async def upload_bytes():
... # yield byte content

await client.post(url, content=upload_bytes())

显式传输实例

直接实例化传输实例时，需要使用 .httpx.AsyncHTTPTransport

>>> import httpx
>>> transport = httpx.AsyncHTTPTransport(retries=1)
>>> async with httpx.AsyncClient(transport=transport) as client:
>>> ...

支持的异步环境

HTTPX 支持 asyncio 或 trio 作为异步环境。它将自动检测这两者中的哪一个用作后端用于套接字操作和并发基元。

AsyncIO是Python的内置库，用于编写具有async/await语法的并发代码。

import asyncio
import httpx

async def main():
async with httpx.AsyncClient() as client:
response = await client.get('https://www.example.com/')
print(response)

asyncio.run(main())

Trio 是一个替代的异步库，围绕结构化并发原则设计。

import httpx
import trio

async def main():
async with httpx.AsyncClient() as client:
response = await client.get('https://www.example.com/')
print(response)

trio.run(main)

AnyIO ( 地址：https://github.com/agronholm/anyio )。AnyIO 是一个异步网络和并发库，它工作在 asyncio 或 trio 之上。它与所选后端的本机库混合在一起（默认为 asyncio）。

import httpx
import anyio

async def main():
async with httpx.AsyncClient() as client:
response = await client.get('https://www.example.com/')
print(response)

anyio.run(main, backend='trio')

调用 Python Web 应用程序

就像httpx.Client允许您直接调用WSGI Web应用程序一样， httpx.AsyncClient允许您直接调用 ASGI Web 应用程序。

以 Starlette 应用程序为例：

from starlette.applications import Starlette
from starlette.responses import HTMLResponse
from starlette.routing import Route

async def hello(request):
return HTMLResponse("Hello World!")

app = Starlette(routes=[Route("/", hello)])

可以直接针对应用程序发出请求，如下所示：

>>> import httpx
>>> async with httpx.AsyncClient(app=app, base_url="http://testserver") as client:
... r = await client.get("/")
... assert r.status_code == 200
... assert r.text == "Hello World!"

启用 HTTP/2

HTTP / 2是HTTP协议的主要新版本，它提供了更多的高效运输，具有潜在的性能优势。HTTP/2 不会改变请求或响应的核心语义，但改变了数据的方式发送到服务器和从服务器发送。

HTTP/1 不是 HTTP/1.2 使用的文本格式，而是二进制格式。二进制格式提供完整的请求和响应多路复用，并且高效压缩 HTTP 标头。流多路复用意味着HTTP / 1.1 每个并发请求需要一个TCP流，HTTP / 2允许单个TCP 流以处理多个并发请求。

HTTP/2 还提供对响应优先级、和服务器推送。

有关HTTP / 2的综合指南，您可能需要查看“http2解释”。

使用 httpx client 时，默认情况下不启用 HTTP/2 支持，因为 HTTP/1.1 是一个成熟的、久经沙场的传输层，所以现在HTTP/1.1是更好的选择。未来版本的 httpx 可能会默认启用 HTTP/2 支持.

首先确保安装可选的 HTTP/2 依赖项：pip install httpx[http2]
然后实例化启用了 HTTP/2 支持的客户端：client = httpx.AsyncClient(http2=True)

async with httpx.AsyncClient(http2=True) as client:
...

在客户端上启用 HTTP/2 支持并不一定意味着您的请求和响应将通过 HTTP/2 传输，因为客户端和服务器都需要支持 HTTP/2。如果连接到仅支持 HTTP/1.1 客户端将使用标准的 HTTP/1.1 连接。

检查 http 版本

client = httpx.AsyncClient(http2=True)
response = await client.get(...)
print(response.http_version) # "HTTP/1.0", "HTTP/1.1", or "HTTP/2".

API 引用

API 接口

异常

异常层次结构

HTTPError
- RequestError
  - TransportError
    - TimeoutException
      - ConnectTimeout
      - ReadTimeout
      - WriteTimeout
      - PoolTimeout
    - NetworkError
      - ConnectError
      - ReadError
      - WriteError
      - CloseError
    - ProtocolError
      - LocalProtocolError
      - RemoteProtocolError
    - ProxyError
    - UnsupportedProtocol
  - DecodingError
  - TooManyRedirects
- HTTPStatusError
InvalidURL
CookieConflict
StreamError
- StreamConsumed
- ResponseNotRead
- RequestNotRead
- StreamClosed