Python爬虫-爬取库的使用介绍_docs.python-reqquests.org-CSDN博客

本文链接：https://blog.csdn.net/cherish1ove/article/details/85242158

使用urllib

Python2中，有urllib和urllib2两个库来实现请求的发送，在Python3中，统一为urllib，官方文档链接为：https://docs.python.org/3/library/urllib.html 。
urllib是Python的内置模块，所以不需要额外安装。
urllib包含4个模块：
- request：它是最基本的HTTP请求模块，可以用来模拟发送请求。就像在浏览器里输入网址然后回车一样，只需要给库方法传入URL以及额外的参数，就可以模拟实现这个过程了。
- error：异常处理模块，如果出现请求错误，我们可以捕获这些异常，然后进行重试或其他操作以保证程序不会意外终止。
- parse：一个工具模块，提供很多URL处理方法，比如拆分、解析、合并等。
- robotparser：主要是用来识别网站的robots.txt文件，然后判断哪些网站可以爬，哪些网站不可以爬，它其实用的比较少。

rquest模块

参考官方文档：https://docs.python.org/3/library/urllib.request.html#basehandler-objects 。

urlopen()

urllib.request模块提供了基本的构造HTTP请求的方法，利用它可以模拟浏览器的请求发起过程，同时它还带有处理授权验证（authenticaton）、重定向（redirection）、浏览器Cookies以及其他内容。

抓取网页实例

import urllib.request

res = urllib.request.urlopen('https://www.baidu.com')
#利用type()方法输出响应的类型
print(type(res))
#status属性可以得到返回结果的状态码，如：200代表请求成功，404代表网页未找到等
print(res.status)
#输出响应的头信息
print(res.getheaders())
#调用getheader()方法并传递一个参数Server获取了响应头中的Server值，结果是BWS/1.1
#意思是服务器使用BWS/1.1搭建的
print(res.getheader('Server'))
#read()方法可以得到返回的网页内容
print(res.read().decode('utf-8'))

运行结果如下：

<class 'http.client.HTTPResponse'>
200
[('Accept-Ranges', 'bytes'), ('Cache-Control', 'no-cache'), ('Content-Length', '227'), ('Content-Type', 'text/html'), ('Date', 'Mon, 24 Dec 2018 22:32:50 GMT'), ('Etag', '"5c1a1790-e3"'), ('Last-Modified', 'Wed, 19 Dec 2018 10:04:00 GMT'), ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('Pragma', 'no-cache'), ('Server', 'BWS/1.1'), ('Set-Cookie', 'BD_NOT_HTTPS=1; path=/; Max-Age=300'), ('Set-Cookie', 'BIDUPSID=F496065066021BD3CEC67DF2678EAA77; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'PSTM=1545690770; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Strict-Transport-Security', 'max-age=0'), ('X-Ua-Compatible', 'IE=Edge,chrome=1'), ('Connection', 'close')]
BWS/1.1
<html>
<head>
	<script>
		location.replace(location.href.replace("https://","http://"));
	</script>
</head>
<body>
	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>

可以发现，它是一个HTTPResponse类型的对象，主要包含read()、readinto()、getheader(name)、getheaders()、fileno()等方法，以及msg、version、status、readson、debuglevel、closed等属性。得到这个对象之后，把它赋值为res变量，然后就可以调用这些方法和属性，得到返回结果的一系列信息了。
利用最基本的urlopen()方法，可以完成最基本的简单网页的GET请求抓取。
想给链接传递一些参数，可以查看urlopen()函数的API：urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)可以发现，除了第一个参数可以传递URL之外，还可以传递其他内容。

data参数

data参数是可选的。如果要添加该参数，并且如果它是字节流编码格式的内容，即bytes类型，则需要通过bytes()方法转化。注：如果传递了这个参数，则它的请求方式不再是GET方式，而是POST方式。

import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
print(response.read().decode('utf-8'))

这里，传递的一个参数word，值是hello。它需要被转码成bytes(字节流)类型。其中转字节流采用了bytes()方法，该方法的第一个参数需要是str(字符串)类型，需要用urllib.parse模块里的urlencode()方法来将参数字典转化为字符串；第二个参数指定编码格式，这里指定为utf-8。请求的站点是 httpbin.org，它可以提供HTTP请求测试，请求的URL链接可以用来测试POST请求，输出一些请求的信息，其中包含我们传递的data参数。

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "word": "hello"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "10", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.7"
  }, 
  "json": null, 
  "origin": "123.157.129.56", 
  "url": "http://httpbin.org/post"
}

传递的参数出现在了form字段中，这表明了模拟了表单提交的方式，以POST方式传输数据。

timeout参数

timeout参数用于设置超时时间，单位为秒，意思就是如果请求超出了设置的这个时间，还没有得到响应，就会抛出异常。如果不指定这个该参数，就会使用全局默认时间。它支持HTTP、HTTPS、FTP请求。

import urllib.request

response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
print(response.read().decode())

运行结果如下：

......
urllib.error.URLError: <urlopen error timed out>

设置的超时时间为0.1秒。程序0.1秒过后，服务器依然没有响应，于是抛出了URLError异常。该异常属于urllib.error模块，错误原因是超时。
可以通过设置这个超时时间来控制一个网页如果长时间未响应，就跳过它的抓取。利用try except语句来实现。

import socket
import urllib.request

import urllib.error
try:
    response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout):
        print("Time Out!")

运行结果为：

Time Out!

按照常理来说，0.1秒内基本不可能得到服务器响应，因此输出了Time Out! 的提示。

其他参数

context参数：它必须是ssl.SSLContext类型，用来指定SSL设置。
cafile和capath两个参数：分别指定CA证书和它的路径，这个在请求HTTPS链接时会有用。
cadefault参数：现在已经弃用了，默认值为False。

Request

利用urlopen()方法可以实现最基本请求的发起，但这几个简单的参数并不足以构建一个完整的请求。如果请求中需要加入headers等信息，就可以利用更为强大的Request类来创建。
Request构造方法：class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
- 第一个参数url用于请求URL，这是必传参数，其他都是可选参数。
- 第二个参数data如果要传，必须传bytes(字节流)类型的。如果它是字典，可以先用urllib.parse模块里面的urlencode()编码。
- 第三个参数headers是一个字典，他就是请求头，可以在构造请求时通过headers参数直接构造，也可以通过调用请求实例的add_headers()方法来添加。添加请求头最常用的用法就是通过修改User-Agent来伪装浏览器，默认的User-Agent是Python-urllib，可以通过修改它来伪装浏览器。
- 第四个参数origin_req_host指的是请求方的host名称或者IP地址。
- 第五个参数unverifiable表示这个请求是否是无法验证的，默认是False，意思就是说用户没有足够权限来选择接收这个请求的结果。例如，请求一个HTML文档中的图片，但是没有自动抓取图像的权限，这时unverifiable的值就是True。
- 第六个参数method是一个字符串，用来指示请求使用的方法，比如GET、POST和PUT等。

from urllib import request,parse

url = 'http://httpbin.org/post'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1',
    'Host':'httpbin.org'
}

dict = {
    'name':'Germey'
}

data = bytes(parse.urlencode(dict),encoding='utf-8')
req = request.Request(url=url, data=data, headers=headers, method="POST")
respose = request.urlopen(req)
print(respose.read().decode('utf-8'))

输出的结果为：

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1"
  }, 
  "json": null, 
  "origin": "58.216.11.144", 
  "url": "http://httpbin.org/post"
}

观察结果发现，成功设置了data、headers和method。
另外，headers可以通过add_headers()方法来添加：

req = request.Request(url=url, data=data, method="POST")
req.add_headers('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163')

高级用法

现在，我们可以构造请求，但是对于一些更高级的操作（比如Cookies处理、代理设置等），该如何去做呢？
Handler的使用，可以把它理解为各种处理器，有专门处理登录验证的，有处理Cookies的，有处理代理设置的。
BaseHandler类，它是所有其他Handler的父类，提供最基本的方法，如default_open()、protocol_request()等。
各种Handler子类，继承于BaseHandler类
- HTTPDefaultErrorHandler：用于处理HTTP响应错误，错误都会抛出HTTPError类型的异常。
- HTTPRedirectHandler：用于处理重定向。
- HTTPCookiesProcessor：用于处理Cookies。
- ProxyHandler：用于设置代理，默认代理为空。
- HTTPPasswordMgr：用于管理密码，它维护了用户名和密码的表。
- HTTPBasicAuthHandler：用于管理认证，如果一个链接打开时需要认证，那么可以用它来解决认证问题。
还有其他的Handler类，就不一一列举了。详情可以参考官方文档：https://docs.python.org/3/library/urllib.request.html#urllib.request.BaseHandler 。
还有一个比较重要的类 OpenerDirector，可以称其为Opener。之前使用过urlopen()这个方法实际上就是urllib提供的一个opener。那么为什么要引入Opener呢？为了实现更高级的功能，所以要深入一层进行配置，使用更底层的实例来完成操作。
简而言之，利用Handler来构建Opener。使用几个实例来介绍。

验证

有些网站在打开时就会弹出提示框，直接提示你输入用户名和密码，验证成功后才能查看界面。如果要请求这样的页面，可以借助HTTPBasicAuthHandler完成。

from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from urllib.error import URLError

username = 'username'
password = 'password'
url = 'http://localhost:5000/'

p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None,url,username,password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)

try:
	result = opener.open(url)
	html = result.read().decode('utf-8')
	print(html)
except URLError as e:
	print(e.reason)

这里首先实例化HTTPBasicAuthHandler对象，其参数是HTTPPasswordMgrWithDefaultRealm对象，它利用add_password()添加进去用户名和密码，这样就建立了一个处理验证的Handler。接下来，利用这个Handler并使用build_opener()方法构建一个Opener，这个Opener在发送请求时就相当于验证成功了。之后，利用Opener的open()方法打开链接，就可以完成验证了。这里获取的结果就是验证后的页面源码内容。

代理

在做爬虫的时候，免不了要使用代理，如果添加代理，可以如下做：

from urllib.error import URLError
from urllib.request import ProxyHandler,build_opener

proxy_handler = ProxyHandler({
    'http':'http:/127.0.0.1:9743',
    'https':'https://127.0.0.1:9743'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

此代码中，在本地搭建了一个代理，它运行在9743端口上。其中使用了ProxyHandler，其参数是一个字典，键名是协议类型（比如HTTP或者HTTPS等），键值是代理链接，可以添加多个代理。然后利用这个Handler及build_opener()方法构造一个Opener，之后发送请求即可。

Cookies

Cookies的处理就需要相关的Handler了。先从网站上获取Cookies：

import http.cookiejar,urllib.request

# 声明一个CookieJar对象
cookie = http.cookiejar.CookieJar()
# 利用HTTPCookieProcessor来构建Handler
handler = urllib.request.HTTPCookieProcessor(cookie)
# 使用build_opener()方法构建出opener，执行open()函数
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

运行结果如下：

BAIDUID=799C89AB9F89F09F03D29194F96D8B16:FG=1
BIDUPSID=799C89AB9F89F09F03D29194F96D8B16
H_PS_PSSID=1465_21103_28206_28131_27751_27244_20719
PSTM=1545779950
delPer=0
BDSVRTM=0
BD_HOME=0

Cookies实际上是以文本形式保存的。所以，可以将其输出为文件格式。

filename = 'cookies.txt'
# CookieJar就需要换成MozillaCookieJar
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

这时CookieJar就需要换成MozillaCookieJar，它在生成文件时会用到，是CookieJar的子类，可以用来处理Cookies和文件相关的事件，比如读取和保存。运行之后，可以得到一个cookies.txt文件。
还可以用LWPCookieJar来保存和读取，但是格式不同，libwww-perl(LWP)格式的cookies文件。

filename = 'cookies.txt'
# CookieJar就需要换成LWPCookieJar
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

上面提到Cookies的保存，那么就需要读取Cookies，以LWPCookieJar格式的Cookies为例

cookie = http.cookiejar.LWPCookieJar(filename)
cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

调用load()方法来读取本地的Cookies文件，获取到了Cookies的内容。前提：首先生成了LWPCookieJar格式的Cookies并保存成文件，然后读取Cookies之后使用同样的方法构建Handler和Opener即可完成操作。

异常处理Error模块

urllib的error模块定义了有request模块产生的异常。如果出现了问题，request模块就会抛出error模块中定义的异常。

URLError

URLError类来自urllib库的error模块，它继承自OSError类，是error异常模块的基类，由request模块产生的异常都可以通过捕获这个类来处理。
它有一个reason属性，即返回错误的原因

from urllib import request, error
try:
	response = request.urlopen('https://cuiqingcai.com/index.html')
except error.URLError as e:
	print(e.reason)

打开一个不存在的页面，应该会报错，但捕获了URLError这个异常，程序没有报错。运行结果如下：

Not Found

HTTPError

它是URLError的子类，专门用来处理HTTP请求错误，比如认证请求失败等。它有3个属性：

code：返回HTTP状态码，比如404表示网页不存在，500表示服务器内部错误等。
reason：同父类一样，用于返回错误的原因。
返回请求头

from urllib import request, error
try:
	response = request.urlopen('https://cuiqingcai.com/index.html')
except error.HTTPError as e:
	print(e.code, e.reason, e.headers, sep='\n')

依然是同样的网址，捕获了HTTPError异常，输出结果：

404
Not Found
Server: nginx/1.10.3 (Ubuntu)
Date: Tue, 25 Dec 2018 23:47:54 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Vary: Cookie
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Link: <https://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"

补充说明

因为存在子类父类的问题，在捕获异常时，选择先捕获子类的异常，再去捕获父类的异常。如下：

from urllib import request, error
try:
	response = request.urlopen('https://cuiqingcai.com/index.html')
except error.HTTPError as e:
	print(e.code, e.reason, e.headers, sep='\n')
except error.URLError as e:
	print(e.reason)

reason属性返回的不一定是字符串，也可能是一个对象。

import socket
import urllib.request
import urllib.error

try:
	response = urllib.request.urlopen('https://www.baidu.com', timeout = 0.01)
except urllib.error.URLError as e:
	print(type(e.reason))
	if isinstance(e.reason, socket.timeout):
		print('TIME OUT')

设置超时时间来强制抛出timeout异常。运行结果如下:

<class 'socket.timeout'>
TIME OUT

reason属性的结果是socket.timeout类。可以用isinstance()方法来判断它的类型，做出更为详细的判断。

解析链接parse模块

urllib库提供了parse模块，它定义了处理URL的标准接口，例如实现URL各部分的抽取、合并以及链接转换。支持如下协议的URL处理：file、ftp、gopher、hdl、http、https、imap、mailto、mms、news、nntp、prospero、rsync、rtsp、rtspu、sftp、sip、sips、snews、svn、svn+ssh、telnet和wais。

urlparse()

该方法可以实现URL的识别和分段，实例如下：

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result, sep='\n')

运行结果如下：

<class 'urllib.parse.ParseResult'> 
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

可以看到，返回结果是一个ParseResult类型的对象，它包含6个部分，分别是scheme、netloc、path、params、query和fragment。
分析实例的URL：http://www.baidu.com/index.html;user?id=5#comment，可以发现urlparse()方法将其拆分成了6个部分。解析时有特定的分隔符：://前面的就是scheme，代表协议；第一个/符号前面便是netloc，即域名，后面是path，即访问路径；分号;后面是params，代表参数；问号?后面是查询条件query，一般用作GET类型的URL；井号#后面是锚点，用于直接定位页面内部的下拉位置。一个标准的链接格式：scheme://netloc/path;params?query#fragment 。
urlparse()方法的API用法：urllib.parse.urlparse(urlstring, scheme=' ', allow_fragments=True)，可以看到它有三个参数。
- urlstring：这是必填项，即待解析的URL。
- scheme：它是默认的协议（比如http或https等）。假如这个链接没有带协议信息，会将这个作为默认的协议。scheme参数只有在URL中不包含scheme信息时才生效。如果URL中有scheme信息就会返回解析出的scheme。
- allow_fragments：即是否忽略fragment。如果它被设置为False，fragment部分就会被忽略，它会被解析为path、parameters或者query的一部分，而fragment部分为空。
```
from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',allow_fragments=False)
print(result)
```
  结果如下
```
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='')
```
  假设URL中不包含params和query
```
from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False)
print(result)
```
  结果如下
```
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')
```
  可以发现，当URL中不包含params和query时，fragment便会被解析为path的一部分。返回结果ParseResult实际上是一个元组，可以用索引顺序来获取，也可以用属性名获取。示例如下：
```
from urllib.parse import urlparse
		
result = urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False)
print(result.scheme, result[0], result.netloc, result[1], sep='\n')
```
  运行结果如下：
```
http
http
www.baidu.com
www.baidu.com
```

urlunparse()

有了urlparse()，相应的就有了它的对立方法urlunparse()。它接受的参数是一个可迭代对象，但是它的长度必须是6，否则会抛出参数数量不足或者过多的问题。实例如下：
```
from urllib.parse import urlparse

data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))
```
这里的参数data用了列表类型，也可以用其他类型，如元组或特定的数据结构。
运行结果如下，成功实现了URL的构造：
```
http://www.baidu.com/index.html;user?a=6#comment
```

urlsplit()

这个方法和urlparse()方法非常类似，只不过它不再单独解析params这一部分，只返回5个结果。上面例子中的params会合并到path中。实例如下：

from urllib.parse import urlsplit

result = urlsplit('http://www.baidu.com/index.html#comment')
print(result)

运行结果如下：

SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html', query='', fragment='comment')

可以发现，返回结果是SplitResult，它其实是一个元组类型，既可以用属性获取值，也可以用索引来获取。

from urllib.parse import urlsplit

result = urlsplit('http://www.baidu.com/index.html#comment')
print(result.scheme, result[0])

运行结果如下：

http http

urlunsplit()

与urlunparse()类似，它也是将链接各个部分组合成完整链接的方法，传入的参数也是一个可迭代对象，例如列表、元组等，唯一的区别是长度必须是5。实例如下：

from urllib.parse import urlparse
	
data = ['http', 'www.baidu.com', 'index.html', 'a=6', 'comment']
print(urlunsplit(data))

运行结果如下：

http://www.baidu.com/index.html?a=6#comment

urljoin()

提供一个base_url（基础链接）作为第一个参数，将新的链接作为第二个参数，该方法会分析base_url的scheme、netloc和path这3个内容并对新链接缺失的部分进行补充，最后返回结果。

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))

运行结果如下：

http://www.baidu.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html?question=2
https://cuiqingcai.com/index.php
http://www.baidu.com?category=2#comment
www.baidu.com?category=2#comment
www.baidu.com?category=2

可以发现，base_url提供三项内容scheme、netloc和path。如果这三项在新的链接里不存在，就予以补充；如果新的链接存在，就是用新的链接的部分。而base_url中的params、query和fragment是不起作用的。

urlencode()

构造GET请求参数，实例如下：

from urllib.parse import urlencode

# 声明一个字典来将参数表示出来
params = {
    'name':'germy',
    'age':'22'
}
base_url = 'http://www.baidu.com?'
# 调用urlencode()方法将字典序列化为GET请求参数
url = base_url + urlencode(params)
print(url)

运行结果：

http://www.baidu.com?name=germy&age=22

parse_qs()

反序列化，将一串GET请求参数利用此方法，就可以转回字典，示例如下：

from urllib.parse import parse_qs

query = 'name=germy&age=22'
print(parse_qs(query))

运行结果如下：

{'name':['germy'], 'age':['22']}

parse_qsl()

用于将参数转化为元组组成的列表，示例如下：

from urllib.parse import parse_qsl

query = 'name=germy&age=22'
print(parse_qsl(query))

运行结果如下：

[('name', 'germy'), ('age', '22')]

可以看到，运行结果是一个列表，而列表中的每一个元素都是元组，元组的第一个内容是参数名，第二个内容是参数值。

quote()

该方法可以将内容转化为URL编码的格式。URL中带有中文参数时，有时可能会导致乱码问题，此时用这个方法可以将中文字符转化为URL编码，示例如下：

from urllib.parse import quote

keyword = '壁纸’
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)

结果：

https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8

unquote()

quote()的对应方法，对URL进行解码。

from urllib.parse import unquote

url = https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8
print(unquote(url))

结果：

https://www.baidu.com/s?wd=壁纸

分析Robots协议（robotparser模块）

Robots协议

Robots协议也称作爬虫协议、机器人协议，全名是网络爬虫排除标准（Robots Exclusion Protocol），用来告诉爬虫和搜索引擎哪些页面可以抓取，哪些不可以抓取。通常是一个名为robots.txt的文本文件，一般放在网站的根目录下。
当搜索爬虫访问一个站点时，首先会检查这个站点根目录下是否存在robots.txt文件，如果存在，搜索爬虫会根据其中定义的爬取范围来爬取。如果没有找到这个文件，搜索爬虫就会访问所有可直接访问的页面。

robots.txt样例

User-agent: *  #描述搜索爬虫的名称，*则代表该协议对任何爬虫有效
Disallow: /    #指定不允许爬取的目录，/代表不允许抓取所有页面
Allow: /public/  #一般和Disallow一起使用，不会单独使用，用来排除某些限制。表示可以抓取public目录

#禁止所有爬虫访问任何目录
User-agent: * 
Disallow: /

#允许所有爬虫访问任何目录，也可以把robots.txt文件留空
User-agent: * 
Disallow:

#禁止所有爬虫访问网站某些目录
User-agent: * 
Disallow: /private/
Disallow: /tmp/

#只允许一个爬虫访问
User-agent: WebCrawler
Disallow:
User-agent: *
Disallow: /

爬虫名称

常见的搜索爬虫名称和对应的网站

爬虫名称	名称	网站
BaiduSpider	百度	www.baidu.com
Googlebot	谷歌	www.google.com
360Spider	360搜索	www.so.com
YodaoSpider	有道	www.youdao.com
ia_archiver	Alexa	www.alexa.com
Scooter	altavista	www.altavista.com

robotparser模块

该模块提供了一个类RobotFileParser，它可以根据某网站的robots.txt文件来判断一个爬取爬虫是否有权限来爬取这个网页。
使用方法：在构造方法里传入robots.txt的链接即可。声明：urllib.robotparser.RobotFileParser(url=' ')
常用方法：
- set_url()：用来设置robots.txt文件的链接。如果在创建RobotFileParser对象时已经传入了链接，就不需要再使用这个方法设置了。
- read()：读取robots.txt文件并进行分析。注意，这个方法执行一个读取和分析操作，如果不调用这个方法，接下来的判断都会为False，所以一定要调用此方法。这个方法不会返回任何内容，但是执行了读取操作。
- parse()：用来解析robots.txt文件，传入的参数是robots.txt某些行的内容，它会按照robots.txt的语法规则来分析这些内容。
- can_fetch()：该方法传入两个参数，第一个是User-agent，第二个是要抓取的URL。返回的内容是该搜索引擎是否可以抓取这个URL，返回的结果是True或False。
- mtime()：返回的是上次抓取和分析robots.txt的时间，这对于长时间分析和抓取的搜索爬虫是很有必要的，你可能需要定期检查来抓取最新的robots.txt。
- modified()：它同样对长时间分析和抓取的搜索爬虫很有帮助，将当前时间设置为上次抓取和分析robots.txt的时间。

示例如下：

from urllib.robotparser import RobotFileParser

#创建RobotFileParser对象
rp = RobotFileParser()
#通过set_url()方法设置robots.txt的链接
#也可以这么设置：rp = RobotFileParser('http://www.jianshu.com/robots.txt')
rp.set_url('https://www.jianshu.com/robots.txt')
rp.read()
#判断网站是否可以被抓取
print(rp.can_fetch('*','https://www.jianshu.com/p/2f09699006dd'))
print(rp.can_fetch('*','https://jianshu.com/search?q=python&page=1&type=collections'))

运行结果如下：

False
False

同样可以使用parse()方法执行读取和分析，示例如下：

from urllib.robotparser import RobotFileParser
from urllib.request import urlopen

#创建RobotFileParser对象
rp = RobotFileParser()
rp.parse(urlopen('http://www.jianshu.com/robots.txt').read().decode('utf-8').split('\n'))
#判断网站是否可以被抓取
print(rp.can_fetch('*','https://www.jianshu.com/p/2f09699006dd'))
print(rp.can_fetch('*','https://jianshu.com/search?q=python&page=1&type=collections'))

ruquests库

上面，我们了解到urllib的基本用法，但是其中有很多不方便的地方，比如处理网页验证和Cookies时，需要些Opener和Handler来处理。为了更加方便实现这些操作，就有了更为强大的库requests。详细信息可以参考官方文档：http://docs.python-requests.org/ 。

基本用法

urllib库中的urlopen()方法实际上是以GET方式请求网页，而requests中相应的方法就是get()方法。示例如下：

import requests

r = requests.get("https://www.baidu.com/")
print(type(r))
print(r.status_code)
print(type(r.text))
#print(r.text)
print(r.cookies)

运行结果如下：

<class 'requests.models.Response'>
200
<class 'str'>
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

说明：调用get()方法实现与urlopen()相同的操作，得到一个Response对象，然后分别输出了Response的类型、状态码、响应体的类型、内容（因为太长，所以特地不显示了）以及Cookies。通过运行结果可以发现，它的返回类型是requests.models.Response，响应体的类型是字符串str，Cookies的类型是RequestCookieJar。

requests还可以设置其他类型的请求

r = requests.post('http://httpbin.org/post')
r = requests.put('http://httpbin.org/put')
r = requests.delete('http://httpbin.org/delete')
r = requests.head('http://httpbin.org/get')
r = requests.options('http://httpbin.org/get')

GET请求

HTTP中最常见的请求之一就是GET请求，下面首先来详细了解一下利用requests构建GET请求的方法。

基本实例

首先，构建一个最简单的GET请求，请求的链接是http://httpbin.org/get，该网站会判断如果客户端发起的是GET请求的话，会返回相应的请求信息。

import requests

r = requests.get("http://httpbin.org/get")
print(r.text)

运行结果如下：

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.21.0"
  }, 
  "origin": "58.216.11.153", 
  "url": "http://httpbin.org/get"
}

如果对于GET请求，要附加额外的信息，要怎么添加呢？比如要添加两个参数，其中name是germy，age是22。要构造这个请求链接，可以直接写成：r = requests.get("http://httpbin.org/get?name=germy&age=22")。这样不人性化，有时添加的参数较多，不太方便，可以使用字典来存储。如：

import requests

data = {
    'name':'germy',
    'age':'22'
}
r = requests.get("http://httpbin.org/get",params=data)
print(r.text)

结果如下：

{
  "args": {
    "age": "22", 
    "name": "germy"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.21.0"
  }, 
  "origin": "58.216.11.153", 
  "url": "http://httpbin.org/get?name=germy&age=22"
}

另外，注意：网页的返回类型实际上是str类型，但是它很特殊，是JSON格式的。所以，要想直接解析返回结果，得到一个字典格式的话，可以直接调用json()方法。如：

import requests

data = {
    'name':'germy',
    'age':'22'
}
r = requests.get("http://httpbin.org/get",params=data)
print(type(r.text))
print(r.json())
print(type(r.json()))

结果如下：

<class 'str'>
{'args': {'age': '22', 'name': 'germy'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.21.0'}, 'origin': '123.157.129.57', 'url': 'http://httpbin.org/get?name=germy&age=22'}
<class 'dict'>

注意：如果返回的结果不是JSON格式，便会出现解析错误，抛出json.decoder.JSONDecodeError异常。

抓取网页

上面请求的链接返回的是JSON形式的字符串，那么如果请求普通的网页，则肯定能获得相应的结果了。

import requests
import re

#加入headers信息，模仿浏览器
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1'
}
r = requests.get("https://www.zhihu.com/explore",headers=headers)
#使用正则表达式匹配出所有问题内容
pattern = re.compile('explore-feed.*?question_link.*?>(.*?)</a>', re.S)
titles = re.findall(pattern, r.text)
print(titles)

运行结果如下：

['\n如何评价最近发现翼龙身上有疑似羽毛的结构？\n', '\n一个演员的演技能烂到什么程度？\n', '\n古代没有化学，古人是怎么确认杀人凶手的？\n', '\n有哪些对学生党来说很有用的小文具/小技巧/小知识/小东西？\n', '\n勇士 129:127 险胜快船，库里 42 分 0.5 秒上篮绝杀，如何评价这场比赛？\n', '\n如果精灵宝可梦里的精灵可以食用，你们想要把哪些宝可梦做成哪些菜?\n', '\n作为杨超越粉丝，你最大的感受是什么？\n', '\n怎样看待Bighit于2019年推出新男团？\n', '\n如何评价宋旻浩首张solo专辑《✖️✖️》？\n', '\n你最讨厌抖音里哪个网红？\n']

抓取二进制数据

图片、音频、视频这些文件本质上都是由二进制码组成的，由于有特定的保存格式和对应的解析方式，才能看到这些形形色色的多媒体。要抓取它们，需要拿到它们的二进制码。以GitHub站点的图标为例：

import requests

r = requests.get('https://github.com/favicon.ico')
print(r.text)
print(r.content)

注意：这里打印r.text时会出现乱码，因为源文件是二进制，没有进行转化就打印出str类型。
将提取的图片保存下来：

import requests

r = requests.get('https://github.com/favicon.ico')
with open('favicon.ico', 'wb') as f:
	f.write(r.content)

使用open()方法，第一个参数是文件名称，第二个参数表示以二进制写的形式打开，可以向文件中写入二进制数据。运行之后，会出现一个名为favicon.ico的图标。

添加headers

如上例，访问知乎的时候不添加headers是被禁止访问的。

POST请求

除了GET请求之外，另外一种比较常见的请求是POST。

import requests

data = {'name':'germey', 'age':'22'}
r = requests.post('http://httpbin.org/post', data=data)
print(r.text)

运行结果如下：

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "age": "22", 
    "name": "germey"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "18", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.21.0"
  }, 
  "json": null, 
  "origin": "115.238.194.89", 
  "url": "http://httpbin.org/post"
}

可以发现，获得返回结果，其中form部分就是提交的数据，证明POST请求成功。

响应

发送请求后，得到的自然是响应。在上面的示例中，使用text和content获取响应的内容。除此之外，还有很多属性和方法用来获取其他信息，比如状态码、响应头、Cookies等。

import requests

r = requests.get('http://www.jianshu.com')
#分别打印出状态码、响应头、Cookies、URL、请求历史
print(type(r.status_code), r.status_code)
print(type(r.headers), r.headers)
print(type(r.cookies), r.cookies)
print(type(r.url), r.url)
print(type(r.history), r.history)

状态码常用来判断请求是否成功，而requests提供了一个内置的状态码查询对象requests.codes，示例如下：

import requests

r = requests.get('http://www.jianshu.com')
exit() if not r.status.code == requests.codes.ok else print('Request Successfully')

通过比较返回码和内置的成功的返回码来保证请求得到了正常响应，输出成功请求的消息，否则程序终止，requests.codes.ok得到的成功的状态码200。下面是返回码和相应的查询条件：

# 信息性状态码
100: ('continue',),
101: ('switching_protocols',),
102: ('processing',),
103: ('checkpoint',),
122: ('uri_too_long', 'request_uri_too_long'),
 
# 成功状态码
200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '✓'),
201: ('created',),
202: ('accepted',),
203: ('non_authoritative_info', 'non_authoritative_information'),
204: ('no_content',),
205: ('reset_content', 'reset'),
206: ('partial_content', 'partial'),
207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),
208: ('already_reported',),
226: ('im_used',),
 
# 重定向状态码
300: ('multiple_choices',),
301: ('moved_permanently', 'moved', '\\o-'),
302: ('found',),
303: ('see_other', 'other'),
304: ('not_modified',),
305: ('use_proxy',),
306: ('switch_proxy',),
307: ('temporary_redirect', 'temporary_moved', 'temporary'),
308: ('permanent_redirect',
      'resume_incomplete', 'resume',), # These 2 to be removed in 3.0
 
# 客户端错误状态码
400: ('bad_request', 'bad'),
401: ('unauthorized',),
402: ('payment_required', 'payment'),
403: ('forbidden',),
404: ('not_found', '-o-'),
405: ('method_not_allowed', 'not_allowed'),
406: ('not_acceptable',),
407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),
408: ('request_timeout', 'timeout'),
409: ('conflict',),
410: ('gone',),
411: ('length_required',),
412: ('precondition_failed', 'precondition'),
413: ('request_entity_too_large',),
414: ('request_uri_too_large',),
415: ('unsupported_media_type', 'unsupported_media', 'media_type'),
416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),
417: ('expectation_failed',),
418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
421: ('misdirected_request',),
422: ('unprocessable_entity', 'unprocessable'),
423: ('locked',),
424: ('failed_dependency', 'dependency'),
425: ('unordered_collection', 'unordered'),
426: ('upgrade_required', 'upgrade'),
428: ('precondition_required', 'precondition'),
429: ('too_many_requests', 'too_many'),
431: ('header_fields_too_large', 'fields_too_large'),
444: ('no_response', 'none'),
449: ('retry_with', 'retry'),
450: ('blocked_by_windows_parental_controls', 'parental_controls'),
451: ('unavailable_for_legal_reasons', 'legal_reasons'),
499: ('client_closed_request',),
 
# 服务端错误状态码
500: ('internal_server_error', 'server_error', '/o\\', '✗'),
501: ('not_implemented',),
502: ('bad_gateway',),
503: ('service_unavailable', 'unavailable'),
504: ('gateway_timeout',),
505: ('http_version_not_supported', 'http_version'),
506: ('variant_also_negotiates',),
507: ('insufficient_storage',),
509: ('bandwidth_limit_exceeded', 'bandwidth'),
510: ('not_extended',),
511: ('network_authentication_required', 'network_auth', 'network_authentication')

高级用法

文件上传

requests可以模拟提交一些数据。上传前面我们下载下来的favicon.ico文件的示例如下：

import requests

files = {'file': open('favicon.ico', 'rb')}
r = requests.post("http://httpbin.org/post", files=files)
print(r.text)

运行结果如下：

{
  "args": {}, 
  "data": "", 
  "files": {
    "file": "data:application/octet-stream;...="
  }, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "6665", 
    "Content-Type": "multipart/form-data; boundary=3fca310d8286eeb7bf62765bd3a97368", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.21.0"
  }, 
  "json": null, 
  "origin": "183.240.33.169", 
  "url": "http://httpbin.org/post"
}

file的内容省略，这个网站会返回响应，里面包含了files字段，而form字段是空的，这证明文件上传部分会单独有一个files字段来标识。

Cookies

相较于urllib，requests库获取和设置cookies更简单。如下，获取Cookies：

import requests

r = requests.get("https://www.baidu.com")
print(r.cookies)
for key,value in r.cookies.items():
        print(key + '=' + value)

运行结果如下：

<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
BDORZ=27315

程序分析：首先调用cookies属性即可成功获得Cookies，可以发现它是一个RequestsCookieJar类型。然后用item()方法将其转化为元组组成的列表，遍历输出每一个Cookies的名称和值，实现Cookie的遍历分析。

可以直接用Cookie来维持登录状态

方法1：首先登录一个网站，将headers中的Cookie内容复制下来，在爬虫程序中设置到headers里面，然后发送请求即可。（这个较为简单，就不举例了。）

方法2：通过cookies参数来设置，此方法需要构建RequestsCookieJar对象，而且需要分割一下cookies，相对来说繁琐。示例如下：

import requests

cookie = '这里是你登录的网站headers中的cookie内容'
#新建一个RequestsCookieJar对象
jar = requests.cookies.RequestsCookieJar()
headers = {
	'Host' : '这里是你访问的网站网址',
	'User-Agent' : '浏览器头部'
	}
#将复制下来的cookies利用split()方法进行分割
for cookie in cookies.split(';'):
	key, value = cookie.split('=', 1)
	#利用set()方法设置好每个Cookie的key和value
	jar.set(key, value)
#通过调用get方法将设置好的Cookie的key和value传递给cookies参数
r = requests.get("http://****网站网址", cookies=jar, headers=headers)
print(r.text)

会话维持

在requests中，如果直接使用get()或post()等方法的确可以做到模拟网页的请求，但是这实际上是相当不同的会话，也就是说相当于你用了两个浏览器打开了不同的页面。假如如下场景：第一个请求利用post()方法登录了某个网站，第二次想获取成功登录后的自己的个人信息，用了一次get()方法去请求个人信息页面。实际上，这相当于打开两个浏览器，是两个完全不相关的会话，不能获取到个人信息。可以设置两次请求时的cookies一样，但是这样过于繁琐。

解决这个问题的主要方法就是维持同一个会话，也就是相当于打开一个新的浏览器选项卡而不是一个新开的浏览器。不想每次都设置cookies，可以利用Session对象。示例如下：

import requests

#请求一个测试网址http://httpbin.org/cookies/set/number/123456789
#请求这个网址时，可以设置一个cookie，名称叫做number，内容时123456789
#后请求了http://httpbin.org/cookies，此网址可以获取当前的Cookies
requests.get("http://httpbin.org/cookies/set/number/123456789")
r = requests.get("http://httpbin.org/cookies")
print(r.text)

运行结果如下：

{
  "cookies": {}
}

这样并不能如想象中那样获取到cookies，使用Session：

import requests

s = requests.Session()
s.get("http://httpbin.org/cookies/set/number/123456789")
r = s.get("http://httpbin.org/cookies")
print(r.text)

运行结果：

{
  "cookies": {
    "number": "123456789"
  }
}

可见，能成功获取到Cookies。利用Session可以做到模拟同一个会话而不用担心Cookies的问题，它通常用于模拟登录成功之后再进行下一步的操作。

SSL证书验证

此外，requests还提供了证书验证的功能。当发送HTTP请求的时候，它会检查SSL证书，可以使用verify参数控制是否检查此证书。如果不加verify参数的话，默认是True，会自动验证。12306的证书没有被官方CA机构信任(貌似现在的证书是信任的)，会出现证书验证错误的结果。

代理处理

对于某些网站，在测试的时候请求几次，能正常获取内容。但是一旦开始大规模爬取，网站可能会弹出验证码，或者跳转到登录认证页面，更甚者可能会直接封禁客户端的IP，导致一定时间段内无法访问。为了防止这种情况发生，需要设置代理来解决这个问题，设置proxies参数。
```
import requests

proxies = {
	'http': 'http://10.10.1.10:3128',
	'https': 'http://10.10.1.10:1080',
}

requests.get("https://www.baiduc.com', proxies=proxies)
```
直接运行这个程序是不行的，因为这个代理可能是无效的。要换成自己的有效代理。

若代理需要使用HTTP Basic Auth，可以使用类似http://user:password@host:port这样的语法来设置代理，示例如下：

import requests

proxies = {
	'http': 'http://user:password@10.10.1.10:3128/',
}
requests.get("https://www.baiduc.com', proxies=proxies)

除了基本的HTTP代理外，requests还支持SOCKS协议的代理。安装socks这个库：pip3 install 'requests[socks]',示例如下：

import requests
	
proxies = {
	'http': 'sock5://user:password@host:port',
	'https': 'sock5://user:password@host:port',
	}
requests.get("https://www.baiduc.com', proxies=proxies)

超时设置

在本机网络状况不好或者服务器网络响应太慢甚至无响应时，我们可能会等待特别久的时间才能收到响应，设置到最后收不到响应就报错。为了防止服务器不能及时响应，应该设置一个超时时间，即超过了这个时间还没有得到响应，那就报错。使用timeout参数，示例如下：

import requests

r = requests.get("https://www.baiduc.com', timeout=1)
print(r.status_code)

通过这样的方式，可以将超时时间设置为1秒，如果1秒没有响应，那就抛出异常。
实际上，请求分为两个阶段，即连接（connect）和读取（read）。上面设置的timeout将用作连接和读取二者的总和。如果要分别指定，就传入一个元组：r = requests.get("https://www.baiduc.com', timeout=(5,11,30))。如果想永久等待，可以将timeout设置为None，或者不设置直接留空，因为默认时None。这样的话，如果服务器还在运行，但是响应特别慢，那就慢慢等吧，它永远不会返回超时错误的。

身份认证

在访问网站时，我们可能会遇到身份认证页面。此时可以使用requests自带的身份认证功能，示例如下：
```
import requests
from request.auth import HTTPBasicAuth

r = requests.get("http://localhost:5000", auth=HTTPBasicAuth('username', 'password'))
print(r.status_code)
```
如果用户名和密码正确的话，请求就会自动认证成功，会返回200状态码；否则返回401状态码。
如果参数都传入一个HTTPBasicAuth类，就显得有点繁琐了，更简单的写法：可以直接传入一个元组，它会默认使用HTTPBasicAuth这个类来认证。
```
import requests
	
r = requests.get("http://localhost:5000", auth=('username', 'password'))
print(r.status_code)
```

requests还提供其他认证方式，如OAuth认证，不过此时要安装oauth包，命令：pip3 install requests_oauthlib，示例如下：

import requests
from requests_oauthlib import OAuth1

url = 'https://api.twitter.com/1.1/account/verify_credentials.json'
auth = OAuth1('YOUR_APP_KEY', 'YOUR_APP_SECRET', 'USER_OAUTH_TOKEN', 'USER_OAUTH_TOKEN_SECRET')
requests.get(url, auth=auth)

Prepared Request

前面介绍urllib时，我们将请求表示为数据结构，其中各个参数都可以通过一个Request对象来表示。在requests中同样可以做到，这个数据结构叫做Prepared Request。示例如下：

在这里插入代码片from requests import Request,Session

url = 'http://httpbin.org/post'
data = {
    'name':'germey'
}
headers = {
    'User-Agent':'Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50'
}

s = Session()
#利用url、data、headers参数构造了一个Request对象
req = Request('POST',url,data=data,headers=headers)
#调用Session的prepare_request()方法将其转换为一个Prepared Request对象
prepped = s.prepare_request(req)
#调用send()方法发送
r = s.send(prepped)
print(r.text)

运行结果如下：

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "germey"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50"
  }, 
  "json": null, 
  "origin": "183.240.33.169", 
  "url": "http://httpbin.org/post"
}

可以看到，达到了同样的POST请求效果。