requests库的各种方法
get()、post()、put()、delete()、head()、options()
GET
import requests
r = requests.get('http://httpbin.org/get')
print(r.text)
运行结果:
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.18.4"
},
"origin": "117.139.252.56",
"url": "http://httpbin.org/get"
}
提供使用get()方法成功发起了GET请求,返回结果中包含了请求头、URL、IP等信息。如果要附加额外的信息,就可以使用如下方法:
import requests
data = {
'name':'July',
'age':'20'
}
r = requests.get('http://httpbin.org/get',params=data)
print(r.text)
{
"args": {
"age": "20",
"name": "July"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.18.4"
},
"origin": "117.139.252.56",
"url": "http://httpbin.org/get?name=July&age=20"
}
这时,请求的链接被主动的构造成了:http://httpbin.org/get?name=July&age=20。
但是返回结果是str类型的,需要使用json()方法转化为字典:
r.json()
下面以知乎为例子看看如何抓取网页:
import requests
import re
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
}
r = requests.get('https://www.zhihu.com/explore',headers=headers)
pattern = re.compile('explore-feed.*?question_link.*>(.*?)</a>',re.S)
titiles = re.findall(pattern,r.text)
print(titiles)
很多时候信息都是由二进制码组成的,比如图片、音频、视频等,由于有特定的保存格式和对应的解析方式才能够看到这些形形色色的多媒体,所以要抓取它们的话就要拿到它们的二进制数据。下面以爬取一张图片为例看看如何爬取二进制数据并保存:
import requests
r = requests.get('https://timgsa.baidu.com/timg?image&quality=80&size=b9999_'
'10000&sec=1547633875442&di=284dda96022c4223f15d4a7ced1f6c45&'
'imgtype=0&src=http%3A%2F%2Fattachments.gfan.com%2Fforum%2F201504%2F01'
'%2F212511bz3hs0n883hhbt9z.jpg')
with open('photos.jpg','wb') as f:
f.write(r.content)
POST
requests实现POST与GET一样,只需要用post()方法就行了;
import requests
data = {
'name':'July',
'age':'20'
}
r = requests.post('http://httpbin.org/post',data=data)
print(r.text)
{
"args": {},
"data": "",
"files": {},
"form": {
"age": "20",
"name": "July"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Content-Length": "16",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.18.4"
},
"json": null,
"origin": "117.139.252.56",
"url": "http://httpbin.org/post"
}
form中的就是我们提交的数据。
在以上的例子中,我们只是获得了响应的内容,但是还有其它很多信息,比如状态码,响应头,Cookies等。下面来举个例子:
import requests
r = requests.get('http://www.taobao.com')
print(type(r.status_code),r.status_code)
print(type(r.headers),r.headers)
print(type(r.cookies),r.cookies)
print(type(r.url),r.url)
print(type(r.history),r.history)
结果:
<class 'int'> 200
<class 'requests.structures.CaseInsensitiveDict'> {'Server': 'Tengine', 'Date': 'Wed, 16 Jan 2019 07:49:03 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding, Ali-Detector-Type, X-CIP-PT', 'Content-MD5': '0HglUXCbCRSfgz4i2y/Zxw==', 'Cache-Control': 'max-age=0, s-maxage=90', 'Ali-Swift-Global-Savetime': '1547624866', 'Via': 'cache33.l2cn657[0,200-0,H], cache35.l2cn657[1,0], cache10.cn341[0,200-0,H], cache4.cn341[1,0]', 'Age': '77', 'X-Cache': 'HIT TCP_MEM_HIT dirn:-2:-2', 'X-Swift-SaveTime': 'Wed, 16 Jan 2019 07:48:25 GMT', 'X-Swift-CacheTime': '51', 'Timing-Allow-Origin': '*', 'EagleId': '2782836515476249432424506e', 'Strict-Transport-Security': 'max-age=31536000', 'Content-Encoding': 'gzip'}
<class 'requests.cookies.RequestsCookieJar'> <RequestsCookieJar[]>
<class 'str'> https://www.taobao.com/
<class 'list'> [<Response [302]>]
上面的例子中我们了解了一些基本用法,下面我们来了解一些高级用法,比如文件上传、Cookies设置、代理设置等。
在前面的例子中我们已经保存了photos.jpg,现在我们用它来模拟上传的过程:
import requests
f = {'file':open('photos.jpg','rb')}
r = requests.post('http://httpbin.org/post',files=f)
print(r.text)
结果如下:
{
"args": {},
"data": "",
"files": {
"file": "data:application/octet-stream;base64,/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAA0JCgsKCA0LCgsODg0PEyAVExISEyccHhcgLikxMC4pLSwzOko+MzZGNywtQFdBRkxOUlNSMj5aYVpQYEpRUk//2wBDAQ4ODhMREyYVFSZPNS01T09PT09PT09PT09PT09PT09PT09PT09PT09PT09...=="
},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Content-Length": "216996",
"Content-Type": "multipart/form-data; boundary=a4b0ea24bc7e45f59ffc48d19b5779c3",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.18.4"
},
"json": null,
"origin": "117.139.252.56",
"url": "http://httpbin.org/post"
}
通过调用Cookies属性得到的Cookies是RequestCookieJar类型的,现在用items()方法将其转化为元组组成的列表,遍历出每一个Cookie的名称和值,实现Cookie的遍历和解析:
import requests
r = requests.get('https://www.csdn.net')
for key,value in r.cookies.items():
print(key + '=' + value)
我们也可以直接用cookie;来维持登录状态,下面以知乎为例进行说明:
import requests
headers = {
'Cookie':'_xsrf=GxZlXRZkH5ow3jcqaJtWQO8aGtPPOP0i; _'
'zap=1b87c7ac-6ed1-47ca-9608-4c0f19e7c5aa; d_'
'c0="ACDiWW8msQ6PTqsJxyeSdVzTSG4aSpff4Zc=|1545186513"'
'; l_cap_id="ZDQ4YmFjMmEwY2FmNDcxM2I2ZjAzYTg4YThlMzMw'
'YTk=|1547428329|c25ccb1d24daa1f2837135334dad63e9ab643ed6"'
'; r_cap_id="ZTMxYmUxZjRmYmU3NGMyZjhiNjUxNmM5ZTIxNjY4OGU=|'
'1547428329|5e5b8acf82f50fd5ec33d0a0ee4f1f3e4f7dd2ba"; '
'cap_id="MmE4OTAwYjMyMDY4NGRiYTlkN2I5NGRmNzBjZWE1NTQ=|1547428329'
'|b3bc8046ad556ba82e7abdd610f831c5064f198f"; tgw_l7_route='
'80f350dcd7c650b07bd7b485fcab5bf7; capsion_ticket="2|1:0|10:1547626069'
'|14:capsion_ticket|44:ZjY3YTUzZDk1ZGIzNGIyYzg2ZTVjZTU5M2NiZmNhMzU='
'|9b5f8969d97727ed23bf1b5ebdbeb8347a42251dac57d568787e51105d40b183"; '
'z_c0="2|1:0|10:1547626113|4:z_c0|92:Mi4xQm5MM0RRQUFBQUFBSU9KWmJ5Y'
'XhEaVlBQUFCZ0FsVk5nVFFzWFFBTkNzdlZ6STJXNzB5emNhT0M3cXdHc0N4and3|'
'1d3b700c69c81b2b1e4dab527aaf366021938f0a4b2914f210c47a2b94130461"; '
'tst=r; q_c1=d04237cecd1d4fdb85137d5872e487ef|1547626115000|1547626115000',
'Host':'www.zhihu.com',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
}
r = requests.get('https://www.zhihu.com',headers=headers)
print(r.text)
当然也可以通过cookies参数来设置:
import requests
cookies = '_xsrf=GxZlXRZkH5ow3jcqaJtWQO8aGtPPOP0i; _zap=1b87c7ac-6ed1-47ca-9608-4c0f19e7c5aa; d_c0="ACDiWW8msQ6PTqsJxyeSdVzTSG4aSpff4Zc=|1545186513"; l_cap_id="ZDQ4YmFjMmEwY2FmNDcxM2I2ZjAzYTg4YThlMzMwYTk=|1547428329|c25ccb1d24daa1f2837135334dad63e9ab643ed6"; r_cap_id="ZTMxYmUxZjRmYmU3NGMyZjhiNjUxNmM5ZTIxNjY4OGU=|1547428329|5e5b8acf82f50fd5ec33d0a0ee4f1f3e4f7dd2ba"; cap_id="MmE4OTAwYjMyMDY4NGRiYTlkN2I5NGRmNzBjZWE1NTQ=|1547428329|b3bc8046ad556ba82e7abdd610f831c5064f198f"; tgw_l7_route=80f350dcd7c650b07bd7b485fcab5bf7; capsion_ticket="2|1:0|10:1547626069|14:capsion_ticket|44:ZjY3YTUzZDk1ZGIzNGIyYzg2ZTVjZTU5M2NiZmNhMzU=|9b5f8969d97727ed23bf1b5ebdbeb8347a42251dac57d568787e51105d40b183"; z_c0="2|1:0|10:1547626113|4:z_c0|92:Mi4xQm5MM0RRQUFBQUFBSU9KWmJ5YXhEaVlBQUFCZ0FsVk5nVFFzWFFBTkNzdlZ6STJXNzB5emNhT0M3cXdHc0N4and3|1d3b700c69c81b2b1e4dab527aaf366021938f0a4b2914f210c47a2b94130461"; tst=r; q_c1=d04237cecd1d4fdb85137d5872e487ef|1547626115000|1547626115000'
jar = requests.cookies.RequestsCookieJar()
headers = {
'Host':'www.zhihu.com',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
}
for cookie in cookies.split(';'):
key,value = cookie.split('=',1)
jar.set(key,value)
r = requests.get('https://www.zhihu.com',cookies=jar,headers=headers)
print(r.text)
首先创建一个RequestCookieJar对象,然后复制下来的cookie利用split方法分割,接着用set方法设置好每个Cookie的key和value,然后通过调用requests的get方法并传递给cookies参数即可。
会话维持
在requests中,如果直接利用get()和post()方法的确可以做到模拟网页的请求,但这实际上相当于不同的会话,就是说相当于用两个浏览器打开了不同的页面。如果出现了这样的情况,比如说第一个请求利用post()方法登录了某个网站,而第二个想获取成功登录后的自己的个人信息,你有一次用get()方法去请求个人信息页面,实际上相当于打开了两个浏览器,是两个完全不相关的会话,因此无法成功获取个人信息。
解决这种问题的主要方法就是维持同一个会话,也就是相当于打开一个新的浏览器选项卡而不是开一个新的浏览器,但是觉得每次都去设置cookies太繁琐了,怎么办呢?最好的解决方法就是利用Session对象。它可以方便地维持一个会话,而不用担心cookies问题,它会帮我们自动处理好,下面举个例子:
import requests
requests.get('http://httpbin.org/cookies/set/number/123456789')
r = requests.get('http://httpbin.org/cookies')
print(r.text)
运行结果如下:
{
"cookies": {}
}
这证明不行,因此我们使用Session看看:
import requests
s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)
结果为:
{
"cookies": {
"number": "123456789"
}
}
通过上面的例子说明了利用Session可以找到模拟同一个会话而不用担心Cookies的问题,它通常用于模拟登录成功之后的下一步操作。
当发送HTTP请求的时候,它会检查SSL证书,我们可以使用verify参数控制是否检查此证书。
代理设置
对于有些网站,当它检查到大规模且频繁的请求的时候,可能会弹出验证码,或者跳到登录认证页面,或者直接封掉IP,为了防止这种情况的发生,就需要设置代理来解决这个问题,这时就需要用到proxies参数,首先看下面的例子:
import requests
proxies = {
'http':'http://121.225.52.143:9999',
'https':'https://121.225.52.143:9999'
}
requests.get('https://www.taobao.com',proxies=proxies)
下面来看一个例子说明当遇到身份验证的页面时通过在参数中传入一个HTTPBasicAuth类来处理:
import requests
from requests.auth import HTTPBasicAuth
r = requests.get('http://localhost:5000',auth=HTTPBasicAuth('username','password'))
print(r.status_code)
有时可以直接传入一个元组,它会默认使用HTTPBasicAuth类来认证,因此也可以这样写:
import requests
r = requests.get('http://localhost:5000',auth=('username','password'))
print(r.status_code)