(一)简介
requests 是python中比较方便的HTTP库,比urllib方便很多,我们以一个简单的实例来看看:
1 import requests 2 3 response = requests.get('https://www.baidu.com/') 4 print(type(response)) 5 print(response.status_code) 6 print(type(response.text)) 7 print(response.text) 8 print(response.cookies) 9 10 11 》》》输出: 12 <class 'requests.models.Response'> 13 200 14 <class 'str'> 15 具体html信息 16 <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
相比urllib是不是简单很多,其各种属性跟urllib也差不多,接下来我们具体看如何发送各种请求。
(二)基本GET请求
1.基本写法:
import requests response = requests.get('http://httpbin.org/get') #只需在末尾加上get即可完成get请求 print(response.text) 》》》输出: { "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Host": "httpbin.org", "User-Agent": "python-requests/2.19.1" }, "origin": "113.54.225.108", "url": "http://httpbin.org/get" }
如果想带参数直接构造字典并传入到get()里面的params参数即可,不需要转码之类的操作:
1 import requests 2 data = { 3 'name' : 'Boru' 4 'age' : '18' 5 } 6 response = requests.get('http://httpbin.org/get', params = data) 7 print(response.text) 8 9 10 》》》输出: 11 { 12 "args": { 13 "age": "22", 14 "name": "germey" 15 }, 16 "headers": { 17 "Accept": "*/*", 18 "Accept-Encoding": "gzip, deflate", 19 "Connection": "close", 20 "Host": "httpbin.org", 21 "User-Agent": "python-requests/2.19.1" 22 }, 23 "origin": "113.54.225.108", 24 "url": "http://httpbin.org/get?name=germey&age=22" 25 }
2.解析json
我们之前遇到json格式的返回值需要使用json.loads()方法进行转码,在requests中可以直接调用json()即可:
1 import requests 2 import json 3 4 response = requests.get('http://httpbin.org/get') 5 print(type(response.text)) 6 print(response.json()) 7 print(json.loads(response.text)) 8 print(type(response.json())) 9 10 11 》》》输出: 12 <class 'str'> 13 {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.19.1'}, 'origin': '113.54.225.108', 'url': 'http://httpbin.org/get'} 14 {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.19.1'}, 'origin': '113.54.225.108', 'url': 'http://httpbin.org/get'} 15 <class 'dict'>
可以看到使用两种方法解析结果都是一样的字典形式。
3.获取二进制数据
我们经常会实现图片,视频等内容的获取,而他们是一种二进制流,所以需要转码操作,先看如下实例:
1 import requests 2 3 response = requests.get('https://www.baidu.com/img/baidu_jgylogo3.gif') 4 print(type(response.text), type(response.content)) 5 print(response.text) 6 print(response.content) 7 8 9 》》》输出: 10 <class 'str'> <class 'bytes'> 11 (�ɨ����t{���,w�| 12 �B�Z�aK�7|M�Ph 13 �%����n8FN&:@F��|V1~w�y��r� �9�khlO�j�!......... ; 14 15 二进制码省略;
然后我们可以把它写入文件当中:
1 import requests 2 3 response = requests.get('https://www.baidu.com/img/baidu_jgylogo3.gif') 4 with open('baidu.image','wb')as f: 5 f.write(response.content) 6 f.close() 7 8 9 》》》输出: 10 即可在文件中看到百度图标的图片
4.添加headers
在我们需要模拟浏览器进行登陆时,headers信息必不可少,requests添加headers信息也很方便,直接将参数传给headers参数。
1 import requests 2 3 headers = { 4 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' 5 } 6 response = requests.get('https://www.zhihu.com/topics', headers = headers) 7 print(response.status_code) 8 9 10 》》》输出: 11 200
我们可以发现,不加headers信息,我们是无法打开知乎网站的,所以在加入headers信息后我们可以打开了。
(三)基本POST请求
我们在进行post请求时,需要传入相应数据,在requests库中,我们不需要转码成bytes类型,只需传入给data参数即可:
import requests data = {'name': 'Boru', 'age': '18'} headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' } response = requests.post("http://httpbin.org/post", data=data, headers=headers) print(response.json()) 》》》输出: {'args': {}, 'data': '', 'files': {}, 'form': {'age': '18', 'name': 'Boru'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Content-Length': '16', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}, 'json': None, 'origin': '210.41.98.60', 'url': 'http://httpbin.org/post'}
(四)响应属性
对于response 我们看看其有哪些可以利用的属性
import requests response = requests.get('http://www.baidu.com') print(type(response.status_code), response.status_code) print(type(response.headers), response.headers) print(type(response.cookies), response.cookies) print(type(response.url), response.url) print(type(response.history), response.history) 》》》输出: <class 'int'> 200 <class 'requests.structures.CaseInsensitiveDict'> {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'Keep-Alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Tue, 25 Sep 2018 14:38:34 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:36 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'} <class 'requests.cookies.RequestsCookieJar'> <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]> <class 'str'> http://www.baidu.com/ <class 'list'> []
这里对状态码的判断有一个补充:
1 import requests 2 3 response = requests.get('http://www.jianshu.com/hello.html') 4 exit() if not response.status_code == requests.codes.not_found else print('404 Not Found') 5 6 7 》》》输出: 8 404 Not Found
关于状态码的补充,可以参考下面的总结:
1 100: ('continue',), 2 101: ('switching_protocols',), 3 102: ('processing',), 4 103: ('checkpoint',), 5 122: ('uri_too_long', 'request_uri_too_long'), 6 200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '✓'), 7 201: ('created',), 8 202: ('accepted',), 9 203: ('non_authoritative_info', 'non_authoritative_information'), 10 204: ('no_content',), 11 205: ('reset_content', 'reset'), 12 206: ('partial_content', 'partial'), 13 207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'), 14 208: ('already_reported',), 15 226: ('im_used',), 16 17 # Redirection. 18 300: ('multiple_choices',), 19 301: ('moved_permanently', 'moved', '\\o-'), 20 302: ('found',), 21 303: ('see_other', 'other'), 22 304: ('not_modified',), 23 305: ('use_proxy',), 24 306: ('switch_proxy',), 25 307: ('temporary_redirect', 'temporary_moved', 'temporary'), 26 308: ('permanent_redirect', 27 'resume_incomplete', 'resume',), # These 2 to be removed in 3.0 28 29 # Client Error. 30 400: ('bad_request', 'bad'), 31 401: ('unauthorized',), 32 402: ('payment_required', 'payment'), 33 403: ('forbidden',), 34 404: ('not_found', '-o-'), 35 405: ('method_not_allowed', 'not_allowed'), 36 406: ('not_acceptable',), 37 407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'), 38 408: ('request_timeout', 'timeout'), 39 409: ('conflict',), 40 410: ('gone',), 41 411: ('length_required',), 42 412: ('precondition_failed', 'precondition'), 43 413: ('request_entity_too_large',), 44 414: ('request_uri_too_large',), 45 415: ('unsupported_media_type', 'unsupported_media', 'media_type'), 46 416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'), 47 417: ('expectation_failed',), 48 418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'), 49 421: ('misdirected_request',), 50 422: ('unprocessable_entity', 'unprocessable'), 51 423: ('locked',), 52 424: ('failed_dependency', 'dependency'), 53 425: ('unordered_collection', 'unordered'), 54 426: ('upgrade_required', 'upgrade'), 55 428: ('precondition_required', 'precondition'), 56 429: ('too_many_requests', 'too_many'), 57 431: ('header_fields_too_large', 'fields_too_large'), 58 444: ('no_response', 'none'), 59 449: ('retry_with', 'retry'), 60 450: ('blocked_by_windows_parental_controls', 'parental_controls'), 61 451: ('unavailable_for_legal_reasons', 'legal_reasons'), 62 499: ('client_closed_request',), 63 64 # Server Error. 65 500: ('internal_server_error', 'server_error', '/o\\', '✗'), 66 501: ('not_implemented',), 67 502: ('bad_gateway',), 68 503: ('service_unavailable', 'unavailable'), 69 504: ('gateway_timeout',), 70 505: ('http_version_not_supported', 'http_version'), 71 506: ('variant_also_negotiates',), 72 507: ('insufficient_storage',), 73 509: ('bandwidth_limit_exceeded', 'bandwidth'), 74 510: ('not_extended',), 75 511: ('network_authentication_required', 'network_auth', 76 'network_authentication'),
(五)高级操作
1.文件上传:
如果我们想把读取下来的图片又上传到http就可以把文件以二进制形式打开再上传到服务器即可
1 import requests 2 3 files = {'file': open('favicon.ico', 'rb')} 4 response = requests.post("http://httpbin.org/post", files=files) 5 print(response.text) 6 7 》》》输出: 8 即可在相应网址上看到该图片
2.获取cookie:
通过返回的cookies属性获取,可通过键值对获取相关cookies信息
1 import requests 2 3 response = requests.get("https://www.baidu.com") 4 print(response.cookies) 5 for key, value in response.cookies.items(): 6 print(key + '=' + value) 7 8 》》》输出: 9 <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]> 10 BDORZ=27315
3.证书验证:
当我们访问有些网址时,会出现证书验证的问题,比如运行如下代码:
1 import requests 2 3 response = requests.get('https://www.12306.cn') 4 print(response.status_code) 5 6 7 8 》》》输出: 9 requests.exceptions.SSLError: HTTPSConnectionPool(host='www.12306.cn', port=443): Max retries exceeded with url: / 。。。。。。。
就会出现SSLError这个错误,如何消除呢,我们可以通过设置verify为false解决此问题
1 import requests 2 from requests.packages import urllib3 3 urllib3.disable_warnings() 4 response = requests.get('https://www.12306.cn', verify=False) 5 print(response.status_code) 6 7 8 》》》输出: 9 200
4.代理设置:
如果我们要设置代理,也很方便,如果是httpa代理,只需传入一个proxies参数即可:
1 import requests 2 3 proxies = { 4 "http": "http://127.0.0.1:9743", 5 "https": "https://127.0.0.1:9743", 6 } 7 8 response = requests.get("https://www.taobao.com", proxies=proxies) 9 print(response.status_code) 10 11 12 》》》输出: 13 200
如果有密码就加上user:password@账号:密码的相关信息即可:
1 import requests 2 3 proxies = { 4 "http": "http://user:password@127.0.0.1:9743/", 5 } 6 response = requests.get("https://www.taobao.com", proxies=proxies) 7 print(response.status_code) 8 9 》》》输出: 10 200
如果不是https代理,而是socks代理则先安装这个包,再进行相同的操作:
1 #现在cmd下安装socks: 2 #pip3 install 'requests[socks]' 3 4 5 import requests 6 7 proxies = { 8 'http': 'socks5://127.0.0.1:9742', 9 'https': 'socks5://127.0.0.1:9742' 10 } 11 response = requests.get("https://www.taobao.com", proxies=proxies) 12 print(response.status_code) 13 14 》》》输出: 15 200
5.超时设置
只需设置一个timeout即可,如设为1,即为在1s内必须得到应答,否则报timeout超时--Readtimeout异常,这时我们可以再加上异常处理即可捕捉这个异常:
1 import requests 2 from requests.exceptions import ReadTimeout 3 4 try: 5 response = requests.get("http://httpbin.org/get", timeout = 0.5) 6 print(response.status_code) 7 except ReadTimeout: 8 print('Timeout') 9 10 11 》》》输出: 12 Timeout
6.认证设置
当我们需要访问某些网站时,需要登陆才能有访问权限,这时候我们加入一个auth参数,以一个元组形式即可:
1 import requests 2 3 r = requests.get('http://120.27.34.24:9001', auth=('user', '123')) 4 print(r.status_code) 5 6 》》》输出: 7 20017:18:40
7.异常处理
我们可以访问API文档查看相关异常,这里我们取一些子类的异常,最后再接受一个父类异常,可以帮助我们看看在访问过程中哪些地方出现问题了:
1 import requests 2 from requests.exceptions import ReadTimeout, ConnectionError, RequestException 3 4 try: 5 response = requests.get('http://httpbin.org/get', timeout=0.5) 6 print(response.status_code) 7 except ReadTimeout: 8 print('Timeout') 9 except ConnectionError: 10 print('Connection error') 11 except RequestException: 12 print('Error') 13 14 15 》》》输出; 16 Connection error