1. Requests库
-
概述: Requests是用python语言编写,基于Urllib,采用Apache2Licensed开源的协议HTTP库,他比urllib更加方便,可以节省大量的工作,完全满足HTTP
-
Request支持HTTP连接保持和连接池,支持使用cookie保持会话,支持文件上传,支持自动响应内容的编码,支持国际化的URL和POST数据自动编码。
-
import requests
r = requests.get(“https://api.github.com/events”)
print® # <Response [200]>
print(type®) # <class ‘requests.models.Response’>
print(r.status_code) # 200 -
r = requests.post(“http://httpbin.org/post”,data = {‘key’:‘value’}) # 发送一个 HTTP POST 请求
r = requests.delete(‘http://httpbin.org/delete’) # 发送一个 HTTP delete 请求:
r = requests.head(‘http://httpbin.org/get’) # 发送一个 HTTP head 请求:
r = requests.options(‘http://httpbin.org/get’) # 发送一个 HTTP options 请求: -
response = requests.get(“https://api.github.com/events”)
print(response) # <Response [200]>
print(response.text) # Json格式
2. 请求
带参数GET请求
import requests
data = {
'name':'leadingme',
'age':18
}
response = requests.get('http://httpbin/get',params=data)
解析Json
import requests
response = requests.get('htpp://httpbin/get')
print(response.text)
print(respons.json())
print(type(response.json())) # response.text为str response.json()为json
获取二进制
- List item
import requests
response = requests.get('http://github.com/favicon.ico')
print('response.text')
print('response.content')
with open('favicon.icon','wb') as f:
f.write(response.content)
f.close()
添加headers
import requests
headers = {
'User-Agent': ''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
response = requests.get('http://www.zhihu.com/explore',headers=headers)
print(response.text)
基本POST请求
import requests
data = {
'name':'leadingme',
'age':18
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
response = requests.post('http://httpbin.org/post',data=data,headers=headers)
print(response.text)
3. 不允许网页跳转
response = requests.post('http://httpbin.org/post',data=data,headers=headers, allow_redirects=False)
4. 响应
response属性
import requests
response =request.get('http://www.jianshu.com')
print(type(response.status_code),response.status_code)
print(type(response.headers),response.headers)
print(type(response.cookies),response.cookies)
print(type(response.url),response.url)
print(type(response.history),response.history)
5.状态码判断
- 常见状态码及含义:
301 Moved Permanently: 重定向到新的URL,永久性
302 Fount: 重定向到新临时的URL,非永久性
304 Not Modified: 请求资源未更新
400 Bad Request: 非法请求
401 Unauthorized 请求未经授权
403 Forbidden: 禁止访问
404 Not Found : 没有找到对应页面
500 Internal Server Error: 服务器内部出错
501 Not Implmented: 服务器不支持实现请求所需的功能
import requests
response = requests.get('http://www.jianshu.com')
if not response.status_code == 200:
exit()
else:
print("Request Successfully!")
response = requests.get('http://www.jianshu.com')
exit() if not request.status_code == requests.codes.not_fount else print('404 Not Found')
6.文件上传
import requests
# 原理跟传data一样
files = {'file':open('favicon.icon','rb')}
response = requests.post('http://httpbin.org/post',files=data)
print(response.text)
7.获取cookies
import requests
response = requests.get('http://www.baidu.com')
print(response.cookies)
for key,value in response.cookie.items():
print(key +'=' + value)
8.会话维持
import requests
s = requests.Session() # 创建一个会话
s.get('http://httpbin.org/cookies/set/number/123456789') # 设置cookies
response = s.get('http://httpbin.org/cookies') # 获取cookies
print(response.text)
9.证书安全
-
SSL安全证书 是通过在客户端浏览器和Web服务器之间建立一条SSL安全通道保证了双方传递信息的安全性,而且用户可以通过服务器证书验证他所访问的网站是否真实可靠
import requests from requests.packages import urllib3 # 取消证书验证 urllib3.disable_warnings() reponse = requests.get('https://www.12306.cn',verify=False) print(response.status_code)
10.代理设置
-
个完整的代理请求过程为:客户端首先与代理服务器创建连接,接着根据代理服务器所使用的代理协议,请求对目标服务器创建连接、或者获得目标服务器的指定资源(如:文件)。在后一种情况中,代理服务器可能对目标服务器的资源下载至本地缓存,如果客户端所要获取的资源在代理服务器的缓存之中,则代理服务器并不会向目标服务器发送请求,而是直接传回已缓存的资源
-
爬虫中可以使用ProxyHeadler设置代理,伪装自己的IP地址。爬取时可以不停地切换IP,服务器检测到不听地域的访问,不会禁用
import requests proxies={ "https": "https://47.100.104.247:8080", "http": "http://36.248.10.47:8080", } reponse = requests.get('http://www.taobao.com',proxies=proxies) print(response.status_code)
11.超时设置
import requests
try:
response = requests.get('http://httpbin.org/get',timeout=0.1)
print(response.status_code)
except requests.Timeout:
print('TimeOut!')
12.认证设置
import requests
r = requests.get('http://120.27.34.24:9001',auth=('user','123'))
print(t.status_code)
13.异常函数
- requests.ConnectionError 网络连接错误异常,如DNS查询失败、拒绝连接等
- requests.HTTPError HTTP错误异常
- requests.URLRequired URL缺失异常
- requests.TooManyRedirects 超过最大重定向次数,产生重定向异常
- requests.ConnectTimeout 连接远程服务器超时异常
- requests.Timeout 请求URL超时,产生超时异常
- requests.SSLError
14.异常处理
import requests
from requests.exceptions import Timeout, HTTPError, RequestException
try:
response = requests.get('http://httpbin.org/get',timeout=0.5)
except Timeout:
print('TimeOut!')
except HTTPError:
print('HttpError!')
except RequestsException:
print('Error!')
else:
print('Request Sucessfully!')