爬虫——Requests库

最新推荐文章于 2024-08-08 14:28:23 发布

TT9980

最新推荐文章于 2024-08-08 14:28:23 发布

阅读量441

点赞数

分类专栏：爬虫 requests 文章标签： python

本文链接：https://blog.csdn.net/TT9980/article/details/104103786

版权

爬虫同时被 2 个专栏收录

7 篇文章 0 订阅

订阅专栏

requests

1 篇文章 0 订阅

订阅专栏

简介

Requests Python编写，基于 urllib，自称HTTP for Humans（让 HTTP 服务人类）
特性： 支持HTTP连接保持和连接池，支持使用cookie保持会话，支持文件上传，支持自劢确定响应内容的编码，支持国际化的 URL 和 POST 数据自劢编码。
使用更简洁方便，比 urllib 更加 Pythoner

开源地址：https://github.com/kennethreitz/requests
中文文档 API： http://docs.python-requests.org/zh_CN/latest/index.html

安装

pip install requests

基本GET请求

import requests 
response = requests.get("http://www.baidu.com/")
# 也可以这么写 
#response = requests.request("get", "http://www.baidu.com/") 
# 查看响应内容，response.content 返回的字节流数据 
print(response.content) 
print(response.content.decode('utf8')) 
# 查看响应内容，response.text 返回的是 Unicode 格式的数据 
print(response.text) 
# 查看完整 url 地址 p
rint(response.url) 
# 查看响应头部字符编码
print(response.encoding) 
# 调用 chardet.detect()来识别文本编码 
print(response. apparent_encoding) 
# 查看响应码
print(response.status_code)

编码问题

编码获取原理
requests 会从服务器迒回的响应头的 Content-Type 去获取字符集编码，如果
content-type 有 charset 字段那么 requests 才能正确识别编码，否则就使用默认的
ISO-8859-1.
iso-8859-1是Latin-1或“西欧语言”

如何获取正确的编码？
那些不规范的页面往往content-type没有charset字段
响应对象中有 apparent_encoding 通过调用 chardet.detect()来识别文本编码。但是需要
注意的是，这有些消耗计算资源。

requests的text() 跟 content() 有什么区别？
text 属性返回的是 decode()解码的 Unicode 型的数据，如果 headers 没有 charset 字符
集的化text属性会调用chardet来计算字符集
而content属性迒回的是bytes型的原始数据，更节省计算资源。

超时
可以告诉 requests 在经过以 timeout 参数设定的秒数时间之后停止等待响应。基本上所
有的生产代码都应该使用这一参数。如果丌使用，你的程序可能会永远失去响应
requests.get(’ http://www.baidu.com/’, timeout=0.001)

异常
遇到网络问题（如：DNS 查询失败、拒绝连接等）时，Requests 会抛出一
个 ConnectionError 异常。
若请求超时，则抛出一个 Timeout 异常。
所有Requests显式抛出的异常都继承自 requests.exceptions.RequestException 。

import requests 
 
try:    
 	requests.get(' http://www.baidu.com/', timeout=0.01) except Exception as e: 
except Exception as e: 
print(e)

添加 headers 和查询参数
如果想添加 headers，可以传入headers参数来增加请求头中的headers信息。如果要将
参数放在url中传递，可以利用 params 参数。


import requests 
 
kw = {'wd':'长城'} 
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"} 
# params 接收一个字典或者字符串的查询参数，字典类型自动转换为 url 编码，不需要
urlencode()
response = requests.get("http://www.baidu.com/s?", params = kw, headers = headers) 
print(response.text) print(response.encoding)

处理HTTPS请求 SSL 证书验证

Requests也可以为HTTPS请求验证SSL 证书：
要想检查某个主机的SSL证书，你可以使用 verify 参数（也可以不写）

import requests 
response = requests.get("https://www.baidu.com/", verify=True) 
# 也可以省略不写 
# response = requests.get("https://www.baidu.com/") 
print(response.text)

如果SSL证书验证不通过，或者不信任服务器的安全证书，则会报出 SSLError，比如12306：

import requests 
response = requests.get("https://www.12306.cn/mormhweb/") 
print(response.text)

报错
SSLError: HTTPSConnectionPool(host=‘www.12306.cn’, port=443): Max retries exceeded with url: / (Caused by SSLError(CertificateError("hostname ‘www.12306.cn’…

图片下载


import requests 
# 代码演示
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
response = requests.get("http://c1.haibao.cn/img/600_0_100_1/1549794487.7856/fa60e1e7264e6082569d 729e4ee302dd.jpg", headers = headers) 
 
with open('./images/img.jpg','wb') as file:     
	file.write(response.content)

基本POST请求（data参数）

最基本的GET请求可以直接用post方法
response = requests.post(“http://www.baidu.com/”, data = data)
传入 data数据
对于 POST 请求来说，一般需要为它增加一些参数。可以利用 data 参数传参。

import requests 

if __name__ == "__main__": 
	 #对应上图的 Request URL
	 #Request_URL = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule
	 Request_URL = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule'
	 #创建 Form_Data 字典，存储上图的 Form Data 
	 Form_Data = {}     
	 Form_Data['i'] = 'Tom'     
	 Form_Data['from'] = 'AUTO'     
	 Form_Data['to'] = 'AUTO'     
	 Form_Data['smartresult'] = 'dict'     
	 Form_Data['client'] = 'fanyideskweb'     
	 Form_Data['salt'] = '1526796477689'     
	 Form_Data['sign'] = 'd0a17aa2a8b0bb831769bd9ce27d28bd'     
	 Form_Data['doctype'] = 'json'     
	 Form_Data['version'] = '2.1'     
	 Form_Data['keyfrom'] = 'fanyi.web'     
	 Form_Data['action'] = 'FY_BY_REALTIME'    
	 Form_Data['typoResult'] = 'false' 
	 head = {}    
	 #写入 User Agent 信息     
	 head['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/65.0.3325.181 Safari/537.36' 
	
	 response = requests.post(Request_URL, data=Form_Data,headers=head)     
	 print(response) 
	
	 print(response.text)     
	 translate_results = response.json() 
	 # #找到翻译结果     
	 translate_results = translate_results['translateResult'][0][0]['tgt']     
	 # #打印翻译信息     
	 print("翻译的结果是：%s" % translate_results)

代理（proxies参数）

如果需要使用代理，你可以通过为任意请求方法提供 proxies 参数来配置单个请求：

import requests 
# 根据协议类型，选择不同的代理 
proxies = { 
	"http": "http://27.184.124.29:8118", 
} 

response = requests.get("http://www.baidu.com", proxies = proxies) 
print(response.text)

Cookies 和 Sission

Cookies

如果一个响应中包含了cookie，那么我们可以利用 cookies参数拿到：

import requests 

response = requests.get("https://www.baidu.com/") 
# 7. 返回 CookieJar 对象: 
cookiejar = response.cookies 
# 8. 将 CookieJar 转为字典：
cookiedict = requests.utils.dict_from_cookiejar(cookiejar)
print(cookiejar)
print(cookiedict)

运行结果：
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
{‘BDORZ’: ‘27315’}

Cookie模拟登陆

用户登陆之后的 cookies 的信息中包含了用户登陆之后的状态信息，所以可以用请求头携带cookies来绕过用户的登陆

Sission

在 requests 里，session对象是一个非常常用的对象，这个对象代表一次用户会话：从客户端浏览器连接服务器开始，到客户端浏览器与服务器断开。
会话能让我们在跨请求时候保持某些参数，比如在同一个 Session 实例发出的所有请求之间保持 cookie 。

实例：实现人人网登录

import requests 

# 1. 创建 session 对象，可以保存 Cookie 值 
ssion = requests.session() 
# 2. 处理 headers 
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"} 
# 3. 需要登录的用户名和密码 
data = {"email":"mr_mao_hacker@163.com", "password":"alarmchime"} 
# 4. 发送附带用户名和密码的请求，并获取登录后的 Cookie 值，保存在 ssion 里 
ssion.post("http://www.renren.com/PLogin.do", data = data) 
# 5. ssion 包含用户登录后的 Cookie 值，可以直接访问那些登录后才可以访问的页面 
response = ssion.get("http://www.renren.com/410043129/profile")
# 6. 打印响应内容 
print(response.text)

处理HTTPS请求 SSL 证书验证

Requests也可以为HTTPS请求验证SSL 证书：
要想检查某个主机的SSL证书，你可以使用 verify 参数（也可以不写）

import requests 
response = requests.get("https://www.baidu.com/", verify=True)
# 也可以省略不写
# response = requests.get("https://www.baidu.com/")
print(response.text)

如果SSL证书验证不通过，或者不信任服务器的安全证书，则会报出 SSLError，比如12306

import requests 

response = requests.get("https://www.12306.cn/mormhweb/") 
print(response.text)

报错：

SSLError: HTTPSConnectionPool(host='www.12306.cn', port=443): Max retriesexceeded with url: / (Caused by SSLError(CertificateError("hostname 'www.12306.cn'…

TT9980

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬虫——Requests库

简介Requests Python编写，基于 urllib，自称HTTP for Humans（让 HTTP 服务人类）特性：支持HTTP连接保持和连接池，支持使用cookie保持会话，支持文件上传，支持自劢确定响应内容的编码，支持国际化的 URL 和 POST 数据自劢编码。使用更简洁方便，比 urllib 更加 Pythoner开源地址：https://github.com/kenn...
复制链接

扫一扫