高手进阶篇！Python爬虫requests库(附案例)

最新推荐文章于 2024-10-07 08:22:02 发布

Python_小明

最新推荐文章于 2024-10-07 08:22:02 发布

阅读量1.2k

点赞数

文章标签： python 爬虫开发语言数据分析

本文链接：https://blog.csdn.net/Python_2332/article/details/130941428

版权

本文详细介绍了Python的requests库，包括安装、基本使用、请求方式、GET请求（如携带headers、参数、Cookies）及POST请求（如发送JSON数据、上传文件）。还探讨了Session会话维持、代理IP的使用、SSL证书验证等高级话题，旨在帮助开发者深入理解和高效运用requests库进行网络爬虫开发。

摘要由CSDN通过智能技术生成

1.requests 库简介

Requests 是一个为人类设计的简单而优雅的 HTTP 库。requests 库是一个原生的 HTTP 库，比 urllib3 库更为容易使用。requests 库发送原生的 HTTP 1.1 请求，无需手动为 URL 添加查询串，也不需要对 POST 数据进行表单编码。相对于 urllib3 库， requests 库拥有完全自动化 Keep-alive 和 HTTP 连接池的功能。requests 库包含的特性如下。

❖ 1Keep-Alive & 连接池

❖ 国际化域名和 URL

❖ 带持久 Cookie 的会话

❖ 浏览器式的 SSL 认证

❖ 自动内容解码

❖ 基本 / 摘要式的身份认证

❖ 优雅的 key/value Cookie

❖ 自动解压

❖ Unicode 响应体

❖ HTTP(S) 代理支持

❖ 文件分块上传

❖ 流下载

❖ 连接超时

❖ 分块请求

❖ 支持 .netrc

1.1 Requests 的安装

pip install requests

1.2 Requests 基本使用

代码 1-1: 发送一个 get 请求并查看返回结果

import requests  
url = 'http://www.tipdm.com/tipdm/index.html' # 生成get请求  
rqg = requests.get(url)  
# 查看结果类型  
print('查看结果类型：', type(rqg))  
# 查看状态码  
print('状态码：',rqg.status_code)  
# 查看编码  
print('编码 ：',rqg.encoding)  
# 查看响应头  
print('响应头：',rqg.headers)  
# 打印查看网页内容  
print('查看网页内容：',rqg.text)

查看结果类型：<class ’requests.models.Response’>  
状态码：200  
编码 ：ISO-8859-1  
响应头：{
   ’Date’: ’Mon, 18 Nov 2019 04:45:49 GMT’, ’Server’: ’Apache-Coyote/1.1’, ’  
Accept-Ranges’: ’bytes’, ’ETag’: ’W/"15693-1562553126764"’, ’Last-Modified’: ’  
Mon, 08 Jul 2019 02:32:06 GMT’, ’Content-Type’: ’text/html’, ’Content-Length’: ’  
15693’, ’Keep-Alive’: ’timeout=5, max=100’, ’Connection’: ’Keep-Alive’}

1.3 Request 基本请求方式

你可以通过 requests 库发送所有的http请求：

requests.get("http://httpbin.org/get") #GET请求  
requests.post("http://httpbin.org/post") #POST请求  
requests.put("http://httpbin.org/put") #PUT请求  
requests.delete("http://httpbin.org/delete") #DELETE请求  
requests.head("http://httpbin.org/get") #HEAD请求  
requests.options("http://httpbin.org/get") #OPTIONS请求

2.使用Request发送GET请求

HTTP中最常见的请求之一就是GET 请求，下面首先来详细了解一下利用requests构建GET请求的方法。

GET 参数说明：get(url, params=None, **kwargs):

❖ URL: 待请求的网址

❖ params ：（可选）字典，列表为请求的查询字符串发送的元组或字节

❖ **kwargs: 可变长关键字参数

首先，构建一个最简单的 GET 请求，请求的链接为 http://httpbin.org/get ，该网站会判断如果客户端发起的是 GET 请求的话，它返回相应的请求信息，如下就是利用 requests构建一个GET请求

import requests  
r = requests.get(http://httpbin.org/get)  
print(r.text)  
{
     
"args": {
   },  
"headers": {
     
"Accept": "*/*",  
"Accept-Encoding": "gzip, deflate",  
"Host": "httpbin.org",  
"User-Agent": "python-requests/2.24.0",  
"X-Amzn-Trace-Id": "Root=1-5fb5b166-571d31047bda880d1ec6c311"  
},  
"origin": "36.44.144.134",  
"url": "http://httpbin.org/get"  
}

可以发现，我们成功发起了 GET 请求，返回结果中包含请求头、URL 、IP 等信息。那么，对于 GET 请求，如果要附加额外的信息，一般怎样添加呢？

2.1 发送带 headers 的请求

首先我们尝试请求知乎的首页信息

import requests  
response = requests.get(’https://www.zhihu.com/explore’)  
print(f"当前请求的响应状态码为：{
     response.status_code}")  
print(response.text)

当前请求的响应状态码为：400

400 Bad Request

openresty

这里发现响应的状态码为 400 ，说明我们请求失败了，因为知乎已经发现了我们是一个爬虫，因此需要对浏览器进行伪装，添加对应的 UA 信息。

import requests  
headers = {
   "user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit  
/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’}  
response = requests.get(’https://www.zhihu.com/explore’, headers=headers)  
print(f"当前请求的响应状态码为：{
     response.status_code}")  
# print(response.text)

当前请求的响应状态码为：200

<!doctype html>

…

这里我们加入了 headers 信息，其中包含了 User-Agent 字段信息，也就是浏览器标识信息。很明显我们伪装成功了！这种伪装浏览器的方法是最简单的反反爬措施之一。

GET 参数说明：携带请求头发送请求的方法

requests.get(url, headers=headers)

-headers 参数接收字典形式的请求头

-请求头字段名作为 key ，字段对应的值作为 value

练习

请求百度的首页 https://www.baidu.com , 要求携带 headers, 并打印请求的头信息 !

解

import requests  
url = 'https://www.baidu.com'  
headers = {
   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit  
/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}  
# 在请求头中带上User-Agent，模拟浏览器发送请求  
response = requests.get(url, headers=headers)  
print(response.content)  
# 打印请求头信息  
print(response.request.headers)

2.2 发送带参数的请求

我们在使用百度搜索的时候经常发现 url 地址中会有一个 ‘?‘ ，那么该问号后边的就是请求参数，又叫做查询字符串!

通常情况下我们不会只访问基础网页，特别是爬取动态网页时我们需要传递不同的参数获取不同的内容；GET 传递参数有两种方法，可以直接在链接中添加参数或者利用 params 添加参数。

2.2.1 在 url 携带参数

直接对含有参数的url发起请求

import requests  
headers = {
   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit  
/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}  
url = ’https://www.baidu.com/s?wd=python’  
response = requests.get(url, headers=headers)

2.2.2 通过 params 携带参数字典

1.构建请求参数字典

2.向接口发送请求的时候带上参数字典，参数字典设置给 params

import requests  
headers = {
   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit  
/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}