python3爬虫实践（六）——requests 库

最新推荐文章于 2023-07-18 16:19:59 发布

please tell me

最新推荐文章于 2023-07-18 16:19:59 发布

阅读量405

点赞数 4

分类专栏： pthon3爬虫实践文章标签：爬虫 python

本文链接：https://blog.csdn.net/WXY19990803/article/details/105357217

版权

pthon3爬虫实践专栏收录该内容

7 篇文章 0 订阅

订阅专栏

requests 库

虽然 python 的标准中，urllib 模块已经包含了平常我们使用的大多数功能，但是他的 API 使用起来让人感觉不太好，而 requests 宣传是“HTTP for Humans”，说明使用更方便。
中文文档：http://docs.python-requests.org/zh_CN/latest/index.html

1、发送 get 请求

最简单的发送 get 请求就是通过 requests.get 来调用：

import requests

headers = {
	'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
response = requests.get('https://www.baidu.com/',headers = headers)
# 查看响应内容，requests.content 返回的字节流数据（types）
print(type(response.content))
print(response.content.decode('utf-8'))

添加 headers 和查询参数。如果想添加 headers，可以传入 headers 参数来增加请求头中的 headers 信息。如果要将参数在 url 中传递，可以利用 params 参数。相关代码如下：

import requests

params = {'wd':'胡歌'}
headers = {
	'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
# params 接收一个字典或者字符串的查询参数，字典类型自动转换为 url 编码，不需要 urlencode
response = requests.get('https://www.baidu.com/s',params = params,headers = headers)
# 查询响应内容，requests.text 返回的是 Unicode 格式的数据（str）
print(response.text)
# 查询响应内容，requests.text 返回的是字节流数据（types）
print(response.content)
# 查询完整 url 地址
print(response.url)
# 查询响应头部字符编码
print(response.encoding)
# 查询响应码，功能和 getcode 差不多
print(response.status_code)

requests.text 和 requests.content 的区别：

requests.content	这个是直接从网络上面抓取的数据。没有经过任何编码。所以是一个 bytes 类型。其实在硬盘上和在网络上传输的字符串都是 bytes 类型
requests.text	这个是 requests 库将 requests.content 进行解码的字符串。解码需要一个编码方式，requests 会根据自己的猜测来判断猜测方式。所以有时候可能会猜测错误，就会导致解码产生乱码。这时候就需要使用 response.content.decode(‘utf-8’) 进行手动解码

2、发送 post 请求

最基本的 post 请求可以使用 post 方法

response = requests.post('url',data = data)

传入 data 数据：这时候就不再使用 urlencode 进行编码了，直接传入一个字典进去就可以了。比如请求拉钩网的数据的代码：

import requests

data = {
	'first':'true',
	'pn':'1',
	'kd':'python'
}
headers = {
	'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
	'Referer':'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput='
}

response = requests.post('https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',data = data,headers = headers)
print(response.content.decode('utf-8'))
# 数据类型为字典类型
print(response.json())

代码只做参考，因为拉钩网的反爬技术日益更新，上面数码不足以达到预期要求。

3、使用代理

使用 requests 添加代理很简单，只要在请求的方法中（比如 get 或者 post）传递 proxies 参数就可以了。实例代码如下：

import requests

url = 'https://www.baidu.com'
headers = {
	'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
proxies = {
	'http':'171.14.209.180.278:27829'
}
response = requests.get(url,headers = headers,proxies = proxies)
with open('baidu.html','w',encoding = 'utf-8') as pf:
	pf.write(response.content.decode('utf-8'))

下面是代理 IP 失效的错误提示：

在这里插入图片描述

4、cookie

若果在一个响应中包含了 cookie ，那么可以利用 cookies 属性拿到这个返回的 cookie 值：

import requests

url = '要登录的网站'
data = {
	'email':'122456799',
	'password':'121233456'
}
response = requests.post(url,data = data)
print(response.cookies)
# 以字典的方式返回 cookie 信息
print(response.cookies.get_dict)

5、session

之前使用的 urllib 库，是可以使用 opener 发送多个请求，多个请求之间可以共享 cookie 的。那么如果使用 requests，也要达到共享 cookie 的目的，那么可以使用 requests 库给我们提供 session 对象。实例代码如下：

import  requests

login_url = 'http://www.renren.com/PLogin.do'
data = {
	'email':'xxxxxx',
	'password':'xxxxxx'
}
headers = {
	'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
session = requests.session()
session.post(login_url,data = data,headers = headers)

resp = session.get('登录后主页的 url')
with open('xxx.html','w',encoding = 'utf-8') as pf:
	pf.write(resp.content.decode('utf-8'))

在这里发现一个现象，代码中登录页面的 url 在原网页上获取的不是 ‘http://www.renren.com/PLogin.do’，而是 ‘http://www.renren.com/SysHome.do’，当使用第二个 url 时，结果是错误的，也就是说不能登录从而无法获取 cookie。

6、处理不信任的 SSL 证书

对于那些已经被信任的 SSL 证书的网站，比如 https://www.baidu.com/，那么使用 requests 直接就可以正常的返回响应。实例代码如下：

# verify = False,作用是不需要验证 SSL 证书
resp = requests.get('不合法证书网站的 url',verify = False)
print(resp.content.decode('utf-8'))

please tell me

关注

4
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
python3爬虫实践（六）——requests 库

requests 库虽然 python 的标准中，urllib 模块已经包含了平常我们使用的大多数功能，但是他的 API 使用起来让人感觉不太好，而 requests 宣传是“HTTP for Humans”，说明使用更方便。中文文档：http://docs.python-requests.org/zh_CN/latest/index.html1、发送 get 请求最简单的发送 ge...
复制链接

扫一扫

专栏目录