爬虫从入门到精通(2) | requests模块の使用

最新推荐文章于 2024-03-05 22:18:54 发布

张烫麻辣亮。

最新推荐文章于 2024-03-05 22:18:54 发布

阅读量926

点赞数 11

分类专栏： # -- [Python-网络爬虫入门] 文章标签： requests模块

侵权必究

本文链接：https://blog.csdn.net/qq_40558166/article/details/102786946

版权

-- [Python-网络爬虫入门] 专栏收录该内容

21 篇文章 94 订阅

订阅专栏

在这里插入图片描述

参考博客：https://blog.csdn.net/shanzhizi/article/details/50903748

一、requests模块基础知识

1.requests的用途

requests 库可以实现 HTTP 协议中绝大部分功能，它提供的功能包括：keep-alive、连接池、Cookie 持久化、内容自动解压、HTTP 代理、SSL 认证、连接超时、Session 等很多特性，最重要的是它同时兼容 python2 和 python3，它是 Github 关注数最多的 Python 项目之一。

2.安装方法

pip install requests

3.参数介绍

3.1 参数介绍

import requests

requests.get(
  	url=base_url, # 请求的url
  	headers={},   # 请求头，例如{‘user-agent’:'xxx'}
  	params={},    # 请求参数字典,例如{‘a’:123}
  	proxies={},   # 代理，例如{‘https’:'168.168.16.16:9000'}    
  	timeout=3,    # 超时时间
  	verify=False, # 跳过ssl验证
  )

3.2 支持的请求方法

requests.get(‘https://github.com/timeline.json’) #GET请求
requests.post(“http://httpbin.org/post”) #POST请求
requests.put(“http://httpbin.org/put”) #PUT请求
requests.delete(“http://httpbin.org/delete”) #DELETE请求
requests.head(“http://httpbin.org/get”) #HEAD请求
requests.options(“http://httpbin.org/get”) #OPTIONS请求

4.返回值response对象

import requests
r=requests.get(.....)

4.1 参数介绍

代码	意义
r.status_code	响应状态码
r.raw	返回原始响应体，也就是 urllib 的 response 对象，使用 r.raw.read() 读取
r.content	字节方式的响应体，会自动为你解码 gzip 和 deflate 压缩
r.text	字符串方式的响应体，会自动根据响应头部的字符编码进行解码
r.headers	以字典对象存储服务器响应头，但是这个字典比较特殊，字典键不区分大小写，若键不存在则返回None。例如获取cookie为response.headers[‘Cookie’]
r.json()	Requests中内置的JSON解码器
r.raise_for_status()	失败请求(非200响应)抛出异常

4.2 response.text乱码问题

当我们用response.text获取字符串的响应正文的时候，有时候会出现乱码：原因是response.encoding这个字符默认指定编码有误。

解决：

 response.encoding='utf-8'
 print(response.text)

5.查看网页使用的是什么请求

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bkGPGIrF-1572261727451)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1572247110458.png)]$

二、requests中get请求使用的三种常见情况

1.不需要请求参数（百度产品）

import requests

base_url = 'https://www.baidu.com/more/'   
response = requests.get(base_url)
response.encoding='utf-8'

print(response.status_code)
print(response.headers)
print(type(response.text))
print(type(response.content))

在这里插入图片描述

2.需要请求参数（新浪新闻）

import requests
  
 # 1.确定url
base_url = 'https://search.sina.com.cn/'  # 新浪新闻
  
# 2.设置headers字典
headers = {
      'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
 	}
  
# 3.设置请求参数
key = '孙悟空'  # 搜索内容
params = {
      'q': key,
      'c': 'news',
      'from': 'channel',
      'ie': 'utf-8',
  }
# 4.发起请求
response = requests.get(base_url, headers=headers, params=params)
response.encoding='gbk'
print(response.text)

在这里插入图片描述

3.请求中常见的分页处理

分页类型
- 第一步：找出分页参数的规律
- 第二步：headers和params字典
- 第三步：用for循环

# --------------------爬取百度贴吧搜索某个贴吧的前十页
import os
  
import requests
  
base_url = 'https://tieba.baidu.com/f?'
headers = {
      'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
  }

# 创建文件夹
dirname = './tieba/woman/'
if not os.path.exists(dirname):
    os.makedirs(dirname)


# 构造参数，for循环发送请求
for i in range(0, 10):
    params = {
          'ie': 'utf-8',
          'kw': '美女',
          'pn': str(i * 50)
      }
      
	response = requests.get(base_url, headers=headers, params=params)

	# 将爬取的内容按页数存放写入html
	with open(dirname + '美女第%s页.html' % (i+1), 'w', encoding='utf-8') as file:
	      file.write(response.content.decode('utf-8'))

三、requests中post请求的使用

1.JSON模块

json.dumps(python的list或者dict)---->(返回值)---->json字符串

json.loads(json字符串)---->(返回值)----->python的list或者dict

post请求一般得到的响应内容是json数据。
处理json数据用到的模块是json模块。
json数据本质就是一个字符串。

response.json()
#可以直接将获取到的json字符串转换为json.dumps(python的list或者dict)---->(返回值)---->json字符串

2.post请求常用格式

response=requests.post(
	url,
	headers={},
	data={},#请求数据字典
)

3.上传文件

import requests
 
url = 'http://127.0.0.1:5000/upload'
files = {'file': open('/home/lyb/sjzl.mpg', 'rb')}
#files = {'file': ('report.jpg', open('/home/lyb/sjzl.mpg', 'rb'))}     #显式的设置文件名
 
r = requests.post(url, files=files)
print(r.text)

四、requests中的钩子函数

hooks可以串改response里的参数信息或者打印一句话

def change_url(response, *args, **kwargs):
    """ 回调函数 """
    response.url = '123'


# 创建一个钩子hooks=dict(response=change_url),字典型，将response放在回调函数中,可以对返回结果进行篡改
response = requests.get('https://www.baidu.com', hooks=dict(response=change_url,))
print response.url

在这里插入图片描述

五、常见的requests报错

1. 连接超时

服务器在指定时间内没有应答，抛出 requests.exceptions.ConnectTimeout

requests.get('http://github.com', timeout=0.001)

# 抛出错误
requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='github.com', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f1b16da75f8>, 'Connection to github.com timed out. (connect timeout=0.001)'))

2. 连接、读取超时

若分别指定连接和读取的超时时间，服务器在指定时间没有应答，抛出 requests.exceptions.ConnectTimeout- timeout=([连接超时时间], [读取超时时间])

连接：客户端连接服务器并并发送http请求服务器
读取：客户端等待服务器发送第一个字节之前的时间

requests.get('http://github.com', timeout=(6.05, 0.01))

# 抛出错误
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='github.com', port=80): Read timed out. (read timeout=0.01)

3. 未知的服务器

requests.get('http://github.comasf', timeout=(6.05, 27.05))

# 抛出错误
requests.exceptions.ConnectionError: HTTPConnectionPool(host='github.comasf', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f75826665f8>: Failed to establish a new connection: [Errno -2] Name or service not known',))

4. 代理连接不上

代理服务器拒绝建立连接，端口拒绝连接或未开放，抛出 requests.exceptions.ProxyError

requests.get('http://github.com', timeout=(6.05, 27.05), proxies={"http": "192.168.10.1:800"})

# 抛出错误
requests.exceptions.ProxyError: HTTPConnectionPool(host='192.168.10.1', port=800): Max retries exceeded with url: http://github.com/ (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fce3438c6d8>: Failed to establish a new connection: [Errno 111] Connection refused',)))

5. 连接代理超时

代理服务器没有响应 requests.exceptions.ConnectTimeout

requests.get('http://github.com', timeout=(6.05, 27.05), proxies={"http": "10.200.123.123:800"})

# 抛出错误
requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='10.200.123.123', port=800): Max retries exceeded with url: http://github.com/ (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fa8896cc6d8>, 'Connection to 10.200.123.123 timed out. (connect timeout=6.05)'))

6. 代理读取超时

说明与代理建立连接成功，代理也发送请求到目标站点，但是代理读取目标站点资源超时
即使代理访问很快，如果代理服务器访问的目标站点超时，这个锅还是代理服务器背
假定代理可用，timeout就是向代理服务器的连接和读取过程的超时时间，不用关心代理服务器是否连接和读取成功

requests.get('http://github.com', timeout=(2, 0.01), proxies={"http": "192.168.10.1:800"})

# 抛出错误
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='192.168.10.1:800', port=1080): Read timed out. (read timeout=0.5)

7. 网络环境异常

可能是断网导致，抛出 requests.exceptions.ConnectionError

requests.get('http://github.com', timeout=(6.05, 27.05))

# 抛出错误
requests.exceptions.ConnectionError: HTTPConnectionPool(host='github.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc8c17675f8>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))

8.官网的一些参考

你可以告诉 requests 在经过以 timeout 参数设定的秒数时间之后停止等待响应。基本上所有的生产代码都应该使用这一参数。如果不使用，你的程序可能会永远失去响应：

>>> requests.get('http://github.com', timeout=0.001)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)

并不是整个下载响应的时间限制，而是如果服务器在 timeout 秒内没有应答，将会引发一个异常（更精确地说，是在 timeout 秒内没有从基础套接字上接收到任何字节的数据时）


- 遇到网络问题（如：DNS 查询失败、拒绝连接等）时，Requests 会抛出一个 requests.exceptions.ConnectionError 异常。
- 如果 HTTP 请求返回了不成功的状态码， Response.raise_for_status() 会抛出一个 HTTPError 异常。
- 若请求超时，则抛出一个 Timeout 异常。
- 若请求超过了设定的最大重定向次数，则会抛出一个 TooManyRedirects 异常。
- 所有Requests显式抛出的异常都继承自 requests.exceptions.RequestException 。

张烫麻辣亮。

关注

11
点赞
踩
20

收藏

觉得还不错? 一键收藏
打赏
2
评论
爬虫从入门到精通(2) | requests模块の使用

文章目录一、requests模块基础知识1.要切记python模块的包名requests2.使用步骤3.response对象①参数②响应内容的乱码问题4.查看网页使用的是get请求还是post请求的方法二、requests模块的get请求的三种情况1.没有请求参数的，比如百度的项目，只需要**填写请求头，封装user-agent**案例-----------百度产品2.带请求参数的，**基础url...
复制链接

扫一扫