Requests库基本使用

最新推荐文章于 2024-09-21 16:28:19 发布

rocketeerLi

最新推荐文章于 2024-09-21 16:28:19 发布

阅读量9k

点赞数 10

分类专栏：爬虫文章标签： Python Requests 爬虫

本文链接：https://blog.csdn.net/rocketeerLi/article/details/86485466

版权

爬虫专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Requests 库介绍

Requests 库是一个非常容易上手的 python 爬虫库，相比于 urllib 库， requests 库非常简洁。

Requests 库中有很多方法，但所有的方法在底层都是通过调用 request() 方法来实现的。因此，严格来说，Requests 库只有一个 request() 方法。但一般都不会直接使用这个方法。

下面是我在看视频教程的过程中，记录的一些笔记。

get() 方法

requests 基本的 get 方法

首先来看一下 requests 下 get 方法的基本用法，下面代码输出了 get() 方法返回值的类型和状态码（状态码为 200 表示请求成功）：

import requests
response = requests.get("http://www.baidu.com")
print(type(response))
print(response.status_code)

代码运行结果为：

<class ‘requests.models.Response’>
200

利用返回值的 text 属性，可以得到请求的内容：

import requests
response = requests.get("http://httpbin.org/get")
print(response.text)

输出结果：

{
“args”: {},
“headers”: {
“Accept”: “/”,
“Accept-Encoding”: “gzip, deflate”,
“Connection”: “close”,
“Host”: “httpbin.org”,
“User-Agent”: “python-requests/2.18.4”
},
“origin”: “123.126.85.145”,
“url”: “http://httpbin.org/get”
}

带参数的 get() 方法

正常写法

带参数 url 的正常写法是写在 url 后面的，代码如下：

import requests
response = requests.get("http://httpbin.org/get?name=rocketeerLi&age=22")
print(response.text)

输出结果：

{
“args”: {
“age”: “22”,
“name”: “rocketeerLi”
},
“headers”: {
“Accept”: “/”,
“Accept-Encoding”: “gzip, deflate”,
“Connection”: “close”,
“Host”: “httpbin.org”,
“User-Agent”: “python-requests/2.18.4”
},
“origin”: “123.126.85.145”,
“url”: “http://httpbin.org/get?name=rocketeerLi&age=22”
}

传参数的方式

将参数写成字典的形式，利用 get() 方法的参数进行请求，可以动态更改请求参数，代码如下：

import requests
data = {
    'name':"rocketeerLi",
    'age':22
}
response = requests.get("http://httpbin.org/get", params=data)
print(response.text)

输出结果是和直接在 url 上拼接是一样的：

{
“args”: {
“age”: “22”,
“name”: “rocketeerLi”
},
“headers”: {
“Accept”: “/”,
“Accept-Encoding”: “gzip, deflate”,
“Connection”: “close”,
“Host”: “httpbin.org”,
“User-Agent”: “python-requests/2.18.4”
},
“origin”: “123.126.85.145”,
“url”: “http://httpbin.org/get?name=rocketeerLi&age=22”
}

requests 解析 json

网站请求时，很多时候，请求到的返回值都是 json 格式，因此，格式之间的转换是非常有必要的，requests 库中有一种非常方便的 json 格式转变方式——json() 方法，与 loads() 方法的结果是一样的。

代码如下：

import requests
import json
response = requests.get("http://httpbin.org/get")
print(type(response.text))
print(response.json())    # response.json() 与 json.loads() 效果是一样的

输出结果：

<class ‘str’>
{‘args’: {}, ‘headers’: {‘Accept’: ‘/’, ‘Accept-Encoding’: ‘gzip, deflate’, ‘Connection’: ‘close’, ‘Host’: ‘httpbin.org’, ‘User-Agent’: ‘python-requests/2.18.4’}, ‘origin’: ‘123.126.85.145’, ‘url’: ‘http://httpbin.org/get’}

直接用 json.loads() 的代码如下：

import requests
import json
response = requests.get("http://httpbin.org/get")
print(json.loads(response.text))
print(type(response.json()))

输出结果：

{‘args’: {}, ‘headers’: {‘Accept’: ‘/’, ‘Accept-Encoding’: ‘gzip, deflate’, ‘Connection’: ‘close’, ‘Host’: ‘httpbin.org’, ‘User-Agent’: ‘python-requests/2.18.4’}, ‘origin’: ‘123.126.85.145’, ‘url’: ‘http://httpbin.org/get’}
<class ‘dict’>

获取二进制数据

请求二进制数据一般是经常使用的，在爬取图片或视频的时候，返回的都是二进制的数据形式。下面就以请求 github 的图标为例。

代码如下：

import requests
response = requests.get("https://github.com/favicon.ico")
print(type(response.text), type(response.content))
print(response.text)
print(response.content)

输出结果：

<class ‘str’> <class ‘bytes’>

请求二进制的结果

可以看到，这个返回的内容是一个二进制数据，我们可以将数据写回文件中，检查是否是 github 的图标。

写回二进制数据：

with open('favicon.ico', 'wb') as f:
    f.write(response.content)
    f.close()

结果如下：

github图标

可以看到，写回的文件就是 github 的图标。

添加 headers

requests 库也可以直接添加 headers，例如，在很多时候，直接利用 get 方法访问网址的时候，会被拒绝。很大一部分原因就是没有添加请求的头部信息。例如，在访问知乎的时候，如果没有浏览器的头部信息，则访问会被拒绝，报 400 Bad Request 错误。

代码：

import requests
# requests 添加 headers
# 如果不加任何 header 会报错 400 Bad Request
response = requests.get("https://www.zhihu.com/explore")
print(response.text)

输出结果：

<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>openresty</center>
</body>
</html>

加入头部信息后的代码如下：

import requests
# 加入浏览器信息
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 \
                    (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
response = requests.get("https://www.zhihu.com/explore", headers=headers)
print(response.text)

会输出正常结果：

添加头部信息后的请求结果

post() 方法

post 方法的基本使用

import requests
# 基本 post 请求
data = {'name':'rocketeerLi', 'age':'22'}
response = requests.post("http://httpbin.org/post", data=data)
print(response.text)

输出结果：

{
“args”: {},
“data”: “”,
“files”: {},
“form”: {
“age”: “22”,
“name”: “rocketeerLi”
},
“headers”: {
“Accept”: “/”,
“Accept-Encoding”: “gzip, deflate”,
“Connection”: “close”,
“Content-Length”: “23”,
“Content-Type”: “application/x-www-form-urlencoded”,
“Host”: “httpbin.org”,
“User-Agent”: “python-requests/2.18.4”
},
“json”: null,
“origin”: “123.126.85.145”,
“url”: “http://httpbin.org/post”
}

传递 headers 参数

import requests
# 传参数增加 headers
data = {'name':'rocketeerLi', 'age':'22'}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 \
                    (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
response = requests.post("http://httpbin.org/post", data=data, headers=headers)
print(response.text)

输出结果改变了 User-Agent 的值：

{
“args”: {},
“data”: “”,
“files”: {},
“form”: {
“age”: “22”,
“name”: “rocketeerLi”
},
“headers”: {
“Accept”: “/”,
“Accept-Encoding”: “gzip, deflate”,
“Connection”: “close”,
“Content-Length”: “23”,
“Content-Type”: “application/x-www-form-urlencoded”,
“Host”: “httpbin.org”,
“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36”
},
“json”: null,
“origin”: “123.126.85.145”,
“url”: “http://httpbin.org/post”
}

响应

响应是浏览器返回过来的信息，它有不同的属性。

响应的各种属性

先来看一下 response 各个属性的值和类型：

import requests
# 响应
# response 属性
response = requests.get("https://www.taobao.com")
print(type(response.status_code), response.status_code)
print(type(response.headers), response.headers)
print(type(response.cookies), response.cookies)
print(type(response.url), response.url)
print(type(response.history), response.history)

输出结果：

<class ‘int’> 200
<class ‘requests.structures.CaseInsensitiveDict’> {‘Server’: ‘Tengine’, ‘Date’: ‘Mon, 14 Jan 2019 15:07:13 GMT’, ‘Content-Type’: ‘text/html; charset=utf-8’, ‘Transfer-Encoding’: ‘chunked’, ‘Connection’: ‘keep-alive’, ‘Vary’: ‘Accept-Encoding, Ali-Detector-Type’, ‘Cache-Control’: ‘max-age=60, s-maxage=300’, ‘X-Snapshot-Age’: ‘0’, ‘Content-MD5’: ‘/6HDXsJxznvy5SRDcFhLMA==’, ‘ETag’: ‘W/“2ab2-1684bf41407”’, ‘Ali-Swift-Global-Savetime’: ‘1547464389’, ‘Via’: ‘cache22.l2cm9[38,200-0,C], cache16.l2cm9[1,0], cache16.cn1247[0,200-0,H], cache7.cn1247[1,0]’, ‘Age’: ‘34’, ‘X-Cache’: ‘HIT TCP_MEM_HIT dirn:-2:-2’, ‘X-Swift-SaveTime’: ‘Mon, 14 Jan 2019 15:06:39 GMT’, ‘X-Swift-CacheTime’: ‘300’, ‘Timing-Allow-Origin’: ‘*’, ‘EagleId’: ‘2760768a15474784332657903e’, ‘Set-Cookie’: ‘thw=cn; Path=/; Domain=.taobao.com; Expires=Tue, 14-Jan-20 15:07:13 GMT;’, ‘Strict-Transport-Security’: ‘max-age=31536000’, ‘Content-Encoding’: ‘gzip’}
<class ‘requests.cookies.RequestsCookieJar’> <RequestsCookieJar[<Cookie thw=cn for .taobao.com/>]>
<class ‘str’> https://www.taobao.com/
<class ‘list’> []

状态码的判断

可以通过状态码来判断请求是否成功，状态码为 200 ，表示请求成功；其他各种状态码代表了不同的请求状态，例如我们熟知的 404 就代表找不到请求页面。

例子：

import requests
# 状态码的判断
response = requests.get("http://www.taobao.com")
exit() if not response.status_code == requests.codes.ok else print("Request Successfully")
exit() if not response.status_code == 200 else print("Request Successfully")

输出结果：

Request Successfully
Request Successfully

高级操作

下面介绍一些常用的高级操作。

文件上传

将我们刚刚爬取下来的 github 图标上传，上传代码如下：

import requests
# 文件上传
files = {'file':open("favicon.ico", "rb")}
response = requests.post("http://httpbin.org/post", files=files)
print(response.text)

上传结果：

在这里插入图片描述

获取 cookies

cookies 是一种非常有用的东西，在访问网址时，该属性携带着访问者的一些信息。例如，如果你之前登录了淘宝账号，再次登录的时候，会发现不用再次输入用户名和密码，自己还是登录的状态。这就是 cookies 的作用，它能够存储浏览器上次访问的信息。

例子：

import requests
# 获取 cookie
response = requests.get("https://www.baidu.com")
print(response.cookies)    # cookies 列表形式
for key, value in response.cookies.items() :
    print(key + '=' + value)

输出结果：

<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
BDORZ=27315

会话维持

cookies 一般是是用来做会话维持的，可以用来做模拟登陆，即登录一次后，第二次访问网站的时候，登录状态还在维持着。

没有会话维持的代码：

import requests
# 两次 get 请求， 没有任何关联，不可以
requests.get("http://httpbin.org/cookies/set/number/123456789")
response = requests.get("http://httpbin.org/cookies")
print(response.text)

输出结果：

{
“cookies”: {}
}

可以看到，两次访问，没有存储登录的状态。

利用 Session 对象，可以模拟登录：

import requests
# Session 对象， 相当于在一个浏览器中先后访问（例如：登录验证）
s = requests.Session()
s.get("http://httpbin.org/cookies/set/number/123456789")
response = s.get("http://httpbin.org/cookies")
print(response.text)

输出结果：

{
“cookies”: {
“number”: “123456789”
}
}

可以看到，cookies 的值被保存下来了。这就是模拟登陆的作用。

证书验证

证书验证也是我们经常会遇到的问题，通常情况下，访问一些网站的时候，都会验证访问者的证书是否合法，如果不合法，就不允许访问。

大部分情况，我们之间设置证书验证为 False 即可。

# 有验证
response = requests.get("https://cms.hit.edu.cn/") # 程序会中断
print(response.status_code)
# 没有验证
response = requests.get("https://cms.hit.edu.cn/", verify=False)
print(response.status_code)
去除 warning 方法
from requests.packages import urllib3
urllib3.disable_warnings()
response = requests.get("https://cms.hit.edu.cn/", verify=False)
print(response.status_code)
cert 手动指定证书
response = requests.get("https://cms.hit.edu.cn/",cert={'path/server.crt', '/path/key'})
print(response.status_code)

但这时会有警告，为了去除警告，我们可以导入 urllib 包，可以屏蔽掉警告信息。

代理设置

代理也是一个很常用的方法。通常，在我们需要多个主机进行访问或需要绕过防火墙时，可以利用代理进行访问。代码如下：

# 无密码
proxies = {
    "http":"http://178.128.63.64:8388"
}
# 有密码
proxies = {
    "http":"http://user:password@178.128.63.64:8388"
}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)
# 利用 socks 进行代理设置
proxies = {
    "http":"socks5://178.128.63.64:8388",
    "https":"socks5://178.128.63.64:8388"
}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)

超时设置

超时是请求能够容忍的最大时间，如果在这个时间内，还没有响应返回过来，那么这次请求就算失败了，不再继续等待请求的结果。

例子：

# 超时设置
from requests.exceptions import ConnectTimeout
response = requests.get("https://httpbin.org/get", timeout = 1)
print(response.status_code)
try :
    response = requests.get("https://httpbin.org/get", timeout = 0.1)
except ConnectTimeout:
    print("ConnectTimeout")

认证设置

# 认证设置
from requests.auth import HTTPBasicAuth
r = requests.get("http://www.github.com", auth=HTTPBasicAuth("rocketeerli", "xxx"))
# 另一种写法
r = requests.get("http://www.github.com", auth=("rocketeerli", "xxx"))

异常处理

异常是难免的，与其他类型的异常一样，我们需要提前预测到可能发生的异常，对其进行相应的处理。

例子：

# 异常处理
from requests.exceptions import ConnectTimeout, HTTPError, RequestException
try :
    response = requests.get("http://httpbin.org/get", timeout = 0.1)
    print(response.status_code)
except ConnectTimeout:
    print("Timeout")
except HTTPError:
    print("Http erro")
except RequestException:
    print("Erro")