1、Requests模块介绍
requests官方文档:https://requests.readthedocs.io/projects/cn/zh-cn/latest/
1.1 requests模块的安装
pip/pip3 install requests
# 临时换源安装
#清华源
pip/pip3 install requests -i https://pypi.tuna.tsinghua.edu.cn/simple
# 阿里源
pip/pip3 install requests -i https://mirrors.aliyun.com/pypi/simple/
# 腾讯源
pip/pip3 install requests -i http://mirrors.cloud.tencent.com/pypi/simple
# 豆瓣源
pip/pip3 install requests -i http://pypi.douban.com/simple/
1.2 发送get请求进行测试
import requests
url = "http://www.baidu.com/" # 不加伪装直接请求https://www.baidu.com/链接无法获取正确响应内容
response = requests.get(url)
print(response.text) # 打印响应内容
2、Requests响应对象
上述测试代码的返回结果有许多乱码,这是由于编解码使用的字符集不同所导致的,我们使用以下方法解决以上问题
import requests
url = "http://www.baidu.com/"
response = requests.get(url)
# 打印响应内容
print(response.content.decode())
2.1 response.text和response.content的区别
- response.text
- 类型:str
- 解码类型:requests模块自动根据HTTP头部响应的编码做出有根据的推测,推测文本的编码
- response.context
- 类型:bytes
- 解码类型:没有指定
# response.text 设置解码格式
import requests
url = "http://www.baidu.com/"
response = requests.get(url)
response.encoding = "utf8"
print(response.text)
# response.context 设置解码格式
import requests
url = "http://www.baidu.com/"
response = requests.get(url)
print(response.content.decode())
2.2 解码方式的选择
- response.content.decode() # 默认utf-8
- response.content.decode(“GBK”)
- 常见解码方式 : https://segmentfault.com/a/1190000012470400
- utf-8
- gbk
- gb2312
- ascii
- iso-8859-1
2.3 resonse响应对象的属性和方法
- response.headers # 响应头
- response.json() # 自动降json字符串类型的响应内容转化为python对象(dict or list)
- response.status_code # 响应状态码
- response.url # 响应的url,有时响应的url与请求的url并不一致
- response.request.headers # 响应对应的请求头
- response.request._cookies # 响应对应请求的cookie,返回cookieJar类型
3、requests发送GET请求
3.1 发送带headers的请求
- requests.get(url, headers=headers)
# 加上请求头就可以直接请求协议为https的链接并获取内容了
import requests
url = "https://www.baidu.com/"
response = requests.get(url)
print(response.content.decode())
print(len(response.content.decode()))
# 构建请求头字典
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
}
response1 = requests.get(url, headers=headers)
print(len(response1.content.decode()))
print(response1.content.decode())
3.2 发送url带有参数的请求
3.2.1 直接对含有参数的url发起请求
import requests
url = "https://www.baidu.com/s?wd=python"
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
with open("baidu.html", "wb") as f:
f.write(response.content)
3.2.2 通过params携带参数字典
构建请求参数字典,发送请求的时候带上参数字典
import requests
url = "https://www.baidu.com/s?"
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
}
# 构建参数字典
data = {
"wd": "python"
}
response = requests.get(url, headers=headers, params=data)
print(response.url)
3.2.3 携带cookie参数
以github网站的登录为例,输入账密登录后打开控制台找到cookie参数所在的位置
import requests
url = "https://github.com/Justinc-github"
# 构建请求头
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
"cookie": "_octo=GH1.1.1998522517.1720237897; _device_id=d9210b9ff71f5ee4935b0e6e546d1cd2; preferred_color_mode=light; tz=Asia%2FShanghai; saved_user_sessions=131849284%3AHGdgXFAY6_XeGqabbdKxVynYHMlcX_bPs_blO19xWes06BWl; user_session=HGdgXFAY6_XeGqabbdKxVynYHMlcX_bPs_blO19xWes06BWl; __Host-user_session_same_site=HGdgXFAY6_XeGqabbdKxVynYHMlcX_bPs_blO19xWes06BWl; tz=Asia%2FShanghai; color_mode=%7B%22color_mode%22%3A%22auto%22%2C%22light_theme%22%3A%7B%22name%22%3A%22light%22%2C%22color_mode%22%3A%22light%22%7D%2C%22dark_theme%22%3A%7B%22name%22%3A%22dark%22%2C%22color_mode%22%3A%22dark%22%7D%7D; logged_in=yes; dotcom_user=Justinc-github; _gh_sess=2XhTGB9CSsYX4wNP2mIHHwmEXaz8t4jDafZtlThiai1zISSbovXTLLe9iUUOlEsFq7GXbE1uq8odzysNAY1DnxtliFYoCfhv8uvk9tUvN6e5lOwKHveZaT9mo%2F9micpeFG0LAAOyJmOKABXlm3eXOlhfFwIdH%2FFPmdSBYPyhSl9uhVF0S%2FKCndVTI7wkVkkXmlfkGz3h%2BErrXv2ZahvcqzD3%2BHHVZsr0MEVB3poXV8anw75ifXUaKu4c3bqsFhYf1%2BEhbwzGwsKGt7bBMopvDn%2FfKWKD9z3ydLVBoOkaFK9fomN67sMk1HazRy5n2s46vOeMF33WH97N0DAZaOXOFlyq%2Fc9HctPkcs5UwUVsgAhnY3rBE8cJkkuiJqOIh2f1ZZB%2By8qr0Ea%2FwJpzjVHI1Swg43LJT5zpJQ%2BUitQJtl%2Fz1A0bHmx7I0SH6m3dHjfY74poY%2BysL3u87k8x%2FYwM8nk897hROELAK8iI6zNvs%2F35w1z7tvIzl%2BRjdIALOilTYDHw7srccoJglmqM4Vs9N%2B5p1PNLeRutbq17fBrSmwx8uTn05WOZdev0LQ8%2B5HgEt44LkmHmsXdXoaB6Y9CVdkgLt77ybUjMV1D%2BuJM069ioJhcNpgnEv2MfceBcrwP5mduZsomFncZjQJYj7g4Zhx%2FU5mmOOdcnVOOh%2FMMjCgOd542g00iWGQnN2QaRt9b7%2B2UR%2Bjl4Orx%2B5iKnNu1P%2F2Bz23Q6SWCw0eTn6b5OZh18aUCF02b2DVBJp2BlwucvhYw3gzn3BC40odUjmrMODdhcnUNyK9t%2B7rW%2BBV0iyLam0t8LRn2QIHyB%2Bfu0mIjnKmtKCNfZtpyP5izEP2KzmvojtUzNHzcWbiISNVViKJ2rCPB0ZcX0AQ%3D%3D--UKTmYCU19bO01YzV--3j50Rs4hv%2Fv7vsIyLVNBWg%3D%3D",
}
# 发起一个请求
response = requests.get(url, headers=headers)
# 验证是否登录成功
with open("github.html", "wb") as f:
f.write(response.content)
print(response.content)
# 成功后在保存的文件中title标签类似以下样式
# <title>Justinc-github (Just inc)</title>
3.3 使用cookies参数保持登录状态
cookie一般是有过期时间,过期后需重新获取
3.3.1 cookies参数的形式(字典)
cookie = {"key": "value"}
3.3.2 功能的实现
import requests
url = "https://github.com/Justinc-github"
# 构建请求头
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
}
# 构建cookies字典
temp = "_octo=GH1.1.1998522517.1720237897; _device_id=d9210b9ff71f5ee4935b0e6e546d1cd2; preferred_color_mode=light; tz=Asia%2FShanghai; saved_user_sessions=131849284%3AHGdgXFAY6_XeGqabbdKxVynYHMlcX_bPs_blO19xWes06BWl; user_session=HGdgXFAY6_XeGqabbdKxVynYHMlcX_bPs_blO19xWes06BWl; __Host-user_session_same_site=HGdgXFAY6_XeGqabbdKxVynYHMlcX_bPs_blO19xWes06BWl; tz=Asia%2FShanghai; color_mode=%7B%22color_mode%22%3A%22auto%22%2C%22light_theme%22%3A%7B%22name%22%3A%22light%22%2C%22color_mode%22%3A%22light%22%7D%2C%22dark_theme%22%3A%7B%22name%22%3A%22dark%22%2C%22color_mode%22%3A%22dark%22%7D%7D; logged_in=yes; dotcom_user=Justinc-github; _gh_sess=2XhTGB9CSsYX4wNP2mIHHwmEXaz8t4jDafZtlThiai1zISSbovXTLLe9iUUOlEsFq7GXbE1uq8odzysNAY1DnxtliFYoCfhv8uvk9tUvN6e5lOwKHveZaT9mo%2F9micpeFG0LAAOyJmOKABXlm3eXOlhfFwIdH%2FFPmdSBYPyhSl9uhVF0S%2FKCndVTI7wkVkkXmlfkGz3h%2BErrXv2ZahvcqzD3%2BHHVZsr0MEVB3poXV8anw75ifXUaKu4c3bqsFhYf1%2BEhbwzGwsKGt7bBMopvDn%2FfKWKD9z3ydLVBoOkaFK9fomN67sMk1HazRy5n2s46vOeMF33WH97N0DAZaOXOFlyq%2Fc9HctPkcs5UwUVsgAhnY3rBE8cJkkuiJqOIh2f1ZZB%2By8qr0Ea%2FwJpzjVHI1Swg43LJT5zpJQ%2BUitQJtl%2Fz1A0bHmx7I0SH6m3dHjfY74poY%2BysL3u87k8x%2FYwM8nk897hROELAK8iI6zNvs%2F35w1z7tvIzl%2BRjdIALOilTYDHw7srccoJglmqM4Vs9N%2B5p1PNLeRutbq17fBrSmwx8uTn05WOZdev0LQ8%2B5HgEt44LkmHmsXdXoaB6Y9CVdkgLt77ybUjMV1D%2BuJM069ioJhcNpgnEv2MfceBcrwP5mduZsomFncZjQJYj7g4Zhx%2FU5mmOOdcnVOOh%2FMMjCgOd542g00iWGQnN2QaRt9b7%2B2UR%2Bjl4Orx%2B5iKnNu1P%2F2Bz23Q6SWCw0eTn6b5OZh18aUCF02b2DVBJp2BlwucvhYw3gzn3BC40odUjmrMODdhcnUNyK9t%2B7rW%2BBV0iyLam0t8LRn2QIHyB%2Bfu0mIjnKmtKCNfZtpyP5izEP2KzmvojtUzNHzcWbiISNVViKJ2rCPB0ZcX0AQ%3D%3D--UKTmYCU19bO01YzV--3j50Rs4hv%2Fv7vsIyLVNBWg%3D%3D"
cookie_list = temp.split("; ")
# # 方案一: 循环分割
# cookies = {}
# for cookie in cookie_list:
# cookies[cookie.split("=")[0]] = cookie.split("=")[-1]
# 方案二: 字典推导式
cookies = {cookie.split("=")[0]: cookie.split("=")[-1] for cookie in cookie_list}
print(cookies)
# 发起一个请求
response = requests.get(url, headers=headers,cookies=cookies)
# 验证是否登录成功
with open("github_cookies.html", "wb") as f:
f.write(response.content)
print(response.content)
# 依然为登录状态 <title>Justinc-github (Just inc)</title>
3.4 cookieJar对象转换为cookies字典
使用requests模块获取的response对象具有cookies属性,该属性值是一个cookieJar类型,将cookieJar类型转换为cookies字典需要以下方法(了解即可)
cookies_list = requests.utils.dict_from_cookiejar(response.cookies)
3.5 timeout-超时参数
设置访问时间,一旦超时则抛出异常,默认等待时间为180秒,通常用来检测IP的可用性。
import requests
url = " https://cms.youtube.com/"
response = requests.get(url, timeout=3) # 访问时间超时3秒则抛出异常
3.6 proxies-代理设置
IP代理根据匿名度可分为透明代理、匿名代理、高匿代理,由于高匿代理效果最好,所以在爬虫中我们最常用的就是高匿代理。
3.6.1 代理IP的概念
代理IP是一个IP,指向的是一个代理服务器,代理服务器替我们想目标网址发送请求
3.6.2 正向代理和反向代理
- 正向代理:服务器知道最终处理请求服务器的真实IP地址,如VPN;
- 反向代理:浏览器不知道服务器的真实地址,如nginx。
3.6.3 proxies代理参数的使用
在浏览器中找IP代理网站进行尝试,以免费代理列表网站为例:https://www.lumiproxy.com/zh-hans/free-proxy/,使用代理IP进行访问。
import requests
url = "http://www.baidu.com/"
proxies = {
# "http": "http://241.252.105.120:64891",
"https": "http://241.252.105.120:64891"
}
response = requests.get(url, proxies=proxies)
print(response.text)
- 成功使用代理
3.7 verify-证书忽略
在访问网站的过程中,有些网站的CA证书没有经过【手信任的根证书办法机构】的认证,在我们访问的时候会显示"您的连接不是私密连接",无法进行正常爬取,将参数修改即可正常访问。
- response = requests.get(url, verify=alse)
# 关闭后会有警告显示,使用以下代码可以去除警告
import warnings
warnings.filterwarnings('ignore')
4、requests发送post请求
POST请求相比于GET请求更加安全,并且对数据长度没有要求,二者的具体区别可以参考以下文章:https://blog.csdn.net/qq_43588129/article/details/115218995下面以爬取百度翻译为例。
- response = requests.post(url, data)
- data参数接收一个字典,其他参数与get请求的参数完全一致
4.1 找到翻译接口地址
4.2 找到表单所带数据
4.3 编写代码获取翻译数据
import json
import requests
class Trans(object):
def __init__(self, word):
self.url = "https://fanyi.baidu.com/sug"
self.headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
}
self.data = {
"kw": word,
}
def get_data(self):
response = requests.post(self.url, headers=self.headers, data=self.data)
return response.content
def parse_data(self, data):
dict_data = json.loads(data) # 使用loads方法将json字符串转化为python字典
print(dict_data["data"][0]["v"])
def run(self):
response = self.get_data()
self.parse_data(response)
if __name__ == '__main__':
word = input("请输入所要翻译的单词:")
Translate = Trans(word)
Translate.run()
4.4 测试
一些表单需要提交的数据可能会发生变化,需要我们在实际爬取的过程中进行分析。
5、session的使用
session是requests库发送请求的一种方法,这种方法会自动保存访问页面得到的cookie值,从而再次访问的时候会自动携带cookie,以达到状态保持的目的。
5.1 使用场景
- 需要进行连续多次的请求
5.2 使用方法
session = requests.session()
response = session.get(url, headers, ...)
response = session.post(url, headers, ...)
5.3 实例-github的登录
5.3.1 找到登录所需表单
- 输入账号和密码后点击登录,观察出现哪些接口
- 点击登录时注意将保留日志勾选,避免之前抓取的包刷新消失
5.3.2 分析表单参数
- 我们发现session请求后进行了302跳转,这和我们登录时登录成功后跳转到个人主页是一致的
- 点击载荷查看表单携带数据。(由于github的密码是明文传输,所以这里我进行了遮盖)
通过两次对比我们发现有一些参数是固定的,一些参数是变化的,如何获取这些变化的数据是极其重要的。
5.3.3 分析动态数据并获取
- authenticity_token:在登录成功之前的包中寻找是否有这些变化的数据(一定要有耐心),最后发现在login中可以找到这一参数;
- timestamp:一般timestamp代表的是当前的时间戳,一会验证即可
- timestamp_secret:一段加密密码,我们选择不改变他的值
5.3.4 编写代码
import time
import requests
import re
def login():
# 获取token
url1 = "https://github.com/login"
session = requests.session()
session.headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
}
res_1 = session.get(url1).content.decode()
token = re.findall('<input type="hidden" data-csrf="true" name="authenticity_token" value="(.*?)" />'
, res_1)[0]
print(f"获取成功,token为: {token}")
# 登录
url2 = "https://github.com/session"
data = {
"commit": "Sign in",
"authenticity_token": token,
"add_account": "",
"login": "1927705375@qq.com",
"password": "SJCsjc0315!",
"webauthn-conditional": "undefined",
"javascript-support": "true",
"webauthn-support": "supported",
"webauthn-iuvpaa-support": "unsupported",
"return_to": "https://github.com/login",
# "allow_signup": "",
# "client_id": "",
# "integration": "",
# "required_field_5e5f": "",
"timestamp": time.time() * 1000,
"timestamp_secret": "ed50ee1652fb00e6084a47feb0e630588947f2faa4b9f0f978248c5d7ce8b3f7"
}
session.post(url2, data=data)
print("登录成功")
# 验证
url3 = "https://github.com/Justinc-github"
response = session.get(url3)
with open("github.html", "wb") as f:
f.write(response.content)
print("验证通过")
if __name__ == '__main__':
login()
- 成功运行!!!