requests模块

1、Requests模块介绍

requests官方文档:https://requests.readthedocs.io/projects/cn/zh-cn/latest/

1.1 requests模块的安装
pip/pip3 install requests


# 临时换源安装
#清华源
pip/pip3 install requests -i https://pypi.tuna.tsinghua.edu.cn/simple
# 阿里源
pip/pip3 install requests -i https://mirrors.aliyun.com/pypi/simple/
# 腾讯源
pip/pip3 install requests -i http://mirrors.cloud.tencent.com/pypi/simple
# 豆瓣源
pip/pip3 install requests -i http://pypi.douban.com/simple/
1.2 发送get请求进行测试
import requests

url = "http://www.baidu.com/" # 不加伪装直接请求https://www.baidu.com/链接无法获取正确响应内容

response = requests.get(url)

print(response.text)  # 打印响应内容

2、Requests响应对象

上述测试代码的返回结果有许多乱码,这是由于编解码使用的字符集不同所导致的,我们使用以下方法解决以上问题

import requests

url = "http://www.baidu.com/"

response = requests.get(url)

# 打印响应内容
print(response.content.decode())
2.1 response.text和response.content的区别
  • response.text
    • 类型:str
    • 解码类型:requests模块自动根据HTTP头部响应的编码做出有根据的推测,推测文本的编码
  • response.context
    • 类型:bytes
    • 解码类型:没有指定
# response.text 设置解码格式
import requests

url = "http://www.baidu.com/"

response = requests.get(url)

response.encoding = "utf8"

print(response.text)

# response.context 设置解码格式
import requests

url = "http://www.baidu.com/"

response = requests.get(url)

print(response.content.decode())
2.2 解码方式的选择
2.3 resonse响应对象的属性和方法
  • response.headers # 响应头
  • response.json() # 自动降json字符串类型的响应内容转化为python对象(dict or list)
  • response.status_code # 响应状态码
  • response.url # 响应的url,有时响应的url与请求的url并不一致
  • response.request.headers # 响应对应的请求头
  • response.request._cookies # 响应对应请求的cookie,返回cookieJar类型

3、requests发送GET请求

3.1 发送带headers的请求
  • requests.get(url, headers=headers)
# 加上请求头就可以直接请求协议为https的链接并获取内容了

import requests

url = "https://www.baidu.com/"

response = requests.get(url)

print(response.content.decode())
print(len(response.content.decode()))
# 构建请求头字典
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
}

response1 = requests.get(url, headers=headers)
print(len(response1.content.decode()))
print(response1.content.decode())
3.2 发送url带有参数的请求
3.2.1 直接对含有参数的url发起请求
import requests

url = "https://www.baidu.com/s?wd=python"

headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
}

response = requests.get(url, headers=headers)

with open("baidu.html", "wb") as f:
    f.write(response.content)
3.2.2 通过params携带参数字典

构建请求参数字典,发送请求的时候带上参数字典

import requests

url = "https://www.baidu.com/s?"

headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
}

# 构建参数字典
data = {
    "wd": "python"
}
response = requests.get(url, headers=headers, params=data)

print(response.url)
3.2.3 携带cookie参数

以github网站的登录为例,输入账密登录后打开控制台找到cookie参数所在的位置

cookies

import requests

url = "https://github.com/Justinc-github"

# 构建请求头
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
    "cookie": "_octo=GH1.1.1998522517.1720237897; _device_id=d9210b9ff71f5ee4935b0e6e546d1cd2; preferred_color_mode=light; tz=Asia%2FShanghai; saved_user_sessions=131849284%3AHGdgXFAY6_XeGqabbdKxVynYHMlcX_bPs_blO19xWes06BWl; user_session=HGdgXFAY6_XeGqabbdKxVynYHMlcX_bPs_blO19xWes06BWl; __Host-user_session_same_site=HGdgXFAY6_XeGqabbdKxVynYHMlcX_bPs_blO19xWes06BWl; tz=Asia%2FShanghai; color_mode=%7B%22color_mode%22%3A%22auto%22%2C%22light_theme%22%3A%7B%22name%22%3A%22light%22%2C%22color_mode%22%3A%22light%22%7D%2C%22dark_theme%22%3A%7B%22name%22%3A%22dark%22%2C%22color_mode%22%3A%22dark%22%7D%7D; logged_in=yes; dotcom_user=Justinc-github; _gh_sess=2XhTGB9CSsYX4wNP2mIHHwmEXaz8t4jDafZtlThiai1zISSbovXTLLe9iUUOlEsFq7GXbE1uq8odzysNAY1DnxtliFYoCfhv8uvk9tUvN6e5lOwKHveZaT9mo%2F9micpeFG0LAAOyJmOKABXlm3eXOlhfFwIdH%2FFPmdSBYPyhSl9uhVF0S%2FKCndVTI7wkVkkXmlfkGz3h%2BErrXv2ZahvcqzD3%2BHHVZsr0MEVB3poXV8anw75ifXUaKu4c3bqsFhYf1%2BEhbwzGwsKGt7bBMopvDn%2FfKWKD9z3ydLVBoOkaFK9fomN67sMk1HazRy5n2s46vOeMF33WH97N0DAZaOXOFlyq%2Fc9HctPkcs5UwUVsgAhnY3rBE8cJkkuiJqOIh2f1ZZB%2By8qr0Ea%2FwJpzjVHI1Swg43LJT5zpJQ%2BUitQJtl%2Fz1A0bHmx7I0SH6m3dHjfY74poY%2BysL3u87k8x%2FYwM8nk897hROELAK8iI6zNvs%2F35w1z7tvIzl%2BRjdIALOilTYDHw7srccoJglmqM4Vs9N%2B5p1PNLeRutbq17fBrSmwx8uTn05WOZdev0LQ8%2B5HgEt44LkmHmsXdXoaB6Y9CVdkgLt77ybUjMV1D%2BuJM069ioJhcNpgnEv2MfceBcrwP5mduZsomFncZjQJYj7g4Zhx%2FU5mmOOdcnVOOh%2FMMjCgOd542g00iWGQnN2QaRt9b7%2B2UR%2Bjl4Orx%2B5iKnNu1P%2F2Bz23Q6SWCw0eTn6b5OZh18aUCF02b2DVBJp2BlwucvhYw3gzn3BC40odUjmrMODdhcnUNyK9t%2B7rW%2BBV0iyLam0t8LRn2QIHyB%2Bfu0mIjnKmtKCNfZtpyP5izEP2KzmvojtUzNHzcWbiISNVViKJ2rCPB0ZcX0AQ%3D%3D--UKTmYCU19bO01YzV--3j50Rs4hv%2Fv7vsIyLVNBWg%3D%3D",
}

# 发起一个请求
response = requests.get(url, headers=headers)

# 验证是否登录成功
with open("github.html", "wb") as f:
    f.write(response.content)
print(response.content)



# 成功后在保存的文件中title标签类似以下样式
# <title>Justinc-github (Just inc)</title>
3.3 使用cookies参数保持登录状态

cookie一般是有过期时间,过期后需重新获取

3.3.1 cookies参数的形式(字典)
cookie = {"key": "value"}
3.3.2 功能的实现
import requests

url = "https://github.com/Justinc-github"

# 构建请求头
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
}

# 构建cookies字典
temp = "_octo=GH1.1.1998522517.1720237897; _device_id=d9210b9ff71f5ee4935b0e6e546d1cd2; preferred_color_mode=light; tz=Asia%2FShanghai; saved_user_sessions=131849284%3AHGdgXFAY6_XeGqabbdKxVynYHMlcX_bPs_blO19xWes06BWl; user_session=HGdgXFAY6_XeGqabbdKxVynYHMlcX_bPs_blO19xWes06BWl; __Host-user_session_same_site=HGdgXFAY6_XeGqabbdKxVynYHMlcX_bPs_blO19xWes06BWl; tz=Asia%2FShanghai; color_mode=%7B%22color_mode%22%3A%22auto%22%2C%22light_theme%22%3A%7B%22name%22%3A%22light%22%2C%22color_mode%22%3A%22light%22%7D%2C%22dark_theme%22%3A%7B%22name%22%3A%22dark%22%2C%22color_mode%22%3A%22dark%22%7D%7D; logged_in=yes; dotcom_user=Justinc-github; _gh_sess=2XhTGB9CSsYX4wNP2mIHHwmEXaz8t4jDafZtlThiai1zISSbovXTLLe9iUUOlEsFq7GXbE1uq8odzysNAY1DnxtliFYoCfhv8uvk9tUvN6e5lOwKHveZaT9mo%2F9micpeFG0LAAOyJmOKABXlm3eXOlhfFwIdH%2FFPmdSBYPyhSl9uhVF0S%2FKCndVTI7wkVkkXmlfkGz3h%2BErrXv2ZahvcqzD3%2BHHVZsr0MEVB3poXV8anw75ifXUaKu4c3bqsFhYf1%2BEhbwzGwsKGt7bBMopvDn%2FfKWKD9z3ydLVBoOkaFK9fomN67sMk1HazRy5n2s46vOeMF33WH97N0DAZaOXOFlyq%2Fc9HctPkcs5UwUVsgAhnY3rBE8cJkkuiJqOIh2f1ZZB%2By8qr0Ea%2FwJpzjVHI1Swg43LJT5zpJQ%2BUitQJtl%2Fz1A0bHmx7I0SH6m3dHjfY74poY%2BysL3u87k8x%2FYwM8nk897hROELAK8iI6zNvs%2F35w1z7tvIzl%2BRjdIALOilTYDHw7srccoJglmqM4Vs9N%2B5p1PNLeRutbq17fBrSmwx8uTn05WOZdev0LQ8%2B5HgEt44LkmHmsXdXoaB6Y9CVdkgLt77ybUjMV1D%2BuJM069ioJhcNpgnEv2MfceBcrwP5mduZsomFncZjQJYj7g4Zhx%2FU5mmOOdcnVOOh%2FMMjCgOd542g00iWGQnN2QaRt9b7%2B2UR%2Bjl4Orx%2B5iKnNu1P%2F2Bz23Q6SWCw0eTn6b5OZh18aUCF02b2DVBJp2BlwucvhYw3gzn3BC40odUjmrMODdhcnUNyK9t%2B7rW%2BBV0iyLam0t8LRn2QIHyB%2Bfu0mIjnKmtKCNfZtpyP5izEP2KzmvojtUzNHzcWbiISNVViKJ2rCPB0ZcX0AQ%3D%3D--UKTmYCU19bO01YzV--3j50Rs4hv%2Fv7vsIyLVNBWg%3D%3D"
cookie_list = temp.split("; ")
# # 方案一: 循环分割
# cookies = {}
# for cookie in cookie_list:
#     cookies[cookie.split("=")[0]] = cookie.split("=")[-1]

# 方案二: 字典推导式
cookies = {cookie.split("=")[0]: cookie.split("=")[-1] for cookie in cookie_list}

print(cookies)

# 发起一个请求
response = requests.get(url, headers=headers,cookies=cookies)

# 验证是否登录成功
with open("github_cookies.html", "wb") as f:
    f.write(response.content)
print(response.content)


# 依然为登录状态  <title>Justinc-github (Just inc)</title>
3.4 cookieJar对象转换为cookies字典

使用requests模块获取的response对象具有cookies属性,该属性值是一个cookieJar类型,将cookieJar类型转换为cookies字典需要以下方法(了解即可)

cookies_list = requests.utils.dict_from_cookiejar(response.cookies)
3.5 timeout-超时参数

设置访问时间,一旦超时则抛出异常,默认等待时间为180秒,通常用来检测IP的可用性。

timeout

import requests

url = " https://cms.youtube.com/"

response = requests.get(url, timeout=3)  # 访问时间超时3秒则抛出异常
3.6 proxies-代理设置

IP代理根据匿名度可分为透明代理、匿名代理、高匿代理,由于高匿代理效果最好,所以在爬虫中我们最常用的就是高匿代理。

3.6.1 代理IP的概念

代理IP是一个IP,指向的是一个代理服务器,代理服务器替我们想目标网址发送请求

3.6.2 正向代理和反向代理
  • 正向代理:服务器知道最终处理请求服务器的真实IP地址,如VPN;
  • 反向代理:浏览器不知道服务器的真实地址,如nginx。
3.6.3 proxies代理参数的使用

在浏览器中找IP代理网站进行尝试,以免费代理列表网站为例:https://www.lumiproxy.com/zh-hans/free-proxy/,使用代理IP进行访问。

IP

import requests

url = "http://www.baidu.com/"
proxies = {
    # "http": "http://241.252.105.120:64891",
    "https": "http://241.252.105.120:64891"
}
response = requests.get(url, proxies=proxies)
print(response.text)
  • 成功使用代理

success

3.7 verify-证书忽略

在访问网站的过程中,有些网站的CA证书没有经过【手信任的根证书办法机构】的认证,在我们访问的时候会显示"您的连接不是私密连接",无法进行正常爬取,将参数修改即可正常访问。

  • response = requests.get(url, verify=alse)
# 关闭后会有警告显示,使用以下代码可以去除警告
import warnings
warnings.filterwarnings('ignore')

4、requests发送post请求

POST请求相比于GET请求更加安全,并且对数据长度没有要求,二者的具体区别可以参考以下文章:https://blog.csdn.net/qq_43588129/article/details/115218995下面以爬取百度翻译为例。

  • response = requests.post(url, data)
  • data参数接收一个字典,其他参数与get请求的参数完全一致
4.1 找到翻译接口地址

url

4.2 找到表单所带数据

data

4.3 编写代码获取翻译数据
import json

import requests


class Trans(object):
    def __init__(self, word):
        self.url = "https://fanyi.baidu.com/sug"
        self.headers = {
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
        }
        self.data = {
            "kw": word,
        }

    def get_data(self):
        response = requests.post(self.url, headers=self.headers, data=self.data)
        return response.content

    def parse_data(self, data):
        dict_data = json.loads(data)  # 使用loads方法将json字符串转化为python字典
        print(dict_data["data"][0]["v"])

    def run(self):
        response = self.get_data()
        self.parse_data(response)


if __name__ == '__main__':
    word = input("请输入所要翻译的单词:")
    Translate = Trans(word)
    Translate.run()
4.4 测试

测试

一些表单需要提交的数据可能会发生变化,需要我们在实际爬取的过程中进行分析。

5、session的使用

session是requests库发送请求的一种方法,这种方法会自动保存访问页面得到的cookie值,从而再次访问的时候会自动携带cookie,以达到状态保持的目的。

5.1 使用场景
  • 需要进行连续多次的请求
5.2 使用方法
session = requests.session()
response = session.get(url, headers, ...)
response = session.post(url, headers, ...)
5.3 实例-github的登录
5.3.1 找到登录所需表单
  • 输入账号和密码后点击登录,观察出现哪些接口
  • 点击登录时注意将保留日志勾选,避免之前抓取的包刷新消失

登录

5.3.2 分析表单参数
  • 我们发现session请求后进行了302跳转,这和我们登录时登录成功后跳转到个人主页是一致的

跳转

  • 点击载荷查看表单携带数据。(由于github的密码是明文传输,所以这里我进行了遮盖)

在这里插入图片描述

通过两次对比我们发现有一些参数是固定的,一些参数是变化的,如何获取这些变化的数据是极其重要的。

表单2

5.3.3 分析动态数据并获取
  • authenticity_token:在登录成功之前的包中寻找是否有这些变化的数据(一定要有耐心),最后发现在login中可以找到这一参数;

token

  • timestamp:一般timestamp代表的是当前的时间戳,一会验证即可
  • timestamp_secret:一段加密密码,我们选择不改变他的值
5.3.4 编写代码
import time

import requests
import re


def login():
    # 获取token
    url1 = "https://github.com/login"
    session = requests.session()
    session.headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
    }

    res_1 = session.get(url1).content.decode()
    token = re.findall('<input type="hidden" data-csrf="true" name="authenticity_token" value="(.*?)" />'
                       , res_1)[0]
    print(f"获取成功,token为: {token}")
    # 登录
    url2 = "https://github.com/session"
    data = {
        "commit": "Sign in",
        "authenticity_token": token,
        "add_account": "",
        "login": "1927705375@qq.com",
        "password": "SJCsjc0315!",
        "webauthn-conditional": "undefined",
        "javascript-support": "true",
        "webauthn-support": "supported",
        "webauthn-iuvpaa-support": "unsupported",
        "return_to": "https://github.com/login",
        # "allow_signup": "",
        # "client_id": "",
        # "integration": "",
        # "required_field_5e5f": "",
        "timestamp": time.time() * 1000,
        "timestamp_secret": "ed50ee1652fb00e6084a47feb0e630588947f2faa4b9f0f978248c5d7ce8b3f7"
    }
    session.post(url2, data=data)
    print("登录成功")
    # 验证
    url3 = "https://github.com/Justinc-github"
    response = session.get(url3)
    with open("github.html", "wb") as f:
        f.write(response.content)
    print("验证通过")


if __name__ == '__main__':
    login()
  • 成功运行!!!

成功运行

5.3.5 查看生成文件

查看文件

相关代码地址:https://gitee.com/justinc666/crawler/tree/master/requests

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值