爬虫学习笔记2-爬虫基础requests模块

最新推荐文章于 2023-04-11 14:47:28 发布

陈弟弟

最新推荐文章于 2023-04-11 14:47:28 发布

阅读量541

点赞数

分类专栏：爬虫学习文章标签： python

本文链接：https://blog.csdn.net/weixin_41446786/article/details/108072004

版权

爬虫学习专栏收录该内容

17 篇文章 3 订阅

订阅专栏

1、requests模块基础

（1）安装request模块：pip/pip3 install requests
（2）request模块发送get请求

	import requests
	
	# 目标url
	url = 'https://www.baidu.com'
	
	# 向目标url发送get请求
	response = requests.get(url)
	
	# 打印响应内容
	print(response.text)

在这里插入图片描述

返回的数据带有乱码，这是因为格式引起的，

response.text
类型：str
解码类型： requests模块自动根据HTTP 头部对响应的编码作出有根据的推测，推测的文本编码

response.content
类型：bytes
解码类型：没有指定
response.content.decode() 默认utf-8
response.content.decode(“GBK”)
常见的编码字符集
utf-8
gbk
gb2312
ascii （读音：阿斯克码）
iso-8859-1

优化程序：

（3）response对象：url请求获得的对象

response.url响应的url；有时候响应的url和请求的url并不一致
response.status_code 响应状态码
response.request.headers 响应对应的请求头
response.headers 响应头
response.request._cookies 响应对应请求的cookie；返回cookieJar类型
response.cookies 响应的cookie（经过了set-cookie动作；返回cookieJar类型）
response.json()自动将json字符串类型的响应内容转换为python对象（dict or list）

（4）requests模块发送请求：爬虫的目的是为了欺骗服务器我们的浏览方式，因此我们必须将自己的请求伪装成浏览器
1）携带请求头发送请求requests.get(url, headers=headers)

注：headers参数为字典形式：请求头字段名作为key，字段对应的值作为value

从浏览器复制User-Agent，构造headers字典
在这里插入图片描述

import requests

url = 'https://www.baidu.com'

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0"}

# 在请求头中带上User-Agent，模拟浏览器发送请求
response = requests.get(url, headers=headers)

print(response.content.decode())

在这里插入图片描述
打印请求头信息：

（5）请求头携带cookie信息，绕开GitHub登录

import requests

url = 'https://github.com/Amen-bang'

# 构造请求头字典
headers = {
    # 从浏览器中复制过来的User-Agent
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',
    # 从浏览器中复制过来的Cookie
    'Cookie': 'cookie信息'
}

# 请求头参数字典中携带cookie字符串
resp = requests.get(url, headers=headers)

print(resp.text)

在这里插入图片描述
（6）cookie参数的使用

构建一个cookie字典cookies = {"cookie的name":"cookie的value"}
将cookie字符串转换为cookies参数所需的字典：

①字典推导式的方法

	cookies_dict = {cookie.split('=')[0]:cookie.split('=')[-1] for cookie in cookies_str.split('; ')}

②稳妥方案：

cookie_list = temp.split(';')
cookies = {}
for cookie in cookie_lise:
	cookies[cookie.split('=')[0]] = cookie.spliit('=')[-1]

import requests

url = 'https://github.com/Amen-bang'

# 构造请求头字典
headers = {
    # 从浏览器中复制过来的User-Agent
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',
}

# 从浏览器中复制过来的Cookie
cookies_str = {'Cookie': '从浏览器中复制过来的Cookie'}
cookies_dict = {cookie.split('=')[0]:cookie.split('=')[-1] for cookie in cookies_str.split('; ')}

resp = requests.get(url, headers=headers,cookies=cookies_dict)

print(resp.text)

（7）cookieJar对象的处理：使用requests获取的resposne对象，具有cookies属性。该属性值是一个cookieJar类型，包含了对方服务器设置在本地的cookie。

cookies_dict = requests.utils.dict_from_cookiejar(response.cookies)

注：requests.utils.dict_from_cookiejar函数返回cookies字典

（8）超时参数的使用
超时参数timeout的使用方法： response = requests.get(url, timeout=响应时间，秒为单位)

2、代理服务以及proxy代理参数的使用

代理IP指向一个代理服务器，访问对应的服务器IP先访问代理服务器
（1）正向代理和反向代理

为浏览器或客户端（发送请求的一方）转发请求的，叫做正向代理
浏览器知道最终处理请求的服务器的真实ip地址，例如VPN
不为浏览器或客户端（发送请求的一方）转发请求、而是为最终处理请求的服务器转发请求的，叫做反向代理
浏览器不知道服务器的真实地址，例如ngin

（2）代理ip（代理服务器）的分类
1）根据匿名程度

透明代理(Transparent Proxy)： 透明代理虽然可以直接“隐藏”你的IP地址，但是还是可以查到你是谁。目标服务器接收到的请求头如下：

	REMOTE_ADDR = Proxy IP
	HTTP_VIA = Proxy IP
	HTTP_X_FORWARDED_FOR = Your IP

匿名代理(Anonymous Proxy)： 使用匿名代理，别人只能知道你用了代理，无法知道你是谁。目标服务器接收到的请求头如下：

	 REMOTE_ADDR = proxy IP
	 HTTP_VIA = proxy IP
	 HTTP_X_FORWARDED_FOR = proxy IP

高匿代理(Elite proxy或High Anonymity Proxy)： 高匿代理让别人根本无法发现你是在用代理，所以是最好的选择。毫无疑问使用高匿代理效果最好。目标服务器接收到的请求头如下：

	REMOTE_ADDR = Proxy IP
	HTTP_VIA = not determined
	HTTP_X_FORWARDED_FOR = not determined

2）根据网站所使用的协议不同，需要使用相应协议的代理服务。从代理服务请求使用的协议可以分为：

http代理：目标url为http协议
https代理：目标url为https协议
socks隧道代理（例如socks5代理）等：
socks 代理只是简单地传递数据包，不关心是何种应用协议（FTP、HTTP和HTTPS等）。
socks 代理比http、https代理耗时少。
socks 代理可以转发http和https的请求

（3）proxies代理参数的使用

# 构建一个proxies字典
proxies = { 
    "http": "http://12.34.56.79:9527", 
    "https": "https://12.34.56.79:9527", 
}
response = requests.get(url, proxies=proxies)

注意：如果proxies字典中包含有多个键值对，发送请求时将按照url地址的协议来选择使用相应的代理ip
查找代理商：快代理、米扑代理……

3、使用verify参数忽略CA证书

网站的CA证书没有经过【受信任的根证书颁发机构】的认证
在这里插入图片描述
爬取该网站会抛出异常：ssl.CertificateError …

解决方案
为了在代码中能够正常的请求，我们使用verify=False参数，此时requests模块发送请求将不做CA证书的验证：verify参数能够忽略CA证书的认证

 	import requests
    url = "https://sam.huat.edu.cn:8443/selfservice/" 
    response = requests.get(url,verify=False)

4、request模块发送post请求

（1）requests发送post请求的方法response = requests.post(url, 字典参数)
（2）模拟post请求金山翻译
①确定请求连接：http://fy.iciba.com/ajax.php?a=fy
在这里插入图片描述
②确定请求参数

③确定返回数据的位置out

④模拟浏览器获取数据

import requests
import json

class King(object):
    def __init__(self,word):
        # 1 确定请求连接
        self.url = 'http://fy.iciba.com/ajax.php?a=fy'
        # 2 构造请求头字典
        self.headers = {
            # 从浏览器中复制过来的User-Agent
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',
        }
        # 3 确定请求参数
        self.data = {
            "f": "auto",
            "t": "auto",
            "w": word
        }


    def get_data(self):
        response = requests.post(self.url,headers=self.headers,data=self.data)
        return response.content.decode()

	#4 数据解析
    def parse_data(self,data):
        data = json.loads(data)
        #{"status":1,"content":{"from":"zh","to":"en-US","vendor":"ciba","out":" The road is long and steady","ciba_use":"\u6765\u81ea\u673a\u5668\u7ffb\u8bd1\u3002","ciba_out":"","err_no":0}}
        print(data['content']['out'])


    def run(self):
        resp = self.get_data()
        #print(resp)        
        self.parse_data(resp)


if __name__ == '__main__':
    king = King('道阻且长，稳步前行')
    king.run()

post的数据来源：
（1）固定值：抓包比较不变值（金山翻译）
（2）输入值：抓包比较根据自身变化值（金山翻译：自己输入识别翻译）
（3）预设值-静态文件：需要提前从静态HTML中获取（百度翻译）
（4）预设值-发请求：需要对指定地址发送请求获取数据
（5）在客服端中生成：分析js，模拟生成数据（有道翻译）

5、利用requests.session进行状态保持

**作用：**自动处理cookie，即下一次请求会带上前一次的cookie
**应用场景：**自动处理多次请求的过程中产生的cookie

session = requests.session() # 实例化session对象
response = session.get(url, headers, ...)
response = session.post(url, data, ...)

应用：使用requests.session来完成github登陆，并获取需要登陆后才能访问的页面
（1）抓包分析：进入GitHub登录页面进行抓包分析，勾选Preserve log令抓包持续不刷新
在这里插入图片描述 （2）点击登录，抓取session文件
（3）重复1和2步骤，重新抓包获取session的值，进行前后对比，找到变化值authenticity_token；
分析：
①排除来源固定值和输入值
②从静态文件中查找:login->response->搜索authenticity_token
在这里插入图片描述
③提取authenticity_token

def login():
    # session
    session = requests.session()

    # headers
    session.headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
    }

    # 1.url1-获取token
    url1 = 'https://github.com/login'
    # 发送请求获取响应
    response = session.get(url1, headers=session.headers).content.decode()
    # 正则提取(.:匹配任意一个字符；*：前一个字符一次或多次；？：前一个字符0次或1次)
    authenticity_token = re.findall('name="authenticity_token" value="(.*?)" />',response)[0]
    print(authenticity_token)

（4）在浏览器登录GitHub抓取session数据包
①获取session请求地址：https://github.com/session
②获取构建表单数据：

commit: Sign in
authenticity_token: g56j3289HGlSj8+X1e2fVWMmqhu0asyzAcPWvNuTtk6RpiPg7MHpgpY9oDb5cp+2rUoII1/INMlkbaYCld32ow==
ga_id: 1338935970.1597742658
login: 账号
password: 密码
webauthn-support: supported
webauthn-iuvpaa-support: unsupported
return_to: 
required_field_b52e: 
timestamp: 1597821100566
timestamp_secret:0f7e176eb53725ab7b2296b326058875b5f9de3ec6af83c5d9a908eae5393dec

③验证：进入“Your Profile”，使用该地址url3：https://github.com/Amen-bang进行验证

 # 2.url2-登录
    url2 = 'https://github.com/session'
    # 构建表单数据
    data = {
        "commit": "Sign in",
        "authenticity_token": authenticity_token,
        "ga_id": "1338935970.1597742658",
        "login": "账户",
        "password": "密码",
        "webauthn - support":" supported"，"webauthn - support":" supported",
        "webauthn - iuvpaa - support": "unsupported",
        "return_to":"",
        "required_field_b52e":"",
        "timestamp": "1597821100566",
        "timestamp_secret": "0f7e176eb53725ab7b2296b326058875b5f9de3ec6af83c5d9a908eae5393dec",
    }
    # 发送请求登录
    session.post(url2,data=data)

    # 3.url3-验证
    url3 = 'https://github.com/Amen-bang'
    response =session.get(url3)
    with open('github.html','wb') as f:
        f.write(response.content)

陈弟弟

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫学习笔记2-爬虫基础requests模块

1、request模块（1）安装request模块：pip/pip3 install requests（2）request模块发送get请求 import requests # 目标url url = 'https://www.baidu.com' # 向目标url发送get请求 response = requests.get(url) # 打印响应内容 print(response.text)...
复制链接

扫一扫