Python爬虫Request轮子工具

最新推荐文章于 2023-03-19 23:35:32 发布

顽强拼搏的阿k

最新推荐文章于 2023-03-19 23:35:32 发布

阅读量1.8k

点赞数 13

分类专栏：爬虫文章标签： python

本文链接：https://blog.csdn.net/weixin_38640052/article/details/116124192

版权

SuperSpider

万字长文，建议使用目录点击查阅，有助于高效开发。建议点赞收藏

文章目录

SuperSpider

Request抓取思路步骤

【1】先确定是否为动态加载网站
【2】找URL规律
【3】正则表达式 | xpath表达式
【4】定义程序框架，补全并测试代码

细节要点：查看页面编码charset、请求时是否需要验证verify=FALSE

多级页面数据抓取思路

【1】整体思路
    1.1> 爬取一级页面,提取 所需数据+链接,继续跟进
    1.2> 爬取二级页面,提取 所需数据+链接,继续跟进
    1.3> ... ... 

【2】代码实现思路
    2.1> 避免重复代码 - 请求、解析需定义函数

UserAgent反爬处理

【1】基于User-Agent反爬
	1.1) 发送请求携带请求头: headers={
   'User-Agent' : 'Mozilla/5.0 xxxxxx'}
	1.2) 多个请求时随机切换User-Agent
        a) 定义py文件存放大量User-Agent，导入后使用random.choice()每次随机选择
        b) 使用fake_useragent模块每次访问随机生成User-Agent
           from fake_useragent import UserAgent
           agent = UserAgent().random
   细节要点：pycharm中下载fake-useragent
        
【2】响应内容存在特殊字符
	解码时使用ignore参数
    html = requests.get(url=url, headers=headers).content.decode('', 'ignore')

Cookie反爬

Cookie参数使用

cookies参数的形式：字典

cookies = {"cookie的name":"cookie的value"}
- 该字典对应请求头中Cookie字符串，以分号、空格分割每一对字典键值对
- 等号左边的是一个cookie的name，对应cookies字典的key
- 等号右边对应cookies字典的value
cookies参数的使用方法

response = requests.get(url, cookies)
将cookie字符串转换为cookies参数所需的字典：

cookies_dict = {cookie.split('=')[0]:cookie.split('=')[-1] for cookie in cookies_str.split('; ')}
注意：cookie一般是有过期时间的，一旦过期需要重新获取

CookieJar对象转换为Cookies字典

使用requests获取的resposne对象，具有cookies属性。该属性值是一个cookieJar类型，包含了对方服务器设置在本地的cookie。我们如何将其转换为cookies字典呢？

转换方法

cookies_dict = requests.utils.dict_from_cookiejar(response.cookies)
其中response.cookies返回的就是cookieJar类型的对象
requests.utils.dict_from_cookiejar函数返回cookies字典

requests模块参数总结

【1】方法一 : requests.get()
【2】参数
   2.1) url
   2.2) headers
   2.3) timeout
   2.4) proxies

【3】方法二 ：requests.post()
【4】参数
    data

requests.get()

思路

【1】url
【2】proxies -> {}
     proxies = {
        'http':'http://1.1.1.1:8888',
	    'https':'https://1.1.1.1:8888'
     }
【3】timeout
【4】headers
【5】cookies

requests.post()

适用场景

【1】适用场景 : Post类型请求的网站

【2】参数 : data={
     }
   2.1) Form表单数据: 字典
   2.2) res = requests.post(url=url,data=data,headers=headers)
  
【3】POST请求特点 : Form表单提交数据
    
    data : 字典，Form表单数据

pycharm中正则处理headers和formdata

【1】pycharm进入方法 ：Ctrl + r ，选中 Regex
【2】处理headers和formdata
    (.*): (.*)
    "$1": "$2",
【3】点击 Replace All

经典Demo有道翻译

request.session()

requests模块中的Session类能够自动处理发送请求获取响应过程中产生的cookie，进而达到状态保持的目的。接下来我们就来学习它

作用与应用场景

requests.session的作用
- 自动处理cookie，即 下一次请求会带上前一次的cookie
requests.session的应用场景
- 自动处理连续的多次请求过程中产生的cookie

使用方法

session实例在请求了一个网站后，对方服务器设置在本地的cookie会保存在session中，下一次再使用session请求对方服务器的时候，会带上前一次的cookie

session = requests.session() # 实例化session对象
response = session.get(url, headers, ...)
response = session.post(url, data, ...)

session对象发送get或post请求的参数，与requests模块发送请求的参数完全一致

response

response.text 和response.content的区别：

response.text
- 类型：str
- 解码类型： requests模块自动根据HTTP 头部对响应的编码作出有根据的推测，推测的文本编码
response.content
- 类型：bytes
- 解码类型：没有指定

动态加载数据抓取-Ajax

特点

【1】右键 -> 查看网页源码中没有具体数据
【2】滚动鼠标滑轮或其他动作时加载,或者页面局部刷新

抓取

【1】F12打开控制台，页面动作抓取网络数据包
【2】抓取json文件URL地址
   2.1) 控制台中 XHR ：异步加载的数据包
   2.2) XHR -> QueryStringParameters(查询参数)

经典Demo：豆瓣电影

json解析模块

json.loads(json)

【1】作用 : 把json格式的字符串转为Python数据类型

【2】示例 : html = json.loads(res.text)

json.dump(python,f,ensure_ascii=False)

【1】作用
   把python数据类型 转为 json格式的字符串,一般让你把抓取的数据保存为json文件时使用

【2】参数说明
   2.1) 第1个参数: python类型的数据(字典，列表等)
   2.2) 第2个参数: 文件对象
   2.3) 第3个参数: ensure_ascii=False 序列化时编码
  
【3】示例代码
    # 示例1
    import json

    item = {
   'name':'QQ','app_id':1}
    with open('小米.json','a') as f:
      json.dump(item,f,ensure_ascii=False)
  
    # 示例2
    import json

    item_list = []
    for i in range(3):
      item = {
   'name':'QQ','id':i}
      item_list.append(item)

    with open('xiaomi.json','a') as f:
        json.dump(item_list,f,ensure_ascii=False)

jsonpath

jsonpath使用示例

book_dict = { 
  "store": {
    "book": [ 
      { "category": "reference",
        "author": "Nigel Rees",
        "title": "Sayings of the Century",
        "price": 8.95
      },
      { "category": "fiction",
        "author": "Evelyn Waugh",
        "title": "Sword of Honour",
        "price": 12.99
      },
      { "category": "fiction",
        "author": "Herman Melville",
        "title": "Moby Dick",
        "isbn": "0-553-21311-3",
        "price": 8.99
      },
      { "category": "fiction",
        "author": "J. R. R. Tolkien",
        "title": "The Lord of the Rings",
        "isbn": "0-395-19395-8",
        "price": 22.99
      }
    ],
    "bicycle": {
      "color": "red",
      "price": 19.95
    }
  }
}

from jsonpath import jsonpath

print(jsonpath(book_dict, '$..author')) # 如果取不到将返回False # 返回列表，如果取不到将返回False

在这里插入图片描述

json模块总结

# 爬虫最常用
【1】数据抓取 - json.loads(html)
    将响应内容由: json 转为 python
【2】数据保存 - json.dump(item_list,f,ensure_ascii=False)
    将抓取的数据保存到本地 json文件

# 抓取数据一般处理方式
【1】txt文件
【2】csv文件
【3】json文件
【4】MySQL数据库
【5】MongoDB数据库
【6】Redis数据库

控制台抓包

打开方式及常用选项

【1】打开浏览器，F12打开控制台，找到Network选项卡

【2】控制台常用选项
   2.1) Network: 抓取网络数据包
     a> ALL: 抓取所有的网络数据包
     b> XHR：抓取异步加载的网络数据包
     c> JS : 抓取所有的JS文件
   2.2) Sources: 格式化输出并打断点调试JavaScript代码，助于分析爬虫中一些参数
   2.3) Console: 交互模式，可对JavaScript中的代码进行测试
    
【3】抓取具体网络数据包后
   3.1) 单击左侧网络数据包地址，进入数据包详情，查看右侧
   3.2) 右侧:
     a> Headers: 整个请求信息
        General、Response Headers、Request Headers、Query String、Form Data
     b> Preview: 对响应内容进行预览
     c> Response：响应内容

代理设置

定义及分类

代理ip的匿名程度，代理IP可以分为下面三类：
    1.透明代理(Transparent Proxy)：透明代理虽然可以直接“隐藏”你的IP地址，但是还是可以查到你是谁。
    2.匿名代理(Anonymous Proxy)：使用匿名代理，别人只能知道你用了代理，无法知道你是谁。
    3.高匿代理(Elite proxy或High Anonymity Proxy)：高匿代理让别人根本无法发现你是在用代理，所以是最好的选择。
    
代理服务请求使用的协议可以分为：
	1.http代理：目标url为http协议
	2.https代理：目标url为https协议
	3.socks隧道代理（例如socks5代理）等：
      socks 代理只是简单地传递数据包，不关心是何种应用协议（FTP、HTTP和HTTPS等）。
      socks 代理比http、https代理耗时少。
      socks 代理可以转发http和https的请求

普通代理思路

【1】获取代理IP网站
   西刺代理、快代理、全网代理、代理精灵、阿布云、芝麻代理... ...

【2】参数类型
   proxies = {
    '协议':'协议://IP:端口号' }
   proxies = {
   
    	'http':'http://IP:端口号',
    	'https':'https://IP:端口号',
   }

普通代理

# 使用免费普通代理IP访问测试网站: http://httpbin.org/get
import requests

url = 'http://httpbin.org/get'
headers = {
   'User-Agent':'Mozilla/5.0'}
# 定义代理,在代理IP网站中查找免费代理IP
proxies = {
   
    'http':'http://112.85.164.220:9999',
    'https':'https://112.85.164.220:9999'
}
html = requests.get(url,proxies=proxies,headers=headers,timeout=5).text
print(html)

私密代理+独享代理

【1】语法结构
   proxies = {
    '协议':'协议://用户名:密码@IP:端口号' }

【2】示例
   proxies = {
   
	  'http':'http://用户名:密码@IP:端口号',
      'https':'https://用户名:密码@IP:端口号',
   }

私密代理+独享代理 - 示例代码

import requests
url = 'http://httpbin.org/get'
proxies = {
   
    'http': 'http://309435365:szayclhp@106.75.71.140:16816',
    'https':'https://309435365:szayclhp@106.75.71.140:16816',
}
headers = {
   
    'User-Agent' : 'Mozilla/5.0',
}

html = requests.get(url,proxies=proxies,headers=headers,timeout=5).text
print(html)

建立自己的代理IP池 - 开放代理 | 私密代理

"""
收费代理：
    建立开放代理的代理IP池
思路：
    1、获取到开放代理
    2、依次对每个代理IP进行测试,能用的保存到文件中
"""
import requests

class ProxyPool:
    def __init__(self):
        self.url = '代理网站的API链接'
        self.headers = {
   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'}
        # 打开文件,用来存放可用的代理IP
        self.f = open('proxy.txt', 'w')

    def get_html(self):
        html = requests.get(url=self.url, headers=self.headers).text
        proxy_list = html.split('\r\n')
        for proxy in proxy_list:
            # 依次测试每个代理IP是否可用
            if self.check_proxy(proxy):
                self.f.write(proxy + '\n')

    def check_proxy(self, proxy):
        """测试1个代理IP是否可用,可用返回True,否则返回False"""
        test_url = 'http://httpbin.org/get'
        proxies = {
   
            'http' : 'http://{}'.format(proxy),
            'https': 'https://{}'.format(proxy)
        }
        try:
            res = requests.get(url=test_url, proxies=proxies, headers=self.headers, timeout=2)
            if res.status_code == 200:
                print(proxy,'\033[31m可用\033[0m')
                return True
            else:
                print(proxy,'无效')
                return False
        except:
            print(proxy,'无效')
            return False

    def run(self):
        self.get_html()
        # 关闭文件
        self.f.close()

if __name__ == '__main__':
    spider = ProxyPool()
    spider.run()

拉勾网阿布云代理

import json
import re
import time
import requests
import multiprocessing
from job_data_analysis.lagou_spider.handle_insert_data import lagou_mysql


class HandleLaGou(object):
    def __init__(self):
        #使用session保存cookies信息
        self.lagou_session = requests.session()
        self.header = {
   
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
        }
        self.city_list = ""

    #获取全国所有城市列表的方法
    def handle_city(self):
        city_search = re.compile(r'www\.lagou\.com\/.*\/">(.*?)</a>')
        city_url = "https://www.lagou.com/jobs/allCity.html"
        city_result = self.<

最低0.47元/天解锁文章

顽强拼搏的阿k

关注

13
点赞
踩
85

收藏

觉得还不错? 一键收藏
14
评论
Python爬虫Request轮子工具

SuperSpider== 万字长文，建议使用目录点击查阅，有助于高效开发。建议点赞收藏 ==文章目录SuperSpiderRequest抓取思路步骤多级页面数据抓取思路UserAgent反爬处理Cookie反爬Cookie参数使用CookieJar对象转换为Cookies字典requests模块参数总结requests.get()requests.post()request.session()作用与应用场景使用方法responseresponse.text 和response.content的区别：动
复制链接

扫一扫