爬虫和反爬虫

冰镇毛衣

已于 2023-02-18 16:34:25 修改

阅读量785

点赞数

分类专栏：爬虫文章标签：爬虫 python 数据挖掘

于 2023-02-18 16:34:00 首次发布

本文链接：https://blog.csdn.net/sumatray/article/details/129098925

版权

爬虫专栏收录该内容

6 篇文章 0 订阅

订阅专栏

一、什么是反爬虫

反爬虫：限制爬虫程序访问服务器资源和获取数据的行为

限制手段：请求限制，拒绝响应，客户端身份验证，文本混淆和使用动态渲染技术等

二、反爬虫的分类：

身份识别发爬虫

验证请求头信息、验证请求参数，使用验证码等

爬虫行为反爬虫：

对ip进行限制，使用蜜罐获取ip，假数据等

数据加密反爬虫：

自定义字体，数据图片，编码格式等

身份识别发爬虫解决思路

Header反爬，通过User-agent字段

通过cookie字段

通过referer字段

基于请求参数反爬

仔细分析抓到的包，搞清楚请求之间的联系

验证码反爬：

Pytesseract/商业打码平台

2.1 验证码的处理和识别

图片识别引擎：

ocr是指使用扫描仪或者数码相机对文本资料进行扫描成图像文件，然后对图像文件进行分析处理，自动识别获取文字信息及版面信息的软件

Tesseract 开源免费

下载地址：Index of /tesseract

调用图片识别引擎

安装pil和pytesseract

pip install pillow  # 一个python图像处理库，pytesseract依赖

pip install pytesseract

from PIL import Image
import pytesseract

# 打开图片
img = Image.open('img/02.jpeg')

# 查看图片
# img.show()

# 调用引擎进行识别
pytesseract.pytesseract.tesseract_cmd=r'D:\Program Files (x86)\tesseract\tesseract.exe'
text = pytesseract.image_to_string('img/01.jpeg')
print(text)

点选式的验证码识别

识别网站   http://121.41.201.214:8083/#/useOnline/pointFixed

复杂的图片可以使用超级鹰进行识别

# http://121.41.201.214:8083/#/useOnline/pointFixed
# 使用selenium打开网站
# 截取全屏图片
# 获取验证码的区域，获取验证码控件
# 截取验证码图片
# 将验证码图片发送给超级鹰
# 根据超级鹰返回的文字左表执行点击操作
import time
from PIL import Image
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait  # 等待
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support import expected_conditions as EC  # 监控控件
from 爬虫开发.chaojiying.chaojiying import Chaojiying_Client
from selenium.webdriver import ActionChains  # 点击

class Click(object):
    def __init__(self):
        self.driver = webdriver.Chrome()
        # 浏览器窗口最大化
        self.driver.maximize_window()

    def handle_captcha(self):
        """打开页面"""
        self.driver.get('http://121.41.201.214:8083/#/useOnline/pointFixed')
        # 一般电脑会把电脑界面显示比例设置位125%  这里将其改回来
        self.driver.execute_script('document.body.style.zoom="0.8"')
        if WebDriverWait(self.driver, 5, 0.5).until(
                EC.presence_of_element_located((By.CLASS_NAME, "verify-img-panel"))):

            captcha_element = self.save()
            time.sleep(2)
            if captcha_element:
                nodes = self.handle_chaojiying()
                if nodes:
                    print('验证码识别成功开始点击验证码')
                    for i in nodes.split('|'):
                        ActionChains(self.driver).move_to_element_with_offset\
                            (captcha_element,int(i.split(',')[0]),int(i.split(',')[1])).click().perform()
                        time.sleep(1)

        self.driver.quit()

    def handle_chaojiying(self):
        """识别验证码"""
        chaojiying = Chaojiying_Client('超级鹰账号', '123456', 'id')  # 用户中心>>软件ID 生成一个替换 96001
        with open('img/chptcha.png', mode='rb') as f:
            img = f.read()
        chptcha_data = chaojiying.PostPic(img, 9103).get('pic_str')
        print(chptcha_data)
        return chptcha_data

    def save(self):
        """截取图片，保存图片"""
        # 截取全屏图片
        self.driver.save_screenshot('img/browser.png')
        # 找到验证码图片控件
        captcha_element = self.driver.find_element(by=By.XPATH, value='//div/img')
        # 获取验证码左上角坐标
        location = captcha_element.location
        # 获取验证码的大小，宽和高
        size = captcha_element.size
        # 验证码尺寸
        code = (location.get('x'), location.get('y'), location.get('x') + size.get('width'),
                location.get('y') + size.get('height') + 50)
        img = Image.open('img/browser.png')
        captcha = img.crop(code)
        captcha.save('img/chptcha.png')
        return captcha_element


if __name__ == '__main__':
    c = Click()
    c.handle_captcha()

三、基于爬虫行为反爬和解决思路

通过请求ip/账号单位时间内请求频率，次数反爬

使用ip代理，多个账号反反爬

通过同一ip/账号请求间隔进行反爬

使用ip代理，设置随机休眠进行反反爬

通过js实现跳转反爬

多次抓包，分析规律

通过蜜罐（陷阱）捕获ip

完成爬虫之后，测试爬取、仔细分析响应内容，找出陷阱

通过假数据进行反爬

长期运行，对比数据库中数据同实际页面数据

阻塞任务队列

分析获取垃圾url的规律，对url进行过滤

阻塞网络IO反爬

审查抓取连接，对请求时间计时