自动化采集数据之解决滑动验证码

最新推荐文章于 2024-10-11 08:31:43 发布

北愚

最新推荐文章于 2024-10-11 08:31:43 发布

阅读量433

点赞数 5

分类专栏：爬虫文章标签：自动化

本文链接：https://blog.csdn.net/yj2094632273/article/details/142004937

版权

爬虫专栏收录该内容

14 篇文章 0 订阅

订阅专栏

# 对于初级选手和数据需求不大的的数据采集，使用自动化库进行模拟，但是抵不住请求多了还是会蹦验证，所以具备处理验证的能力也是十分有必要的。

一、工具准备

python的自动化工具主要有Selenium和DrissionPage。

（一）Selenium

Selenium的核心功能是在多个浏览器上进行自动化测试，帮助测试人员减少手动测试的工作量。
Selenium的发展历程可以追溯到2004年，当时‌ThoughtWorks公司的一个员工Jason Huggins为了减少手工测试的工作量，开发了一套JavaScript库，这套库后来发展成为Selenium Core，为Selenium Remote Control(RC)和Selenium IDE提供了坚实的基础。Selenium 1.x时期主要使用Selenium RC进行自动化测试。随着技术的发展，Selenium逐渐演变为一个功能强大的自动化测试工具，被包括‌Google、‌百度、‌腾讯等公司在内的许多大型企业广泛使用。

（二）DrissionPage

DrissionPage是一个基于 python 的网页自动化工具。它既能控制浏览器，也能收发数据包，还能把两者合而为一。可兼顾浏览器自动化的便利性和 requests 的高效率。它功能强大，内置无数人性化设计和便捷功能。它的语法简洁而优雅，代码量少，对新手友好。

所以面向大多数的新手朋友，我们选择DrissionPage。

二、逻辑准备

拼图是在水平方向上的滑动，我们需要的就是得到缺口的位置和初始的位置，得到它们的差值，然后把拖动按钮，拖动的距离就是根据前面提到的差值。

首先，为了得到图片我们可以利用xpath，如果图片网址是固定的或者是一个img的src，可以用src()等直接得到，如果是随机的，那就用截图的方式（Selenium的话2只能截图）。

得到两张图后，利用ddddocr，可以在原图（背景图）上得到一组坐标，左上角和右下角。

ocr = ddddocr.DdddOcr(det=False, ocr=False)  # 禁用文字识别
background = "1.png"
with open(background, 'rb') as f:
    background_img = f.read()
front = "2.png"
with open(front, 'rb') as f:
    front_img = f.read()
# 调用 ddddocr 的滑块匹配功能
res = ocr.slide_match(background_img, front_img, simple_target=True)
target = res['target']

三、代码实现

以shunfeng订单查询为例。

from DrissionPage import ChromiumPage
import time
import ddddocr
import cv2
import matplotlib.pyplot as plt
import requests


def show_with_matplotlib(imgPath, location):
    img = cv2.imread(imgPath)
    # 绘制红色矩形框
    cv2.rectangle(
        img,
        (location[0], location[1]),
        (location[2], location[3]),
        (0, 0, 255),
        2
    )
    # 将 BGR 转换为 RGB
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    # 使用 Matplotlib 显示图像
    plt.imshow(img_rgb)
    plt.axis('off')  # 关闭坐标轴
    plt.show()


def downloadImage(url, fileNme):
    r = requests.get(url)
    with open(fileNme, 'wb') as f:
        f.write(r.content)


def getLocation(Page):
    time.sleep(5)
    img1Url = Page.ele('x:/html/body/div[7]/div[1]/div[1]/div[2]/div/div/div[1]/div[2]').style(
        'background-image').replace('url(', '').replace(')', '').replace('"', '')
    downloadImage(img1Url, '1.png')
    img2Url = Page.ele('x:/html/body/div[7]/div[1]/div[1]/div[2]/div/div/div[1]/div[1]/div[1]').style(
        'background-image').replace('url(', '').replace(')', '').replace('"', '')
    downloadImage(img2Url, '2.png')
    locations = Page.ele('x:/html/body/div[7]/div[1]/div[1]/div[2]/div/div/div[2]/div').rect.corners[0:4:3]
    location = (int(locations[0][0]), int((locations[0][1] + locations[1][1]) / 2))
    return location


def detect():
    ocr = ddddocr.DdddOcr(det=False, ocr=False)  # 禁用文字识别
    background = "1.png"
    with open(background, 'rb') as f:
        background_img = f.read()
    front = "2.png"
    with open(front, 'rb') as f:
        front_img = f.read()
    # 调用 ddddocr 的滑块匹配功能
    res = ocr.slide_match(background_img, front_img, simple_target=True)
    target = res['target']
    show_with_matplotlib(background, target)
    return target[0]


def mover(Page, target):
    Page.ele('x:/html/body/div[7]/div[1]/div[1]/div[2]/div/div/div[2]/div/div[3]').drag_to(target)
    time.sleep(55)
    pass


if __name__ == '__main__':
    page = ChromiumPage()
    page.get("https://www.sf-express.com/chn/sc/waybill/waybill-detail/SF3109285694508")
    loc = getLocation(page)
    addX = detect()
    X = (loc[0] + addX + 40, loc[1])
    mover(page, X)