python自动化测试登录验证码识别解决方案总结

最新推荐文章于 2024-05-16 03:52:07 发布

IT界的搬运工007

最新推荐文章于 2024-05-16 03:52:07 发布

阅读量4.5k

点赞数 3

分类专栏： python爬虫文章标签： python ocr 爬虫

本文链接：https://blog.csdn.net/qq_41676496/article/details/112688326

版权

python爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

前提概要：自动化测试或者爬虫一定会遇到的一个问题就是：怎么识别或者绕过验证码。总结以下几种方式：

OCR识别
- 本地OCR识别
- 第三方OCR接口识别
fiddler抓包获取验证码
通过session绕过验证登录
写死验证码或放开验证码验证

1、本地ocr识别

准备：

（1）安装Tesseract-OCR：https://digi.bib.uni-mannheim.de/tesseract/，建议安装4.0的。

（2）安装pytesseract，python第三方库，pycharm中直接安装

识别：

（1）截取验证码图片

（2）处理图片：二值化

（3）识别图片内容，并删除保存的图片

代码示例：

# -*- coding:utf-8 -*-
import traceback
from PIL import Image
from io import BytesIO
import time
import pytesseract
from auto_common.base.remove_file_or_dir import *


def screenshot_code(driver, veri_code_xpath):
    """
    截图验证码图片
    :param veri_code_xpath: 验证码图片的xpath
    :return: 验证码图片
    """
    element_screen = driver.find_element_by_xpath(veri_code_xpath)
    location = element_screen.location
    size = element_screen.size
    # 截取当前窗口保存为png
    graph_ver_code = driver.get_screenshot_as_png()

    # 打开截图定位要截取的位置
    image = Image.open(BytesIO(graph_ver_code))
    left = location['x']
    top = location['y']
    right = location['x'] + size['width']
    bottom = location['y'] + size['height']
    image = image.crop((left, top, right, bottom))
    return image


def edit_picture(image):
    """
    图片二值化，增加识别率
    :param image:
    :return:
    """
    image = image.convert('L')
    rows, cols = image.size
    for i in range(rows):
        for j in range(cols):
            pixel = image.getpixel((i, j))
            if pixel > 150:
                image.putpixel((i, j), 255)
            elif pixel < 130:
                image.putpixel((i, j), 0)
    pic_name = time.strftime("%Y%m%d%H%M%S", time.localtime())
    current_path = os.getcwd()
    father_path = os.path.dirname(current_path)
    image_path = r'G:\project\picture\%s.png' % pic_name
    image.save(image_path)
    return image_path


def recognize_captcha(image_path):
    """
    识别验证码
    :param image_path:
    :return:返回识别的验证码
    """
    image = Image.open(image_path)
    code = pytesseract.image_to_string(image)
    # 识别后删除图片，可忽略
    remove_file(image_path)
    print(code)
    return code

可能遇到的问题：

（1）解码错误：UnicodeDecodeError: 'utf-8' codec can't decode....

可能原因：pytesseract.py文件的路径配置问题，将变量tesseract_cmd值改为OCR安装路径，如：

tesseract_cmd = r'F:\software\OCR\Tesseract-OCR\tesseract.exe'

（2）识别率低

调整二值化的阈值，或者使用训练图片库进行训练调整。相当于自己造轮子，可以找一些开源的优化方案。

2、fiddler抓包验证码识别

若接口中有返回验证码，可从接口中获取验证码。

方案：使用fiddler抓包自动保存到本地，读取文件中的验证码信息。

（1）打开fiddler菜单：Rules》Customize Rules

（2）在OnBeforeRequest方法中加入如下JavaScript代码，登录接口地址和文件保存路径自定义，

         //保存请求
        if (oSession.fullUrl.Contains("登录接口地址1") || oSession.fullUrl.Contains("登录接口地址2"))
        {
            var fso;
            var file;
            fso = new ActiveXObject("Scripting.FileSystemObject");
            //文件保存路径，可自定义
            var timestamp = Date.parse(new Date());
            file = fso.OpenTextFile("G:\\project\\response"+timestamp+".txt",8 ,true, true);
            file.writeLine("Request url: " + oSession.url);
            file.writeLine("Request header:" + "\n" + oSession.oRequest.headers);
            file.writeLine("Request body: " + oSession.GetRequestBodyAsString());
            file.writeLine("\n");
            file.close();
            
        }

（3）在OnBeforeResponse方法中加入如下JavaScript代码，登录接口地址和文件保存路径自定义，

        //保存响应
        if (oSession.fullUrl.Contains("登录接口地址2") || oSession.fullUrl.Contains("登录接口地址2"))
        {
            oSession.utilDecodeResponse();//消除保存的请求可能存在乱码的情况
            var fso;
            var file;
            fso = new ActiveXObject("Scripting.FileSystemObject");
            //文件保存路径，可自定义
            var timestamp = Date.parse(new Date());
            file = fso.OpenTextFile("G:\\project\\response"+timestamp+".txt",8 ,true, true);
            file.writeLine("Response code: " + oSession.responseCode);
            file.writeLine("Response body: " + oSession.GetResponseBodyAsString());
            file.writeLine("\n");
            file.close();
        }

（4）打开fiddler，然后打开登录页面，生成登录请求文件

（5）获取请求文件中的验证码的方法，python

# -*- coding:utf-8 -*-
import ast


# 获取请求文件中的验证码
def return_veri_code(response_file):
    """
    获取请求文件中的验证码
    :param response_file: 请求文件路径
    :return:
    """
    with open(response_file, 'r', encoding='utf-16') as fp:
        li = fp.readlines()
        expect = 'Response body: {"header":{"code'
        code = ''
        for i in li:
            if expect in i:
                real = i[15:]
                # 将字符串转化为字典
                real_dic = ast.literal_eval(real)
                # 获得请求中的验证码
                code = real_dic['body']['code']
    return code

（6）在登录方法中调用获取请求文件的方法，直接登录。

3、百度AI通用文字识别开放接口

准备：

（1）访问百度只能云平台：https://login.bce.baidu.com/?redirect=https%3A%2F%2Fconsole.bce.baidu.com%2F%3Ffromai%3D1#/aip/overview

（2）注册账号并创建应用：https://jingyan.baidu.com/article/ab0b563063a586c15bfa7d55.html

（3）获取个人的API_KEY和SECRET_KEY，一天可以免费调用5000次。

识别代码示例：

（1）调用百度OCR开放接口方法：

# -*-coding:utf-8 -*-
import requests
import base64
import traceback


def image_to_words(image_path):
    """
    调用百度OCR开发接口识别图片文字
    :param image_path: 图片路径
    :return: words：文本信息
    """
    # client_id 为官网获取的API_KEY， client_secret 为官网获取的SECRET_KEY
    host = 'https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&' \
           'client_id=API_KEY&client_secret=SECRET_KEY'
    response = requests.get(host)
    token_data = response.json()
    # 获取access_token
    if response:
        access_token = token_data['access_token']
        print('access_token获取成功:', access_token)
    else:
        access_token = ''
        print('access_token获取失败')
    request_url = "https://aip.baidubce.com/rest/2.0/ocr/v1/general_basic"
    f = open(image_path, 'rb')
    img = base64.b64encode(f.read())
    f.close()
    params = {"image": img}
    request_url = request_url + "?access_token=" + access_token
    headers = {'content-type': 'application/x-www-form-urlencoded'}
    response = requests.post(request_url, data=params, headers=headers)
    words = ''
    if response:
        datas = response.json().get('words_result')
        for i in datas:
            data = i.get('words')
            words = words + data
        return words
    else:
        print('识别异常：', traceback.print_exc())

4、通过session绕过验证登录

5、写死验证码或放开验证码

注：大佬们如有其它方法，可以留言，待我研究验证后会更新到文章中，希望汇集各路大佬的智慧，更好的解决这个问题。

IT界的搬运工007

关注

3
点赞
踩
33

收藏

觉得还不错? 一键收藏
4
评论
python自动化测试登录验证码识别解决方案总结

自动化测试或者爬虫一定会遇到的一个问题就是：怎么识别或者绕过验证码。1、本地ocr识别准备：（1）安装ocr（2）安装包：pytesseract识别：（1）截取验证码图片（2）处理图片：二值化（3）识别图片内容，并删除保存的图片代码示例：# -*- coding:utf-8 -*-import tracebackfrom PIL import Imagefrom io import BytesIOimport timeimport pytesseract
复制链接

扫一扫