Python识别验证码的思路及解决方案

最新推荐文章于 2024-08-14 14:09:18 发布

「已注销」

最新推荐文章于 2024-08-14 14:09:18 发布

阅读量2k

点赞数 1

文章标签： python 图像识别程序人生经验分享其他

本文链接：https://blog.csdn.net/weixin_50181817/article/details/109328466

版权

本文介绍了Python识别简单验证码的思路，包括灰度处理、二值化、去除边框等步骤，以及使用pytesseract、OpenCV等库。通过实例展示了如何降噪、分隔、比对数字图片来确定验证码，提供了一个解决验证码识别问题的方法。

摘要由CSDN通过智能技术生成

在本篇内容里小编给大家整理的是一篇关于python识别验证码的思路及解决方案，有需要的朋友们可以参考下。

1、介绍

在爬虫中经常会遇到验证码识别的问题，现在的验证码大多分计算验证码、滑块验证码、识图验证码、语音验证码等四种。本文就是识图验证码，识别的是简单的验证码，要想让识别率更高，识别的更加准确就需要花很多的精力去训练自己的字体库。

识别验证码通常是这几个步骤：

（1）灰度处理

（2）二值化

（3）去除边框（如果有的话）

（4）降噪

（5）切割字符或者倾斜度矫正

（6）训练字体库

（7）识别

这6个步骤中前三个步骤是基本的，4或者5可根据实际情况选择是否需要。

经常用的库有pytesseract(识别库)、OpenCV(高级图像处理库)、imagehash（图片哈希值库）、numpy（开源的、高性能的Python数值计算库）、PIL的 Image，ImageDraw，ImageFile等。博主的Python学习圈子点击即可进入一起交流学习，还有最新的Python资料可以免费下载

2、实例

以某网站登录的验证码识别为例：具体过程和上述的步骤稍有不同。
在这里插入图片描述
首先分析一下，验证码是由4个从0到9等10个数字组成的，那么从0到9这个10个数字没有数字只有第一、第二、第三和第四等4个位置。那么计算下来共有40个数字位置，如下：

在这里插入图片描述

那么接下来就要对验证码图片进行降噪、分隔得到上面的图片。以这40个图片集作为基础。

对要验证的验证码图片进行降噪、分隔后获取四个类似上面的数字图片、通过和上面的比对就可以知道该验证码是什么了。

以上面验证码2837为例：

1、图片降噪
在这里插入图片描述
2、图片分隔

在这里插入图片描述
3、图片比对

通过比验证码降噪、分隔后的四个数字图片，和上面的40个数字图片进行哈希值比对，设置一个误差，max_dif：允许最大hash差值，越小越精确，最小为0。
在这里插入图片描述
这样四个数字图片通过比较后获取对应是数字，连起来，就是要获取的验证码。

完整代码如下：

#coding=utf-8
import os
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.common.action_chains import ActionChains
import collections
import mongoDbBase
import numpy
import imagehash
from PIL import Image,ImageFile
import datetime
class finalNews_IE:
    def __init__(self,strdate,logonUrl,firstUrl,keyword_list,exportPath,codepath,codedir):
        self.iniDriver()
        self.db = mongoDbBase.mongoDbBase()
        self.date = strdate
        self.firstUrl = firstUrl
        self.logonUrl = logonUrl
        self.keyword_list = keyword_list
        self.exportPath = exportPath
        self.codedir = codedir
        self.hash_code_dict ={
   }
        for f in range(0,10):
            for l in range(1,5):
                file = os.path.join(codedir, "codeLibrary\code" +  str(f) + '_'+str(l) + ".png")
                # print(file)
                hash = self.get_ImageHash(file)
                self.hash_code_dict[hash]= str(f)
    def iniDriver(self):
        # 通过配置文件获取IEDriverServer.exe路径
        IEDriverServer = "C:\Program Files\Internet Explorer\IEDriverServer.exe"
        os.environ["webdriver.ie.driver"] = IEDriverServer
        self.driver = webdriver.Ie(IEDriverServer)
    def WriteData(self, message, fileName):
        fileName = os.path.join(os.getcwd(), self.exportPath + '/' + fileName)
        with open(fileName, 'a') as f:
            f.write(message)
    # 获取图片文件的hash值
    def get_ImageHash(self,imagefile):
        hash = None
        if os.path.exists(imagefile):
            with open(imagefile, 'rb') as fp:
                hash = imagehash.average_hash(Image.open(fp))
        return hash
    # 点降噪
    def clearNoise(self, imageFile, x=0, y=0):
        if os.path.exists(imageFile):
            image = Image.open(imageFile)
            image = image.convert('L')
            image = numpy.asarray(image)
            image = (image > 135) * 255