python爬虫验证码识别模块tesseracr与pytesseract

最新推荐文章于 2024-07-31 17:39:14 发布

愤怒的马农

最新推荐文章于 2024-07-31 17:39:14 发布

阅读量8.8k

点赞数 1

分类专栏：识别验证码 python 爬虫文章标签： python 爬虫验证码

本文链接：https://blog.csdn.net/weixin_43407092/article/details/88555394

版权

python 同时被 3 个专栏收录

38 篇文章 1 订阅

订阅专栏

爬虫

22 篇文章 2 订阅

订阅专栏

识别验证码

1 篇文章 0 订阅

订阅专栏

由于tesserocr在windows环境下会出现各种不兼容问题，并且与pycharm虚拟环境不兼容等问题，所以在windows系统环境下，选择pytesseract模块进行安装，如果实在要安装请使用whl文件安装或者使用conda安装

pip install pytesseract

如果在pytesseract运行是找不到tesseract解释器，这种情况一般是在虚拟环境下会发生，我们需要将tesseract-OCR的执行文件tesseract.ext配置到windows系统中的PATH环境中，或者修改pytesseract.py文件，将其中的“tesseract_cmd”字段指定为tesseract.exe的完整路径即可

测试识别功能：

import pytesseract
from PIL import Image

image = Image.open('tesseracttest.png')		# 图片名
text = pytesseract.image_to_string(image)
print(text)

Ubuntu,linux系统中,安装命令如下

#安装tesseract
sudo apt-get install -y tesseract-ocr libtesseract-dev libleptonica-dev

#安装语言包
git clone https://github.com/tesseract-ocr/tessdata.git
sudo mv tessdata/* /usr/share/tesseract-ocr/tessdata

#安装pytesseract
pip3 install pytesseract

识别图片里的内容再写入另外一张图片里

from PIL import Image
import subprocess

def cleanFile(filePath, newFilePath):
    image = Image.open(filePath)

    # 对图片进行阈值过滤（低于143的置为黑色，否则为白色）
    image = image.point(lambda x: 0 if x < 143 else 255)
    # 重新保存图片
    image.save(newFilePath)

    # 调用系统的tesseract命令对图片进行OCR识别
    subprocess.call(["tesseract", newFilePath, "output"])

    # 打开文件读取结果
    with open("output.txt", 'r') as f:
        print(f.read())

if __name__ == "__main__":
    cleanFile("tesseracttest.jpg", "123.jpg")    # 读取tesseracttest内的文字,再把文字写入123中