使用tesseract实现简单图形验证码识别_tesseract识别纯数字-CSDN博客

本文链接：https://blog.csdn.net/wang785994599/article/details/96831231

本文分享了Tesseract OCR在不同版本下识别数字、英文及中文的实践心得，并提供了安装配置教程，包括环境变量设置、多语言识别及Python对接方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

识别类似这种纯数字的，建议使用3.02版本的tesseract。
识别中文字符，使用5.0即可。
需要自己训练模型的，使用4.0版本，因为4.0版本的文档稍微多一些，且支持使用自己训练的模型。

按照官方文档提示进行安装
语言库选择math，chinese simplified.
windows下需要添加环境变量

变量名TESSDATA_PREFIX
变量值F:\Program Files (x86)\Tesseract-OCR\tessdata

在这里插入图片描述
进入tesseract安装目录
执行

tesseract.exe 22.png result

结果会存储在result.txt中
在这里插入图片描述

在这里插入图片描述

ps:对于中文的识别,5.0版本更加准确。
中文识别需要安装对应的中文训练集，下载地址：https://github.com/tesseract-ocr/tesseract/wiki
运行时需要指定训练集

tesseract.exe 44.png E:\result -l chi_sim

多语言应指定多个训练集

tesseract.exe 44.png E:\result -l chi_sim+eng

在这里插入图片描述

安装pytesseract: pip install pytesseract

import pytesseract
from PIL import Image
image = Image.open("44.png")
print(pytesseract.image_to_string(image,lang="chi_sim"))

在这里插入图片描述

可能会出现异常

pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it’s not in your path

环境变量的问题，只需要把Tesseract-ORC添加到系统环境即可。
或者在识别之前修改tesseract_cmd的值，使其指向tesseract.exe

pytesseract.pytesseract.tesseract_cmd = r"D:\Tesseract-OCR\tesseract.exe"