Linux安装Leptonica、Tesseract-OCR Python执行(识别验证码)

最新推荐文章于 2024-10-11 07:48:15 发布

foxtool23

最新推荐文章于 2024-10-11 07:48:15 发布

阅读量1.3k

点赞数 1

分类专栏： Python 文章标签： docker linux python ocr

本文链接：https://blog.csdn.net/foxtool23/article/details/105789584

版权

Python 专栏收录该内容

2 篇文章

订阅专栏

本文详细介绍了如何在CentOS系统上从零开始安装Tesseract OCR引擎，包括必要的依赖库leptonica的安装步骤，以及如何通过Python调用Tesseract进行图像文字识别。此外，还提供了使用Docker快速部署Tesseract的方案，以及如何处理中文识别的额外配置。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Tesseract：开源的OCR识别引擎

经测试: 英文数字识别,中文不行,验证码图片简单的可以复杂的识别不出来.

方案一(自己在干净系统上安装)
docker run --name i-centos --restart=always -v E:\Python:/www -itd daocloud.io/centos:7

https://blog.51cto.com/jschu/1702456

一.

yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
yum install gcc gcc-c++
yum install autoconf automake libtool
yum install libjpeg-devel libpng-devel libtiff-devel zlib-devel
yum install make

二.
安装leptonica
http://www.leptonica.org/download.html
./configure
make && make install

vi /etc/profile
export LD_LIBRARY_PATH=$LD_LIBRARY_PAYT:/usr/local/lib
export LIBLEPT_HEADERSDIR=/usr/local/include
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
source /etc/profile

三.
tesseract-ocr安装
手动下载安装包。
https://github.com/tesseract-ocr/tesseract

cd tesseract-3.04.01
./autogen.sh
./configure
make && make install
ldconfig

四.
pip install pytesseract
pip install pillow

#!usr/bin/env python
# coding:utf-8

import pytesseract
from PIL import Image

image = Image.open(r'/www/1.jpg')
image = image.convert('L') #转化为灰度图
threshold = 127     #设定的二值化阈值
table = []   #table是设定的一个表，下面的for循环可以理解为一个规则，小于阈值的，就设定为0，大于阈值的，就设定为1
for i in range(256):
    if i < threshold:
        table.append(0)
    else:
        table.append(1)

image = image.point(table,'1')    #对灰度图进行二值化处理，按照table的规则（也就是上面的for循环）
result = pytesseract.image_to_string(image) #对去噪后的图片进行识别
print(result)

python /www/a.py

方案二:(用别人搭建好的)
docker pull ncino/tesseract-ocr
docker run --name i-ocr --restart=always -v E:\Python:/www -itd ncino/tesseract-ocr
tesseract 1.jpg result
pip install pytesseract

获得结果

docker exec -i i-ocr python3 /www/a.py

其他:

中文的话需要加语言库 docker exec -i i-ocr tesseract /www/1.png /www/1.txt –l chi_sim+eng

语言库地址为：https://github.com/tesseract-ocr/tessdata

如果报错wrong number of lut entries

#!usr/bin/env python
# coding:utf-8

import pytesseract
from PIL import Image

image = Image.open(r'/www/1.jpg')
image = image.convert('L') #转化为灰度图


threshold = 127  #设定的二值化阈值
#table = []   table是设定的一个表，下面的for循环可以理解为一个规则，小于阈值的，就设定为0，大于阈值的，就设定为1
#for i in range(256):
#  if i < threshold:
#        table.append(0)
        #else:
#    table.append(1)
#img = img.point(table,'1')  对灰度图进行二值化处理，按照table的规则（也就是上面的for循环）

filter_func = lambda x: 0 if x < threshold else 1
img = img.point(filter_func,'1')


result = pytesseract.image_to_string(image) #对去噪后的图片进行识别
print(result)