2021-07-23

最新推荐文章于 2025-01-07 19:58:02 发布

lltsygxs

最新推荐文章于 2025-01-07 19:58:02 发布

阅读量821

点赞数

分类专栏：机器学习文章标签： python ocr

本文链接：https://blog.csdn.net/lltsygxs/article/details/119010955

版权

机器学习专栏收录该内容

1 篇文章 1 订阅

订阅专栏

本文详细介绍了OCR技术，并重点讲解了开源OCR引擎Tesseract-OCR的安装、环境变量设置及使用，包括如何处理错误和添加语言包。同时，介绍了Python库pytesseract的安装与使用方法，提供了识别图片文字的实例，以及如何处理识别中的问题。文章还给出了Tesseract的命令行参数和pytesseract的常用语法。

摘要由CSDN通过智能技术生成

没有标题

Tesseract-ocr / pytesseract 详细安装过程

Tesseract-ocr / pytesseract 详细安装过程

简要介绍

OCR（Optical Character Recognition）

OCR，光学字符识别。是指通过扫描字符(例如图形验证码等)，然后通过其形状将其翻译成电子文本的过程。

Tesseract-ocr

Tesseract-ocr，开源的OCR识别引擎，也就是人们常说的Tesseract软件，C++开发。Tesseract引擎最初由HP实验室研发，后经由Google进行优化改进后发布。已经有5.0版本

pytesseract

pytesseract是用来调用Tesseract软件的Python库（因为Tesseract由C++开发，无法直接通过python调用）。pytesseract在python与Tesseract软件之间架起沟通的桥梁，所以必须安装Tesseract软件。
安装地址

Tesseract下载与安装

下载地址

https://digi.bib.uni-mannheim.de/tesseract/
win32表示Windows32位系统
win64表示Windows64为系统，alpha估计是开发测试版
其他为Linux系统，dev表示开发测试版
建议下载稳定版，我的是win10系统，下载安装了tesseract-ocr-win64-setup-v5.0.0.20190623.exe
请添加图片描述

Tesseract详细安装过程及错误处理

以管理员身份打开软件安装
在这里插入图片描述

请添加图片描述

如果没有出错，说明安装完成，语言包也安装成功（因为Tesseract默认只安装识别英文与数字的包，识别不了汉语）。
如果出现下面错误，说明语言包安装失败，直接点击确定（可能需要点击十几次确定），安装完成后就需要手动添加语言包。
请添加图片描述
语言包下载地址

[https://github.com/tesseract-ocr/tessdata]
请添加图片描述
下载得到的压缩包加压
在这里插入图片描述

环境变量设置

不设置环境变量，不影响python的调用，但无法在cmd命令行中通过 tesseract 命令打开程序，会显示不是内部程序或者内部指令,
不过，安装成功后会自带cmd命令窗口（嘿嘿嘿）
在这里插入图片描述

Tesseract程序使用以及举例说明

打开tesseract程序自带的命令行窗口，输入tesseract，出现如下图内容

请添加图片描述
小试牛刀
以下图test1.png为例（英文与数字）

在这里插入图片描述
注意：该txt文件名叫output_1.txt.txt，说明tesseract保存文件默认添加txt后缀

j接下来，试一试test2.png（含有中文）

在这里插入图片描述

在这里插入图片描述

可以看到，出现了问题。因为识别的图中还有中文字符，而tesseract默认识别数字与英文，所以需要指定中文语言包（同理，识别其他国家语言需要相应的语言包）
完整命令格式：tesseract imgPath savePath -l traineddata
l : language(语言) 后面跟语言包（traineddata: 已经训练好的数据，也就是指语言包）
所以：程序默认 tesseract imgPath savePath -l eng (eng:english 英语语言包)
识别中文： tesseract imgPath savePath -l chi_sim (chi_sim: 简体中文 Chinese simple)

上面的指令改为 tesseract C:\Users\liulintong\Desktop\test2.png C:\Users\liulintong\Desktop\output_2 -l chi_sim
在这里插入图片描述

识别基本成功（除了双引号）

接下来试一个比较复杂的
请添加图片描述

在这里插入图片描述
出现了个别子错误。可以通过安装更加精准的语言包降低错误概率

Tesseract软件使用的详细手册

输入命令 tesseract --help-extra
请添加图片描述

输入命令 tesseract --help-extra
用法:
    Tesseract --help | --help-extra | --help-psm | --help-oem | --version
    Tesseract --list-langs [--tessdata-dir PATH]
    Tesseract --print-parameters [options...] [configfile...]
    Tesseract imagename|imagelist|stdin outputbase|stdout [options...] [configfile...]

光学字符识别选项:
    - tessdata-dir PATH     指定tessdata路径的位置。
    --user-words PATH       指定用户词文件的位置。
    --user-patterns PATH    指定用户词文件的位置。
    -l LANG[+LANG]          指定用于光学字符识别的语言，可以加多个语言包

    - psm NUM  Specify page segmentation mode.（指定页面分段模式。）
    - oem NUM  Specify OCR Engine mode.（指定光学字符识别引擎模式。）
注意:这些选项必须出现在任何配置文件之前。

Page segmentation modes:页面分割模式
     0 仅定向和脚本检测(OSD)。
     1 带OSD的自动页面分割。
     2 自动页面分割，但没有OSD，或OCR。(未实施)
     3 全自动页面分割，但没有OSD。(默认)
     4 假设有一列不同大小的文本。
     5 假设有一个垂直对齐的统一文本块。
     6 假设有一个统一的文本块。
     7 将图像视为单个文本行。
     8 将图像视为一个单词。
     9 将图像视为圆圈中的一个单词。
    10 将图像视为单个字符。
    11 稀疏文本。不按特定顺序查找尽可能多的文本。
    12 带有OSD的稀疏文本。
    13 原始线。将图像视为单个文本行，绕过特定Tesseract的处理。

OCR Engine modes:光学字符识别引擎模式
    0 仅旧引擎。
    1 仅神经网络LSTM引擎。
    2 台传统+ LSTM发动机。
    3 默认值，基于可用的内容。

单一选项:
    -h, --help      显示最少的帮助消息。
    --help-extra    显示高级用户的额外帮助。
    -v, --version   显示版本信息。
    -- list-langs   列出可用于tesseract引

pytesseract 安装与使用

安装pytesseract

pip install pytesseract
参考手册
添加链接描述

常用语法

import pytesseract
from PIL import Image

# 设置tesseract.exe 软件的本地存储地址
tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
pytesseract.pytesseract.tesseract_cmd = tesseract_cmd 

# 查看可用的语言包
print(pytesseract.get_languages(config=''))

# 图片识别
‘’‘
	image_to_string: 参数(image, lang=None, config='', nice=0, output_type=Output.STRING, timeout=0) -> (bytes | str)
	
	image 对象或字符串 - 要由 Tesseract 处理的图像的 PIL Image/NumPy 数组或文件路径。如果您传递对象而不是文件路径，pytesseract 将隐式地将图像转换为RGB 模式。
	lang String - Tesseract 语言代码字符串。如果未指定，则默认为eng！多语言示例：lang='eng+chi_sim'
	config String - pytesseract 函数无法使用的任何其他自定义配置标志。例如：config='--psm 6'
	nice Integer - 修改 Tesseract 运行的处理器（指的应该是CPU与GPU）优先级。在 Windows 上不支持。Nice 调整了class Unix 进程的良好程度。
	output_type类属性 - 指定输出的类型，默认为string。有关所有支持类型的完整列表，查看pytesseract.Output类的定义。
	timeout Integer 或 Float - OCR 处理的持续时间（以秒为单位），之后，pytesseract 将终止并引发 RuntimeError。	
’‘’
# 例子一：通过PIL.Image 处理的图片
print(pytesseract.image_to_string(Image.open('test.png'))) 

# 例子二：直接传入图片地址,与出错后处理方式
try:
# （小声bb：我没发现这两种写法有什么区别 ^_^，以下内容为猜测：tesseract容易识别黑白两色的图片，对彩色内容识别度差，如果图片较清晰，黑白两色，不用进行处理，可以直接传入地址。但是大多数图片不会这么老实，就需要通过PIL等图片处理库进行调教，调教好了直接送进去￥_￥）
    print(pytesseract.image_to_string(r'C:\Users\liulintong\Desktop\test1.png', timeout=2,lang='chi_sim')) # (timeout=2)等你2秒，2秒还识别不出来就去死吧，不用识别了
    
    # image_to_data与image_to_string传入参数基本一致（稍微比image_to_string多一两个不常用的参数），输出结果很详细，自己去试
    print(pytesseract.image_to_data(r'C:\Users\liulintong\Desktop\test1.png', output_type='string',timeout=2,lang='chi_sim')) 
except RuntimeError as timeout_error:
    # Tesseract processing is terminated
    print("超过2秒")
# pytesseract .image_to_string （“test.png” ）

通过pytesseract调用tesseract软件识别案例

import pytesseract

tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
pytesseract.pytesseract.tesseract_cmd = tesseract_cmd 
print(pytesseract.get_languages())
try:
    print(pytesseract.image_to_string(r'C:\Users\liulintong\Desktop\test1.png', timeout=2,lang='chi_sim')) # Timeout after 2 seconds
    print(pytesseract.image_to_string(r'C:\Users\liulintong\Desktop\test2.png', timeout=2,lang='chi_sim')) 
    print(pytesseract.image_to_string(r'C:\Users\liulintong\Desktop\test3.png', output_type='string',timeout=10,lang='chi_sim')) 
except RuntimeError as timeout_error:
    print("超过2秒")

识别的结果

与上述结果基本一致
请添加图片描述

注意：通过测试发现，上述识别内容存在重复部分（绿框部分），老子也不知道怎么回事￥_￥
但是，使用Image.open（imagePath）后没用出现上述问题。（我tm也一脸懵逼，反正前面说过了，建议使用Image方式传入图片路径）

在这里插入图片描述