Tesseract-OCR-3.0.5 数字识别训练与合并多次训练数据

最新推荐文章于 2024-05-31 16:07:47 发布

ruyulin

最新推荐文章于 2024-05-31 16:07:47 发布

阅读量1.2w

点赞数 29

分类专栏： OCR识别文章标签： Tesseract-OCR 3.0.5 训练数字 OCR识别学习

本文链接：https://blog.csdn.net/ruyulin/article/details/89046148

版权

最近项目中有个需求，使用手持设备对3C码进行拍照识别，最后决定使用Tesseract-OCR，刚才对这个不了解，网上一大堆帖子，按照步骤操作下来，要么报错，要么就是标题党，实在是很恶心。为了以后可能还是用到，特意记录下来。

我的环境

Windows10
JDK1.8
Tesseract-OCR-3.0.5
下载地址：https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.02-20180621.exe
jTessBoxEditor-2.2.0
下载地址： https://sourceforge.net/projects/vietocr/files/jTessBoxEditor

软件安装

Tesseract-OCR-3.0.5

安装比较简单，双击直接安装即可，也不需要更改目录。安装完毕后需要配置环境变量。

TESSDATA_PREFIX
C:\Program Files (x86)\Tesseract-OCR\tessdata
path
C:\Program Files (x86)\Tesseract-OCR

TESSDATA_PREFIX 配置：
TESSDATA_PREFIX 配置截图
path 配置
path 环境变量配置截图

jTessBoxEditor-2.2.0 安装

这个安装直接解压即可，这个软件需要有JAVA的环境，有关JAVA的安装和环境变量配置比较简单，这里就不介绍了，注意如果没有java环境这个软件是运行不起来的。

具体步骤

下面这张图是我已经完成的截图，其中num.traineddata，这个文件就是最后生成的训练文件。
num-1、num-2、num-3文件夹是存放的需要学习的图片，我为什么分了3个文件夹呢？是这样，为了提高我们日后识别的准确率，所以这个训练的过程是持续的，num-1表示第一次训练学习的数据，num-2表示第二次训练学习的数据，依次类推，这样的好处就是，省去了上次学习训练的重新校对box文件。我们只需要把本次需要训练学习的数据生成tif，在生成box文件进行对本次的内容进行校对即可，然后后续合并结果，生成最终的训练好的文件来使用。
训练完毕的截图

准备训练数据

由于项目的识别的内容都是数字，所以我准备的训练数据都是带有数字的图片

num-1文件夹训练数据内容

num-2文件夹训练数据内容
在这里插入图片描述
num-3文件夹训练数据内容

准备好数据目录结构

生成tif文件

使用jTessBoxEditor工具生成tif文件，为了方便后续操作，将生成好的tif文件保持到Scan-OCR目录下，解压jTessBoxEditor压缩包后进入双击train.bat即可运行。

打开jTessBoxEditor工具后，点击Tools，点击Merge TIFF，选中num-1文件夹中所有图片，点击打开。
在这里插入图片描述
调整保存目录，保存名为：num.font.exp1.tif 然后点击保存。

点击保存后，提示完成了num.font.exp1.tif文件的创建。在Scan-OCR目录下可以看到刚创建的文件

上面已经完成了num-1文件夹中训练数据tif文件的创建，num-2、num-3文件训练数据创建tif重复上面步骤即可
完成3个文件夹创建trf文件的目录结构：
trf文件创建截图

生成bok文件

通过cmd命令的方式进行生成3个box文件，命令：

tesseract num.font.exp1.tif num.font.exp1 batch.nochop makebox
tesseract num.font.exp2.tif num.font.exp2 batch.nochop makebox
tesseract num.font.exp3.tif num.font.exp3 batch.nochop makebox

执行命令过程：

D:\Scan-OCR>tesseract num.font.exp1.tif num.font.exp1 batch.nochop makebox
Tesseract Open Source OCR Engine v3.05.02 with Leptonica
Page 1
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 2
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 3
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 4
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 5
Warning. Invalid resolution 1 dpi. Using 70 instead.

D:\Scan-OCR>tesseract num.font.exp2.tif num.font.exp2 batch.nochop makebox
Tesseract Open Source OCR Engine v3.05.02 with Leptonica
Page 1
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 2
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 3
Warning. Invalid resolution 1 dpi. Using 70 instead.

D:\Scan-OCR>tesseract num.font.exp3.tif num.font.exp3 batch.nochop makebox
Tesseract Open Source OCR Engine v3.05.02 with Leptonica
Page 1
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 2
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 3
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 4
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 5
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 6
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 7
Warning. Invalid resolution 1 dpi. Using 70 instead.

D:\Scan-OCR>

执行命令后，会生成3个box文件
bok文件创建截图

字符和位置进行校正

使用jTessBoxEditor工具打开每个tif进行字符和位置校正，然后保存即可
打开trf文件截图

图中：
char 表示识别的字符
x y width height 表示字符的位置信息，我们微调的内容
1）字符是否识别正确
2）字符位置信息是否正确（比如图中字符2，char对应的字符是正确的，但是位置信息不正确，经过调整，如下：）
在这里插入图片描述
如果需要调整，我们调整后需要保存一下，注意每张被训练的图片，调整后的信息都是存放在对应的bok文件中的。感兴趣的，可以打开看看。
这个微调的过程很枯燥，都是重复性的工作，慢慢的调整完所有的图片后，保存就可以进行下一步操作。

生成TR文件

tesseract num.font.exp1.tif num.font.exp1 nobatch box.train
tesseract num.font.exp2.tif num.font.exp2 nobatch box.train
tesseract num.font.exp3.tif num.font.exp3 nobatch box.train

执行命令过程：

D:\Scan-OCR>tesseract num.font.exp1.tif num.font.exp1 nobatch box.train
Tesseract Open Source OCR Engine v3.05.02 with Leptonica
Page 1
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
   Boxes read from boxfile:      11
   Found 11 good blobs.
Generated training data for 2 words
Page 2
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
   Boxes read from boxfile:      13
   Found 13 good blobs.
Generated training data for 3 words
Page 3
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
   Boxes read from boxfile:      12
   Found 12 good blobs.
Generated training data for 1 words
Page 4
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
   Boxes read from boxfile:      12
   Found 12 good blobs.
Generated training data for 1 words
Page 5
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
   Boxes read from boxfile:      12
   Found 12 good blobs.
Generated training data for 3 words

D:\Scan-OCR>tesseract num.font.exp2.tif num.font.exp2 nobatch box.train
Tesseract Open Source OCR Engine v3.05.02 with Leptonica
Page 1
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
   Boxes read from boxfile:      12
   Found 12 good blobs.
Generated training data for 2 words
Page 2
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
   Boxes read from boxfile:      12
   Found 12 good blobs.
Generated training data for 1 words
Page 3
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
   Boxes read from boxfile:      12
   Found 12 good blobs.
Generated training data for 1 words

D:\Scan-OCR>tesseract num.font.exp3.tif num.font.exp3 nobatch box.train
Tesseract Open Source OCR Engine v3.05.02 with Leptonica
Page 1
Warning. Invalid resolution 1 dpi. Using 70 instead.
row xheight=78.5, but median xheight = 11.5
APPLY_BOXES:
   Boxes read from boxfile:      13
   Found 13 good blobs.
Generated training data for 2 words
Page 2
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
   Boxes read from boxfile:      12
   Found 12 good blobs.
Generated training data for 1 words
Page 3
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
   Boxes read from boxfile:      12
   Found 12 good blobs.
Generated training data for 2 words
Page 4
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
   Boxes read from boxfile:      12
   Found 12 good blobs.
Generated training data for 1 words
Page 5
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
   Boxes read from boxfile:       6
   Found 6 good blobs.
Generated training data for 1 words
Page 6
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
   Boxes read from boxfile:       6
   Found 6 good blobs.
Generated training data for 1 words
Page 7
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
   Boxes read from boxfile:      14
   Found 14 good blobs.
Generated training data for 3 words

D:\Scan-OCR>

执行命令后，会生成3个tr文件
tr文件截图

新建字体特征文件

创建一个名称为font_properties的字体特征文件。文件内容格式：

其中fontname为字体名称，必须与[lang].[fontname].exp[num].box中的名称保持一致。
、、、、的取值为1或0，表示字体是否具有这些属性。
在Scan-OCR目录下创建一个名称为font_properties的文件，用记事本打开，输入以下下内容：

font 0 0 0 0 0

这里全取值为0，表示字体不是粗体、斜体等等。注意font_properties文件是没有拓展名的
在这里插入图片描述

从所有文件中提取字符

输入命令，生成unicharset文件

D:\Scan-OCR>unicharset_extractor num.font.exp1.box num.font.exp2.box num.font.exp3.box
Extracting unicharset from num.font.exp1.box
Extracting unicharset from num.font.exp2.box
Extracting unicharset from num.font.exp3.box
Wrote unicharset file ./unicharset.

生成shape文件

输入命令生成shapetable文件：

D:\Scan-OCR>shapeclustering -F font_properties -U unicharset num.font.exp1.tr num.font.exp2.tr num.font.exp3.tr
Reading num.font.exp1.tr ...
Reading num.font.exp2.tr ...
Reading num.font.exp3.tr ...
Bad properties for index 3, char 3: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 4, char 0: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 5, char 9: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 6, char 4: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 7, char 1: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 8, char 2: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 9, char 8: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 10, char 6: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 11, char 7: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 12, char 5: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 13, char ?: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 14, char F: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 15, char 垄: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 16, char ~: 0,255 0,255 0,0 0,0 0,0
Building master shape table
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0 1 2 3 4 5 6 7 8 9 10 11 12 13
Stopped with 0 merged, min dist 0.100629
Master shape_table:Number of shapes = 14 max unichars = 1 number with multiple unichars = 0

生成聚集字符特征文件

D:\Scan-OCR>mftraining -F font_properties -U unicharset -O unicharset num.font.exp1.tr num.font.exp2.tr num.font.exp3.tr
Read shape table shapetable of 14 shapes
Reading num.font.exp1.tr ...
Reading num.font.exp2.tr ...
Reading num.font.exp3.tr ...
Bad properties for index 3, char 3: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 4, char 0: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 5, char 9: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 6, char 4: 0,255 0,255 0,0 0,0 0,0
Bad propert

最低0.47元/天解锁文章

ruyulin

关注

29
点赞
踩
58

收藏

觉得还不错? 一键收藏
6
评论
Tesseract-OCR-3.0.5 数字识别训练与合并多次训练数据

最近项目中有个需求，使用手持设备对3C码进行拍照识别，最后决定使用Tesseract-OCR，刚才对这个不了解，网上一大堆帖子，按照步骤坐下来，要么报错，要么就是标题党，实在是很恶心。为了以后可能还是用到，特意记录下来。我的环境Windows10JDK1.8Tesseract-OCR-3.0.5下载地址：https://digi.bib.uni-mannheim.de/tes...
复制链接

扫一扫

专栏目录