最近项目中有个需求,使用手持设备对3C码进行拍照识别,最后决定使用Tesseract-OCR,刚才对这个不了解,网上一大堆帖子,按照步骤操作下来,要么报错,要么就是标题党,实在是很恶心。为了以后可能还是用到,特意记录下来。
我的环境
-
Windows10
-
JDK1.8
-
Tesseract-OCR-3.0.5
下载地址:https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.02-20180621.exe
-
jTessBoxEditor-2.2.0
下载地址: https://sourceforge.net/projects/vietocr/files/jTessBoxEditor
软件安装
Tesseract-OCR-3.0.5
安装比较简单,双击直接安装即可,也不需要更改目录。安装完毕后需要配置环境变量。
TESSDATA_PREFIX
C:\Program Files (x86)\Tesseract-OCR\tessdata
path
C:\Program Files (x86)\Tesseract-OCR
TESSDATA_PREFIX 配置:
path 配置
jTessBoxEditor-2.2.0 安装
这个安装直接解压即可,这个软件需要有JAVA的环境,有关JAVA的安装和环境变量配置比较简单,这里就不介绍了,注意如果没有java环境这个软件是运行不起来的。
具体步骤
下面这张图是我已经完成的截图,其中num.traineddata,这个文件就是最后生成的训练文件。
num-1、num-2、num-3文件夹是存放的需要学习的图片,我为什么分了3个文件夹呢?是这样,为了提高我们日后识别的准确率,所以这个训练的过程是持续的,num-1表示第一次训练学习的数据,num-2表示第二次训练学习的数据,依次类推,这样的好处就是,省去了上次学习训练的重新校对box文件。我们只需要把本次需要训练学习的数据生成tif,在生成box文件进行对本次的内容进行校对即可,然后后续合并结果,生成最终的训练好的文件来使用。
准备训练数据
由于项目的识别的内容都是数字,所以我准备的训练数据都是带有数字的图片
num-1文件夹训练数据内容
num-2文件夹训练数据内容
num-3文件夹训练数据内容
准备好数据目录结构
生成tif文件
使用jTessBoxEditor工具生成tif文件,为了方便后续操作,将生成好的tif文件保持到Scan-OCR目录下,解压jTessBoxEditor压缩包后进入双击train.bat即可运行。
打开jTessBoxEditor工具后,点击Tools,点击Merge TIFF,选中num-1文件夹中所有图片,点击打开。
调整保存目录,保存名为:num.font.exp1.tif 然后点击保存。
点击保存后,提示完成了num.font.exp1.tif文件的创建。在Scan-OCR目录下可以看到刚创建的文件
上面已经完成了num-1文件夹中训练数据tif文件的创建,num-2、num-3文件训练数据创建tif重复上面步骤即可
完成3个文件夹创建trf文件的目录结构:
生成bok文件
通过cmd命令的方式进行生成3个box文件,命令:
tesseract num.font.exp1.tif num.font.exp1 batch.nochop makebox
tesseract num.font.exp2.tif num.font.exp2 batch.nochop makebox
tesseract num.font.exp3.tif num.font.exp3 batch.nochop makebox
执行命令过程:
D:\Scan-OCR>tesseract num.font.exp1.tif num.font.exp1 batch.nochop makebox
Tesseract Open Source OCR Engine v3.05.02 with Leptonica
Page 1
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 2
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 3
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 4
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 5
Warning. Invalid resolution 1 dpi. Using 70 instead.
D:\Scan-OCR>tesseract num.font.exp2.tif num.font.exp2 batch.nochop makebox
Tesseract Open Source OCR Engine v3.05.02 with Leptonica
Page 1
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 2
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 3
Warning. Invalid resolution 1 dpi. Using 70 instead.
D:\Scan-OCR>tesseract num.font.exp3.tif num.font.exp3 batch.nochop makebox
Tesseract Open Source OCR Engine v3.05.02 with Leptonica
Page 1
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 2
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 3
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 4
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 5
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 6
Warning. Invalid resolution 1 dpi. Using 70 instead.
Page 7
Warning. Invalid resolution 1 dpi. Using 70 instead.
D:\Scan-OCR>
执行命令后,会生成3个box文件
字符和位置进行校正
使用jTessBoxEditor工具打开每个tif进行字符和位置校正,然后保存即可
图中:
char 表示识别的字符
x y width height 表示字符的位置信息,我们微调的内容
1)字符是否识别正确
2)字符位置信息是否正确(比如图中字符2,char对应的字符是正确的,但是位置信息不正确,经过调整,如下:)
如果需要调整,我们调整后需要保存一下,注意每张被训练的图片,调整后的信息都是存放在对应的bok文件中的。感兴趣的,可以打开看看。
这个微调的过程很枯燥,都是重复性的工作,慢慢的调整完所有的图片后,保存就可以进行下一步操作。
生成TR文件
tesseract num.font.exp1.tif num.font.exp1 nobatch box.train
tesseract num.font.exp2.tif num.font.exp2 nobatch box.train
tesseract num.font.exp3.tif num.font.exp3 nobatch box.train
执行命令过程:
D:\Scan-OCR>tesseract num.font.exp1.tif num.font.exp1 nobatch box.train
Tesseract Open Source OCR Engine v3.05.02 with Leptonica
Page 1
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
Boxes read from boxfile: 11
Found 11 good blobs.
Generated training data for 2 words
Page 2
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
Boxes read from boxfile: 13
Found 13 good blobs.
Generated training data for 3 words
Page 3
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
Boxes read from boxfile: 12
Found 12 good blobs.
Generated training data for 1 words
Page 4
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
Boxes read from boxfile: 12
Found 12 good blobs.
Generated training data for 1 words
Page 5
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
Boxes read from boxfile: 12
Found 12 good blobs.
Generated training data for 3 words
D:\Scan-OCR>tesseract num.font.exp2.tif num.font.exp2 nobatch box.train
Tesseract Open Source OCR Engine v3.05.02 with Leptonica
Page 1
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
Boxes read from boxfile: 12
Found 12 good blobs.
Generated training data for 2 words
Page 2
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
Boxes read from boxfile: 12
Found 12 good blobs.
Generated training data for 1 words
Page 3
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
Boxes read from boxfile: 12
Found 12 good blobs.
Generated training data for 1 words
D:\Scan-OCR>tesseract num.font.exp3.tif num.font.exp3 nobatch box.train
Tesseract Open Source OCR Engine v3.05.02 with Leptonica
Page 1
Warning. Invalid resolution 1 dpi. Using 70 instead.
row xheight=78.5, but median xheight = 11.5
APPLY_BOXES:
Boxes read from boxfile: 13
Found 13 good blobs.
Generated training data for 2 words
Page 2
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
Boxes read from boxfile: 12
Found 12 good blobs.
Generated training data for 1 words
Page 3
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
Boxes read from boxfile: 12
Found 12 good blobs.
Generated training data for 2 words
Page 4
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
Boxes read from boxfile: 12
Found 12 good blobs.
Generated training data for 1 words
Page 5
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
Boxes read from boxfile: 6
Found 6 good blobs.
Generated training data for 1 words
Page 6
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
Boxes read from boxfile: 6
Found 6 good blobs.
Generated training data for 1 words
Page 7
Warning. Invalid resolution 1 dpi. Using 70 instead.
APPLY_BOXES:
Boxes read from boxfile: 14
Found 14 good blobs.
Generated training data for 3 words
D:\Scan-OCR>
执行命令后,会生成3个tr文件
新建字体特征文件
创建一个名称为font_properties的字体特征文件。文件内容格式:
其中fontname为字体名称,必须与[lang].[fontname].exp[num].box中的名称保持一致。
、 、 、、 的取值为1或0,表示字体是否具有这些属性。
在Scan-OCR目录下创建一个名称为font_properties的文件,用记事本打开,输入以下下内容:
font 0 0 0 0 0
这里全取值为0,表示字体不是粗体、斜体等等。注意font_properties文件是没有拓展名的
从所有文件中提取字符
输入命令,生成unicharset文件
D:\Scan-OCR>unicharset_extractor num.font.exp1.box num.font.exp2.box num.font.exp3.box
Extracting unicharset from num.font.exp1.box
Extracting unicharset from num.font.exp2.box
Extracting unicharset from num.font.exp3.box
Wrote unicharset file ./unicharset.
生成shape文件
输入命令生成shapetable文件:
D:\Scan-OCR>shapeclustering -F font_properties -U unicharset num.font.exp1.tr num.font.exp2.tr num.font.exp3.tr
Reading num.font.exp1.tr ...
Reading num.font.exp2.tr ...
Reading num.font.exp3.tr ...
Bad properties for index 3, char 3: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 4, char 0: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 5, char 9: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 6, char 4: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 7, char 1: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 8, char 2: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 9, char 8: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 10, char 6: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 11, char 7: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 12, char 5: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 13, char ?: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 14, char F: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 15, char 垄: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 16, char ~: 0,255 0,255 0,0 0,0 0,0
Building master shape table
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0 1 2 3 4 5 6 7 8 9 10 11 12 13
Stopped with 0 merged, min dist 0.100629
Master shape_table:Number of shapes = 14 max unichars = 1 number with multiple unichars = 0
生成聚集字符特征文件
D:\Scan-OCR>mftraining -F font_properties -U unicharset -O unicharset num.font.exp1.tr num.font.exp2.tr num.font.exp3.tr
Read shape table shapetable of 14 shapes
Reading num.font.exp1.tr ...
Reading num.font.exp2.tr ...
Reading num.font.exp3.tr ...
Bad properties for index 3, char 3: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 4, char 0: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 5, char 9: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 6, char 4: 0,255 0,255 0,0 0,0 0,0
Bad propert