Tessract训练样本字库

Accccccccv

已于 2024-01-09 09:45:24 修改

阅读量575

点赞数 3

分类专栏：图像识别 tesseract 机器学习文章标签：其他 java

于 2021-10-31 15:45:15 首次发布

本文链接：https://blog.csdn.net/weixin_47142014/article/details/120991287

版权

2 篇文章 0 订阅

订阅专栏

1 篇文章 0 订阅

订阅专栏

1 篇文章 0 订阅

订阅专栏

在这里插入图片描述

双击文件夹下的train.bat

在这里插入图片描述

1）首先打开jTessBoxEditor，菜单Tools->Merge TIFF，进入训练样本所在文件夹，选中要参与训练的样本图片（按住shift或ctrl键点击可多选）：在这里插入图片描述

（2）选择图片后，点击打开，保存到当前路径，命名为“chi.test.exp0.tif”
tif文面命名格式[lang].[fontname].exp[num].tif
lang是语言，fontname是字体，num为自定义数字。

打开命令行程序，进入到该文件夹下，输入一下命令

tesseract chi.test.exp0.tif zwp.test.exp0 -l chi_sim -psm 7 batch.nochop makebox

Box Editor->Open 选择刚才生成的box文件，
在这里插入图片描述

（1）执行命令，执行完之后，会在当前目录生成font_properties文件

echo test 0 0 0 0 0 >font_properties

（2）生成.tr训练文件:

tesseract chi.test.exp0.tif zwp.test.exp0 nobatch box.train

（3）生成字符集文件

unicharset_extractor chi.test.exp0.box

（4）生成shape文件：

shapeclustering -F font_properties -U unicharset -O chi.unicharset chi.test.exp0.tr

（5）生成聚字符特征文件：

mftraining -F font_properties -U unicharset -O chi.unicharset chi.test.exp0.tr

（6）生成字符正常化特征文件：

cntraining chi.test.exp0.tr

（7）文件重命名：

rename normproto chi.normproto
rename inttemp chi.inttemp
rename pffmtable chi.pffmtable
rename shapetable chi.shapetable

（8）合并训练文件：

combine_tessdata chi.

最后将生成的“chi.traineddata”语言包文件复制到Tesseract-OCR 安装目录下的tessdata文件夹中，就可以用了

关注

专栏目录