Tesseract-OCR LSTM二次训练遇到的坑

最新推荐文章于 2024-06-20 18:27:29 发布

星辰辰大海

最新推荐文章于 2024-06-20 18:27:29 发布

阅读量5k

点赞数 4

文章标签： Tesseract LSTM

本文链接：https://blog.csdn.net/qq_19313495/article/details/102977915

版权

我的环境：

win10
Tesseract 4.1.0
jTessBoxEditor 2.2.1

训练过程参考了下面这篇文章：

https://blog.csdn.net/Hu_helloworld/article/details/100923215

坑1. makebox：

我在使用下面这条指令后：

tesseract nml.num.exp0.tif nml.num.exp0 -l eng --psm 6 batch.nochop makebox

只能生成tif中第一张图的信息，后来发现似乎是jTessBoxEditor的merge tiff用法不对。我原来是使用多张jpg生成一张tif，后来我把所有jpg转成tif（用opencv），然后再用merge tiff生成一张tiff就能够识别出所有图片的文字信息了。

坑2. Compute CTC targets failed!

我在使用下面这条指令后：

lstmtraining --model_output="F:\Test\AMyWork\ImgSampleLib\nomal\samples\CTCCB24\output\output" --continue_from="F:\Test\AMyWork\ImgSampleLib\nomal\samples\CTCCB24\eng.lstm" 
--train_listfile="F:\Test\AMyWork\ImgSampleLib\nomal\samples\CTCCB24\eng.training_files.txt" --traineddata="F:\Test\AMyWork\ImgSampleLib\nomal\samples\CTCCB24\eng.traineddata" 
--debug_interval -1 --max_iterations 2000

发生无限循环打印Compute CTC targets failed!。经过谷歌一顿搜索，发现时box里面的数据格式不正确。lstm训练模式下，box只接受一整行的数据，而不是把一整行数据拆成一个个的框。所以只需要将属于一行的数据，它们的box坐标框的范围从单个文字改成一整行，并且还需要用一个\t结尾，比如下面这样：

1 148 127 268 151 0
2 148 127 268 151 0
3 148 127 268 151 0
4 148 127 268 151 0
5 148 127 268 151 0
6 148 127 268 151 0
7 148 127 268 151 0
8 148 127 268 151 0
     148 127 268 151 0

最后一个是\t，即tab键。

其实官方也有对这个做说明，只是全是英文，很少有人去看。

框格式的官方说明

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

正确的LSTM框获取指令应该是类似下面这样的指令：

tesseract cn.my.exp0.tif cn.my.exp1 -l chi_sim lstmbox

之后就可以愉快的开始训练过程了。

星辰辰大海

关注

4
点赞
踩
15

收藏

觉得还不错? 一键收藏
14
评论
Tesseract-OCR LSTM二次训练遇到的坑

我的环境：win10 Tesseract 4.1.0 jTessBoxEditor2.2.1训练过程参考了下面这篇文章：https://blog.csdn.net/Hu_helloworld/article/details/100923215 坑1. makebox：我在使用下面这条指令后：tesseract nml.num.exp0.tif nml.num.ex...
复制链接

扫一扫