4、利用.tif和.box文件,生成.lstmf文件用于lstm训练
tesseract nml.num.exp0.tif nml.num.exp0 -l eng --psm 6 lstm.train
5用已有的或官方下载的.traineddata文件中提取.lstm文件
https://github.com/tesseract-ocr/tessdata_best
从该链接中下载所需语言的.traineddata文件
注:一定要用从上述链接中下载的.traineddata文件,其他的.traineddata文件中提取.lstm文件无法进行训练。
将下载好的.traineddata文件拷贝到训练文件夹下
combine_tessdata -e eng.traineddata eng.lstm
训练语句
tesseract eng_my.font.exp0.tif eng_my.font.exp0 -l eng --psm 7 batch.nochop makebox
这条语句不适用于最新的tesseract4.1.0 其中的-l eng -psm 7 多余,会导致训练软件jTessBoxEditor无法显示识别框
tesseract eng_my.font.exp0.tif eng_my.font.exp0 batch.nochop makebox
tesseract eng_my.font.exp0.tif eng_my.font.exp0 nobatch box.train
tesseract eng2_my.font.exp0 eng2_my.font.exp0 -l eng --psm 6 lstm.train
下面的调用命令行有很大大大大的bug。命令行的顺序是乱的,根本运行不起来,报各种错。
lstmtraining
--traineddata = “C:\jTessBoxEditorFX\samples\trainedsrc\eng.traineddata”
--model_output = “C:\jTessBoxEditorFX\samples\trainingoutput”
--continue_from = “C:\jTessBoxEditorFX\samples\trainedsrc\eng.lstm”
--train_listfile = “C:\jTessBoxEditorFX\samples\trainedsrc\eng.training_files.txt”
--debug_interval -1 --max_iterations 2000
lstmtraining --model_output="C:\jTessBoxEditorFX\samples\trainingoutput" --continue_from="C:\jTessBoxEditorFX\samples\trainedsrc\eng.lstm"
--train_listfile="C:\jTessBoxEditorFX\samples\trainedsrc\eng.training_files.txt" --traineddata="C:\jTessBoxEditorFX\samples\trainedsrc\eng.traineddata"
--debug_interval -1 --max_iterations 800
输入lstmtraining 看看命令行的顺序是什么
lstmtraining
USAGE: lstmtraining -v | --version | lstmtraining [.tr files ...]
--debug_level Level of Trainer debugging (type:int default:0)
--load_images Load images with tr files (type:int default:0)
--debug_interval How often to display the alignment. (type:int default:0)
--net_mode Controls network behavior. (type:int default:192)
--perfect_sample_delay How many imperfect samples between perfect ones. (type:int default:0)
--max_image_MB Max memory to use for images. (type:int default:6000)
--append_index Index in continue_from Network at which to attach the new network defined by net_spec (type:int default:-1)
--max_iterations If set, exit after this many iterations (type:int default:0)
--clusterconfig_min_samples_fraction Min number of samples per proto as % of total (type:double default:0.625)
--clusterconfig_max_illegal Max percentage of samples in a cluster which have more than 1 feature in that cluster (type:double default:0.05)
--clusterconfig_independence Desired independence between dimensions (type:double default:1)
--clusterconfig_confidence Desired confidence in prototypes created (type:double default:1e-06)
--target_error_rate Final error rate in percent. (type:double default:0.01)
--weight_range Range of initial random weights. (type:double default:0.1)
--learning_rate Weight factor for new deltas. (type:double default:0.001)
--momentum Decay factor for repeating deltas. (type:double default:0.5)
--adam_beta Decay factor for repeating deltas. (type:double default:0.999)
--stop_training Just convert the training model to a runtime model. (type:bool default:false)
--convert_to_int Convert the recognition model to an integer model. (type:bool default:false)
--sequential_training Use the training files sequentially instead of round-robin. (type:bool default:false)
--debug_network Get info on distribution of weight values (type:bool default:false)
--randomly_rotate Train OSD and randomly turn training samples upside-down (type:bool default:false)
--configfile File to load more configs from (type:string default:)
--D Directory to write output files to (type:string default:)
--F File listing font properties (type:string default:font_properties)
--X File listing font xheights (type:string default:)
--U File to load unicharset from (type:string default:unicharset)
--O File to write unicharset to (type:string default:)
--output_trainer File to write trainer to (type:string default:)
--test_ch UTF8 test character string (type:string default:)
--net_spec Network specification (type:string default:)
--continue_from Existing model to extend (type:string default:)
--model_output Basename for output models (type:string default:lstmtrain)
--train_listfile File listing training files in lstmf training format. (type:string default:)
--eval_listfile File listing eval files in lstmf training format. (type:string default:)
--traineddata Combined Dawgs/Unicharset/Recoder for language model (type:string default:)
--old_traineddata When changing the character set, this specifies the old character set that is to be replaced (type:string default:)
根据上面的重新调整顺序
lstmtraining
--debug_interval -1 --max_iterations 800
--continue_from="C:\jTessBoxEditorFX\samples\trainedsrc\eng.lstm"
--model_output="C:\jTessBoxEditorFX\samples\trainingoutput"
--train_listfile="C:\jTessBoxEditorFX\samples\trainedsrc\eng.training_files.txt"
--traineddata="C:\jTessBoxEditorFX\samples\trainedsrc\eng.traineddata"
发现运行起来了
合成训练结果
lstmtraining --stop_training --continue_from="C:\jTessBoxEditorFX\samples\trainingoutput_checkpoint"
--model_output="C:\jTessBoxEditorFX\samples\trainingoutput\eng2_my.traineddata"
--traineddata="C:\jTessBoxEditorFX\samples\trainedsrc\eng.traineddata"
图像中没有框,如何增添框
选中整张图像,然后点击inset,你会发现有新的框进来。
两个字符挨得的太近,被框到了一个框,点击split,即可拆分成两个。
调整框太慢,调整x的话,选中x的调整(变蓝了),下面选中调整的框(变红的框),此时见证奇迹的时候到了,按住键盘的up(向上的箭头)键或down键(向下的箭头),你会发现选中的框在飞速的移动。
整张图像会识别为~符号,注意删掉
输入字符的时候敲击两次enter
整张图像都没有标注框怎么办?
- 将.box文件用notepad++打开,复制再贴全部内容到新建的txt文件中
- 按照.box里每行的顺序手动添加标示框
| 识别内容 | 识别框左上角的x坐标 |识别框左上角的y坐标|识别框的宽|识别框的高|tif图像页码|
| B | 16 |92|138|261|0|
3.找到图像页码对应的行数,然后插入一行,在txt中添加好后复制到.box文件中。最后重新打开.box文件校对就可以了。