cnocr训练_Tesseract-OCR 训练过程 V3.02

软件:

jTessBoxEditor Version 0.9 (30 April 2013)

Tesseract-OCR win32 v3.02 with Leptonica

训练步骤:

1.使用jTessBoxEditor,tools->merge_tif,产生tif文件

2.产生box文件

tesseract.exe eng.arial.01.tif eng.arial.01 batch.nochop

makebox

3.使用jTessBoxEditor打开,Insert或Delete,添加删除字符,并通过xywh调整对应的坐票

4.训练(如果遇到不可识别的字符,couldn t find a matching

blob,尝试换位置或调坐标)

tesseract.exe eng.arial.01.tif eng.arial.01 nobatch

box.train

5.字体预处理

unicharset_extractor.exe eng.arial.01.box

6.创建font_properties.txt,内容为:arial 0 0 0 0 0

7.字体处理

mftraining.exe -F font_properties.txt -U unicharset

eng.arial.01.tr

8.cntraining.exe eng.arial.01.tr

9.把unicharset, inttemp, normproto,

pffmtable这四个文件加上前缀“eng.arial.01.”

10.combine_tessdata.exe eng.arial.01.

显示:

Combining tessdata files

TessdataManager combined tesseract data files.

Offset for type 0 is -1

Offset for type 1 is 108

Offset for type 2 is -1

Offset for type 3 is 1660

Offset for type 4 is 327545

Offset for type 5 is 327781

Offset for type 6 is -1

Offset for type 7 is -1

Offset for type 8 is -1

Offset for type 9 is -1

Offset for type 10 is -1

Offset for type 11 is -1

Offset for type 12 is –1

必须确定的是第2、4、5、6行的数据不是-1,那么一个新的字典就算生成了。

11.此时目录下“eng.arial.01.traineddata”的文件拷贝到tesseract程序目录下的“tessdata”目录

12.

#tesseract.exe test.jpg result -l eng.arial.01

#tesseract.exe a.bmp result2 -l eng.arial.01

指定布局识别方式

tesseract.exe 42.png result2 -l eng.arial.01 -psm 7

布局参数描述:

-psm N

Set Tesseract to only

run a subset of layout analysis and assume a certain form of image.

The options for N are:

0 = Orientation and

script detection (OSD) only.

1 = Automatic page

segmentation with OSD.

2 = Automatic page

segmentation, but no OSD, or OCR.

3 = Fully automatic page

segmentation, but no OSD. (Default)

4 = Assume a single

column of text of variable sizes.

5 = Assume a single

uniform block of vertically aligned text.

6 = Assume a single

uniform block of text.

7 = Treat the image as a

single text line.

8 = Treat the image as a

single word.

9 = Treat the image as a

single word in a circle.

10 = Treat the image as

a single character.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值