Tesseract OCR to Page (TPT) and Page Viewer (PVT)

1). TPT and PVT from PRImA
Tesseract OCR to Page (TPT) and Page Viewer (PVT)

Very nice tool sets from the Pattern Recognition and Image Analysis Research Lab (PRImA).

How do I segment a document using Tesseract then output the resulting bounding boxes and labels

2). Had problem running TPT though
Only works under layout mode (i.e. with -rec-mode layout)
any other values e.g. ocr-regions, ocr-lines, ocr-words, ocr-glyphs, will crash.

3). Using Tessearact to output HOCR format then load directly using PVT

  • Tesseract 3.02: working!
  • Tesseract 3.04/3.05: Partly working :( only recognise Regions, not Lines or Words
  • Reason: newer version output more information in Title field which cannot be interpreted by PVT properly
  • Specifically,
    • For class='ocr_line', 3.04 adds baseline, etc. in title
    • For class='ocr_line', 3.05 adds baseline, x_size, textangle, x_descenders, x_ascenders etc. in title, for example:
      • <span class='ocr_line' id='line_1_10' title="bbox 99 516 833 563; baseline 0.014 -17; x_size 39; x_descenders 8; x_ascenders 9">
      • <span class='ocr_line' id='line_1_59' title="bbox 2128 2529 2152 2780; textangle 90; x_size 23; x_descenders 3; x_ascenders 7">
    • For class='ocrx_word', new version adds x_wconf in title
      • e.g. <span class='ocrx_word' id='word_1_4' title='bbox 1303 283 1911 536; x_wconf 86' lang='eng' dir='ltr'><strong>Waves</strong></span>

4). PAGE Converter and Validator
Try to convert HOCR output file (XML format in fact) from Tesseract to PAGE format - NO SUCCESS.

Using A basic Java version for conversion only gives more information.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值