Tesseract OCR to Page (TPT) and Page Viewer (PVT)

最新推荐文章于 2024-04-19 14:13:29 发布

windtalkersm

最新推荐文章于 2024-04-19 14:13:29 发布

阅读量511

点赞数

分类专栏： OCR备忘

本文链接：https://blog.csdn.net/windtalkersm/article/details/59700924

版权

OCR备忘专栏收录该内容

8 篇文章 0 订阅

订阅专栏

1). TPT and PVT from PRImA
Tesseract OCR to Page (TPT) and Page Viewer (PVT)

Very nice tool sets from the Pattern Recognition and Image Analysis Research Lab (PRImA).

How do I segment a document using Tesseract then output the resulting bounding boxes and labels

2). Had problem running TPT though
Only works under layout mode (i.e. with -rec-mode layout)
any other values e.g. ocr-regions, ocr-lines, ocr-words, ocr-glyphs, will crash.

3). Using Tessearact to output HOCR format then load directly using PVT

Tesseract 3.02: working!
Tesseract 3.04/3.05: Partly working :( only recognise Regions, not Lines or Words
Reason: newer version output more information in Title field which cannot be interpreted by PVT properly
Specifically,

For class='ocr_line', 3.04 adds baseline, etc. in title
For class='ocr_line', 3.05 adds baseline, x_size, textangle, x_descenders, x_ascenders etc. in title, for example:



For class='ocrx_word', new version adds x_wconf in title

e.g. Waves

4). PAGE Converter and Validator
Try to convert HOCR output file (XML format in fact) from Tesseract to PAGE format - NO SUCCESS.

Using A basic Java version for conversion only gives more information.

windtalkersm

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Tesseract OCR to Page (TPT) and Page Viewer (PVT)

1). TPT and PVT from PRImA Tesseract OCR to Page (TPT) and Page Viewer (PVT)Very nice tool sets from the Pattern Recognition and Image Analysis Research Lab (PRImA). How do I segment a document using
复制链接

扫一扫

专栏目录