1). TPT and PVT from PRImA
Tesseract OCR to Page (TPT) and Page Viewer (PVT)
Very nice tool sets from the Pattern Recognition and Image Analysis Research Lab (PRImA).
How do I segment a document using Tesseract then output the resulting bounding boxes and labels
2). Had problem running TPT though
Only works under layout mode (i.e. with -rec-mode layout)
any other values e.g. ocr-regions
, ocr-lines
, ocr-words
, ocr-glyphs
, will crash.
3). Using Tessearact to output HOCR format then load directly using PVT
- Tesseract 3.02: working!
- Tesseract 3.04/3.05: Partly working :( only recognise
Regions
, not Lines or Words- Reason: newer version output more information in
Title
field which cannot be interpreted by PVT properly- Specifically,
- For
class='ocr_line'
, 3.04 addsbaseline
, etc. intitle
- For
class='ocr_line'
, 3.05 addsbaseline
,x_size
,textangle
,x_descenders
,x_ascenders
etc. intitle
, for example:
<span class='ocr_line' id='line_1_10' title="bbox 99 516 833 563; baseline 0.014 -17; x_size 39; x_descenders 8; x_ascenders 9">
<span class='ocr_line' id='line_1_59' title="bbox 2128 2529 2152 2780; textangle 90; x_size 23; x_descenders 3; x_ascenders 7">
- For
class='ocrx_word'
, new version addsx_wconf
intitle
- e.g.
<span class='ocrx_word' id='word_1_4' title='bbox 1303 283 1911 536; x_wconf 86' lang='eng' dir='ltr'><strong>Waves</strong></span>
4). PAGE Converter and Validator
Try to convert HOCR output file (XML format in fact) from Tesseract to PAGE format - NO SUCCESS.
Using A basic Java version for conversion only
gives more information.