jTessBoxEditor介绍

jTessBoxEditor  官网原文链接  jTessBoxEditor - Tesseract box editor & trainer

jTessBoxEditor is a box editor and trainer for Tesseract OCR, providing editing of box data of both Tesseract 2.0x and 3.0x formats and full automation of Tesseract training. It can read images of common image formats, including multi-page TIFF. The program requires Java Runtime Environment 7 or later.

jTessBoxEditorFX is jTessBoxEditor rewritten in JavaFX to address the existing issue of rendering complex scripts in Java Swing. It requires JRE 8u40 or later.

jTessBoxEditor is released and distributed under the Apache License, v2.0.

Double click on the JAR file to launch the program, or execute the following command:

java -Xms128m -Xmx1024m -jar jTessBoxEditor.jar

jTessBoxEditor Swing UI

Box View


jTessBoxEditor Swing UI

You will need to provide the TIFF/Box files as input to the editor. Images to be used in training should be of 300 DPI and 1 bpp (bit per pixel) black&white or 8 bpp grayscale uncompressed TIFF format; box files, encoded in UTF-8 format, are generated by Tesseract executables with appropriate command-line options (see Tesseract Training Wiki). Or they both can be created using the built-in TIFF/Box Generator.

Note that the coordinate system used in the box file has (0,0) at the bottom-left; on computer graphics devices, however, (0,0) is defined as top-left. jTessBoxEditor uses and displays in the graphics device coordinates. The edited box files are still read and written in proper format.

The generator produces, for a given input UTF-8 text file, a TIFF/Box pair of files suitable for training with Tesseract. The generated image is, depending on anti-aliasing mode enabled, a binary or 8-bpp grayscale, uncompressed multi-page TIFF with 300-DPI resolution. Noise can optionally be added to the image, which could result in better trainned data. Letter tracking, or spacing between characters, can be adjusted to eliminate bounding box overlapping issues. Note that some boxes could be slightly different (by 1 or 2 pixels) from the ones that would have been generated by Tesseract itself; nevertheless, the generated box file can be used to validate the one created by Tesseract with the use of a Unicode-compatible file compare tool, such as WinMerge.

Generate TIFF/Box

Tips: Experiments indicate that the quality of training with images created by TIFF/Box Generator is higher with font sizes 12pt or greater and with some noise added.

For automated training, be sure to build all the necessary Tesseract executables if needed; Windows executables are already bundled with the program. Place all required source training data files, prefixed with an appropriate language code, in a specified directory; check samples folder for examples. The training process can also be automated using train.ps1, a Windows PowerShell script.

Train Tesseract

The Merge TIFF function can save multiple images containing text of the same font into a single multi-page TIFF file for convenient training. A conversion function is included to convert numeric character reference (NCR) and escape sequence in the Character text field to Unicode characters.

If there is any question, please post in VietOCR Forums.

References

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值