=== 09/12/2016 更 ===
Updated Introduction with one recent paper, good overview on Tesseract
Read together with How to train Tesseract 3.01 under Resources section
=== 08/12/2016 更 ===
Tesseract API Example
=== 15/11/2016 更“资源”===
Introduction
The Tesseract Optical Character Recognition (OCR) engine originally developed by HewlettPackard between 1984 and 1994 was one of the top 3 engines in the 1995 UNLV Accuracy test as “HP Labs OCR” (Rice et al 1995). Between 1995 and 2005 there was little activity in Tesseract, until it was open sourced by HP and UNLV.
It was re-released to the open source community in August of 2006 by Google (Vincent, 2006), hosted under Google code and GitHub under the tesseract-ocr project. 1 More recent evaluations have found Tesseract to perform well in comparisons with other commercial and open source OCR systems (Dhiman and Singh. 2013; Chattopadhyay et al. 2011; Heliński et al. 2012; Patel et al. 2012; Vijayarani and Sakila. 2015). A wide range of external tools, wrappers and add-on projects are also available including Tesseract user interfaces, online services, training and training data preparation, and additional language data.
Originally developed for recognition of English text, Smith (2007), Smith et al (2009) and Smith (2014) provide overviews of the Tesseract system during the process of development and internationalization. Currently, Tesseract v3.02 release, v3.03 candidate release and v3.04 evelopment versions are available, and the tesseractocr project supports recognition of over 60 languages.
Recent focus is OCR. Make some preparation.
How do I detect text regions in an image?
`Some answers here mention methods such as stroke width transform. These methods are well suitable for images where text is really sparse, scattered and don’t have much regularity. Text region identification in such natural images is known as the text in the wild problem and multiple deep learning based methods have been proposed recently. This is an active research area, but I think the paper mentioned here Page on ox.ac.uk is a recent work that is appealing.
But, the data examples you have given here are simpler cases and you should test more traditional document layout analysis approaches. Fortunately, there is an open source library called Leptonica Page on leptonica.com that solves your exact problem. It uses multi-resolution morphology and it seems to do very well where the pixel density is very different in the text and the image regions. That is true in most magazine and newspaper articles. Also, there was a paper that extended this library to tackle cases like engineering drawings and maps where this pixel density is not much different. The paper can be found in Page on uwa.edu.au.`
How do I detect text/images in a document image?
- MSER(Maximally Stable Extremal Regions) features
On simple and appreciably efficient method is to use MSER(Maximally Stable Extremal Regions) features to detect text. Given a printed paper, MSER features tries to find connected (and nearly connected) regions. We need to tweak the parameters such as the minimum area, threshold, etc to make it work for a specific kind of printed paper.
Ready-made MSER Feature is implemented as a function in OpenCV library is efficient, fast and efficient to use with C++, Python or Java. Also, MATLAB provides you with same functions.
This link Feature Detection and Description, would be helpful. Also, to know what the inputs(parameters) that have to be given to this algorithm is discussed here, Page on stackoverflow.com .To use it through Matlab check this out, Detect MSER features and return MSERRegions object.
OpenALPR
很不错的开源项目。代码完整,注释清楚,逻辑清晰,文档全面。
有打算仔细研究一下,并翻译出来。
OpenALPR在线文档
OpenALPR Documentation如何提高精度
Accuracy Improvements
虽然讲的是提高检测精度,但提到了很多关键的实现细节,对工程实现很有帮助如何训练 OCR
Training OCR
训练Tesseract 字符识别的详细步骤。提供了很多有用的工具。起步时,不用再专门看 Tesseract 的文档了
Tesseract
应该是最新的全面介绍 Tesseract 的文档。
主要介绍原理,算法和整体设计框架,没有代码和实例。
Slides from Tesseract tutorial at DAS, Santorini
一个简单的应用实例,提供了lib库和DLL(包括Release and Debug),
有机会可以试下?
tesseract-ocr-sample
Resources
德国车牌字体
好像很多其他国家也采用同样的字体
How to train Tesseract 3.01
Optical Character Recognition (OCR) is a very popular tool nowadays. It makes machines able to automatically identify text in digital images. A lot of research has been done, resulting in a lot of different techniques and publications. Currently many OCR tools that are available on the market are expensive and not open source, but few of them are free and open source.
Tesseract was originally developed at HP, as a PhD research.
In the first three steps, the component analysis and text/word splitting has been done. Afterwards the words are splitted in characters and each character/components is passed to a 2-way recognize pass. In the first pass the results of the recognized characters/words are passed to an adaptive classifier, which uses the data as training data. After that the text will be recognized a second time but now using the adaptive classifier.
Documentation for Tesseract 3.04.01
Documentation for Tesseract 3.02
tesseract::TessBaseAPI Class Reference
Is Tesseract(an OCR engine) reentrant?
From the release notes, Tesseract is (mostly, and to the degree that you describe needing) thread-safe as of 3.01 (Oct 21 2011)
Thread-safety! Moved all critical globals and statics to members of the appropriate class. Tesseract is now thread-safe (multiple instances can be used in parallel in multiple threads.) with the minor exception that some control parameters are still global and affect all threads.
Tesseract Ocr Engine Cube mode - Training Tesseract
There is an explanation of the various training files required by the Cube engine mode on the tesseract-ocr-extradocs project wiki:
tesseract-ocr-extradocs - Cube.wiki
the neural network file format