tesseract源码解释之（一）常用API接口简介

最新推荐文章于 2024-07-04 10:25:39 发布

小澎哥

最新推荐文章于 2024-07-04 10:25:39 发布

阅读量5.4k

点赞数 2

文章标签： ocr tesseract api

目前，Tesseract可以识别超过100种语言。也可以用来训练其它的语言。

源码包提供了一个OCR的引擎——libtesseract以及一个命令行程序——tesseract。Tesseract文字识别主要流程为：二值化，切分处理，识别，纠错等步骤。Tesseract引擎概括地可以分为图片布局分析，字符分割和识别两个部分。而其中的字符分割和识别是整个tesseract的设计目标。对于字符切割tesseract细致地可以分为四个部分：分析连通区域找；到块区域；找文本行和单词；得出(识别)文本。而Tesseract提供的API可以在baseapi.h文件中找到。本文将主要介绍baseapi.h文件中常用的api接口使用方法，运用各接口完成简单的识别调用。在随后的的文章中再一一关注各API的源码，了解具体的算法及实现方法。

（tesseract4.0的文档）

tesseract::TessBaseAPI，基础的接口函数，包含了初始化，简单的处理图片文字信息，版面分析的结果体等。
IMAGE，只是一个类，里边封装了相关的图片操作，包括图片的读取，图片参数信息的获取等。
其他，包括数据类型声明，相关结构体声明，跨平台处理，命令端参数提取等。

我们在实际中用到的就是前两个里边的东西。

Tesseract的大部分接口说明及其用法

void SetImage(const unsigned char* imagedata, int width, int height,

int bytes_per_pixel, int bytes_per_line);

Provide an image for Tesseract to recognize. Format is as

* TesseractRect above. Copies the image buffer and converts to Pix.

* SetImage clears all recognition results, and sets the rectangle to the

* full image, so it may be followed immediately by a GetUTF8Text, and it

* will automatically perform recognition.

为Tesseract 提供待识别的图片。

void SetImage(Pix* pix);

/**

* Provide an image for Tesseract to recognize. As with SetImage above,

* Tesseract takes its own copy of the image, so it need not persist until

* after Recognize.

* Pix vs raw, which to use?

* Use Pix where possible. Tesseract uses Pix as its internal representation

* and it is therefore more efficient to provide a Pix directly.

void SetRectangle(int left, int top, int width, int height);

/**

* Restrict recognition to a sub-rectangle of the image. Call after SetImage.

* Each SetRectangle clears the recogntion results so multiple rectangles

* can be recognized with the same image.

识别限制到图像的一个子矩形区域,SetImage之后调用此函数。每一次该函数调用后将清除识别结果,以便同一张图像可以进行多矩形区域的识别。

3. void SetSourceResolution(int ppi);

设置源图像的分辨率（像素每英尺），可以计算最终的字体大小信息。SetImage之后调用此函数。

* Set the resolution of the source image in pixels per inch so font size

* information can be calculated in results. Call this after SetImage().

4. /**

* In extreme cases only, usually with a subclass of Thresholder, it

* is possible to provide a different Thresholder. The Thresholder may

* be preloaded with an image, settings etc, or they may be set after.

* Note that Tesseract takes ownership of the Thresholder and will

* delete it when it it is replaced or the API is destructed.

void SetThresholder(ImageThresholder* thresholder) {

delete thresholder_;

thresholder_ = thresholder;

ClearResults();

}

5. /**

* Get a copy of the internal thresholded image from Tesseract.

* Caller takes ownership of the Pix and must pixDestroy it.

* May be called any time after SetImage, or after TesseractRect.

Pix* GetThresholdedImage();

6. /**

* Get the result of page layout analysis as a leptonica-style

* Boxa, Pixa pair, in reading order.

* Can be called before or after Recognize.

Boxa* GetRegions(Pixa** pixa);

以aleptonica-style Boxa, Pixa pair格式获得页面结构分析的结果，在Recognize前后均可被调用。

7. /**

* Get the textlines as a leptonica-style

* Boxa, Pixa pair, in reading order.

* Can be called before or after Recognize.

* If raw_image is true, then extract from the original image instead of the

* thresholded image and pad by raw_padding pixels.

* If blockids is not nullptr, the block-id of each line is also returned as an

* array of one element per line. delete [] after use.

* If paraids is not nullptr, the paragraph-id of each line within its block is

* also returned as an array of one element per line. delete [] after use.

Boxa* GetTextlines(const bool raw_image, const int raw_padding,

Pixa** pixa, int** blockids, int** paraids);

以aleptonica-style Boxa, Pixa pair格式获取文本行，在Recognize前后均可被调用。如果blockids（block数目）是空的话，每行block-id返回每行一个元素的数组，使用之后被删除。

Helper method to extract from the thresholded image. (most common usage)

Boxa* GetTextlines(Pixa** pixa, int** blockids) {

return GetTextlines(false, 0, pixa, blockids, nullptr);

}

/**

* Get textlines and strips of image regions as a leptonica-style Boxa, Pixa

* pair, in reading order. Enables downstream handling of non-rectangular

* regions.

* Can be called before or after Recognize.

* If blockids is not nullptr, the block-id of each line is also returned as an

* array of one element per line. delete [] after use.

Boxa* GetStrips(Pixa** pixa, int** blockids);

以aleptonica-style Boxa, Pixa pair格式获取图像区域的文本行和条形区域，方便后面非矩形区域的处理。在Recognize前后均可被调用

/**

* Get the words as a leptonica-style

* Boxa, Pixa pair, in reading order.

* Can be called before or after Recognize.

Boxa* GetWords(Pixa** pixa);

以aleptonica-style Boxa, Pixa pair格式获取图像区域的文字，在Recognize前后均可被调用。

8. * Gets the individual connected (text) components (created

* after pages segmentation step, but before recognition)

* as a leptonica-style Boxa, Pixa pair, in reading order.

* Can be called before or after Recognize.

* Note: the caller is responsible for calling boxaDestroy()

* on the returned Boxa array and pixaDestroy() on cc array.

Boxa* GetConnectedComponents(Pixa** cc);

在页面分析之后识别之间，以aleptonica-style Boxa, Pixa pair格式获得独立连通的文本区域，在Recognize前后均可被调用。

/**

* Get the given level kind of components (block, textline, word etc.) as a

* leptonica-style Boxa, Pixa pair, in reading order.

* Can be called before or after Recognize.

* If blockids is not nullptr, the block-id of each component is also returned

* as an array of one element per component. delete [] after use.

* If blockids is not nullptr, the paragraph-id of each component with its block

* is also returned as an array of one element per component. delete [] after

* use.

* If raw_image is true, then portions of the original image are extracted

* instead of the thresholded image and padded with raw_padding.

* If text_only is true, then only text components are returned.

Boxa* GetComponentImages(const PageIteratorLevel level,

const bool text_only, const bool raw_image,

const int raw_padding,

Pixa** pixa, int** blockids, int** paraids);

以aleptonica-style Boxa, Pixa pair格式获得制定级别的元素（block，textline, word），在Recognize前后均可被调用。果blockids（block数目）是空的话，每行block-id返回每行一个元素的数组，使用之后被删除。如果text_only 为真，只有text可被返回。

// Helper function to get binary images with no padding (most common usage).

Boxa* GetComponentImages(const PageIteratorLevel level,

const bool text_only,

Pixa** pixa, int** blockids) {

return GetComponentImages(level, text_only, false, 0, pixa, blockids, nullptr);

}

9：DumpPGM 函数声明：

void tesseract::TessBaseAPI::DumpPGM ( const char * filename ) 将内部二值图像放到PGM文件中。

10：AnalyseLayout 函数声明：

/**

* Runs page layout analysis in the mode set by SetPageSegMode.

* May optionally be called prior to Recognize to get access to just

* the page layout results. Returns an iterator to the results.

* If merge_similar_words is true, words are combined where suitable for use

* with a line recognizer. Use if you want to use AnalyseLayout to find the

* textlines, and then want to process textline fragments with an external

* line recognizer.

* Returns nullptr on error or an empty page.

* The returned iterator must be deleted after use.

* WARNING! This class points to data held within the TessBaseAPI class, and

* therefore can only be used while the TessBaseAPI class still exists and

* has not been subjected to a call of Init, SetImage, Recognize, Clear, End

* DetectOS, or anything else that changes the internal PAGE_RES.

PageIterator* AnalyseLayout();

PageIterator* AnalyseLayout(bool merge_similar_words);

以SetPageSegMode设定的模式进行页面结构分析,返回一个(iterator),错误返回为空。Iterator 使用后必须删除。注意：该函数指向TessBaseAPI 类内部的数据，因此必须在TessBaseAPI 存在的情况下才可被调用。不能被改变内部PAGE_RES的Init, SetImage, Recognize, Clear, End DetectOS或者其他调用。

11：Recognize 函数声明：

/**

* Recognize the image from SetAndThresholdImage, generating Tesseract

* internal structures. Returns 0 on success.

* Optional. The Get*Text functions below will call Recognize if needed.

* After Recognize, the output is kept internally until the next SetImage.

int Recognize(ETEXT_DESC* monitor);

int tesseract::TessBaseAPI::Recognize(ETEXT_DESC * monitor)

识别来自SetAndThresholdImage的图像，产生Tesseract 内部结构数据，成功返回0，如果需要，下面的Get*Tex函数会调用它。识别完成后，在SetImage之前，输出都会保持在内部。

12：RecognizeForChopTest 函数声明：

/**

* Methods to retrieve information after SetAndThresholdImage(),

* Recognize() or TesseractRect(). (Recognize is called implicitly if needed.)

/** Variant on Recognize used for testing chopper. */

int RecognizeForChopTest(ETEXT_DESC* monitor);

int tesseract::TessBaseAPI::RecognizeForChopTest(ETEXT_DESC * monitor)

检索来自SetAndThresholdImage(), Recognize() or TesseractRect()的信息（在需要的情况下隐式调用Recognize）。对Recognize 变化一测试chopper. 13：ProcessPages 函数声明：

/**

* Turns images into symbolic text.

* filename can point to a single image, a multi-page TIFF,

* or a plain text list of image filenames.

* retry_config is useful for debugging. If not nullptr, you can fall

* back to an alternate configuration if a page fails for some

* reason.

* timeout_millisec terminates processing if any single page

* takes too long. Set to 0 for unlimited time.

* renderer is responible for creating the output. For example,

* use the TessTextRenderer if you want plaintext output, or

* the TessPDFRender to produce searchable PDF.

* If tessedit_page_number is non-negative, will only process that

* single page. Works for multi-page tiff file, or filelist.

* Returns true if successful, false on error.

bool ProcessPages(const char* filename, const char* retry_config,

int timeout_millisec, TessResultRenderer* renderer);

// Does the real work of ProcessPages.

bool ProcessPagesInternal(const char* filename, const char* retry_config,

int timeout_millisec, TessResultRenderer* renderer);

bool tesseract::TessBaseAPI::ProcessPages ( const char * filename,

const char * retry_config,

int STRING * )

timeout_millisec, text_out

识别指定文件的所有页面，文件格式为(a multi-page tiff or list of filenames, or single image), 并且根据参数（tessedit_create_boxfile, tessedit_make_boxes_from_boxes, tessedit_write_unlv, tessedit_create_hocr.）得到合适的文本。在输入文件的每一页运行ProcessPage，输入文件可以是（a multi-page tiff, single-page other file format, or a plain text list of images to read）,返回值放在text_out中。如果tessedit_page_number 非负，程序将会在其所代表那一页开始。运行错误返回false. 如果程序暂停在某一页timeout_millisec（非负）时间终止程序，或者由于某些原因一些页面处理失败，该页面将会以retry_config的配置文件重新处理。

14：ProcessPage 函数声明：

/**

* Turn a single image into symbolic text.

* The pix is the image processed. filename and page_index are

* metadata used by side-effect processes, such as reading a box

* file or formatting as hOCR.

* See ProcessPages for desciptions of other parameters.

bool ProcessPage(Pix* pix, int page_index, const char* filename,

const char* retry_config, int timeout_millisec,

TessResultRenderer* renderer);

bool tesseract::TessBaseAPI::ProcessPage ( Pix *

int

pix,

page_index,

const char * filename, const char * retry_config, int STRING * )

timeout_millisec, text_out

为ProcessPages进行单页面识别。Text放到text_out中， pix是文件名，page_index是边缘处理后的元数据，比如box文件，或者hOCR格式文件。

15：GetIterator 函数声明：

ResultIterator * tesseract::TessBaseAPI::GetIterator()

为 LayoutAnalysis and/or Recognize运行结果获取读取顺序的迭代器（iterator），使用之后删除。

16：GetMutableIterator 函数声明：

/**

* Get a reading-order iterator to the results of LayoutAnalysis and/or

* Recognize. The returned iterator must be deleted after use.

* WARNING! This class points to data held within the TessBaseAPI class, and

* therefore can only be used while the TessBaseAPI class still exists and

* has not been subjected to a call of Init, SetImage, Recognize, Clear, End

* DetectOS, or anything else that changes the internal PAGE_RES.

ResultIterator* GetIterator();

MutableIterator * tesseract::TessBaseAPI::GetMutableIterator（）

为 LayoutAnalysis and/or Recognize运行结果获取可变的迭代器（iterator），使用之后删除。

17：GetUTF8Text 函数声明：

/**

* The recognized text is returned as a char* which is coded

* as UTF8 and must be freed with the delete [] operator.

char* GetUTF8Text();

char * tesseract::TessBaseAPI::GetUTF8Text()

识别的文本被返回为字符指针，以UTF8编码（must be freed with the delete [] operator）。从内部数据结构中获得文本字符串。

18：

函数声明：

/**

* Make a HTML-formatted string with hOCR markup from the internal

* data structures.

* page_number is 0-based but will appear in the output as 1-based.

* monitor can be used to

* cancel the recognition

* receive progress callbacks

* Returned string must be freed with the delete [] operator.

char* GetHOCRText(ETEXT_DESC* monitor, int page_number);

char * tesseract::TessBaseAPI::GetHOCRText(int page_number)

小澎哥

关注

2
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
tesseract源码解释之（一）常用API接口简介

目前，Tesseract可以识别超过100种语言。也可以用来训练其它的语言。源码包提供了一个OCR的引擎——libtesseract以及一个命令行程序——tesseract。Tesseract文字识别主要流程为：二值化，切分处理，识别，纠错等步骤。Tesseract引擎概括地可以分为图片布局分析，字符分割和识别两个部分。而其中的字符分割和识别是整个tesseract的设计目标。对...
复制链接

扫一扫