tesseract源码解释之(一)常用API接口简介

     

目前,Tesseract可以识别超过100种语言。也可以用来训练其它的语言。

源码包提供了一个OCR的引擎——libtesseract以及一个命令行程序——tesseract。Tesseract文字识别主要流程为:二值化,切分处理,识别,纠错等步骤。Tesseract引擎概括地可以分为图片布局分析,字符分割和识别两个部分。而其中的字符分割和识别是整个tesseract的设计目标。对于字符切割tesseract细致地可以分为四个部分:分析连通区域找;到块区域;找文本行和单词;得出(识别)文本。而Tesseract提供的API可以在baseapi.h文件中找到。本文将主要介绍baseapi.h文件中常用的api接口使用方法,运用各接口完成简单的识别调用。在随后的的文章中再一一关注各API的源码,了解具体的算法及实现方法。

(tesseract4.0的文档

  1. tesseract::TessBaseAPI,基础的接口函数,包含了初始化,简单的处理图片文字信息,版面分析的结果体等。 
  2. IMAGE,只是一个类,里边封装了相关的图片操作,包括图片的读取,图片参数信息的获取等。 
  3. 其他,包括数据类型声明,相关结构体声明,跨平台处理,命令端参数提取等。  

我们在实际中用到的就是前两个里边的东西。

 

Tesseract的大部分接口说明及其用法

  1. void SetImage(const unsigned char* imagedata, int width, int height,

                int bytes_per_pixel, int bytes_per_line);

 

  Provide an image for Tesseract to recognize. Format is as

   * TesseractRect above. Copies the image buffer and converts to Pix.

   * SetImage clears all recognition results, and sets the rectangle to the

   * full image, so it may be followed immediately by a GetUTF8Text, and it

   * will automatically perform recognition.

   为Tesseract 提供待识别的图片。

 

  void SetImage(Pix* pix);

 

/**

   * Provide an image for Tesseract to recognize. As with SetImage above,

   * Tesseract takes its own copy of the image, so it need not persist until

   * after Recognize.

   * Pix vs raw, which to use?

   * Use Pix where possible. Tesseract uses Pix as its internal representation

   * and it is therefore more efficient to provide a Pix directly.

   */

  

  1. void SetRectangle(int left, int top, int width, int height);

/**

   * Restrict recognition to a sub-rectangle of the image. Call after SetImage.

   * Each SetRectangle clears the recogntion results so multiple rectangles

   * can be recognized with the same image.

   */

识别限制到图像的一个子矩形区域,SetImage之后调用此函数。每一次该函数调用后将清除识别结果,以便同一张图像可以进行多矩形区域的识别。

 

3.  void SetSourceResolution(int ppi);

设置源图像的分辨率(像素每英尺),可以计算最终的字体大小信息。SetImage之后调用此函数。

   * Set the resolution of the source image in pixels per inch so font size

   * information can be calculated in results.  Call this after SetImage().

4. /**

   * In extreme cases only, usually with a subclass of Thresholder, it

   * is possible to provide a different Thresholder. The Thresholder may

   * be preloaded with an image, settings etc, or they may be set after.

   * Note that Tesseract takes ownership of the Thresholder and will

   * delete it when it it is replaced or the API is destructed.

   */

  void SetThresholder(ImageThresholder* thresholder) {

    delete thresholder_;

    thresholder_ = thresholder;

    ClearResults();

  }

 

5. /**

   * Get a copy of the internal thresholded image from Tesseract.

   * Caller takes ownership of the Pix and must pixDestroy it.

   * May be called any time after SetImage, or after TesseractRect.

   */

  Pix* GetThresholdedImage();

 

6. /**

   * Get the result of page layout analysis as a leptonica-style

   * Boxa, Pixa pair, in reading order.

   * Can be called before or after Recognize.

   */

  Boxa* GetRegions(Pixa** pixa);

以aleptonica-style Boxa, Pixa pair格式获得页面结构分析的结果,在Recognize前后均可被调用。

 

7. /**

   * Get the textlines as a leptonica-style

   * Boxa, Pixa pair, in reading order.

   * Can be called before or after Recognize.

   * If raw_image is true, then extract from the original image instead of the

   * thresholded image and pad by raw_padding pixels.

   * If blockids is not nullptr, the block-id of each line is also returned as an

   * array of one element per line. delete [] after use.

   * If paraids is not nullptr, the paragraph-id of each line within its block is

   * also returned as an array of one element per line. delete [] after use.

   */

  Boxa* GetTextlines(const bool raw_image, const int raw_padding,

                     Pixa** pixa, int** blockids, int** paraids);

 

以aleptonica-style Boxa, Pixa pair格式获取文本行,在Recognize前后均可被调用。如果blockids(block数目)是空的话,每行block-id返回每行一个元素的数组,使用之后被删除。

  /*

     Helper method to extract from the thresholded image. (most common usage)

  */

  Boxa* GetTextlines(Pixa** pixa, int** blockids) {

    return GetTextlines(false, 0, pixa, blockids, nullptr);

  }

 

  /**

   * Get textlines and strips of image regions as a leptonica-style Boxa, Pixa

   * pair, in reading order. Enables downstream handling of non-rectangular

   * regions.

   * Can be called before or after Recognize.

   * If blockids is not nullptr, the block-id of each line is also returned as an

   * array of one element per line. delete [] after use.

   */

  Boxa* GetStrips(Pixa** pixa, int** blockids);

以aleptonica-style Boxa, Pixa pair格式获取图像区域的文本行和条形区域,方便后面非矩形区域的处理。在Recognize前后均可被调用

  /**

   * Get the words as a leptonica-style

   * Boxa, Pixa pair, in reading order.

   * Can be called before or after Recognize.

   */

  Boxa* GetWords(Pixa** pixa);

以aleptonica-style Boxa, Pixa pair格式获取图像区域的文字,在Recognize前后均可被调用。

 

8.  * Gets the individual connected (text) components (created

   * after pages segmentation step, but before recognition)

   * as a leptonica-style Boxa, Pixa pair, in reading order.

   * Can be called before or after Recognize.

   * Note: the caller is responsible for calling boxaDestroy()

   * on the returned Boxa array and pixaDestroy() on cc array.

   */

  Boxa* GetConnectedComponents(Pixa** cc);

 

在页面分析之后识别之间,以aleptonica-style Boxa, Pixa pair格式获得独立连通的文本区域,在Recognize前后均可被调用。

/**

   * Get the given level kind of components (block, textline, word etc.) as a

   * leptonica-style Boxa, Pixa pair, in reading order.

   * Can be called before or after Recognize.

   * If blockids is not nullptr, the block-id of each component is also returned

   * as an array of one element per component. delete [] after use.

   * If blockids is not nullptr, the paragraph-id of each component with its block

   * is also returned as an array of one element per component. delete [] after

   * use.

   * If raw_image is true, then portions of the original image are extracted

   * instead of the thresholded image and padded with raw_padding.

   * If text_only is true, then only text components are returned.

   */

  Boxa* GetComponentImages(const PageIteratorLevel level,

                           const bool text_only, const bool raw_image,

                           const int raw_padding,

                           Pixa** pixa, int** blockids, int** paraids);

以aleptonica-style Boxa, Pixa pair格式获得制定级别的元素(block,textline, word),在Recognize前后均可被调用。果blockids(block数目)是空的话,每行block-id返回每行一个元素的数组,使用之后被删除。如果text_only 为真,只有text可被返回。

  // Helper function to get binary images with no padding (most common usage).

  Boxa* GetComponentImages(const PageIteratorLevel level,

                           const bool text_only,

                           Pixa** pixa, int** blockids) {

    return GetComponentImages(level, text_only, false, 0, pixa, blockids, nullptr);

  }

 

9:DumpPGM 函数声明:

void tesseract::TessBaseAPI::DumpPGM ( const char * filename ) 将内部二值图像放到PGM文件中。

10:AnalyseLayout 函数声明:

/**

   * Runs page layout analysis in the mode set by SetPageSegMode.

   * May optionally be called prior to Recognize to get access to just

   * the page layout results. Returns an iterator to the results.

   * If merge_similar_words is true, words are combined where suitable for use

   * with a line recognizer. Use if you want to use AnalyseLayout to find the

   * textlines, and then want to process textline fragments with an external

   * line recognizer.

   * Returns nullptr on error or an empty page.

   * The returned iterator must be deleted after use.

   * WARNING! This class points to data held within the TessBaseAPI class, and

   * therefore can only be used while the TessBaseAPI class still exists and

   * has not been subjected to a call of Init, SetImage, Recognize, Clear, End

   * DetectOS, or anything else that changes the internal PAGE_RES.

   */

  PageIterator* AnalyseLayout();

  PageIterator* AnalyseLayout(bool merge_similar_words);

以SetPageSegMode设定的模式进行页面结构分析,返回一个(iterator),错误返回为空。Iterator 使用后必须删除。注意:该函数指向TessBaseAPI 类内部的数据,因此必须在TessBaseAPI 存在的情况下才可被调用。不能被改变内部PAGE_RES的Init, SetImage, Recognize, Clear, End DetectOS或者其他调用。

11:Recognize 函数声明:

/**

   * Recognize the image from SetAndThresholdImage, generating Tesseract

   * internal structures. Returns 0 on success.

   * Optional. The Get*Text functions below will call Recognize if needed.

   * After Recognize, the output is kept internally until the next SetImage.

   */

  int Recognize(ETEXT_DESC* monitor);

int tesseract::TessBaseAPI::Recognize(ETEXT_DESC * monitor)

识别 来自SetAndThresholdImage的图像,产生Tesseract 内部结构数据,成功返回0,如果需要,下面的Get*Tex函数会调用它。识别完成后,在SetImage之前,输出都会保持在内部。

12:RecognizeForChopTest 函数声明:

/**

   * Methods to retrieve information after SetAndThresholdImage(),

   * Recognize() or TesseractRect(). (Recognize is called implicitly if needed.)

   */

 

  /** Variant on Recognize used for testing chopper. */

  int RecognizeForChopTest(ETEXT_DESC* monitor);

int tesseract::TessBaseAPI::RecognizeForChopTest(ETEXT_DESC * monitor)

检索来自SetAndThresholdImage(), Recognize() or TesseractRect()的信息(在需要的情况下隐式调用Recognize)。对Recognize 变化一测试chopper. 13:ProcessPages 函数声明:

/**

   * Turns images into symbolic text.

   *

   * filename can point to a single image, a multi-page TIFF,

   * or a plain text list of image filenames.

   *

   * retry_config is useful for debugging. If not nullptr, you can fall

   * back to an alternate configuration if a page fails for some

   * reason.

   *

   * timeout_millisec terminates processing if any single page

   * takes too long. Set to 0 for unlimited time.

   *

   * renderer is responible for creating the output. For example,

   * use the TessTextRenderer if you want plaintext output, or

   * the TessPDFRender to produce searchable PDF.

   *

   * If tessedit_page_number is non-negative, will only process that

   * single page. Works for multi-page tiff file, or filelist.

   *

   * Returns true if successful, false on error.

   */

  bool ProcessPages(const char* filename, const char* retry_config,

                    int timeout_millisec, TessResultRenderer* renderer);

  // Does the real work of ProcessPages.

  bool ProcessPagesInternal(const char* filename, const char* retry_config,

                            int timeout_millisec, TessResultRenderer* renderer);

 

bool tesseract::TessBaseAPI::ProcessPages ( const char * filename,

const char * retry_config,

int STRING * )

timeout_millisec, text_out

识别指定文件的所有页面,文件格式为(a multi-page tiff or list of filenames, or single image), 并且根据参数(tessedit_create_boxfile, tessedit_make_boxes_from_boxes, tessedit_write_unlv, tessedit_create_hocr.)得到合适的文本。在输入文件的每一页运行ProcessPage,输入文件可以是(a multi-page tiff, single-page other file format, or a plain text list of images to read),返回值放在text_out中。如果tessedit_page_number 非负,程序将会在其所代表那一页开始。运行错误返回false. 如果程序暂停在某一页timeout_millisec(非负)时间终止程序,或者由于某些原因一些页面处理失败,该页面将会以retry_config的配置文件重新处理。

14:ProcessPage 函数声明:

/**

   * Turn a single image into symbolic text.

   *

   * The pix is the image processed. filename and page_index are

   * metadata used by side-effect processes, such as reading a box

   * file or formatting as hOCR.

   *

   * See ProcessPages for desciptions of other parameters.

   */

  bool ProcessPage(Pix* pix, int page_index, const char* filename,

                   const char* retry_config, int timeout_millisec,

                   TessResultRenderer* renderer);

bool tesseract::TessBaseAPI::ProcessPage ( Pix *

int

pix,

page_index,

const char * filename, const char * retry_config, int STRING * )

timeout_millisec, text_out

为ProcessPages进行单页面识别。Text放到text_out中, pix是文件名,page_index是边缘处理后的元数据,比如box文件,或者hOCR格式文件。

15:GetIterator 函数声明:

ResultIterator * tesseract::TessBaseAPI::GetIterator()

为 LayoutAnalysis and/or Recognize运行结果获取读取顺序的迭代器(iterator),使用之后删除。

16:GetMutableIterator 函数声明:

 

  /**

   * Get a reading-order iterator to the results of LayoutAnalysis and/or

   * Recognize. The returned iterator must be deleted after use.

   * WARNING! This class points to data held within the TessBaseAPI class, and

   * therefore can only be used while the TessBaseAPI class still exists and

   * has not been subjected to a call of Init, SetImage, Recognize, Clear, End

   * DetectOS, or anything else that changes the internal PAGE_RES.

   */

  ResultIterator* GetIterator();

MutableIterator * tesseract::TessBaseAPI::GetMutableIterator()

为 LayoutAnalysis and/or Recognize运行结果获取可变的迭代器(iterator),使用之后删除。

17:GetUTF8Text 函数声明:            

/**

   * The recognized text is returned as a char* which is coded

   * as UTF8 and must be freed with the delete [] operator.

   */

  char* GetUTF8Text();

char * tesseract::TessBaseAPI::GetUTF8Text()

识别的文本被返回为字符指针,以UTF8编码(must be freed with the delete [] operator)。从内部数据结构中获得文本字符串。

 18:

函数声明:

/**

   * Make a HTML-formatted string with hOCR markup from the internal

   * data structures.

   * page_number is 0-based but will appear in the output as 1-based.

   * monitor can be used to

   *  cancel the recognition

   *  receive progress callbacks

   * Returned string must be freed with the delete [] operator.

   */

  char* GetHOCRText(ETEXT_DESC* monitor, int page_number);

 

char * tesseract::TessBaseAPI::GetHOCRText(int page_number)

 

  • 2
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值