chinese character recognition using Tesseract OCR

最新推荐文章于 2023-07-10 14:32:55 发布

newHung

最新推荐文章于 2023-07-10 14:32:55 发布

阅读量4.7k

点赞数

分类专栏： linux 文章标签：国际化

linux 专栏收录该内容

50 篇文章 0 订阅

订阅专栏

I have been using Tesseract 3.0.2 OCR SDK for image text extraction. But if I use chinese text images and pass through OCR then Tesseract doesn't provide me the chinese characters instead of that I am getting numeric and english characters. But I need chinese characters as displayed in the image I am using.

How can I achieve this? Is there any way I can obtain chinese characters rather then any other characters?

Any help is appreciable.

9 down vote accepted

You need to download chinese trained data (it will be a file like chi_sim.traineddata) and add it to yourtessdata folder.

To download the file https://code.google.com/p/tesseract-ocr/downloads/detail?name=chi_sim.traineddata.gz

and use like this

Tesseract* tesseract= [[Tesseract alloc] initWithDataPath:@"tessdata" language:@"chi_sim"];

if you have any problem you can download my experiment with tessaract (with chinese language support) from https://github.com/aryansbtloe/ExperimentWithTesseract.git

I have tested this one...Hope you will find this useful.

edited Sep 28 '13 at 17:08

Nishant Tyagi
5,725 2 12 34

answered May 16 '13 at 8:43

Alok Singh
527 2 10

Thanks it works :-) – Nishant Tyagi May 16 '13 at 9:11

Alok, I tried your sample and it works well on about half of simplified Chinese characters I tried. For the rest it may either recognize a compound character as several different characters each representing a component in the compound character, or totally wrong. Do you know of any method to improve the accuracy of recognition? – CodePlumber Jun 14 at 22:11