chinese character recognition using Tesseract OCR

I have been using Tesseract 3.0.2 OCR SDK for image text extraction. But if I use chinese text images and pass through OCR then Tesseract doesn't provide me the chinese characters instead of that I am getting numeric and english characters. But I need chinese characters as displayed in the image I am using.

How can I achieve this? Is there any way I can obtain chinese characters rather then any other characters?

Any help is appreciable.




9 down vote accepted

You need to download chinese trained data (it will be a file like chi_sim.traineddata) and add it to yourtessdata folder.

To download the file https://code.google.com/p/tesseract-ocr/downloads/detail?name=chi_sim.traineddata.gz

and use like this

Tesseract* tesseract= [[Tesseract alloc] initWithDataPath:@"tessdata" language:@"chi_sim"];

if you have any problem you can download my experiment with tessaract (with chinese language support) from https://github.com/aryansbtloe/ExperimentWithTesseract.git

I have tested this one...Hope you will find this useful.

share | improve this answer
 
1  
Thanks it works :-) –   Nishant Tyagi  May 16 '13 at 9:11
 
Alok, I tried your sample and it works well on about half of simplified Chinese characters I tried. For the rest it may either recognize a compound character as several different characters each representing a component in the compound character, or totally wrong. Do you know of any method to improve the accuracy of recognition? –   CodePlumber  Jun 14 at 22:11

9 down vote accepted

You need to download chinese trained data (it will be a file like chi_sim.traineddata) and add it to yourtessdata folder.

To download the file https://code.google.com/p/tesseract-ocr/downloads/detail?name=chi_sim.traineddata.gz

and use like this

Tesseract* tesseract= [[Tesseract alloc] initWithDataPath:@"tessdata" language:@"chi_sim"];

if you have any problem you can download my experiment with tessaract (with chinese language support) from https://github.com/aryansbtloe/ExperimentWithTesseract.git

I have tested this one...Hope you will find this useful.

share | improve this answer
 
1  
Thanks it works :-) –   Nishant Tyagi  May 16 '13 at 9:11
    
Alok, I tried your sample and it works well on about half of simplified Chinese characters I tried. For the rest it may either recognize a compound character as several different characters each representing a component in the compound character, or totally wrong. Do you know of any method to improve the accuracy of recognition? –   CodePlumber  Jun 14 at 22:11
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值