Tesseract Java 识别中文+数字+字母，使用多种语言

weixin_44214515

已于 2023-08-22 13:40:51 修改

阅读量4.9k

点赞数 3

文章标签： java 算法 ocr

于 2021-10-20 11:59:58 首次发布

本文链接：https://blog.csdn.net/weixin_44214515/article/details/120863352

版权

在Java中使用Tesseract进行OCR识别时遇到问题，使用'chi_sim'语言无法完整识别数字，而使用'eng'则无法正确识别中文。解决方法是通过设置语言参数为'eng+chi_sim'来同时识别中文、英文和数字。示例代码展示了如何设置Tesseract的数据路径和语言，从而实现混合识别。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Java基于Tesseract来进行OCR识别时，如果使用chi_sim，对数字则识别不完全。如果使用eng，则对中文识别不正确，那么如何既能识别数字又能识别出中文和字母呢？

Tesseract命令行识别时支持-l参数指定语言，如：-l deu+eng。在使用Java类库时同样也是支持的，代码如下：

File tempFolder = TempDirectory.location();
File trainDataHome = new File(tempFolder, "tessdata");
		
ITesseract tesseract = new Tesseract();

// 加载语言，使用两种语言
tesseract.setLanguage("eng+chi_sim");
tesseract.setDatapath(trainDataHome.getAbsolutePath());
		
String content = tesseract.doOCR(new File("D:\\test\\4-0-0.png"));
System.out.println(content);