环境:
java:jdk1.8
os:windows10
tesseract:4.1.0
step1:安装配置tesseract
a.下载地址:https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v4.1.0.20190314.exe
可以在同目录下载最新的版本
b.双击下载后的文件开始安装
下载后将其安装到C:\Program Files\Tesseract-OCR,注意在安装过程中下载中文语言包
c.将C:\Program Files\Tesseract-OCR加到系统PATH变量中
d.新建一个值为C:\Program Files\Tesseract-OCR\tessdata的系统变量TESSDATA_PREFIX
e.打开一个命令窗口输入以下命令 >tesseract --list-langs
List of available languages (11):
chi_sim
...
出现上面chi_sim说明tesseract安装成功!
step2:java调用tesseract
编写如下测试类:TesseractUtil.java
package com.bry.tesseract;
import java.io.File;
import java.util.ArrayList;
import java.util.List;
public class TesseractUtil {
public static String identifyTextFromPicture(String tessPath, String sourceFileName, String savePath, String language) {
ProcessBuilder pb = new ProcessBuilder();
pb.directory(new File(tessPath));
pb.environment().put("TESSDATA_PREFIX", tessPath + File.separatorChar + "tessdata");
pb.redirectErrorStream(true);
File sourceFile = new File(sourceFileName);
List<String> cmd = new ArrayList<String>();
String ocr_result_filename = sourceFile.getName().substring(0, sourceFile.getName().lastIndexOf("."));
cmd.add(pb.directory().getAbsolutePath() + File.separatorChar + "tesseract");
cmd.add(sourceFile.getAbsolutePath());
cmd.add(savePath + File.separatorChar + ocr_result_filename);
cmd.add("-l");
cmd.add(language);
pb.command(cmd);
try {
Process process = pb.start();
if (process.waitFor() == 0) {
return savePath + File.separatorChar + ocr_result_filename + ".txt";
}
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
public static void main(String[] arags) {
String resultFile = TesseractUtil.identifyTextFromPicture("C:/Program Files/Tesseract-OCR", "D:/temp/test.png", "d:/temp", "chi_sim");
System.out.println(resultFile);
}
}
运行后在D:/temp/test.txt中看识别结果