一、总体执行步骤
1、下载Tesseract引擎安装(路径用于配置环境变量)
2、下载Tessdata语言库,放在引擎安装的tessdata目录下
3、导入maven依赖
4、编写代码
二、步骤细分:
1、官网(可忽略)
官网地址:UB Mannheim: Digitale Bibliothek
项目地址:https://github.com/tesseract-ocr/tesseract/wiki
2、Tesseract引擎安装包下载地址(安装目录用于配置环境变量)
对应版本: https://digi.bib.uni-mannheim.de/tesseract/
3、配置环境变量
使用Tesseract引擎安装的根目录地址;
打开命令终端,输入:tesseract -v,可以看到版本信息即安装完成;
4、下载其他语言的识别包
语言包: https://tesseract-ocr.github.io/tessdoc/Data-Files
简体字识别包:https://github.com/tesseract-ocr/tessdata/raw/4.00/chi_sim.traineddata
繁体字识别包:https://github.com/tesseract-ocr/tessdata/raw/4.00/chi_tra.traineddata
https://github.com/tesseract-ocr/tessdata_best https://github.com/tesseract-ocr/tessdata
https://blog.csdn.net/hktkfly6/article/details/104228994
5、maven坐标
<!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j --> <dependency> <groupId>net.sourceforge.tess4j</groupId> <artifactId>tess4j</artifactId> <version>4.5.4</version> </dependency>
6、代码
例子一:
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
public class ORCreadPicWord {
/**
*
* @param srcImage 图片全路径
* @param ZH_CN 是否使用简体中文
* @return
*/
public static String findORC(String srcImage,boolean ZH_CN) {
try {
File srcimage = new File(srcImage);
if (!srcimage.exists()) {
return "图片不存在";
}
BufferedImage textImage = ImageIO.read(srcimage);
ITesseract instance = new Tesseract();//导入依赖后可用
instance.setDatapath("C:\\Program Files (x86)\\Tesseract-OCR\\tessdata");//设置中文字体的目录(全路径)
if (ZH_CN) {
instance.setLanguage("chi_sim");//中文识别
}
String result = instance.doOCR(textImage);
return result;
} catch (IOException | TesseractException e) {
e.printStackTrace();
return "未知错误";
}
}
public static void main(String[] args) {
//图片中文件结束处的光标也会影响文字读取
String orc = findORC("C:\\Users\\Administrator\\Desktop\\w\\test.jpg", true);
System.out.println(orc);
}
}
例子二:
OcrService
import net.sourceforge.tess4j.TesseractException;
import java.awt.image.BufferedImage;
public interface OcrService {
public String recognizeText(BufferedImage image) throws TesseractException;
}
OcrServiceImpl
import com.icbc.service.OcrService;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.springframework.stereotype.Service;
import java.awt.image.BufferedImage;
@Service
public class OcrServiceImpl implements OcrService {
private ITesseract tesseract;
public OcrServiceImpl() {
this.tesseract = new Tesseract();
this.tesseract.setDatapath("C:\\Program Files (x86)\\Tesseract-OCR\\tessdata");
}
public String recognizeText(BufferedImage image) throws TesseractException {
tesseract.setLanguage("chi_sim");
//若要实现简体+繁体的识别,可将语言类型拼接起来
// tesseract.setLanguage("chi_sim+chi_tra");
return tesseract.doOCR(image);
}
}
OcrController
import com.icbc.service.OcrService;
import net.sourceforge.tess4j.TesseractException;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.multipart.MultipartFile;
import javax.imageio.ImageIO;
import java.io.IOException;
@RestController
@RequestMapping("/ocr")
public class OcrController {
@Autowired
private OcrService ocrService;
@PostMapping("/recognize")
public String recognizeText(@RequestParam("file") MultipartFile file) {
try {
String result = ocrService.recognizeText(ImageIO.read(file.getInputStream()));
return result;
} catch (IOException | TesseractException e) {
return "图片处理出错: " + e.getMessage();
}
}
}
postman测试:
设置heads和body