Java 使用 Tesseract 识别图片文字

代码的代

已于 2024-01-18 11:05:07 修改

阅读量661

点赞数 6

分类专栏：文件文本识别、提取文章标签： java 开发语言 ocr

于 2024-01-18 10:59:26 首次发布

本文链接：https://blog.csdn.net/weixin_46044938/article/details/135668695

版权

文件文本识别、提取专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Java 使用 Tesseract 识别图片文字

Tesseract下载地址
leptonica下载地址
语言训练包下载地址
官方手册
不需要安装，直接在程序中添加 maven 依赖，连环境变量都不需要配置！！！被去瞎折腾配置这种玩意

<!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j -->
<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>5.8.0</version>
</dependency>

找一个磁盘放语言训练包
代码测试（Java）

/**
 * 图片识别
 *
 * @param sb          字符串
 * @param file        文件
 * @param isIncreaseContrast 是否对比度增强
 * @return 结果
 */
public static String getImageContentStr(StringBuilder sb, File file, boolean isIncreaseContrast) {
    try {
        ITesseract tesseract = new Tesseract();
        // 设置识别训练包
        tesseract.setDatapath("D:\\tessdata");
        // 设置识别语言
        tesseract.setLanguage("chi_sim");
        // 解决图片编码方式导致的识别问题
        BufferedImage bufferedImage = ImageIO.read(file);
        if (isIncreaseContrast) {
            // 对比度增强算法
            bufferedImage = increaseContrast(bufferedImage);
        }

        // 识别图片内容
        String text = tesseract.doOCR(bufferedImage);
        sb.append(text);
    } catch (TesseractException | IOException e) {
        System.out.println("图片识别失败：" + e);
        e.printStackTrace();
        return "";
    }
    return sb.toString();
}

public static void main(String[] args) {
    File file = new File("F:\\桌面\\休闲\\图片识别5.png");
    StringBuilder sb = new StringBuilder();
    String imageContentStr = FileManageService.getImageContentStr(sb, file, true);
    System.out.println("图片识别成功：" + imageContentStr);
}