摘要
Java + OCR 提取图片文字。在电商平台中,平台可能需要对商家一些营业执照,身份证件件进行审核,我们可能需要证件中的编号进行提取。当少量商家的时候,可以人工审核,把证件中的编号提取出来;当商家数量变多的时候,平台审核就很耗费人工。这个时候我们就自然而然想到能不能把图片中的文字提取出来。下面我们就来了解一下OCR图片文字提取。
SpringBoot实战电商项目mall4j 地址 : https://gitee.com/gz-yami/mall4j
Java+Tesseract_OCR 图片文字提取初次入门
可以调用第三放接口,进行图片文字提取,例如:百度API
这里我们自己在本地来做一个图片文字提取,使用Java+Tesseract的方式
环境准备
win10,jdk1.8, idea,maven
下载tesseract3.02,中文词库
首先必须要有Java环境JDK
1.下载安装tesseract3.02
2.下载中文词库chi_sim.traineddata
百度网盘:
链接:https://pan.baidu.com/s/1VYx6zbKAacYkiUY5UvogEQ
提取码:hsr7
3.将chi_sim.traineddata放在tessdata目录下
4.打开\tessdata\configs目录下的digits文件,将
tessedit_char_whitelist 0123456789-. 改为 tessedit_char_whitelist 0123456789x.
5.配置环境变量(tesseract3.02的安装目录)
6.测试是否安装完成
WIN+R打开cmd命令
输入tesseract
出现表示安装完成
Usage:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile…]
pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile.
Single options:
-v --version: version info
–list-langs: list available languages for tesseract engine
提取文字
tesseract C:/pic/1.png C:/pic/1 -l chi_sim
c盘的图片,生成1.txt(提取到的文字内容)
-l :不是1 是英文字母L对应的小写字母l
chi_sim 中文词库
java 项目中导入依赖
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>3.2.1</version>
<exclusions>
<exclusion>
<groupId>com.sun.jna</groupId>
<artifactId>jna</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>net.java.dev.jna</groupId>
<artifactId>jna</artifactId>
<version>4.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/javax.media/jai_imageio -->
<dependency>
<groupId>javax.media</groupId>
<artifactId>jai_imageio</artifactId>
<version>1.1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.swinglabs/swingx -->
<dependency>
<groupId>org.swinglabs</groupId>
<artifactId>swingx</artifactId>
<version>1.6.1</version>
</dependency>
代码
ImageIOHelper.class
package com.example.excel.excelexportodb.Utils;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.Iterator;
import java.util.Locale;
import javax