MAC系统中的JAVA中使用tess4j实现OCR识别的环境搭建(含tesseract安装配置)

近期准备搜集整理一些pdf中的资料,但大部分是扫描版,不能直接拷贝。手打又很费劲,所以从技术角度出发,看有什么突破。试了几个ocr软件试用版感觉效果很强大。所以搭建java版本的ocr环境看能不能减轻工作量。

OCR (Optical Character Recognition,光学字符识别)

是指电子设备(例如扫描仪或数码相机)检查纸上打印的字符,通过检测暗、亮的模式确定其形状,然后用字符识别方法将形状翻译成计算机文字的过程;即,针对印刷体字符,采用光学的方式将纸质文档中的文字转换成为黑白点阵的图像文件,并通过识别软件将图像中的文字转换成文本格式,供文字处理软件进一步编辑加工的技术。

Tess4J

是对Tesseract OCR API.的Java-JNA封装。使java能够通过调用Tess4J的API来使用Tesseract-OCR。支持的格式:TIFF,JPEG,GIF,PNG,BMP,JPEG,and PDF

Tesseract的OCR引擎

最先由HP实验室于1985年开始研发,至1995年时已经成为OCR业内最准确的三款识别引擎之一。然而,HP不久便决定放弃OCR业务,Tesseract也从此尘封。数年以后,HP意识到,与其将Tesseract束之高阁,不如贡献给开源软件业,让其重焕新生。在2005年,Tesseract由美国内华达州信息技术研究所获得,并委托Google对其进行改进、优化工作。

Tesseract目前已作为开源项目发布在Google Project,它与Leptonica图片处理库结合,可以读取各种格式的图像并将它们转化成超过60种语言的文本,我们还可以不断训练自己的库,使图像转换文本的能力不断增强。如果团队深度需要,还可以以它为模板,开发出符合自身需求的OCR引擎。

Tesseract 地址 https://github.com/tesseract-ocr/tesseract

Tess4J 地址 https://github.com/nguyenq/tess4j

test4j 配置

  1. 增加pom文件
<!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j -->
<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>4.4.1</version>
    <scope>test</scope>
</dependency>
  1. java代码简单示例
package com.chl.test.orc;

import java.io.File;

import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;

public class Tess4jOcrTest {

	public static void main(String[] args) {
		String bath = "/Users/chenhailong/Desktop/pdf/";
		test1(bath + "test.png");
	}
	
	/**
	 * 根据路径识别文字结果
	 * @param path
	 */
	public static void test1(String path) {
		File file = new File(path);
		ITesseract it = new Tesseract();
		try {
			String result = it.doOCR(file);
			System.out.println("识别结果:"+result );
		} catch (TesseractException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}
}

  1. 右键运行上面的代码,出现以下错误
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/chenhailong/.m2/repository/org/slf4j/slf4j-log4j12/1.7.25/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/chenhailong/.m2/repository/ch/qos/logback/logback-classic/1.2.3/logback-classic-1.2.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Detected both log4j-over-slf4j.jar AND bound slf4j-log4j12.jar on the class path, preempting StackOverflowError. 
SLF4J: See also http://www.slf4j.org/codes.html#log4jDelegationLoop for more details.
Exception in thread "main" java.lang.ExceptionInInitializerError
	at org.slf4j.impl.StaticLoggerBinder.<init>(StaticLoggerBinder.java:72)
	at org.slf4j.impl.StaticLoggerBinder.<clinit>(StaticLoggerBinder.java:45)
	at org.slf4j.LoggerFactory.bind(LoggerFactory.java:150)
	at org.slf4j.LoggerFactory.performInitialization(LoggerFactory.java:124)
	at org.slf4j.LoggerFactory.getILoggerFactory(LoggerFactory.java:412)
	at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:357)
	at net.sourceforge.tess4j.Tesseract.<clinit>(Tesseract.java:78)
	at com.chl.test.orc.Tess4jOcrTest.test1(Tess4jOcrTest.java:22)
	at com.chl.test.orc.Tess4jOcrTest.main(Tess4jOcrTest.java:13)
Caused by: java.lang.IllegalStateException: Detected both log4j-over-slf4j.jar AND bound slf4j-log4j12.jar on the class path, preempting StackOverflowError. See also http://www.slf4j.org/codes.html#log4jDelegationLoop for more details.
	at org.slf4j.impl.Log4jLoggerFactory.<clinit>(Log4jLoggerFactory.java:54)
	... 9 more

解决方式为排除冲突包

<!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j -->
<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>4.4.1</version>
    <scope>test</scope>
    <exclusions>
    	<exclusion>
    	    <groupId>org.slf4j</groupId>
			<artifactId>log4j-over-slf4j</artifactId>
    	</exclusion>
    </exclusions>
</dependency>
  1. 继续运行,出现以下错误
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/chenhailong/.m2/repository/org/slf4j/slf4j-log4j12/1.7.25/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/chenhailong/.m2/repository/ch/qos/logback/logback-classic/1.2.3/logback-classic-1.2.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "main" java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract':
dlopen(libtesseract.dylib, 9): image not found
dlopen(libtesseract.dylib, 9): image not found
Native library (darwin/libtesseract.dylib) not found in resource path ([file:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/resources.jar, file:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Con... file:/Users/chenhailong/.m2/repository/org/apache/pdfbox/jbig2-imageio/3.0.2/jbig2-imageio-3.0.2.jar, file:/Users/chenhailong/.m2/repository/net/sourceforge/lept4j/lept4j/1.12.3/lept4j-1.12.3.jar, file:/Users/chenhailong/.m2/repository/org/jboss/jboss-vfs/3.2.14.Final/jboss-vfs-3.2.14.Final.jar, file:/Users/chenhailong/.m2/repository/ch/qos/logback/logback-classic/1.2.3/logback-classic-1.2.3.jar, file:/Users/chenhailong/.m2/repository/ch/qos/logback/logback-core/1.2.3/logback-core-1.2.3.jar, file:/Users/chenhailong/.m2/repository/org/slf4j/jul-to-slf4j/1.7.28/jul-to-slf4j-1.7.28.jar])
	at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:302)
	at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:455)
	at com.sun.jna.Library$Handler.<init>(Library.java:192)
	at com.sun.jna.Native.loadLibrary(Native.java:646)
	at com.sun.jna.Native.loadLibrary(Native.java:630)
	at net.sourceforge.tess4j.util.LoadLibs.getTessAPIInstance(LoadLibs.java:85)
	at net.sourceforge.tess4j.TessAPI.<clinit>(TessAPI.java:42)
	at net.sourceforge.tess4j.Tesseract.init(Tesseract.java:427)
	at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:223)
	at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:195)
	at com.chl.test.orc.Tess4jOcrTest.test1(Tess4jOcrTest.java:24)
	at com.chl.test.orc.Tess4jOcrTest.main(Tess4jOcrTest.java:13)
	Suppressed: java.lang.UnsatisfiedLinkError: dlopen(libtesseract.dylib, 9): image not found
		at com.sun.jna.Native.open(Native Method)
		at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:191)
		... 11 more
	Suppressed: java.lang.UnsatisfiedLinkError: dlopen(libtesseract.dylib, 9): image not found
		at com.sun.jna.Native.open(Native Method)
		at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:204)
		... 11 more
	Suppressed: java.io.IOException: Native library (darwin/libtesseract.dylib) not found in resource path ([file:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/resources.jar, ... file:/Users/chenhailong/.m2/repository/ch/qos/logback/logback-core/1.2.3/logback-core-1.2.3.jar, file:/Users/chenhailong/.m2/repository/org/slf4j/jul-to-slf4j/1.7.28/jul-to-slf4j-1.7.28.jar])
		at com.sun.jna.Native.extractFromResourcePath(Native.java:1095)
		at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:276)
		... 11 more

mac系统上没法直接使用 tess4j包,需要安装 tesseract。 安装命令看https://github.com/tesseract-ocr/tesseract/wiki地址。

安装过程中要注意账号权限问题等。如下:

chenhailongdeMacBook-Pro:~ chenhailong$ brew install tesseract
Updating Homebrew...






^C
==> Installing dependencies for tesseract: libpng, jpeg, libtiff, leptonica
==> Installing tesseract dependency: libpng
==> Downloading https://homebrew.bintray.com/bottles/libpng-1.6.34.high_sierra.bottle.tar.gz


######################################################################## 100.0%
==> Pouring libpng-1.6.34.high_sierra.bottle.tar.gz
?  /usr/local/Cellar/libpng/1.6.34: 26 files, 1.2MB
==> Installing tesseract dependency: jpeg
==> Downloading https://homebrew.bintray.com/bottles/jpeg-9c.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring jpeg-9c.high_sierra.bottle.tar.gz
?  /usr/local/Cellar/jpeg/9c: 21 files, 724.5KB
==> Installing tesseract dependency: libtiff
==> Downloading https://homebrew.bintray.com/bottles/libtiff-4.0.9_4.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring libtiff-4.0.9_4.high_sierra.bottle.tar.gz
?  /usr/local/Cellar/libtiff/4.0.9_4: 246 files, 3.5MB
==> Installing tesseract dependency: leptonica
==> Downloading https://homebrew.bintray.com/bottles/leptonica-1.76.0.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring leptonica-1.76.0.high_sierra.bottle.tar.gz
?  /usr/local/Cellar/leptonica/1.76.0: 48 files, 5.6MB
==> Installing tesseract
==> Downloading https://homebrew.bintray.com/bottles/tesseract-3.05.02.high_sierra.bottle.tar.gz
#########################################                                 58.2%


curl: (56) LibreSSL SSL_read: SSL_ERROR_SYSCALL, errno 54
Error: Failed to download resource "tesseract"
Download failed: https://homebrew.bintray.com/bottles/tesseract-3.05.02.high_sierra.bottle.tar.gz
Warning: Bottle installation failed: building from source.
==> Installing dependencies for tesseract: autoconf, autoconf-archive, automake, libtool, pkg-config
==> Installing tesseract dependency: autoconf
==> Downloading https://homebrew.bintray.com/bottles/autoconf-2.69.high_sierra.bottle.4.tar.gz
######################################################################## 100.0%
==> Pouring autoconf-2.69.high_sierra.bottle.4.tar.gz
==> Caveats
Emacs Lisp files have been installed to:
  /usr/local/share/emacs/site-lisp/autoconf
==> Summary
?  /usr/local/Cellar/autoconf/2.69: 71 files, 3.0MB
==> Installing tesseract dependency: autoconf-archive
==> Downloading https://homebrew.bintray.com/bottles/autoconf-archive-2018.03.13.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring autoconf-archive-2018.03.13.high_sierra.bottle.tar.gz
?  /usr/local/Cellar/autoconf-archive/2018.03.13: 585 files, 3.5MB
==> Installing tesseract dependency: automake
==> Downloading https://homebrew.bintray.com/bottles/automake-1.16.1.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring automake-1.16.1.high_sierra.bottle.tar.gz
?  /usr/local/Cellar/automake/1.16.1: 131 files, 3MB
==> Installing tesseract dependency: libtool
==> Downloading https://homebrew.bintray.com/bottles/libtool-2.4.6_1.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring libtool-2.4.6_1.high_sierra.bottle.tar.gz
==> Caveats
In order to prevent conflicts with Apple's own libtool we have prepended a "g"
so, you have instead: glibtool and glibtoolize.
==> Summary
?  /usr/local/Cellar/libtool/2.4.6_1: 71 files, 3.7MB
==> Installing tesseract dependency: pkg-config
==> Downloading https://homebrew.bintray.com/bottles/pkg-config-0.29.2.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring pkg-config-0.29.2.high_sierra.bottle.tar.gz
?  /usr/local/Cellar/pkg-config/0.29.2: 11 files, 627.2KB
==> Downloading https://github.com/tesseract-ocr/tesseract/archive/3.05.02.tar.gz
==> Downloading from https://codeload.github.com/tesseract-ocr/tesseract/tar.gz/3.05.02
######################################################################## 100.0%
==> ./autogen.sh
==> ./configure --prefix=/usr/local/Cellar/tesseract/3.05.02
==> make install


==> Downloading https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata
==> Downloading from https://raw.githubusercontent.com/tesseract-ocr/tessdata/3.04.00/eng.traineddata
######################################################################## 100.0%
==> Downloading https://github.com/tesseract-ocr/tessdata/raw/3.04.00/osd.traineddata
==> Downloading from https://raw.githubusercontent.com/tesseract-ocr/tessdata/3.04.00/osd.traineddata
######################################################################## 100.0%
?  /usr/local/Cellar/tesseract/3.05.02: 79 files, 38.6MB, built in 10 minutes 40 seconds
==> Caveats
==> autoconf
Emacs Lisp files have been installed to:
  /usr/local/share/emacs/site-lisp/autoconf
==> libtool
In order to prevent conflicts with Apple's own libtool we have prepended a "g"
so, you have instead: glibtool and glibtoolize.

要注意安装过程中出现的问题,我这里有个tesseract下载失败,libtool工具冲突。
但是查看 tesseract --version 可以正常显示,如下:

chenhailongdeMacBook-Pro:~ chenhailong$ tesseract --version
tesseract 3.05.02
 leptonica-1.76.0
  libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11

先重新安装以下tesseract

chenhailongdeMacBook-Pro:~ chenhailong$ brew install tesseract
Updating Homebrew...
^CWarning: tesseract 3.05.02 is already installed and up-to-date
To reinstall 3.05.02, run `brew reinstall tesseract`
chenhailongdeMacBook-Pro:~ chenhailong$ brew reinstall tesseract
==> Reinstalling tesseract 
==> Downloading https://homebrew.bintray.com/bottles/tesseract-3.05.02.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring tesseract-3.05.02.high_sierra.bottle.tar.gz
?  /usr/local/Cellar/tesseract/3.05.02: 79 files, 38.6MB

继续运行出现以下错误:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/chenhailong/.m2/repository/org/slf4j/slf4j-log4j12/1.7.25/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/chenhailong/.m2/repository/ch/qos/logback/logback-classic/1.2.3/logback-classic-1.2.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Error opening data file ./tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00000001270bfe6b, pid=38486, tid=0x0000000000002703
#
# JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build 1.8.0_171-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [libtesseract.dylib+0x12e6b]  _ZN9tesseract9Tesseract15recog_all_wordsEP8PAGE_RESP10ETEXT_DESCPK4TBOXPKci+0xa7
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/chenhailong/git/LawSpider/hs_err_pid38486.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

提示不能找到 tessdata资源目录。
参考如下地址处理
:http://www.zmonster.me/2015/04/17/tesseract-install-usage.html。
搜索一下有说权限问题、环境变量问题。我这里设置

export TESSDATA_PREFIX=/opt/language/

依然报错如上。
参考如下

http://www.itkeyword.com/doc/692839954630676463/tesseract-for-java-setting-tessdata-prefix-for-executable-jar

在代码处增加了setDatapath 的设置,运行变为正常

ITesseract it = new Tesseract();
// 如果没有改变tessdata目录位置请输入.
it.setDatapath(".");
// 如果变更过tessdata目录请指定位置
it.setDatapath("/opt/language/");

测试一张图片如下:

测试ocr功能

识别结果如下:

97X-l-449-38I65-3 Tapworllly: Designing Great thunc Apps r 2010 by O'Reilly Media. Inc. Simplified
Chinese cdilion. jninlly published by O‘Reilly M:dia. Inc, and Publishing Hnusc of Elcclmnlcs Induslry.
2m 1. Aulhorucd uanslauun or me English :dllion‘ 2010 O'Reilly Media‘ Ina. lln: owner ofall nghls m
publish and stll m: same. All rights reamed including the rims nfrcpmdnclion in whole or in van in any
rm...

可以明显感觉到部分识别是错误的。应该是图像的清晰度不够。如果直接识别带有中文字符的图片,可能会出现异常,因为需要下载对应的字体库。附上地址:
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-400

到这里,简单的运行环境就算搭建完成,之后在试试看中文识别的准确路怎么样。

  • 2
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值