Linux文字识别软件,linux下的文字识别软件tesseract

最新推荐文章于 2024-08-07 07:15:00 发布

weixin_39989973

最新推荐文章于 2024-08-07 07:15:00 发布

阅读量583

点赞数

文章标签： Linux文字识别软件

Ubuntu下的安装步骤： (setup steps under ubuntu )

1.安装对应的lib (install the libs)

sudo apt-get install autoconf automake libtool

sudo apt-get install libpng12-dev

sudo apt-get install libjpeg62-dev

sudo apt-get install libtiff4-dev

sudo apt-get install zlib1g-dev

sudo apt-get install libleptonica # install leptonica

其实本来还应该装这些的，只不过有些电脑自带有这些，还是装一下安全，反正如果装有的，会跳过的

sudo apt-get install gcc

sudo apt-get install g++

sudo apt-get install automake

(这样步骤后，我的的执行./configure还是提示没有找到leptoniaca库，于是我自己的链接下载了

安装包https://leptonica.org/download.html。先装上再试试看吧。这个不会解决问题的话还真是恼火呢，搞了半天了，都没搞定，原因是在tesseract文件夹下面执行./configure 的时候

提示leptonica library missing这个错误baidu。google都不好用，最后还是在FAQ上面找到了答案

leptonica library missing

If get this error message when you run ./configure and your leptonica header files are located in /usr/local/include (e.g. you installed leptonica to /usr/local) than run:

LIBLEPT_HEADERSDIR=/usr/local/include./configure

or:

CPPFLAGS="-I/usr/local/include"LDFLAGS="-L/usr/local/lib"./configure

继续试试看吧。。。

哎哟，，还是第二个命令有效，终于过了，不容易啊。

2. install tesseract 3.00

$ wget https://tesseract-ocr.googlecode.com/files/tesseract-3.00.tar.gz

$tar zxvf tesseract-3.00.tar.gz

$ cd tesseract-3.00 && ./configure && make && sudo make install

3. 安装中文字库 (install Chinese lib )

$ wget https://tesseract-ocr.googlecode.com/files/chi_sim.traineddata.gz

$ gunzip chi_sim.traineddata.gz

$ sudo cp chi_sim.traineddata /usr/local/share/tessdata/

在这一步中，我先安装了英文字库到/usr/local/share/tessdata/目录下面得，可是在运行测试文件爱你的额时候出现如下错误：

gzw@gzw-laptop:~/openhw/tesscract/tesseract-3.01$ tesseract phototest.tif phototest -l eng

tesseract: error while loading shared libraries: libtesseract.so.3: cannot open shared object file: No such file or directory

于是又搜索了下解决办法：

./tests: error while loading shared libraries: xxx.so.0:cannot open shared object file: No such file or directory

出现这类错误表示，系统不知道xxx.so放在哪个目录下，这时候就要在/etc/ld.so.conf中加入xxx.so所在的目录。

一般而言，有很多的so会存放在/usr/local/lib这个目录底下，去这个目录底下找，果然发现自己所需要的.so文件。

所以，在/etc/ld.so.conf中加入/usr/local/lib这一行，保存之后，再运行：/sbin/ldconfig –v更新一下配置即可。

继续尝试中。。。。。。

果不其然，这个方法可行。

tesseract phototest.tif phototest -l eng

输出:

Tesseract Open Source OCR Engine v3.01 with Leptonica

Page 0

这时应该在当前目录生成一个phototest.txt 文本文件,内容就是phototest.tif 显示的文字.

、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、

分割线。。。。。。。。。。。。。。。。

晚上试了下中文的，也都是可以用的，而且识别率其实很高，当然图片是清晰的，不清晰的没试过。

初步想看看他的源代码。不过感觉内容有点庞大的哦。。我了过去哦。。。

4. 把图片转换成tif 格式，然后使用： (usage)

$ tesseract apple.tif result -l chi_sim

注意的几点： (NOTICE)

1. 速度比较慢。不过可以忍受，免费的么。 (slow, but it doesn't matter, what I care is tesseract is free )

2. 图片的文字要做到水平。如果你的图片文字是斜的，会影响效果，建议先用“旋转”功能把图片处理一下。 ( if your text is not vertical, you have to make some changes to the image using "notation" tools or something )