linux 安装Tesseract-OCR
1.安装对应的lib (install the libs)
sudo apt-get install libpng12-dev
sudo apt-get install libjpeg62-dev
sudo apt-get install libtiff4-dev
其实本来还应该装这些的,只不过有些电脑自带有这些,还是装一下安全,反正如果装有的,会跳过的sudo apt-get install gcc
sudo apt-get install g++
sudo apt-get install automake
2.下载安装leptonica
http://www.leptonica.org/download.html 或者
http://code.google.com/p/leptonica/downloads/list
下载leptonica 包: http://www.leptonica.org/source/leptonica-1.68.tar.gz
解压后切换到leptonica-1.68 根目录
$./configure
$make
$make install
3.tesseract安装:
install tesseract 3.0.2
$ wget http://tesseract-ocr.googlecode.com/files/tesseract-3.0.2.tar.gz $tar zxvf tesseract-3.02.tar.gz $cd tesseract-3.02 && ./configure && make && sudo make install
4 安装英文字库
下载tesseract-3.02 英文语言包:http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz ,解压后将tesseract-ocr/tessdata 下的所有文件全部拷贝到/usr/local/share/tessdata 下
$ wget http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz
$ gunzip chi_sim.traineddata.gz$ sudo cp chi_sim.traineddata /usr/local/share/tessdata/
5 安装中文字库 (install Chinese lib )
$ wget http://tesseract-ocr.googlecode.com/files/chi_sim.traineddata.gz$ gunzip chi_sim.traineddata.gz$ sudo cp chi_sim.traineddata /usr/local/share/tessdata/
将文字库拷备到/usr/local/share/tessdata/目录下
6识别命令
Tesseract 【图片名】 【识别结果文件(.txt输出)】 -l 【语言】
英文字体识别
tesseract phototest.tif phototest -l eng
中文字体识别
tesseract phototest.tif phototest -l chi_sim
错误记录
1.遇到这个错误
$ tesseract foo.png bar
tesseract: error while loading shared libraries: libtesseract_api.so.3 cannot open shared object file: No such file or directory
You need to update the cache for the runtime linker. The following should get you up and running:
$ sudo ldconfig
2. 使用遇到如下错误
Tesseract Open Source OCR Engine v3.02.02 with LeptonicaError in findTiffCompression: function not presentError in pixReadStreamTiff: function not presentError in pixReadStream: tiff: no pix returnedError in pixRead: pix not readUnsupported image type.
安装下面对应的lib (install the libs),后重新编译tesseract
sudo apt-get install libpng12-dev
sudo apt-get install libjpeg62-dev
sudo apt-get install libtiff4-dev
注意的几点: (NOTICE)
1. 速度比较慢。 不过可以忍受,免费的么。 (slow, but it doesn't matter, what I care is tesseract is free )
2. 图片的文字要做到水平。如果你的图片文字是斜的,会影响效果,建议先用“旋转”功能把图片处理一下。 ( if your text is not vertical, you have to make some changes to the image using "notation" tools or something )