tesseract下载和安装
最近在苹果电脑上面安卓tesseract,如果直接用Homebrew安装,如brew install tesseract会安装不成功,提示有个依赖的包无法下载成功,尝试无数次那个出问题的下载地址是无法连到服务器;再试用brew install tesseract --HEAD安装最新的版本,此时编译的时候失败,只说我的电脑版本太低。。。最后参考https://github.com/tesseract-ocr/tesseract/wiki/Compiling#macos 终于安装成功了:
brew install automake autoconf libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
brew install gcc
brew install pango
git clone https://github.com/tesseract-ocr/tesseract/
cd tesseract
./autogen.sh
./configure CC=gcc-8 CXX=g++-8 CPPFLAGS=-I/usr/local/opt/icu4c/include LDFLAGS=-L/usr/local/opt/icu4c/lib
make -j
sudo make install # if desired
make training # if installed with training dependencies
最后用tesseract --version来查看是否安装成功。
安装tesseract多语言
使用tesseract --list-langs查看,会报错:
linfangfangdeMacBook-Pro:tesseract linfangfang$ tesseract --list-langs
Error opening data file /usr/local/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
List of available languages (0):
可以到https://github.com/tesseract-ocr/tessdata_fast查看,发现tesseract多语言包都在这个路径下,需要我们手动下载:
到/usr/local/share下克隆一份,如:
linfangfangdeMacBook-Pro:share linfangfang$ git clone https://github.com/tesseract-ocr/tessdata_fast
正克隆到 'tessdata_fast'...
remote: Enumerating objects: 243, done.
remote: Total 243 (delta 0), reused 0 (delta 0), pack-reused 243
接收对象中: 100% (243/243), 335.11 MiB | 1.92 MiB/s, 完成.
处理 delta 中: 100% (40/40), 完成.
正在检出文件: 100% (163/163), 完成.
linfangfangdeMacBook-Pro:share linfangfang$
再把tessdata_fast的文件复制到tessdata文件夹中
linfangfangdeMacBook-Pro:share linfangfang$ sudo cp -rf tessdata_fast/* tessdata/
Password:
linfangfangdeMacBook-Pro:share linfangfang$ cd tessdata
linfangfangdeMacBook-Pro:tessdata linfangfang$ ls
COPYING kaz.traineddata
README.md khm.traineddata
afr.traineddata kir.traineddata
amh.traineddata kmr.traineddata
ara.traineddata kor.traineddata
asm.traineddata kor_vert.traineddata
aze.traineddata lao.traineddata
aze_cyrl.traineddata lat.traineddata
bel.traineddata lav.traineddata
ben.traineddata lit.traineddata
bod.traineddata ltz.traineddata
bos.traineddata mal.traineddata
bre.traineddata mar.traineddata
bul.traineddata mkd.traineddata
cat.traineddata mlt.traineddata
ceb.traineddata mon.traineddata
ces.traineddata mri.traineddata
chi_sim.traineddata msa.traineddata
chi_sim_vert.traineddata mya.traineddata
chi_tra.traineddata nep.traineddata
chi_tra_vert.traineddata nld.traineddata
chr.traineddata nor.traineddata
configs oci.traineddata
cos.traineddata ori.traineddata
cym.traineddata osd.traineddata
dan.traineddata pan.traineddata
deu.traineddata pdf.ttf
div.traineddata pol.traineddata
dzo.traineddata por.traineddata
ell.traineddata pus.traineddata
eng.traineddata que.traineddata
enm.traineddata ron.traineddata
epo.traineddata rus.traineddata
est.traineddata san.traineddata
eus.traineddata script
fao.traineddata sin.traineddata
fas.traineddata slk.traineddata
fil.traineddata slv.traineddata
fin.traineddata snd.traineddata
fra.traineddata spa.traineddata
frk.traineddata spa_old.traineddata
frm.traineddata sqi.traineddata
fry.traineddata srp.traineddata
gla.traineddata srp_latn.traineddata
gle.traineddata sun.traineddata
glg.traineddata swa.traineddata
grc.traineddata swe.traineddata
guj.traineddata syr.traineddata
hat.traineddata tam.traineddata
heb.traineddata tat.traineddata
hin.traineddata tel.traineddata
hrv.traineddata tessconfigs
hun.traineddata tgk.traineddata
hye.traineddata tha.traineddata
iku.traineddata tir.traineddata
ind.traineddata ton.traineddata
isl.traineddata tur.traineddata
ita.traineddata uig.traineddata
ita_old.traineddata ukr.traineddata
jav.traineddata urd.traineddata
jpn.traineddata uzb.traineddata
jpn_vert.traineddata uzb_cyrl.traineddata
kan.traineddata vie.traineddata
kat.traineddata yid.traineddata
kat_old.traineddata yor.traineddata
linfangfangdeMacBook-Pro:~ linfangfang$ tesseract --list-langs
List of available languages (161):
afr
amh
ara
asm
aze
aze_cyrl
。。。。。。。。。
此时就可以说明安装成功了,测试下:
linfangfangdeMacBook-Pro:~ linfangfang$ tesseract /Users/linfangfang/Desktop/pic.jpg /Users/linfangfang/Desktop/out -l chi_sim
Tesseract Open Source OCR Engine v4.0.0-306-gb67f with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 326
Detected 67 diacritics
linfangfangdeMacBook-Pro:~ linfangfang$
在桌面上生成一个out.txt的文件,里面就可以获取图片里面的文字了