开源软件:Tesseract[1]
维基百科:Tesseract[2]
在MacOS系统中使用Homebrew安装Tesseract
在MacOS的终端Terminal.app中安装Homebrew,而后可以快捷方便下载任何官方软件包,保存路径统一,查找方便。
- 查找终端Terminal.app的位置
- 快捷键 Shift+Command+A ,打开finder的app文件夹(即/Applications)
- 打开实用工具文件夹的终端.app (位置:/Applications/Utilities/Terminal.app )
2. 安装Homebrew[3]
homebrew是osx下的一个包管理工具,可以方便的管理各类包。
官方给出的定义是macOS缺失的软件包管理器。 [4]
安装只需要在终端Terminal.app中,输入代码:(最新代码以官网Homebrew公布为准)
"
※Homebrew的基本操作命令(可暂先不看)
更新:brew update
查看版本:brew -v
帮助信息:brew -h
查询软件的详细信息:brew info <软件名>
查看安装列表:brew list
安装软件包:brew install <软件名>
卸载软件包:brew uninstall <软件名>
彻底卸载指定软件,包括旧版本:brew uninstall --force <软件名>
搜索软件:brew search <正则表达式/软件名>
更新所有软件包:brew upgrade <软件名>
查询有更新版本的软件:brew outdated
清理指定软件的过时包:brew cleanup <软件名>
清理所有的过时软件:brew cleanup
列出需要清理的内容:brew cleanup -n
用浏览器打开相关包的页面:brew home <软件名>
显示包依赖:brew deps <软件名>
锁定某个包:brew pin $FORMULA
取消锁定:brew unpin $FORMULA
查看已安装的包的依赖,树形显示:brew deps --installed --tree
3. 在终端Terminal.app中,通过brew查看Tesseract的相关信息
输入:
brew info tesseract
输出:
可以看到 Tesseract 版本号,安装文件位置,文件数量,文件大小
tesseract: stable 4.1.1 (bottled), HEAD
OCR (Optical Character Recognition) engine
https://github.com/tesseract-ocr/
/usr/local/Cellar/tesseract/4.1.1 (65 files, 29.6MB) *
Poured from bottle on 2020-02-12 at 16:58:17
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/tesseract.rb
==> Dependencies
Build: autoconf ✘, autoconf-archive ✘, automake ✘, libtool ✘, pkg-config ✘
Required: leptonica ✔, libtiff ✔
==> Options
--HEAD
Install HEAD version
==> Caveats
This formula contains only the "eng", "osd", and "snum" language data files.
If you need any other supported languages, run `brew install tesseract-lang`.
==> Analytics
install: 62,766 (30 days), 198,380 (90 days), 765,471 (365 days)
install-on-request: 6,102 (30 days), 19,486 (90 days), 84,858 (365 days)
build-error: 0 (30 days)
其中:
==> Caveats
This formula contains only the "eng", "osd", and "snum" language data files. If you need any other supported languages, run `brew install tesseract-lang`.
此条写明,标准包中语言只包括几种语言数据。如果想要更多支持语言,需要输入:
brew install tesseract-lang [5]
3. 安装Tesseract,并支持多语言
输入:
brew install tesseract-lang
输出:
安装完成
==> Downloading https://homebrew.bintray.com/bottles/little-cms2-2.10.mojave.bot
==> Downloading from https://akamai.bintray.com/1d/1d92fdb6dfbacebb2431da4c3c9e2
######################################################################## 100.0%
==> Downloading https://homebrew.bintray.com/bottles/tesseract-lang-4.0.0.mojave
==> Downloading from https://akamai.bintray.com/63/631211ef37fcafa9a3fac6a7cd6ca
######################################################################## 100.0%
==> Installing dependencies for tesseract-lang: little-cms2
==> Installing tesseract-lang dependency: little-cms2
==> Pouring little-cms2-2.10.mojave.bottle.tar.gz
/usr/local/Cellar/little-cms2/2.10: 21 files, 1MB
==> Installing tesseract-lang
==> Pouring tesseract-lang-4.0.0.mojave.bottle.tar.gz
/usr/local/Cellar/tesseract-lang/4.0.0: 163 files, 651.8MB
4. Tesseract-OCR的基本操作命令
输入:
tesseract -h
输出:
Usage:
tesseract --help | --help-extra | --version
tesseract --list-langs
tesseract imagename outputbase [options...] [configfile...]
OCR options:
-l LANG[+LANG] Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.
Single options:
--help Show this help message.
--help-extra Show extra help for advanced users.
--version Show version information.
--list-langs List available languages for tesseract engine.
5. Tesseract-OCR用法[6]
{-l lang
imagename:需要识别的图片名称,直接拖拽进入。
outputbase:输出结果的txt文件的名称,不需要标注后缀,输出直接就是txt格式。
lang:指定输出语言。默认是英文。需要识别简体中文,输入:-l chi_sim,需要识别简体中文和英文,输入:-l chi_sim+eng。
pagesegmode:识别模式,包括如下:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
6. Tesseract-OCR识别操作
样例图片:
输入:(需要识别中文+英文)
tesseract /Users/AAA/Desktop/ffmpeg1.png out -l chi_sim+eng
输出:
在用户名下查看文件:
准确无误!
参考
- ^Tesseract的GitHub地址: https://github.com/tesseract-ocr/tesseract/
- ^Tesseract的wiki解释: https://github.com/tesseract-ocr/tesseract/wiki
- ^Homebrew官网: https://brew.sh
- ^Homebrew中文说明: https://brew.sh/index_zh-cn
- ^Tesseract支持更多语言: https://blog.csdn.net/weixin_40368256/article/details/100624099
- ^Tesseract-OCR用法: https://www.itread01.com/content/1547557393.html