brew安装指定版本ruby_1. MacOS安装使用Tesseract-OCR进行图文识别

开源软件:Tesseract[1]

维基百科:Tesseract[2]

在MacOS系统中使用Homebrew安装Tesseract

在MacOS的终端Terminal.app中安装Homebrew,而后可以快捷方便下载任何官方软件包,保存路径统一,查找方便。

  1. 查找终端Terminal.app的位置
  • 快捷键 Shift+Command+A ,打开finder的app文件夹(即/Applications)
  • 打开实用工具文件夹的终端.app (位置:/Applications/Utilities/Terminal.app )

2. 安装Homebrew[3]

homebrew是osx下的一个包管理工具,可以方便的管理各类包。
官方给出的定义是macOS缺失的软件包管理器。 [4]

安装只需要在终端Terminal.app中,输入代码:(最新代码以官网Homebrew公布为准)

"

※Homebrew的基本操作命令(可暂先不看)

更新:brew update
查看版本:brew -v
帮助信息:brew -h

查询软件的详细信息:brew info <软件名>
查看安装列表:brew list

安装软件包:brew install <软件名>
卸载软件包:brew uninstall <软件名>
彻底卸载指定软件,包括旧版本:brew uninstall --force <软件名>

搜索软件:brew search <正则表达式/软件名>
更新所有软件包:brew upgrade <软件名>
查询有更新版本的软件:brew outdated
清理指定软件的过时包:brew cleanup <软件名>
清理所有的过时软件:brew cleanup
列出需要清理的内容:brew cleanup -n

用浏览器打开相关包的页面:brew home <软件名>
显示包依赖:brew deps <软件名>
锁定某个包:brew pin $FORMULA
取消锁定:brew unpin $FORMULA
查看已安装的包的依赖,树形显示:brew deps --installed --tree

3. 在终端Terminal.app中,通过brew查看Tesseract的相关信息

输入:

brew info tesseract

输出:

7a04508ae7265017fc0a9971d29ca68e.png

可以看到 Tesseract 版本号,安装文件位置,文件数量,文件大小

tesseract: stable 4.1.1 (bottled), HEAD
OCR (Optical Character Recognition) engine
https://github.com/tesseract-ocr/
/usr/local/Cellar/tesseract/4.1.1 (65 files, 29.6MB) *
  Poured from bottle on 2020-02-12 at 16:58:17
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/tesseract.rb
==> Dependencies
Build: autoconf ✘, autoconf-archive ✘, automake ✘, libtool ✘, pkg-config ✘
Required: leptonica ✔, libtiff ✔
==> Options
--HEAD
	Install HEAD version
==> Caveats
This formula contains only the "eng", "osd", and "snum" language data files.
If you need any other supported languages, run `brew install tesseract-lang`.
==> Analytics
install: 62,766 (30 days), 198,380 (90 days), 765,471 (365 days)
install-on-request: 6,102 (30 days), 19,486 (90 days), 84,858 (365 days)
build-error: 0 (30 days)

其中:

c5db8b7f81b6e42e7effc02dbb30b35d.png
==> Caveats
This formula contains only the "eng", "osd", and "snum" language data files. If you need any other supported languages, run `brew install tesseract-lang`.

此条写明,标准包中语言只包括几种语言数据。如果想要更多支持语言,需要输入:

brew install tesseract-lang [5]

3. 安装Tesseract,并支持多语言

输入:

brew install tesseract-lang

输出:

46c8fefe8b5d12950dcd19d21cd6aa38.png

安装完成

==> Downloading https://homebrew.bintray.com/bottles/little-cms2-2.10.mojave.bot
==> Downloading from https://akamai.bintray.com/1d/1d92fdb6dfbacebb2431da4c3c9e2
######################################################################## 100.0%
==> Downloading https://homebrew.bintray.com/bottles/tesseract-lang-4.0.0.mojave
==> Downloading from https://akamai.bintray.com/63/631211ef37fcafa9a3fac6a7cd6ca
######################################################################## 100.0%
==> Installing dependencies for tesseract-lang: little-cms2
==> Installing tesseract-lang dependency: little-cms2
==> Pouring little-cms2-2.10.mojave.bottle.tar.gz
   /usr/local/Cellar/little-cms2/2.10: 21 files, 1MB
==> Installing tesseract-lang
==> Pouring tesseract-lang-4.0.0.mojave.bottle.tar.gz
   /usr/local/Cellar/tesseract-lang/4.0.0: 163 files, 651.8MB

4. Tesseract-OCR的基本操作命令

输入:

 tesseract -h

输出:

1b6de470272e59daf3392ce850097ae2.png
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

OCR options:
  -l LANG[+LANG]        Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.

Single options:
  --help                Show this help message.
  --help-extra          Show extra help for advanced users.
  --version             Show version information.
  --list-langs          List available languages for tesseract engine.

5. Tesseract-OCR用法[6]

{-l lang

imagename:需要识别的图片名称,直接拖拽进入。

outputbase:输出结果的txt文件的名称,不需要标注后缀,输出直接就是txt格式。

lang:指定输出语言。默认是英文。需要识别简体中文,输入:-l chi_sim,需要识别简体中文和英文,输入:-l chi_sim+eng。

pagesegmode:识别模式,包括如下:

0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.

6. Tesseract-OCR识别操作

样例图片:

d51eaa242a72d9e28accc01f6a35ebd5.png

输入:(需要识别中文+英文)

tesseract /Users/AAA/Desktop/ffmpeg1.png out -l chi_sim+eng

输出:

b50bf1bebca7e2ee23ee3a21d13ec103.png

在用户名下查看文件:

db287b8245575175952faa609bf66b51.png

准确无误!

参考

  1. ^Tesseract的GitHub地址: https://github.com/tesseract-ocr/tesseract/
  2. ^Tesseract的wiki解释: https://github.com/tesseract-ocr/tesseract/wiki
  3. ^Homebrew官网: https://brew.sh
  4. ^Homebrew中文说明: https://brew.sh/index_zh-cn
  5. ^Tesseract支持更多语言: https://blog.csdn.net/weixin_40368256/article/details/100624099
  6. ^Tesseract-OCR用法: https://www.itread01.com/content/1547557393.html
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值