Mac系列之:Mac安装tesseract和python使用pytesseract、pillow包提取图片中中文
一、安装tesseract
brew install tesseract
==> Installing dependencies for tesseract: libarchive
==> Installing tesseract dependency: libarchive
==> Pouring libarchive-3.6.1.catalina.bottle.tar.gz
🍺 /usr/local/Cellar/libarchive/3.6.1: 62 files, 3.6MB
==> Installing tesseract
==> Pouring tesseract--5.1.0.catalina.bottle.tar.gz
==> Caveats
This formula contains only the "eng", "osd", and "snum" language data files.
If you need any other supported languages, run `brew install tesseract-lang`.
==> Summary
🍺 /usr/local/Cellar/tesseract/5.1.0: 58 files, 30.0MB
==> Caveats
==> tesseract
This formula contains only the "eng", "osd", and "snum" language data files.
If you need any other supported languages, run `brew install tesseract-lang`.
二、查看tesseract版本
成功安装后查看tesseract版本
tesseract --version
tesseract 5.1.0
leptonica-1.82.0
libgif 5.2.1 : libjpeg 9e : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found libarchive 3.6.1 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.6 liblz4/1.9.3 libzstd/1.5.2
Found libcurl/7.64.1 SecureTransport (LibreSSL/2.8.3) zlib/1.2.11 nghttp2/1.39.2
三、安装过程遇到的报错解决方法
错误一:
-
- 安装tesseract的过程中报缺少依赖的错误
- Error: No such file or directory @ rb_sysopen - /Users/f/Library/Caches/Homebrew/downloads/266702d9bc59c9dfde27ce555b4a3f9ed9d0de770ba697e62a111d74ee0a4231–openjpeg-2.4.0.catalina.bottle.tar.gz
- 针对这类错误单独安装缺少的包即可
brew install openjpeg
错误二:
- 单独安装依赖出现如下提示:
- Disable this behaviour by setting HOMEBREW_NO_INSTALL_CLEANUP. Hide these hints with HOMEBREW_NO_ENV_HINTS (see
man brew
). - 执行如下命令即可:
export HOMEBREW_NO_INSTALL_CLEANUP=TRUE
三、下载中文包
- tesseract默认不支持中文,需要单独下载中文包
- 中文包下载地址: https://tesseract-ocr.github.io/tessdoc/Data-Files
四、中文包存放目录
- /usr/local/Cellar/tesseract/{tesseract版本}/share/tessdata
cd /usr/local/Cellar/tesseract/5.1.0/share/tessdata
五、查看全部语言库
tesseract --list-langs
List of available languages in "/usr/local/share/tessdata/" (4):
chi_sim
eng
osd
snum
六、python 安装pytesseract和pillow
pip install pytesseract
pip install pillow
七、识别图片中文字体
import pytesseract
from PIL import Image
# 读取图片
im = Image.open('/Users/f/PycharmProjects/firstProject/a/a.png')
# 识别文字,并指定语言
string = pytesseract.image_to_string(im, lang='chi_sim')
print(string)