OCR技术浅析-tesserOCR（3）

最新推荐文章于 2021-11-22 10:22:09 发布

aogu9974

最新推荐文章于 2021-11-22 10:22:09 发布

阅读量103

点赞数

原文链接：http://www.cnblogs.com/thors/p/9494057.html

版权

tesserOCR使用

tesserOCR是文字识别软件（惠普公司开源）

Optical Character Recognition (OCR)即光学字符辨识是把打印文本转换成一个数字表示的过程。它有各种各样的实际应用--从数字化印刷书籍、创建收据的电子记录，到车牌识别甚至破解基于图像的验证码。

tesserOCR 训练说明

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract

http://qianjiye.de/2015/08/tesseract-ocr

http://yanghespace.com/2015/11/01/Tesseract3训练新语言/

https://blog.csdn.net/huangli19870217/article/details/45075033

1.安装

源码地址 https://github.com/tesseract-ocr/

Google 地址 http://code.google.com/p/tesseract-ocr/downloads/list

Ubuntu sudo apt-get install tesseract-ocr

Centos yum install tesseract.i686

Window 云盘有window版安装包

2.添加环境变量

目录结构：

Tesseract 程序文件

Tessdata 语言包文件

3.使用方式

命令行语法：

Tesseract.exe imagePath OutPutPath [ -l lang] [--oem ocrenginemode] [ -psm pagesegmode] [configfile]

Pagesegmode 值有：

0 =定向和脚本检测（OSD）只。

1 =利用OSD进行自动页面分割。

2 =自动页面分割，但没有OSD或OCR

3 =全自动页面分割，但没有OSD。（默认）

4 =假设一列可变大小的文本。

5＝假定垂直对齐文本的单一均匀块。

6 =假设一个统一的文本块。

7 =将图像视为单个文本行。

8 =将图像视为单个单词。

9＝将图像作为循环中的单个单词处理。

10 =将图像视为单个字符。

语言包和模式必须在配置文件之前。

4.php使用tesseract

git上已有开源的php类库实现了tesserOCR的方法 https://github.com/thiagoalessio/tesseract-ocr-for-php

Comporser 安装 Comporser requir thiagoalessio/tesseract_ocr

其实只需要执行 exec 就可以。

<?php

new TesseractOCR('multi-languages.png')

->lang('eng', 'jpn', 'por') //使用语言包

->whitelist(range('A', 'Z')) //固定范围

 ->run();

刚安装好的tesserOCR犹如初生的婴儿，识别能力并不强，可以下载官方提供的语言包（非系统语言包）或自己训练

转载于:https://www.cnblogs.com/thors/p/9494057.html

aogu9974

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫