Python图转文字OCR——tesserocr

最新推荐文章于 2024-08-08 08:00:00 发布

XerCis

最新推荐文章于 2024-08-08 08:00:00 发布

阅读量1.5k

点赞数 1

分类专栏： Python 文章标签： Python OCR 文字识别 tesserocr pytesseract

本文链接：https://blog.csdn.net/lly1122334/article/details/103507671

版权

Python 专栏收录该内容

524 篇文章 195 订阅

订阅专栏

文章目录

1. 简介
2. 安装
- 2.1 tesseract
- 2.2 pytesseract
3 测试
- 3.1 tesseract
- 3.2 pytesseract
参考文献

1. 简介

Tesseract是一款OCR（光学字符识别）引擎。Tesseract4的OCR引擎使用LSTM实现，同时保留Tesseract3的识别模式

Tesseract支持UTF-8，能识别超过100种语言，开箱即用

Tesseract支持多种输出格式：纯文本、HTML、PDF、TSV等

Tesseract只支持命令行，要用GUI的话看第三方库

2. 安装

2.1 tesseract

tesseract下载地址

下载最新版，带dev的为开发版，本人下载的是tesseract-ocr-w64-setup-v5.0.0.20190623.exe
勾选Additional script data (download)和Additional language data (download)，下载速度很慢，请自行选择需要的语言
配置环境变量，Path：C:\Program Files\Tesseract-OCR
配置环境变量，新建变量名TESSDATA_PREFIX，变量值C:\Program Files\Tesseract-OCR\tessdata;
命令行tesseract -v看版本

在这里插入图片描述 `

2.2 pytesseract

pip install pytesseract

pip install pillow

3 测试

测试图片
在这里插入图片描述

3.1 tesseract

命令行执行命令tesseract image.png result，识别结果输出为result.txt

结果为：

Python3WebSpider

3.2 pytesseract

import pytesseract
from PIL import Image

image = Image.open("image.png")
print(pytesseract.image_to_string(image))

结果为：

Python3WebSpider

尝试识别中文

在这里插入图片描述

import pytesseract
from PIL import Image
import matplotlib.pyplot as plt

image = Image.open("image.jpg")
plt.imshow(image)
plt.show()
print(pytesseract.image_to_string(image, lang='chi_sim'))

结果为：

富强民主文明和谐
自由平等公正法治
爱围敬业诚信友善

中文识别效果很差

提高识别准确率需训练字库

参考文献

XerCis

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录