聊一聊调试Tesseract-OCR和pytesseract过程中遇到的DPI的坑

最新推荐文章于 2024-03-22 05:19:39 发布

skyfoxbj

最新推荐文章于 2024-03-22 05:19:39 发布

阅读量1.6k

点赞数

分类专栏： python 文章标签： python 图像识别

本文链接：https://blog.csdn.net/skyfoxbj/article/details/116715213

版权

python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

聊一聊调试Tesseract-OCR和pytesseract过程中遇到的DPI的坑

环境

环境

OS：windows 10
python：3.9.4
tesseract：v5.0.0-alpha.20200223
pytesseract：0.3.7
Pillow：8.2.0
fonttools：4.22.0

最近在做python爬虫项目的过程中，遇到了woff字体加密的页面需要破解。基本上解决方案就两种，一是将字体文件解析xml，根据xml中每个字的笔画数据特点识别出文字；二是用OCR直接识别文字；我选的是第二种，通用性更强一些。具体过程：

1、安装Tesseract-OCR
程序和语言模块都从这里下载：https://github.com/tesseract-ocr/tesseract
用户手册：https://tesseract-ocr.github.io/tessdoc/Home.html
我装的版本是：tesseract v5.0.0-alpha.20200223，安装在C:\Program Files\Tesseract-OCR
软件安装完毕后，需要检查一下环境变量是否设置好：
TESSDATA_PREFIX，路径指向"C:\Program Files\Tesseract-OCR\tessdata"文件夹
新建path，指向C:\Program Files\Tesseract-OCR

2、安装pytesseract、Pillow和fonttools
pip install pytesseract
安装完毕后修改pytesseract.py文件，tesseract_cmd = ‘C:/Program Files/Tesseract-OCR/tesseract.exe’
pip install pillow
pip install fonttools

3、测试
将一张待识别图片拷贝到D盘根目录，然后打开PowerShell执行"tesseract d:\temp.png 1"，打开d:\1.txt，发现识别有错误，我只需要做数字字符的识别，所以再加个参数变成这样“tesseract .\temp.png 1 digits”，完美识别

4、程序

# -*- coding:utf8 -*-

from PIL import Image, ImageDraw, ImageFont
from fontTools.ttLib import TTFont
import pytesseract


def DrawFont(font, char):
    s = font.getsize(char)
    image = Image.new('RGB', (s[0], s[1] + 10), color='white')
    draw = ImageDraw.Draw(image)
    draw.text((0, 0), char, font=font, fill='black')
    return image

if __name__ == '__main__':
	# 打开字体文件
    font = ImageFont.truetype(r'0f52e9f5.woff', 200)
    wf = TTFont(r'0f52e9f5.woff')
	# 将字体文件中的所有字符绘至图片上
	c = list(wf.getBestCmap().keys())[1:]
    str = ''.join([chr(i) for i in c])
    img = DrawFont(font, str)
    # OCR识别
	esult = pytesseract.image_to_string(Image.open('d:/temp.png'), config="digits")
	print(str(int(result)))

执行了一下，识别结果竟然是空串。于是开始了漫长的搜索和调试之旅。各种碰壁之后，突发奇想，把绘出的图片存盘，在PowerShell中执行命令：”tesseract d:\temp.png 1“后，竟然得到一条警告信息：Warning: Invalid resolution 0 dpi. Using 70 instead.

这就是说，用Image.new得到的图形中不带DPI信息！！！tesseract识别不了不带DPI信息的图片。于是各种资料查找各种尝试后，终于知道在Image.New语句之后加一句”image.info[‘dpi’] = (300, 300)“就可以让图形带上DPI信息，将图形存盘后，用命令行方式完美识别。

再次信心满满执行程序，于是结果很悲催的还是空串。有段时间我都想换用另一种方法了，要不是解析xml的方法坑更大，我就不用写这篇文章了。好吧言归正传，分析跟踪调试之后结论是：pytesseract的问题！！！

跟踪进pytesseract.py，重点看DPI信息是否被正确传递，结果发现pytesseract的运行流程是先把图形保存为临时图片，然后调用tesseract对该图片进行识别，并返回识别结果。然而pytesseract在保存图片的时候，并未保存DPI信息，致使临时图片中不含有DPI信息，在调用tesseract对该图片识别的时候就会发生错误造成识别失败。pytesseract.py中原代码为：

@contextmanager
def save(image):
    try:
        with NamedTemporaryFile(prefix='tess_', delete=False) as f:
            if isinstance(image, str):
                yield f.name, realpath(normpath(normcase(image)))
                return
            image, extension = prepare(image)
            input_file_name = f.name + extsep + extension
            image.save(input_file_name, format=image.format)
            yield f.name, input_file_name
    finally:
        cleanup(f.name)

修改后加入保存DPI信息的部分：

@contextmanager
def save(image):
    try:
        with NamedTemporaryFile(prefix='tess_', delete=False) as f:
            if isinstance(image, str):
                yield f.name, realpath(normpath(normcase(image)))
                return
            image, extension = prepare(image)
            input_file_name = f.name + extsep + extension
            if 'dpi' in image.info.keys():
                image.save(input_file_name, format=image.format, dpi=image.info['dpi'])
            else:
                image.save(input_file_name, format=image.format)
            yield f.name, input_file_name
    finally:
        cleanup(f.name)

调用程序部分：

# -*- coding:utf8 -*-

from PIL import Image, ImageDraw, ImageFont
from fontTools.ttLib import TTFont
import pytesseract


def DrawFont(font, char):
    s = font.getsize(char)
    image = Image.new('RGB', (s[0], s[1] + 10), color='white')
    image.info['dpi'] = (300, 300)
    draw = ImageDraw.Draw(image)
    draw.text((0, 0), char, font=font, fill='black')
    return image

if __name__ == '__main__':
	# 打开字体文件
    font = ImageFont.truetype(r'0f52e9f5.woff', 200)
    wf = TTFont(r'0f52e9f5.woff')
	# 将字体文件中的所有字符绘至图片上
	c = list(wf.getBestCmap().keys())[1:]
    str = ''.join([chr(i) for i in c])
    img = DrawFont(font, str)
    # OCR识别
	esult = pytesseract.image_to_string(Image.open('d:/temp.png'), config="digits")
	print(str(int(result)))