pytesseract，一个超厉害的Python库！

最新推荐文章于 2025-03-27 19:59:31 发布

黑马聊AI

最新推荐文章于 2025-03-27 19:59:31 发布

阅读量3k

点赞数 25

分类专栏： Python编程文章标签： python 开发语言

本文链接：https://blog.csdn.net/2401_83617404/article/details/140939532

版权

Python编程专栏收录该内容

73 篇文章

订阅专栏

pytesseract 是一个 Python 的封装库，用于调用 Tesseract-OCR 引擎进行图片中的文字识别。它能够将图像中的文字转换为字符串，是处理图像文字识别任务的便捷工具。

如何安装pytesseract

首先，要使用pytesseract，您需要安装Tesseract-OCR引擎，这是pytesseract的依赖项。您可以从官方网站下载并安装适用于您操作系统的版本。

在Python环境中，安装pytesseract库非常简单。打开命令行工具，执行以下命令：

pip install pytesseract

安装完毕后，您需要在Python代码中引入pytesseract库以及PIL库（用于图像处理），可以使用以下代码：

import pytesseract
from PIL import Image

确保Tesseract-OCR的安装路径已经被添加到系统环境变量中，否则pytesseract可能无法找到OCR引擎。在Windows系统中，通常需要将tesseract.exe的路径添加到系统变量，而在Linux或macOS系统中，确保tesseract命令可以在命令行中直接使用。

pytesseract的功能特性

精准度：pytesseract 提供高精度的文字识别功能。
跨平台：支持多种操作系统，如 Windows、Linux 和 macOS。
易于集成：可以轻松集成到 Python 项目中。
丰富功能：支持多种文字识别语言和自定义训练。
开源免费：遵循 Apache 2.0 许可，可免费使用和修改。

pytesseract的基本功能

pytesseract 是一个 Python 包，用于将图像中的文字转换为字符串。它封装了 Tesseract-OCR 引擎，这是一个强大的光学字符识别（OCR）库。

文字识别

pytesseract 的核心功能是识别图像中的文字。下面是一个简单的例子，展示了如何使用这个库来识别图像中的文字。

from PIL import Image
import pytesseract

# 打开图像文件
image = Image.open('path_to_image.jpg')

# 使用 pytesseract 对图像进行文字识别
text = pytesseract.image_to_string(image)

print(text)

语言选择

pytesseract 允许指定 OCR 识别的语言。默认情况下，它使用英语，但可以轻松更改。

from PIL import Image
import pytesseract

# 打开图像文件
image = Image.open('path_to_image.jpg')

# 指定语言为中文
text = pytesseract.image_to_string(image, lang='chi_sim')

print(text)

配置选项

pytesseract 提供了丰富的配置选项，允许用户自定义 OCR 识别过程。

from PIL import Image
import pytesseract

# 打开图像文件
image = Image.open('path_to_image.jpg')

# 自定义配置选项
custom_oem_psm_config = r'--oem 3 --psm 6'
text = pytesseract.image_to_string(image, config=custom_oem_psm_config)

print(text)

输出格式

pytesseract 支持多种输出格式，包括 JSON 和 PDF。

from PIL import Image
import pytesseract

# 打开图像文件
image = Image.open('path_to_image.jpg')

# 输出为 JSON 格式
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.JSON)

print(data)

输出坐标

pytesseract 可以输出每个识别到的文字的坐标位置，这对于图像处理和数据分析非常有用。

from PIL import Image
import pytesseract

# 打开图像文件
image = Image.open('path_to_image.jpg')

# 获取文字及其坐标
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)

for i in range(len(data['text'])):
    if int(data['conf'][i]) > 60:
        (x, y, w, h) = (data['left'][i], data['top'][i], data['width'][i], data['height'][i])
        text = data['text'][i]
        print(f'Text: {text} - Position: ({x}, {y}, {w}, {h})')

这些基本功能展示了 pytesseract 的强大性和灵活性，为程序员提供了一个简单易用的工具来处理图像中的文字。

pytesseract的高级功能

使用配置参数

pytesseract 允许我们通过配置参数来优化OCR识别过程。我们可以指定OCR引擎模式、页面分割模式等，以提高识别的准确度和效率。

from PIL import Image
import pytesseract

# 设置OCR引擎模式为Tesseract LstmCombined
custom_oem_psm_config = r'--oem LSTM_COMBINED --psm AUTO'
text = pytesseract.image_to_string(Image.open('image.jpg'), config=custom_oem_psm_config)
print(text)

输出文字坐标

pytesseract 可以输出识别的文字坐标，这对于图像处理和布局分析非常有用。

from PIL import Image
import pytesseract

# 识别图像中的文字及其坐标
image = Image.open('image.jpg')
data = pytesseract.image_to_data(image, output_type=Output.DICT)
for i in range(len(data['text'])):
    if int(data['conf'][i]) > 60:  # 确定置信度
        (x, y, w, h) = (data['left'][i], data['top'][i], data['width'][i], data['height'][i])
        print(f"文字: {data['text'][i]}，坐标: ({x}, {y}), 宽度: {w}, 高度: {h}")

识别多种语言

pytesseract 支持识别多种语言，只需在配置参数中指定语言代码。

from PIL import Image
import pytesseract

# 识别英文和法文
custom_config = r'--psm AUTO --oem LSTM_COMBINED -l eng+fra'
image = Image.open('image.jpg')
text = pytesseract.image_to_string(image, config=custom_config)
print(text)

使用训练好的模型

如果内置的OCR模型无法满足特定需求，可以训练自己的模型，并用 pytesseract 加载。

from PIL import Image
import pytesseract

# 加载训练好的模型
custom trained_model = 'path/to/your/trained_model'
text = pytesseract.image_to_string(image, model=trained_model)
print(text)

识别图像中的表格

pytesseract 可以识别图像中的表格内容，但需要配合其他库（如 tabula-py）来实现更准确的结果。

from PIL import Image
import pytesseract
import tabula

# 识别图像中的表格
image = Image.open('table_image.jpg')
text = pytesseract.image_to_string(image, config='--psm 6')
tables = tabula.read_pdf(text, pages='all', multiple_tables=True)
for table in tables:
    print(table)

处理复杂背景的图像

对于复杂背景的图像，可以通过预处理步骤来提高OCR的识别率。

from PIL import Image, ImageFilter, ImageOps
import pytesseract

# 预处理图像
image = Image.open('complex_image.jpg')
image = image.filter(ImageFilter.EDGE_ENHANCE_MORE)
image = ImageOps.grayscale(image)
text = pytesseract.image_to_string(image)
print(text)

pytesseract的实际应用场景

身份证号码识别

在处理大量文档时，自动提取身份证号码是一项常见需求。使用pytesseract可以轻松实现这一功能。

from PIL import Image
import pytesseract

# 加载图片
image = Image.open('id_card.jpg')

# 使用pytesseract识别图片中的文字
text = pytesseract.image_to_string(image, lang='eng', config='--psm 6')

# 打印识别结果
print(text)
# 输出中包含身份证号码

验证码识别

在自动化测试或爬虫过程中，验证码是常见的障碍。pytesseract可以帮助我们识别简单验证码。

from PIL import Image
import pytesseract

# 加载验证码图片
image = Image.open('captcha.jpg')

# 使用pytesseract识别验证码
captcha_text = pytesseract.image_to_string(image, lang='eng', config='--psm 7')

# 打印识别结果
print(captcha_text)
# 输出验证码文本

文档信息提取

从扫描的文档中提取关键信息，如发票号码、日期等，是pytesseract的另一个应用场景。

from PIL import Image
import pytesseract

# 加载文档图片
image = Image.open('invoice.jpg')

# 使用pytesseract提取文档信息
text = pytesseract.image_to_string(image, lang='eng', config='--psm 6')

# 打印识别结果
print(text)
# 从输出中提取所需信息

文本对比

在文档审核过程中，对比两个文档的内容是否一致是必要的。使用pytesseract可以快速实现。

from PIL import Image
import pytesseract

# 加载两个文档图片
image1 = Image.open('document1.jpg')
image2 = Image.open('document2.jpg')

# 使用pytesseract提取文档内容
text1 = pytesseract.image_to_string(image1, lang='eng', config='--psm 6')
text2 = pytesseract.image_to_string(image2, lang='eng', config='--psm 6')

# 对比两个文档内容
print(text1 == text2)
# 输出对比结果，判断是否相同

批量处理图片

处理大量图片时，使用pytesseract自动化识别文本可以大大提高效率。

import os
from PIL import Image
import pytesseract

# 图片文件夹路径
image_folder = 'path_to_image_folder'

# 遍历文件夹中的所有图片
for filename in os.listdir(image_folder):
    if filename.endswith('.jpg'):
        # 加载图片
        image_path = os.path.join(image_folder, filename)
        image = Image.open(image_path)
        
        # 使用pytesseract识别图片中的文字
        text = pytesseract.image_to_string(image, lang='eng', config='--psm 6')
        
        # 打印识别结果
        print(filename + ': ' + text)

二维码识别

识别图片中的二维码并提取信息，是pytesseract的另一个实用功能。

from PIL import Image
import pytesseract

# 加载二维码图片
image = Image.open('qrcode.jpg')

# 使用pytesseract识别二维码
text = pytesseract.image_to_string(image, lang='eng', config='--psm 6')

# 打印识别结果
print(text)
# 输出二维码中的文本信息

总结

通过本文的学习，我们掌握了使用pytesseract进行图像文字识别的基本操作和高级技巧。pytesseract作为一个强大的OCR库，可以帮助我们快速实现图片中文字的提取，无论是简单的文本识别还是复杂背景下的文字解析，都能游刃有余。在实际应用中，pytesseract可以应用于各种场景，如数据抓取、信息自动化处理等，大大提高了我们的工作效率。希望这篇文章能够帮助大家更好地理解和运用pytesseract，开启图像文字识别的大门。
编程、AI、副业交流：https://t.zsxq.com/19zcqaJ2b
领【150 道精选 Java 高频面试题】请 go 公众号：码路向前。