使用Python和Tesseract进行图片文字批量提取

最新推荐文章于 2024-08-03 15:35:50 发布

南河༜

最新推荐文章于 2024-08-03 15:35:50 发布

阅读量327

点赞数 11

分类专栏：学习笔记文章标签： python 开发语言

本文链接：https://blog.csdn.net/jxs_hk/article/details/139410553

版权

学习笔记专栏收录该内容

1 篇文章 0 订阅

订阅专栏

在Python中，可以使用Tesseract库结合Python脚本来批量提取图片中的文字。Tesseract是一个高度精确的开源OCR引擎，可以识别多种语言的文字。

首先，您需要安装Python和Tesseract OCR引擎。然后，安装Python的Tesseract库（pytesseract）和其他可能需要的库，如Pillow（用于图像处理）。

# 安装Tesseract OCR引擎（Windows）
# 从https://github.com/UB-Mannheim/tesseract/wiki下载安装包并安装

# 安装Python的Tesseract库（pytesseract）
pip install pytesseract

示例

# @description:将转换生成的文字保存在一个txt文件中
import pytesseract
from PIL import Image
import os

# 设置tesseract-ocr安装路径
pytesseract.pytesseract.tesseract_cmd = r'D:\Software\WorkSoftware\Tesseract-OCR\tesseract.exe'  # Windows路径
# D:\Software\WorkSoftware\Tesseract-OCR\tesseract.exe



# 图片所在文件夹
image_folder = 'C:\\Users\\……\\Desktop\\test'

# 输出文本文件路径
output_file = 'D:\\Software\\ITSoftware\\PyCharm 2021.3\\Workspace\\picture2txt\\output.txt'
# D:\\Software\\ITSoftware\\PyCharm 2021.3\\Workspace\\picture2txt

# 读取文件夹中所有图片，并提取文本
with open(output_file, 'w') as f:
    for image_name in os.listdir(image_folder):
        if image_name.endswith(('.png', '.jpg', '.jpeg', '.bmp')):
            image_path = os.path.join(image_folder, image_name)
            text = pytesseract.image_to_string(Image.open(image_path))
            f.write(text)
            print(f'Extracted text from {image_name}')

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2024/6/3 10:35
# @Author  : River❀
# @File    : pic2text.py
# @Software: PyCharm
# @description:每一个图片对应一个文件  结果有一些不准确，需要人员手动验证

import os
from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r'D:\Software\WorkSoftware\Tesseract-OCR\tesseract.exe'

# 定义图片文件夹路径和输出文件路径
image_folder = 'C:\\Users\\……\\Desktop\\test'
output_folder = 'D:\\Software\\ITSoftware\\PyCharm 2021.3\\Workspace\\picture2txt\\files'

# 确保输出文件夹存在
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# 遍历图片文件夹中的所有图片文件
for filename in os.listdir(image_folder):
    if filename.endswith('.png') or filename.endswith('.jpg') or filename.endswith('.jpeg'):
        # 打开图片文件
        img_path = os.path.join(image_folder, filename)
        image = Image.open(img_path)

        # 使用Tesseract提取文字(lang='chi_sim+eng' 包含中文和英文)
        text = pytesseract.image_to_string(image, lang='chi_sim')

        # 将提取的文字保存到文本文件
        output_file = os.path.join(output_folder, f'{os.path.splitext(filename)}.txt')
        with open(output_file, 'w', encoding='utf-8') as f:
            f.write(text)