python docx 获取图片包括获取压缩包中的图片与整页转换成图片方案

路人与大师

已于 2024-05-31 18:17:13 修改

阅读量693

点赞数 9

文章标签： python 开发语言

于 2024-05-31 17:45:55 首次发布

本文链接：https://blog.csdn.net/weixin_41046245/article/details/139358634

版权

要使用 python-docx 提取 Word 文档中的图片，首先要了解的是，python-docx 本身并不支持直接提取嵌入在文档中的图片。但可以通过一些其他方法来实现。以下是一个基本的方法，使用 python-docx 结合 zipfile 模块来提取图片。

首先，确保你已经安装了必要的库：

pip install python-docx

以下是提取图片的步骤：

读取 Word 文档：将 .docx 文件视为一个 zip 文件。
提取图片文件：从 zip 文件中找到并提取所有图片文件。

下面是一个示例代码来演示如何实现这个过程：

import zipfile
import os

def extract_images_from_docx(docx_path, output_dir):
    # Ensure the output directory exists
    os.makedirs(output_dir, exist_ok=True)
    
    # Open the docx file as a zip file
    with zipfile.ZipFile(docx_path, 'r') as docx:
        # Iterate over each file in the docx (zip) file
        for file in docx.namelist():
            # Check if the file is an image (Word stores images in the word/media/ directory)
            if file.startswith('word/media/'):
                # Extract the image to the output directory
                image_data = docx.read(file)
                image_filename = os.path.basename(file)
                output_path = os.path.join(output_dir, image_filename)
                
                # Write the image data to the output file
                with open(output_path, 'wb') as image_file:
                    image_file.write(image_data)
                print(f"Extracted {image_filename} to {output_path}")

# Example usage
docx_path = 'example.docx'  # Replace with your .docx file path
output_dir = 'extracted_images'  # Directory to save extracted images
extract_images_from_docx(docx_path, output_dir)

这个代码将会：

创建一个目标目录来保存提取的图片（如果目录不存在）。
打开 .docx 文件作为 zip 文件。
遍历 zip 文件中的所有文件，找到存储图片的目录（通常是 word/media/）。
将找到的图片文件提取并保存到目标目录中。

确保将 docx_path 替换为你的 Word 文档的实际路径，将 output_dir 替换为你希望保存图片的目录路径。运行这段代码后，所有嵌入在 Word 文档中的图片将会被提取并保存到指定的目录中。

我希望一整页都转换成图片输出

要将整页转换成图片输出，这超出了 python-docx 的功能范围。你需要使用其他工具或库来实现这一点。python-docx 主要用于操作 Word 文档的内容，而不是呈现或渲染它们。为了将 Word 文档的页面转换成图片，通常有以下几个选择：

使用 python-docx 和 python-pptx：将每个页面的内容转换成幻灯片，然后将幻灯片转换为图片。
使用 pywin32 和 Word COM：在 Windows 上通过自动化 Word 应用程序将页面另存为图片。
使用 libreoffice 的命令行工具：在命令行中使用 LibreOffice 将 Word 文档转换为 PDF，然后将 PDF 转换为图片。
使用 pdf2image：将 Word 文档先转换成 PDF，然后将 PDF 转换为图片。

下面是一个示例，展示了如何使用 libreoffice 和 pdf2image 将 Word 文档转换为图片：

安装必要的软件和库

首先，你需要安装 LibreOffice 和 pdf2image Python 库：

sudo apt-get install libreoffice
pip install pdf2image

转换流程

使用 LibreOffice 将 Word 文档转换为 PDF。
使用 pdf2image 将 PDF 转换为图片。

示例代码

import subprocess
from pdf2image import convert_from_path

def convert_docx_to_pdf(docx_path, pdf_path):
    # 使用LibreOffice将DOCX转换为PDF
    subprocess.run(['libreoffice', '--headless', '--convert-to', 'pdf', docx_path, '--outdir', os.path.dirname(pdf_path)], check=True)

def convert_pdf_to_images(pdf_path, output_dir):
    # 使用pdf2image将PDF转换为图片
    os.makedirs(output_dir, exist_ok=True)
    images = convert_from_path(pdf_path)
    for i, image in enumerate(images):
        image_path = os.path.join(output_dir, f'page_{i + 1}.png')
        image.save(image_path, 'PNG')
        print(f"Saved {image_path}")

# 示例用法
docx_path = 'example.docx'  # 替换为你的DOCX文件路径
pdf_path = 'example.pdf'  # 临时PDF文件路径
output_dir = 'output_images'  # 图片输出目录

# 将DOCX转换为PDF
convert_docx_to_pdf(docx_path, pdf_path)

# 将PDF转换为图片
convert_pdf_to_images(pdf_path, output_dir)

解释

convert_docx_to_pdf 函数：使用 LibreOffice 的命令行工具将 DOCX 文件转换为 PDF。
convert_pdf_to_images 函数：使用 pdf2image 库将 PDF 文件的每一页转换为 PNG 图片，并保存到指定的输出目录。

确保你在系统上正确安装了 LibreOffice，并且命令 libreoffice 可以从命令行运行。运行这段代码后，你的 Word 文档的每一页将被转换为 PNG 图片并保存到指定的输出目录中。

在linux系统中会出现NotImplementedError: docx2pdf is not implemented for linux as it requires Microsoft Word to be installed
那么如何在linux系统中实现功能呢？
由于 docx2pdf 需要 Microsoft Word 的支持，目前只能在 Windows 环境下运行。但在 Linux 环境下，我们可以使用 LibreOffice 的命令行工具 soffice 将 .docx 文件转换为 PDF 文件。然后再使用 pdf2image 将 PDF 文件的每一页转换为图片。

首先，确保安装 LibreOffice 和所需的 Python 库：

sudo apt-get install libreoffice
pip install pdf2image

然后，可以使用以下代码将 .docx 文件转换为 PDF 并将每一页转换为图片：

import subprocess
from pdf2image import convert_from_path
import os

def docx_to_pdf(docx_path, pdf_path):
    # Use LibreOffice to convert DOCX to PDF
    subprocess.run(['soffice', '--headless', '--convert-to', 'pdf', '--outdir', os.path.dirname(pdf_path), docx_path])

def pdf_to_images(pdf_path, output_dir):
    # Ensure the output directory exists
    os.makedirs(output_dir, exist_ok=True)
    
    # Convert the PDF to images
    images = convert_from_path(pdf_path)
    
    # Save each image
    for i, image in enumerate(images):
        image_path = os.path.join(output_dir, f'page_{i + 1}.png')
        image.save(image_path, 'PNG')
        print(f"Saved {image_path}")

def docx_to_images(docx_path, output_dir):
    pdf_path = os.path.join(output_dir, 'temp.pdf')
    
    # Convert DOCX to PDF
    docx_to_pdf(docx_path, pdf_path)
    
    # Convert PDF to images
    pdf_to_images(pdf_path, output_dir)
    
    # Clean up temporary PDF file
    os.remove(pdf_path)

# Example usage
docx_path = 'example.docx'  # Replace with your .docx file path
output_dir = 'output_images'  # Directory to save output images
docx_to_images(docx_path, output_dir)

这个代码将会：

使用 LibreOffice 将 .docx 文件转换为 PDF 文件。
使用 pdf2image 将生成的 PDF 文件的每一页转换为图片。
将生成的每一页图片保存到目标目录中，并命名为 page_1.png, page_2.png, 等等。
删除临时生成的 PDF 文件。

确保将 docx_path 替换为你的 Word 文档的实际路径，将 output_dir 替换为你希望保存图片的目录路径。运行这段代码后，Word 文档的每一页将会被转换成图片并保存到指定的目录中。

路人与大师

关注

9
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
python docx 获取图片包括获取压缩包中的图片与整页转换成图片方案

要使用提取 Word 文档中的图片，首先要了解的是，本身并不支持直接提取嵌入在文档中的图片。但可以通过一些其他方法来实现。以下是一个基本的方法，使用结合zipfile模块来提取图片。.docx.docx确保将docx_path替换为你的 Word 文档的实际路径，将output_dir替换为你希望保存图片的目录路径。运行这段代码后，所有嵌入在 Word 文档中的图片将会被提取并保存到指定的目录中。我希望一整页都转换成图片输出要将整页转换成图片输出，这超出了的功能范围。
复制链接

扫一扫