python文件处理：解析docx/word文件文字、图片、复选框

EelBarb

已于 2024-04-02 15:23:25 修改

阅读量4k

点赞数 8

分类专栏： python 文章标签： word python 自动化

于 2024-04-02 15:18:59 首次发布

本文链接：https://blog.csdn.net/G541788_/article/details/137265022

版权

前言

因为一些项目原因，我需要提供解析docx内容功能。本来以为这是一件比较简单的工作，没想到在解析复选框选项上吃了亏，并且较长一段时间内通过各种渠道都没有真正解决这一问题，反而绕了远路。

终于，我在github python-docx模块的Issues中找到了重要的思路及线索，并最终通过后续努力，实现了【解析docx/word文件文字、图片、复选框】这一功能。

Feature: Read checkboxes in Word forms · Issue #224 · python-openxml/python-docx · GitHub

python-docx基础操作

# 安装python-docx模块
pip install python-docx

import os
import docx
import time

# 图片附件的存储地址
image_save_path = 'appendix_dir'

# 读取docx表格里的数据，图片及文字
def read_table_from_docx(file_path):
    """
    :param file_path:
    :return: table_data, images
    """

    # 读取docx/word文件
    doc = docx.Document(file_path)

    # 获取docx中的table对象
    tables = doc.tables
    table_data = []
    images = []

    # 拿取文件中的图片对象，并存储在images列表里
    for rel in doc.part.rels.values():
        if "image" in rel.reltype:
            image = rel.target_part
            image_data = image.blob
            images.append(image_data)

    # 读取文件表格中的文字内容
    # 这里不能解析特殊字符和复选框
    # 并且合并单元格的文字内容，将出现多行多列重复出现，需要注意
    for table in tables:
        for row in table.rows:
            row_data = []
            for cell in row.cells:
                # print(cell, cell.text)
                row_data.append(cell.text)
            table_data.append(row_data)

    return table_data, images

table_data, images = read_table_from_docx('template.docx')
print(table_data)

# 另存docx图片到本地
for i, image_data in enumerate(images):
    # 拼接 存储图片 绝对路径
    image_name = f"expert_{int(time.time() * 1000)}.jpg"
    with open(os.path.join(image_save_path, image_name), "wb") as f:
        f.write(image_data)