文档整合自动化-CSDN博客

本文链接：https://blog.csdn.net/qq_42052591/article/details/148293780

主要功能是按照JSON文件（Sort.json）中指定的顺序合并多个Word文档（.docx），并清除文档中的所有超链接。最终输出合并后的文档名为"sorted_按章节顺序.docx"。

主要分为几个部分：

初始化配置

定义超链接清除函数（处理段落+表格）
获取当前工作目录

读取排序规则

解析Sort.json文件
构建完整文件路径列表

文件验证

检查JSON中所有文件是否存在
输出缺失文件警告

主流程：读取JSON，验证文件，合并文档

环境配置步骤：

安装好 Python 和成功配置相应的环境变量，我的 Python 版本为 3.8.2
需要安装 win32com、docx、docxcompose，分别输入以下代码安装

pip install pypiwin32
pip install python-docx
pip install docxcompose

💡 提示：安装后可通过 python -c "import win32com; print('成功')" 验证

代码简略版：

!/usr/bin/python3.6
# -*- coding: utf-8 -*-
"""
@Time    ：24-12 10:07
@Software: PyCharm
@Project ：Merge files001
"""
import os
import json
from docx import Document
from docxcompose.composer import Composer
 
# 清除文档中的所有超链接
def remove_hyperlinks(doc):
    for para in doc.paragraphs:
        for run in para.runs:
            # 通过run的XML属性查看是否为超链接
            if 'hyperlink' in run._r.xml:
                run._r.getparent().remove(run._r) 
    for table in doc.tables:
        for row in table.rows:
            for cell in row.cells:
                for para in cell.paragraphs:
                    for run in para.runs:
                        if 'hyperlink' in run._r.xml:
                            run._r.getparent().remove(run._r) 
# 获取当前工作目录( cwd )
cwd = os.getcwd()
# 读取JSON文件并获取排序信息
def get_order_from_json(json_path):
    # 读取 JSON 文件，获取文件列表
    with open(json_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    return [os.path.join(cwd, 'Word_Test', file_name) for file_name in data['file_order']]
# 调用函数读取排序信息
json_path = 'Sort.json'  #JSON文件路径
ordered_files = get_order_from_json(json_path)

 
# 使用排序后的文件列表进行合并
def combine_all_docx_ordered(filename_master, files_list_ordered):
    # 确保文件列表不为空
    if not files_list_ordered:
        print("没有文档可供合并。")
        return
    try:
        master = Document(filename_master) 
        remove_hyperlinks(master)  
    except Exception as e:
        print(f"无法打开主文档{filename_master}：{e}")
        return
 
    # 在循环之前添加一个分页符，合并后的文档从第二页开始。
    master.add_page_break()  
    composer = Composer(master)
    # 如果文件列表中只有一个文件，即主文档自身，直接保存即可
    if len(files_list_ordered) == 1:
        print("只有一个文档，无需合并。")
        master.save("single_doc.docx")
        return
        
    for doc_temp_path in files_list_ordered[1:]:  
        try:
            doc_temp = Document(doc_temp_path)  
            remove_hyperlinks(doc_temp) 
        except Exception as e:
            print(f"无法打开文档 {doc_temp_path}:{e}")
            continue
        doc_temp.add_page_break() 
        composer.append(doc_temp) 
 
    # 保存合并后的文档
    try:
        composer.save("sorted_按章节顺序.docx")
        print("合并后的文档已保存。")
    except Exception as e:
        print(f"保存合并文档时出错: {e}")
 
# 验证JSON中的文件是否存在
def verify_files_existence(files_paths):
    existing_files = []
    missing_files = []
    for file_path in files_paths:
        if os.path.exists(file_path):
            existing_files.append(file_path)
        else:
            missing_files.append(file_path)
    return existing_files, missing_files
# 验证文件并处理不存在的文件
existing_files, missing_files = verify_files_existence(ordered_files)
 
if missing_files:
    print("以下文件在JSON中指定但未找到：")
    for missing_file in missing_files:
        print(missing_file)
else:
    # 合并文档
    # 调用新的函数进行合并
    combine_all_docx_ordered(ordered_files[0], ordered_files)
    print("————按JSON排序合并完成————")