如何使用 Python 批量转换文件格式（Word, PDF, Excel）为 TXT 文件（linux版本）

小鱼学Ai

于 2025-03-13 16:46:35 发布

阅读量535

点赞数 8

文章标签：开发语言避雷 deekseep 人工智能语言模型 python

本文链接：https://blog.csdn.net/weixin_46623613/article/details/146235547

版权

1. 引言

在日常工作中，我们可能需要将不同格式的文档（如 .docx、.pdf、.xlsx）转换为 .txt 格式，便于文本处理和存储。使用 Python 脚本，我们可以自动化这个过程，节省大量时间和精力。

2. 项目需求

在开始之前，需要安装一些 Python 库：

pip install python-docx PyMuPDF pandas openpyxl

3. 脚本功能概述

脚本支持将以下文件格式转换为 .txt 文件：

.docx -> TXT
.doc -> TXT（通过 LibreOffice 转换为 .docx 再处理）
.wps -> TXT（同上）
.pdf -> TXT
.xlsx -> TXT

同时，脚本支持批量转换多个文件，并生成带有时间戳的唯一文件名，避免文件被覆盖。

4. 关键代码讲解

4.1 获取当前时间戳

# 获取当前时间的时间戳
def get_timestamp():
    return datetime.now().strftime("%Y%m%d_%H%M%S")

该函数生成当前时间的时间戳（如：20250313_144530），用于文件名生成，确保每个转换后的文件名唯一。

4.2 转换 Word 文件为 TXT

# 转换 Word 文件（.docx）为 TXT
def convert_docx_to_txt(docx_file, txt_file):
    try:
        doc = Document(docx_file)
        with open(txt_file, "w", encoding="utf-8") as f:
            for para in doc.paragraphs:
                f.write(para.text + "\n")
        print(f"Word 文件 {docx_file} 转换为 TXT 完成。")
    except Exception as e:
        print(f"转换 Word 文件 {docx_file} 为 TXT 时出错: {e}")

此函数使用 python-docx 库读取 .docx 文件，并将内容写入 .txt 文件中。

4.3 转换其他格式文件为 DOCX

# 使用 LibreOffice 将 WPS / DOC 转换为 DOCX
def convert_wps_or_doc_to_docx(input_file):
    try:
        output_file = input_file.rsplit(".", 1)[0] + ".docx"
        cmd = f'docker exec -it libreoffice libreoffice --headless --convert-to docx "{input_file}" --outdir "{os.path.dirname(input_file)}"'
        subprocess.run(cmd, shell=True, check=True)
        return output_file
    except Exception as e:
        print(f"转换 {input_file} 为 DOCX 失败: {e}")
        return None

通过 LibreOffice 在 Docker 容器中将 .doc 或 .wps 文件转换为 .docx 文件。

4.4 转换 PDF 为 TXT

# 转换 PDF 文件（.pdf）为 TXT
def convert_pdf_to_txt(pdf_file, txt_file):
    try:
        doc = fitz.open(pdf_file)
        text = ""
        for page_num in range(len(doc)):
            page = doc.load_page(page_num)
            text += page.get_text("text") + "\n"
        with open(txt_file, "w", encoding="utf-8") as f:
            f.write(text)
        print(f"PDF 文件 {pdf_file} 转换为 TXT 完成。")
    except Exception as e:
        print(f"转换 PDF 文件 {pdf_file} 为 TXT 时出错: {e}")

使用 PyMuPDF（fitz）库读取 PDF 文件内容，并将文本保存到 .txt 文件中。

4.5 转换 Excel 为 TXT

# 转换 Excel 文件（.xlsx）为 TXT
def convert_excel_to_txt(excel_file, txt_file):
    try:
        df = pd.read_excel(excel_file, engine='openpyxl')
        text = df.to_string(index=False, header=True)
        with open(txt_file, "w", encoding="utf-8") as f:
            f.write(text)
        print(f"Excel 文件 {excel_file} 转换为 TXT 完成。")
    except Exception as e:
        print(f"转换 Excel 文件 {excel_file} 为 TXT 时出错: {e}")

使用 pandas 读取 .xlsx 文件并将其转换为 .txt 格式。

4.6 处理文件转换

# 处理不同文件类型的主函数
def convert_to_txt(input_file, output_file):
    file_extension = os.path.splitext(input_file)[1].lower()
    if file_extension == ".docx":
        convert_docx_to_txt(input_file, output_file)
    elif file_extension == ".doc":
        convert_doc_to_txt(input_file, output_file)
    elif file_extension == ".wps":
        convert_wps_to_txt(input_file, output_file)
    elif file_extension == ".pdf":
        convert_pdf_to_txt(input_file, output_file)
    elif file_extension == ".xlsx":
        convert_excel_to_txt(input_file, output_file)
    else:
        print(f"不支持的文件格式: {file_extension}")

此函数根据文件类型调用相应的转换函数。

4.7 批量转换功能

# 处理批量转换
def batch_convert(input_files, output_dir):
    os.makedirs(output_dir, exist_ok=True)
    for input_file in input_files:
        file_name = os.path.splitext(os.path.basename(input_file))[0]
        timestamp = get_timestamp()
        output_file = os.path.join(output_dir, f"{file_name}_{timestamp}.txt")
        print(f"转换文件: {input_file} -> {output_file}")
        convert_to_txt(input_file, output_file)

该函数接收多个文件并将它们转换为 .txt 文件，支持批量处理。

5. 示例

假设你有一个文件夹 files_to_convert/，其中包含 .docx、.pdf 和 .xlsx 文件。你可以使用以下代码进行批量转换：

input_files = ["files_to_convert/file1.docx", "files_to_convert/file2.pdf", "files_to_convert/file3.xlsx"]
output_dir = "converted_files"
batch_convert(input_files, output_dir)

该代码会将这些文件转换为 .txt 文件，并保存在 converted_files 目录中。