Python实现Word文档转换Markdown

最新推荐文章于 2025-04-07 17:32:37 发布

pacong

最新推荐文章于 2025-04-07 17:32:37 发布

阅读量4.5k

点赞数 6

分类专栏： python自动化办公文章标签： python word

本文链接：https://blog.csdn.net/pacong/article/details/134316578

版权

python自动化办公专栏收录该内容

3 篇文章

订阅专栏

本文介绍了如何利用Python的mammoth和markdownify库将Word文档转换为Markdown，包括处理Word中的图片并自定义转换逻辑。通过这两个模块，作者展示了从读取Word文件到保存HTML和Markdown文件的完整过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

很多桌面软件（比如Typora）都提供了导入 Word 文件的功能，这类功能一般是通过 Pandoc 这个软件来扩展实现的。

Pandoc 是瑞士军刀一般的存在，能够较好的处理各类的文档格式转换，但是如果我们需要自己写程序，调用 Pandoc 则需要额外的安装 Pandoc 才行，并且也不方便自定义。

转换逻辑

Word文档到Markdown文档的转换总体而言分两步来实现:

第一步，将Word文档转换为HTML文档;
第二步，将HTML文档转换为Markdown文档;

依赖模块

要实现这个功能我们需要借助Python的两个第三方模块:

mammoth
markdownify

mammoth是一个用于将Word文档转换为HTML的模块，它支持在Python、
JavaScript、Java、 .Net等平台使用。

而markdownify则是将HTML转换为Markdown文档的模块。

安装

pip install mammoth -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install markdownify -i https://pypi.tuna.tsinghua.edu.cn/simple

文件目录
在这里插入图片描述

运行
注意：文件要是docx的，不然报错

import time
import mammoth
import markdownify
# 转存Word文档内的图片
def convert_imgs(image):
    with image.open() as image_bytes:
        file_suffix = image.content_type.split("/")[1]
        path_file = "./img/{}.{}".format(str(time.time()),file_suffix)
        with open(path_file, 'wb') as f:
            f.write(image_bytes.read())

    return {"src":path_file}

# 读取Word文件
with open(r"D:\code\Test\TestBook\测试用例相关文档\WEB测试用例\12345.docx", "rb") as docx_file:
    # 转化Word文档为HTML
    result = mammoth.convert_to_html(docx_file,convert_image = mammoth.images.img_element(convert_imgs))
    # 获取HTML内容
    html = result.value
    # 转化HTML为Markdown
    md = markdownify.markdownify(html,heading_style="ATX")
    print(md)
    with open("./docx_to_html.html",'w',encoding='utf-8') as html_file,open("./docx_to_md.md","w",encoding='utf-8') as md_file:
        html_file.write(html)
        md_file.write(md)
    messages = result.messages

运行后的文件目录
在这里插入图片描述

分析

处理Word图片

因为Word文档中不可避免地会存在很多图片，为了在转换后的文档中能够正确地显示图片,
我们需要自定义一下Word 文档内图片的处理方式。默认情况下，mammoth会将图片转换为
base64编码的字符串，这样不用胜成额外的本地图片文件,但是会使文档体积变得很大。所
以我们选择将图片另存为本地图片

# 转存Word文档内的图片
def convert_img(image):
    with image.open() as image_bytes:
        file_suffix = image.content_type.split("/")[1]
        path_file = "./img/{}.{}".format(str(time.time()),file_suffix)
        with open(path_file, 'wb') as f:
            f.write(image_bytes.read())

    return {"src":path_file}

转换

代码如下所示:

# 读取Word文件
with open(r"D:\code\Test\TestBook\测试用例相关文档\WEB测试用例\12345.docx", "rb") as docx_file:
    # 转化Word文档为HTML
    result = mammoth.convert_to_html(docx_file,convert_image = mammoth.images.img_element(convert_imgs))
    # 获取HTML内容
    html = result.value
    # 转化HTML为Markdown
    md = markdownify.markdownify(html,heading_style="ATX")
    print(md)
    with open("./docx_to_html.html",'w',encoding='utf-8') as html_file,open("./docx_to_md.md","w",encoding='utf-8') as md_file:
        html_file.write(html)
        md_file.write(md)
    messages = result.messages