[318]pandoc用法及markdown转换word、pdf、html

周小董

已于 2024-04-12 21:59:34 修改

阅读量1.7w

点赞数 7

分类专栏： Python前行者文章标签： python

于 2018-07-12 09:05:51 首次发布

本文链接：https://blog.csdn.net/xc_zhou/article/details/81009893

版权

Python前行者专栏收录该内容

339 篇文章

订阅专栏

文章目录

在线格式转换工具：https://pandoc.org/try/
用户手册：https://www.pandoc.org/MANUAL.html
格式转换示例：https://pandoc.org/demos.html
Pandoc WIKI：https://github.com/jgm/pandoc/wiki
github：https://github.com/jgm/pandoc
pypi：https://pypi.org/project/pypandoc/

pandoc安装及使用

pandoc是什么

pandoc是一个软件，是一个能把千奇百怪的文档格式互相转换的神器，是一把文档转换的瑞士军刀（swiss-army knife）。不多说，放一张其官网（http://www.pandoc.org）的图，一图胜千言，感受一下：

可以看出，Pandoc 支持非常多的格式；关键它还是一个开源免费的工具，源代码放在了 GitHub 上，使用 Haskell 编程语言实现。具体来说，Pandoc 支持以下格式之间的转换（← 表示可以从该格式转换为其他格式； → 表示可以转换为该格式；↔︎ 表示支持该格式的双向转换）：

轻量级标记格式
- ↔︎ Markdown（包括 CommonMark 和GitHub-flavored Markdown）
- ↔︎ reStructuredText
- → AsciiDoc
- ↔︎ Emacs Org-Mode
- ↔︎ Emacs Muse
- ↔︎ Textile
- ← txt2tags
HTML 格式
- ↔︎ (X)HTML 4
- ↔︎ HTML5
Ebooks
- ↔︎ EPUB 版本 2 或者版本 3
- ↔︎ FictionBook2
文档格式
- → GNU TexInfo
- ↔︎ Haddock markup
Roff 格式
- ↔︎ roff man
- → roff ms
TeX 格式
- ↔︎ LaTeX
- → ConTeXt
XML 格式
- ↔︎ DocBook 版本 4 或者版本 5
- ↔︎ JATS
- → TEI Simple
大纲格式
- ↔︎ OPML
数据格式
- ← CSV 表格
文字处理格式
- ↔︎ Microsoft Word docx
- ↔︎ OpenOffice/LibreOffice ODT
- → OpenDocument XML
- → Microsoft PowerPoint
交互式笔记格式
- ↔︎ Jupyter notebook (ipynb)
页面布局格式
- → InDesign ICML
Wiki 标记语言格式
- ↔︎ MediaWiki 标记语
- ↔︎ DokuWiki 标记语
- ← TikiWiki 标记语
- ← TWiki 标记语
- ← Vimwiki 标记语
- → XWiki 标记语
- → ZimWiki 标记语
- ↔︎ Jira wiki 标记语
幻灯片放映格式
- → LaTeX Beamer
- → Slidy
- → reveal.js
- → Slideous
- → S5
- → DZSlides
自定义格式
- → 支持使用 lua 编写自定义转换器
PDF
- → 通过 pdflatex、lualatex、xelatex、latexmk、tectonic、wkhtmltopdf、weasyprint、prince、context、pdfroff 插件或者工具转为为 PDF

Pandoc 集成

除了使用命令行方式之外，很多开发工具和软件都集成了 Pandoc，从而实现文件格式的转换。例如 Markdown 编辑器 PanWriter、Typora，文本编辑器 Atom、Sublime Text、Emacs、Vim，R Markdown，PanConvert、Manubot 等等。

更多集成了 Pandoc 的第三方软件列表可以点此查看。

win安装pandoc

下面以下载Windows下的pandoc为例：

官网下载链接：https://github.com/jgm/pandoc/releases/
根据自己的操作系统位数下载合适的msi安装包，我下载的是：pandoc-3.1.12.3-windows-x86_64.msi

下载pandoc安装包之后，像安装普通软件一样点开安装就可以了。安装完成之后，打开cmd命令行，输入pandoc -v，如果正常显示出类似下面的信息就表明安装成功，如果未成功，可能需要配置环境变量，把安装的路径C:\Program Files\Pandoc\加入环境变量：

PS C:\Users\admin> pandoc -v
pandoc.exe 3.1.12.3
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: C:\Users\admin\AppData\Roaming\pandoc
Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.

centos安装pandoc

首先从官网下载安装包

上传到服务器(这里放到 /root目录下了)，进行解压

tar -zxvf pandoc-3.1.12.3-linux-amd64.tar.gz

然后使用命令

ln -s /root/pandoc-3.1.6/bin/pandoc /usr/bin/pandoc
cd /
pandoc --version

基础使用

以下操作均在CMD中操作，对大小写敏感。

查看pandoc指令集。

pandoc -h
pandoc --help

查看pandoc默认LaTeX模板代码。

pandoc -D latex

复制pandoc默认的Latex模板代码，保存到C盘根目录下，便于编辑。

pandoc -D latex > C:\1.latex

可以通过指令查看支持的格式

pandoc --list-input-formats
pandoc --list-output-formats

查看程序支持代码高亮的编程语言：

pandoc --list-highlight-languages

转换Markdown为HTML

pandoc -s input.md -o output.html

其中，-s 表示创建一个独立文件，将会生成文件页眉和页脚。，input.md表示将Markdown文件转换为HTML文件，-o output.html表示将结果输出到output.html文件中。

转换Markdown为PDF

pandoc -s input.md -o output.pdf

和将Markdown转换为HTML的命令类似，只是输出的文件类型不同，需要使用PDF。

转换多个文件

pandoc -s file1.md file2.md -o output.html

可以同时将多个Markdown文件转换为同一个格式的文件。

引入CSS样式

pandoc -s input.md -o output.html --css=mycss.css

可以使用--css选项引入自定义的CSS样式。

pandoc -s input.md -o output.html --toc

可以在生成的HTML文件中自动生成目录。

转换为其他格式

pandoc -s input.md -o output.docx

除了将Markdown转换为HTML或PDF，还可以将其转换为Word等其他格式

导出 Word 文档设置

--toc # 生成目录
--toc-depth=NUMBER # 生成的目录深度
--wrap=auto|none|preserve # 文字换行方式
--reference-doc=FILE # 指定模板

默认的 word 模板可以通过命令来查看

pandoc --print-default-data-file reference.docx > custom-reference.docx

默认的模板出现将在当前目录下的 custom-reference.docx 下，你可以对各个文字样式、默认表格样式进行修改。
Mac 的用户需要注意，使用 Pages 对模板文件进行修改会导致模板失效，需要使用 Office 进行编辑，并使用兼容模式进行保存。

文字样式可以在开始菜单中，找到对应的样式，进行修改

表格样式需要选中文档中的表格部分，选择表格工具 > 设计 > 修改表格样式来修改

修改完成后，可以通过 --reference-doc=custom-reference.docx 来指定模板。
也可以将模板文件放置到 Pandoc 的数据文件夹下，并命名为 reference.docx，后续 Pandoc 将把这个文件作为默认模板进行使用。
这里的数据文件夹，可以在 pandoc -v 指令的打印信息中，通过 data-dir 字段来获取。

导出 HTML 网页设置

-s 或 --standalone # 在生成的 html 文件中包含 head、body 等 html 文件的必要组成部分
--self-contained # 内嵌资源，不产生任何外部依赖
--data-dir=DIRECTORY # 读取资源的目录
# 模板设置
--template=FILE|URL # 使用模板进行渲染
-V KEY[=VAL], --variable=KEY[:VAL] # 将模板中的变量设置为对应的值

我们可以定制模板

pandoc --print-default-template=html > template.html # 查看默认的模板，并输出到 template.html

导出 EPUB 文档设置

--epub-cover-image=FILE # 设置封面
--epub-metadata=FILE # 设置媒体信息
--epub-embed-font=FILE # 设置字体

设置 metadata

metadata 是导出格式中的一些信息，例如作者、日期、简介等等，在 EPUB、PDF、Word 等格式中有一定作用。
可以在命令行中指定 metadata

-M KEY[=VAL], --metadata=KEY[:VAL] # 将 KEY 的内容设置为 VAL
--metadata-file=FILE # 读取 FILE 中的内容作为 metadata

也可以在文件中顶部对 metadata 进行声明，使用 YAML 语法

---
# 该部分必须在文档的顶部
# 顶部和底部的 三个横线必须保留
title: 标题
author:
- 作者 1
- 作者 2
date: 日期
keywords: [关键词 1, 关键词 2]
abstract: |
  第一段

  第二段
---

md转pdf用法及常见问题

把markdown文件导出pdf时，默认使用pandoc自带的latex模板（cmd中pandoc -D latex可查看默认latex代码）。使用默认自带latex模板导出时，可以省略不写模板的路径和名称。默认导出文件位置与被导出文件在同一位置，也可指定位置。

pandoc 输入文档.后缀名 -o 输出文档.后缀名

下面指令就是用默认latex模板导出，不用输入模板的路径和模板名称。
如果使用自定义模板，就要加上--template 模板路径\模板名称，模板名称后面的后缀名.latex可以不写，会自动识别。模板后缀可以是.tex。

pandoc en.md -o en.pdf

导出到指定文件夹（相对路径）,导出文件到en.md同级目录下的photo文件夹中，路径不用引号。

pandoc en.md -o .\photo\en.pdf

导出到指定文件夹中（绝对路径），导出文件到F:\OBS文件夹中，路径不用引号。

pandoc en.md -o F:\OBS\en.pdf

导出时指定编译环境，以使用pandoc默认latex模板把markdown导出PDF为例。

--pdf-engine=xelatex

导出时设置字体，以使用pandoc默认latex模板把markdown导出PDF为例。

-V CJKmainfont="SimSun"

导出时设置页边距，以使用pandoc默认latex模板把markdown导出PDF为例。

-V geometry:"left=1.5cm,right=1.5cm,top=2cm,bottom=2cm"

导出时标题自动编号，以使用pandoc默认latex模板把markdown导出PDF为例。

-N

导出时自动生成目录，以使用pandoc默认latex模板把markdown导出PDF为例。

--toc

Markdown导出中文到PDF，pandoc默认的latex引擎是pdflatex，是不支持中文的，因此需要手动设置编译时所用的引擎为xelatex，且要指定中文字体才可以。

pandoc test.md -o test.pdf --pdf-engine=xelatex

--pdf语句也可以写在前面

pandoc --pdf-engine=xelatex test.md -o test.pdf

上面操作编译中文可能没有错误，但是得到的pdf文档中可能所有的中文都没有了。这是字体的问题，因为编译时默认的字体时不支持中文的。也有可能直接无法编辑，报错。解决办法是手动设置中文字体。如下面采用CJK里的SimSun（宋体）。

pandoc test.md -o test.pdf --pdf-engine=xelatex -V CJKmainfont="SimSun"

安装LaTeX

如果平台本身没有pdf转换的引擎，那么在转换的时候会报错：

RuntimeError: Pandoc died with exitcode “47” during conversion: pdflatex not found. Please select a different --pdf-engine or install pdflatex

所以我们要额外安装 pdflatex

Windows 安装

Texlive官网是https://www.tug.org/texlive/ 因为他是国外的速度比较慢，所以使用中科大镜像列表。地址是：https://mirrors.ustc.edu.cn/CTAN/systems/texlive/Images/

下载完毕后，使用管理员身份运行install-tl-windows.bat，根据提示进行安装即可，之后加入环境变量D:\soft\texlive\2024\bin\windows，命令行执行pdflatex -v正常输出说明安装成功了。

tex -v
latex -v
xelatex -v
pdflatex -v

来获取安装的TeX的环境信息！如果看到的版本信息能够正常输出，那么TeX Live的安装就是成功的！

在windows开始菜单里，找到 TexWords editor 编辑器，输入如下文本，点击绿色箭头

在 Linux 上

sudo yum update
sudo yum install texlive
sudo yum install texlive*
# sudo yum install texlive-xetex

Libreoffice 下载：https://zh-cn.libreoffice.org/download/libreoffice/

（安装完毕后配置环境变量）

import os
import platform


def word2pdf(file_path, output_path):
    # libreOfice 下载地址：
    pf = platform.system()
    if "Windows" in pf:
        command = r'"C:\Program Files\LibreOffice\program\soffice.exe"'
    elif "Linux" in pf:
        version = r'6.4'
        command = f'libreoffice{version}'
    else:
        raise Exception("不支持的系统平台")

    os.system(f"{command} --headless --convert-to pdf {file_path} --outdir {output_path}")


if __name__=="__main__":
    word2pdf(r"C:/Users/fang/Documents/ssh登录日志_2022-05-29.docx", r"C:/Users/fang/Documents/")

常见问题

注意：文件名中不能包含空格

生成pdf过程中的问题

执行命令：pandoc --latex-engine=xelatex test.md -o test.pdf
编译出错：latex-engine has been removed. Use --pdf-engine instead.
替换为：pandoc --pdf-engine=xelatex test.md -o test.pdf
编译生成的pdf文件只有英文，中文不显示，原因是没有指定中文字体，在 cmd 中用 fc-list 查看所有安装的字体：fc-list :lang=zh 输出所有中文字体；

fc-list :lang=zh
C:/Windows/fonts/STSONG.TTF: 华文宋体,STSong:style=Regular
C:/Windows/fonts/FZSTK.TTF: 方正舒体,FZShuTi:style=Regular
C:/Windows/fonts/simsun.ttc: 新宋体,NSimSun:style=常规,Regular
C:/Windows/fonts/msyh.ttc: Microsoft YaHei UI:style=Regular,Normal,obyčejné,Standard,Κανονικά,Normaali,Normál,Normale,Standaard,Normalny,Обычный,Normálne,Navadno,Arrunta
C:/Windows/fonts/msyh.ttc: 微软雅黑,Microsoft YaHei:style=Regular,Normal,obyčejné,Standard,Κανονικά,Normaali,Normál,Normale,Standaard,Normalny,Обычный,Normálne,Navadno,Arrunta

cmd输出的中文乱码，执行chcp 65001打开 active code page 65001 可以看到正常的中文输出，注意：用 -V mainfont="Microsoft YaHei" 指定中文字体，必须是双引号，否则会报错

pandoc --pdf-engine=xelatex -V mainfont="Microsoft YaHei" test.md -o test.pdf

中文正常显示后，发现中文不换行。原因是 Pandoc 使用的 latex 模板文件需要修改
看了几篇文章，他们都改用了Tzeng Yuxio的模板文件，下载该模板，修改命令为：

pandoc --pdf-engine=xelatex --template=pm-template.latex test.md -o test.pdf
# 或者 指明绝对路径：
pandoc --pdf-engine=xelatex --template=D:\tools\Pandoc\pm-template.latex test.md -o test.pdf

我们需要一个template.tex的模板用于构建，下载地址：https://raw.githubusercontent.com/tzengyuxio/pages/gh-pages/pandoc/pm-template.latex

详细可见下文，主要是因为tex对中文支持的缘故

\documentclass[$if(fontsize)$$fontsize$,$endif$$if(lang)$$lang$,$endif$$if(papersize)$$papersize$,$endif$]{$documentclass$}
\newcommand{\tightlist}{%
  \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
\usepackage{geometry} 		% 設定邊界
\geometry{
  top=1in,
  inner=1in,
  outer=1in,
  bottom=1in,
  headheight=3ex,
  headsep=2ex
}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage{amssymb,amsmath}
\usepackage{ifxetex,ifluatex}
\usepackage{fixltx2e} % provides \textsubscript
% use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
  \usepackage[utf8]{inputenc}
$if(euro)$
  \usepackage{eurosym}
$endif$
\else % if luatex or xelatex
  \usepackage{fontspec} 	% 允許設定字體
  \usepackage{xeCJK} 		% 分開設置中英文字型
  \setCJKmainfont{Microsoft YaHei} 	% 設定中文字型
  \setmainfont{Georgia} 	% 設定英文字型
  \setromanfont{Georgia} 	% 字型
  \setmonofont{Courier New}
  \linespread{1.2}\selectfont 	% 行距
  \XeTeXlinebreaklocale "zh" 	% 針對中文自動換行
  \XeTeXlinebreakskip = 0pt plus 1pt % 字與字之間加入0pt至1pt的間距，確保左右對整齊
  \parindent 0em 		% 段落縮進
  \setlength{\parskip}{20pt} 	% 段落之間的距離
  \ifxetex
    \usepackage{xltxtra,xunicode}
  \fi
  \defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}
  \newcommand{\euro}{€}
$if(mainfont)$
    \setmainfont{$mainfont$}
$endif$
$if(sansfont)$
    \setsansfont{$sansfont$}
$endif$
$if(monofont)$
    \setmonofont{$monofont$}
$endif$
$if(mathfont)$
    \setmathfont{$mathfont$}
$endif$
\fi
% use microtype if available
\IfFileExists{microtype.sty}{\usepackage{microtype}}{}
$if(geometry)$
\usepackage[$for(geometry)$$geometry$$sep$,$endfor$]{geometry}
$endif$
$if(natbib)$
\usepackage{natbib}
\bibliographystyle{plainnat}
$endif$
$if(biblatex)$
\usepackage{biblatex}
$if(biblio-files)$
\bibliography{$biblio-files$}
$endif$
$endif$
$if(listings)$
\usepackage{listings}
$endif$
$if(lhs)$
\lstnewenvironment{code}{\lstset{language=Haskell,basicstyle=\small\ttfamily}}{}
$endif$
$if(highlighting-macros)$
$highlighting-macros$
$endif$
$if(verbatim-in-note)$
\usepackage{fancyvrb}
$endif$
$if(tables)$
\usepackage{longtable}
$endif$
$if(graphics)$
\usepackage{graphicx}
% We will generate all images so they have a width \maxwidth. This means
% that they will get their normal width if they fit onto the page, but
% are scaled down if they would overflow the margins.
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth
\else\Gin@nat@width\fi}
\makeatother
\let\Oldincludegraphics\includegraphics
\renewcommand{\includegraphics}[1]{\Oldincludegraphics[width=\maxwidth]{#1}}
$endif$
\ifxetex
  \usepackage[setpagesize=false, % page size defined by xetex
              unicode=false, % unicode breaks when used with xetex
              xetex]{hyperref}
\else
  \usepackage[unicode=true]{hyperref}
\fi
\hypersetup{breaklinks=true,
            bookmarks=true,
            pdfauthor={$author-meta$},
            pdftitle={$title-meta$},
            colorlinks=true,
            urlcolor=$if(urlcolor)$$urlcolor$$else$blue$endif$,
            linkcolor=$if(linkcolor)$$linkcolor$$else$magenta$endif$,
            pdfborder={0 0 0}}
\urlstyle{same}  % don't use monospace font for urls
$if(links-as-notes)$
% Make links footnotes instead of hotlinks:
\renewcommand{\href}[2]{#2\footnote{\url{#1}}}
$endif$
$if(strikeout)$
\usepackage[normalem]{ulem}
% avoid problems with \sout in headers with hyperref:
\pdfstringdefDisableCommands{\renewcommand{\sout}{}}
$endif$
\setlength{\parindent}{0pt}
%\setlength{\parskip}{6pt plus 2pt minus 1pt}
\setlength{\emergencystretch}{3em}  % prevent overfull lines

\title{\huge 在OSX平台上的XeLaTeX中文測試} % 設置標題，使用巨大字體
\author{FoolEgg.com} 		% 設置作者
\date{February 2013} 		% 設置日期
\usepackage{titling}
\setlength{\droptitle}{-8em} 	% 將標題移動至頁面的上面

\usepackage{fancyhdr}
\usepackage{lastpage}
\pagestyle{fancyplain}

$if(numbersections)$
\setcounter{secnumdepth}{5}
$else$
\setcounter{secnumdepth}{0}
$endif$
$if(verbatim-in-note)$
\VerbatimFootnotes % allows verbatim text in footnotes
$endif$
$if(lang)$
\ifxetex
  \usepackage{polyglossia}
  \setmainlanguage{$mainlang$}
\else
  \usepackage[$lang$]{babel}
\fi
$endif$
$for(header-includes)$
$header-includes$
$endfor$

$if(title)$
\title{$title$}
$endif$
\author{$for(author)$$author$$sep$ \and $endfor$}
\date{$date$}

\begin{document}
$if(title)$
\maketitle
$endif$

$for(include-before)$
$include-before$

$endfor$
$if(toc)$
{
\hypersetup{linkcolor=black}
\setcounter{tocdepth}{$toc-depth$}
\tableofcontents
}
$endif$
$body$

$if(natbib)$
$if(biblio-files)$
$if(biblio-title)$
$if(book-class)$
\renewcommand\bibname{$biblio-title$}
$else$
\renewcommand\refname{$biblio-title$}
$endif$
$endif$
\bibliography{$biblio-files$}

$endif$
$endif$
$if(biblatex)$
\printbibliography$if(biblio-title)$[title=$biblio-title$]$endif$

$endif$
$for(include-after)$
$include-after$

$endfor$
\end{document}

也可以在这里获取一个，pandoc template pandoc 毕业设计模板

指定模板文件后，使用这个模板文件编译出错
- 修改pm-template.latex 中设置的中文字体LiHei Pro为本机中已安装的中文字体
- 重新编译再次报错，在input.log 日志文件中找到具体的报错信息

xeCJK warning: “CJKfamily-Unknown”
Unknown CJK family \CJKsfdefault' is being ignored.
Try to use\setCJKmonofont[…]{…}’ to define it.

解决办法就是在pm-template.latex 中加上 \setCJKmonofont{Courier New}，然后继续编译报错

! Undefined control sequence.
l.199 \tightlist

在这里找到了解决办法，在模板文件中加入下边代码

\newcommand{\tightlist}{%
  \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}

改完之后再次执行命令进行编译，报错

Error producing PDF. ! File ended while scanning use of \end. <inserted text>

在网上查找解决方法，说删除掉所有日志和其他几种文件再次编译，但是我多次尝试都还是报同样的错误。经过一个小时的折腾，终于发现是我在修改模板文件期间不小心删掉了一个大括号，心累…最后输入命令，成功生成了还算看得过去的pdf文档

需要转化的文档中含有图片
docx如果存在图片，需要将图片保存到media目录下面如下：

pandoc test.docx --extract-media=. -o test.md

下面是使用pandoc把markdown（以下简称md）文件转为PDF所遇到的一系列的坑。

尝试过的不合适的方案

下面涉及的转换方法可以参见pandoc的使用文档（http://www.pandoc.org/MANUAL.html），这里不再叙述细节。

直接使用pandoc把md转为PDF。问题：报错，原因是电脑没有安装Latex解析引擎。
装好了MiKTex这样一个Latex解析引擎之后，再转，又失败了，原因是不支持中文字体（md文件中有中文字体）。
指定字体为宋体后再转，虽然成功转为了PDF文件，但是文件的字体、行间距、排版等样式都比较丑陋，不满意。
在网上又搜了搜，再加上指定Latex模板文件转，虽然能够调整一些参数了，但一方面模板文件犹如天书般难懂，另一方面很多格式还是不能精细调整。
变换一下思路，采取迂回战术，按照md->HTML->PDF的路径转。于是先把md转为HTML，HTML的样式倒是挺美观，然后在浏览器中使用浏览器的打印功能把HTML转为PDF。然而还是存在文档格式不能精细调整的问题。
发现转HTML的时候可以指定css样式文件，在css样式文件中可以配置标题、正文、段落的精细格式。奈何这种方法和Latex模板文件的缺点类似，投入/产出比过高，遂放弃。

最终采取的解决方案

最终转换到另一种思路来，依然采用迂回战术，不过是按照：md->docx->PDF，一试，发现这种方式很靠谱，主要是因为：一是md转为docx很简单，只需pandoc的一条命令就搞定，不需要额外装其他的软件；二是转为docx文档后，还可以在其中手动精细调整字体、字号、段落之类的格式，直到自己满意为止。最后用Word也可以完美地把docx转为PDF。

而且在生成的docx文档中手工调完精细格式之后，还可以把docx文档的格式模板导出成dot文档模板文件保存，后面再生成docx文档时不需要再一个个手工调格式了，只需把这个dot模板文件中的样式全部导入即可（假设这个模板文件的名字为：md2doc-template.dot），导入方法简要说明如下（这里以Microsoft Word 2013为例进行说明）：打开从md文件转出来的docx文档（假设文档名称为：md-doc.docx），点击开始菜单 > 样式右下角的三角，弹出样式对话框：

点击样式对话框下面的管理样式按钮，进入样式管理对话框：

点击样式管理对话框左下角的导入/导出按钮，打开导入/导出对话框：

下面的导入/导出对话框中，保持左边的不变，点击右边的关闭文件按钮：

然后点击打开文件，打开刚刚保存的模板文件md2doc-template.dot：

然后执行下图的操作：

此时就大功告成了，模板文档中的样式都复制到新文档的样式中来了。

最后，md转为docx文档的pandoc命令类似于这样：pandoc text.md -o text.docx

脚本封装，批量转换

把转换命令封装成bat批处理脚本，以后再进行转换的时候只需简单双击一下bat脚本即可，比如封装成convert.bat文件，内容如下：

pandoc text.md -o text.docx && pause

批量处理：假如几十上百个md文件要批量转为docx文件，难道要手工一个个复制这些文件名和pandoc命令去转换吗？完全不需要，一个bat脚本搞定（convert.bat）：

@echo off
:: 遍历当前文件夹下的所有后缀名为md的文件
for /f %%a in ('dir /b *.md') do (
    :: 执行pandoc命令，把每个md文件都转为docx文件，docx文件的文件名为：md文件名.md.docx
    pandoc %%a -o %%a.docx
)
pause

假设当前文件夹下有3个md文件test1.md、test2.md、test3.md，他们的内容都为：

# 一级标题1
## 二级标题1
* 列表项1
* 列表项2

## 二级标题2
正文1-1

# 一级标题2
正文2-1

# 一级标题3
表格1：

姓名|年龄|性别
---|---|---
张三|30|男
李四|28|男
王五|29|男

双击执行convert.bat后，生成了3个docx文件：

打开test1.md.docx，发现整体格式还不错，并且还支持md的表格语法：

参考：https://blog.csdn.net/horses/article/details/108536784
https://www.jianshu.com/p/52cbee87a45a
https://blog.csdn.net/fenghuizhidao/article/details/107202497
https://www.bilibili.com/read/cv23945665/
https://zhuanlan.zhihu.com/p/682455380
https://zhuanlan.zhihu.com/p/597368293
https://blog.csdn.net/sinat_37795871/article/details/132352393

其它（未整理）：
https://evvail.com/2021/02/02/2184.html
https://sspai.com/post/77206
https://blog.csdn.net/okfang616/article/details/126299213

问题参考：
http://blog.csdn.net/phodal/article/details/23381821
https://www.cnblogs.com/clovershell/p/14653611.html
https://www.jianshu.com/p/dcc2f95cc086
https://www.zhihu.com/question/368619069/answer/1768807494

pypandoc用法

pypandoc是一个Python库，它提供了一个Python的接口来调用Pandoc。使用pypandoc，你可以在Python代码中直接调用Pandoc，而不需要使用subprocess模块。

首先，你需要安装pypandoc。你可以使用pip来安装：

pip install pypandoc

pypandoc主要有3个函数：

convert()
convent_file()
convenr_text()

其中convert()官方建议不用，容易产生歧义。

convert_file()和convent_text()区别就在于一个接收文件参数，一个接收文本参数。结合下面例子一看即懂：

import pypandoc
 
"""
　　pypandoc.convert_file(source_file, to, format=None, extra_args=(), encoding='utf-8',
                 outputfile=None, filters=None, verify_format=True)
    参数说明：
    source_file:源文件路径
    to：输入应转换为的格式；可以是'pypandoc.get_pandoc_formats（）[1]`之一
    format：输入的格式；将从具有已知文件扩展名的源_文件推断；可以是“pypandoc.get_pandoc_formats（）[1]”之一（默认值= None)
    extra_args：要传递给pandoc的额外参数（字符串列表）(Default value = ())
    encoding：文件或输入字节的编码 (Default value = 'utf-8')
    outputfile：转换后的内容输出路径+文件名，文件名的后缀要和to的一致，如果没有，则返回转换后的内容（默认值= None)
    filters – pandoc过滤器，例如过滤器=['pandoc-citeproc']
    verify_format：是否对给定的格式参数进行验证，（pypandoc根据文件名截取后缀格式，与用户输入的format进行比对）
     
    pypandoc.convert_text(source, to, format, extra_args=(), encoding='utf-8',
                     outputfile=None, filters=None, verify_format=True):
    参数说明：
    source：字符串       
    其余和canvert_file()相同      
 
"""
 
# 将当前目录下html目录中的1.html网页文件直接转换成.docx文件，文件名为file1.docx，并保存在当前目录下的doc文件夹中
pypandoc.convert_file('./html/1.html', 'docx', outputfile="./doc/file1.docx")
 
# 将当前目录下html目录中的1.html网页文件 读取出来，然后转换成.docx文件，文件名为file2.docx，并保存在当前目录下的doc文件夹中
with open('./html/1.html', encoding='utf-8') as f:
    f_text = f.read()
pypandoc.convert_text(f_text, 'docx', 'html', outputfile="./doc/file2.docx")

可转格式

支持输入格式：

biblatex, bibtex, commonmark, commonmark_x, creole, csljson, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, rtf, t2t, textile, tikiwiki, twiki, vimwiki

支持输出格式：

asciidoc, asciidoctor, beamer, biblatex, bibtex, commonmark, commonmark_x, context, csljson, docbook, docbook4, docbook5, docx, dokuwiki, dzslides, epub, epub2, epub3, fb2, gfm, haddock, html, html4, html5, icml, ipynb, jats, jats_archiving, jats_articleauthoring, jats_publishing, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, ms, muse, native, odt, opendocument, opml, org, pdf, plain, pptx, revealjs, rst, rtf, s5, slideous, slidy, tei, texinfo, textile, xwiki, zimwiki

样例：将docx文件转化为txt文件

你可以使用pypandoc.convert_file函数来转换文件。以下是一个例子，展示了如何使用pypandoc将.docx文件转换为.txt文件：

import pypandoc

def convert_docx_to_txt(docx_file_path, txt_file_path):
    output = pypandoc.convert_file(docx_file_path, 'plain', outputfile=txt_file_path)
    assert output == ""
    print(f"文件已转换: {txt_file_path}")

# 使用函数转换文件
convert_docx_to_txt('input.docx', 'output.txt')

在这个例子中，convert_docx_to_txt函数接受两个参数：docx_file_path是.docx文件的路径，txt_file_path是转换后的.txt文件的路径。pypandoc.convert_file函数用于执行转换，'plain'参数表示转换为纯文本格式。

请注意，pypandoc仍然需要你的系统上安装了Pandoc，并且它的可执行文件在系统的PATH环境变量中，这样Python才能找到并运行它。

样例：将md文件转化为html文件

# -*- coding: utf-8 -*-
#
def readme():
    """
    转化文件的格式。
    convert(source, to, format=None, extra_args=(), encoding='utf-8', outputfile=None, filters=None)
    parameter-
        source：源文件
        to：目标文件的格式，比如html、rst、md等
        format：源文件的格式，比如html、rst、md等。默认为None，则会自动检测
        encoding：指定编码集
        outputfile：目标文件，比如test.html（注意outputfile的后缀要和to一致）
    """
    try:
        import pypandoc
        return pypandoc.convert('README.md', 'html', format='md',outputfile='1.html')
    except (IOError, ImportError):
        with open('README.md') as f:
            return f.read()
readme()

样例：将epub文件转化为pdf文件

pandoc命令

pandoc -s ./data/input/test.epub -o ./data/output/pdf/epub_pdf.pdf --pdf-engine=xelatex \
    --template=./docs/pm-template.latex \
    -V mainfont="Microsoft YaHei" \
    -V geometry:"left=1.5cm,right=1.5cm,top=2cm,bottom=2cm"

pypandoc命令

# -*- coding: utf-8 -*-
import pypandoc
from typing import Iterable

# 获取所有可用的格式
# print(pypandoc.get_pandoc_formats()[1])
# 获取pandoc版本
# print(pypandoc.get_pandoc_version())


# 使用函数转换文件
def convert_docx_to_txt(docx_file_path='',to_str='plain', txt_file_path='',extra_args:Iterable=()):
    print('extra_args:',extra_args)
    output = pypandoc.convert_file(docx_file_path, to_str, outputfile=txt_file_path, extra_args=extra_args)
    # output = pypandoc.convert_text(inpit_file, 'html',format='pptx')
    assert output == ""
    print(f"文件已转换: {txt_file_path}")

docx_file_path = './data/input/test.epub'
input_type = docx_file_path[docx_file_path.rindex('.')+1:]
print('input_type:',input_type)

txt_file_path='./data/output/pdf/%s_pdf.pdf'%input_type
convert_docx_to_txt(docx_file_path=docx_file_path,to_str='pdf',
                        txt_file_path=txt_file_path,
                        extra_args=[
                            '--pdf-engine=xelatex',
                            '--variable=mainfont:Microsoft YaHei',
                            '--template=./docs/pm-template.latex',
                            # '--wrap=auto',
                        ]
                    )

参考：https://blog.csdn.net/m0_57236802/article/details/135313801
https://www.cnblogs.com/leomei91/p/7649547.html
https://www.cnblogs.com/alex-13/p/16079221.html
https://zhuanlan.zhihu.com/p/491695617