pdfkit批量转换html文件为pdf

yivifu

已于 2022-04-30 14:28:05 修改

阅读量1.2k

点赞数

分类专栏： python 文章标签： python

于 2022-04-28 15:51:00 首次发布

本文链接：https://blog.csdn.net/yivifu/article/details/124474913

版权

python 专栏收录该内容

14 篇文章 2 订阅

订阅专栏

最简单的epub文件转换为pdf文件的方法为使用电子书管理工具calibre，但是calibre要将文件添加到书库才能进行转换，添加到书库时会创建冗余的文件夹和文件，让人多少有点不爽。calibre工具套件里有epub文件编辑工具The Calibre e-book Editor，利用这个工具可以将epub文件中的资源文件导出到本地目录，导出到本地目录后就可以使用python包pdfkit转换为pdf文件。

上述方法需要三个准备工作：

1、安装wkhtmltopdf工具，pdfkit包装了这个工具。下载安装方法问度娘即可。

2、使用The Calibre e-book Editor将epub文件中的html文件里的层叠样式表、图片等外部文件的引用替换为绝对路径，如果不转换，程序执行过程中有很大概率发生OSError。批量转换HTML文件中资源链接的相对路径为绝对路径的参考方法如下图（具体查找正则表达式要看html文件中的链接格式而定，下图适用于用../引用到根目录的情形）：

图中替换路径file:///代表协议，e:/book是epub文件中的资源将要导出的目录。导出资源文件后的目录典型结构如下图：

其中，images文件夹中保存的是图片等资源，text文件夹中保存的是html文件。

3. 安装pdfkit包：pip install pdfkit

完成以上准备工作后，我们可以用pdfkit将text文件夹中的html文件转换为一个pdf文件：

import pdfkit
from pathlib import Path


input_path = Path(r"E:\book\text")
output_path = r'E:\book'
#指定wkhtmltopdf安装路径，或者将wkhtmltopdf.exe添加到path环境变量中
config_pdf = pdfkit.configuration(wkhtmltopdf=r'd:\programs\wkhtmltopdf\bin\wkhtmltopdf.exe')

options ={
    'encoding':'UTF-8',
    'page-size':'A4',
    'margin-top':'0.75in',
    'margin-right':'0.75in',
    'margin-bottom':'0.75in',
    'margin-left':'0.75in',
    'no-outline':False,
    #允许wkhtmltopdf.exe访问本地资源
    'enable-local-file-access': True
}


input_files = list(str(o) for o in input_path.glob("*.html"))

pdfkit.from_file(input = input_files,
                     output_path=output_path + '\\pdf文件名.pdf',
                     options=options,configuration=config_pdf)

print('Done!')

上述代码中，'enable-local-file-access'选项默认情况下为False，不修改在转换时会导致ProtocolUnknownError，具体信息如下：

OSError: wkhtmltopdf reported an error:
Exit with code 1 due to network error: ProtocolUnknownError

如果不在html文件中连接层叠样式表，或者想为html文件使用另外的层叠样式表，那么有个限制，那就是层叠样式表只能为单个html输入文件添加。也就是说，只能将每个html文件转换成对应的pdf文件，然后再用其他工具或者用python代码将这些pdf文件合并（使用pypdf2软件包是个不错的选择）。将html文件分别转换成对应的pdf文件代码如下：

import pdfkit
from pathlib import Path


input_path = Path(r"E:\book\text")
output_path = r'E:\book\pdf'

config_pdf = pdfkit.configuration(wkhtmltopdf=r'd:\programs\wkhtmltopdf\bin\wkhtmltopdf.exe')

options ={
    'encoding':'UTF-8',
    'page-size':'A4',
    'margin-top':'0.75in',
    'margin-right':'0.75in',
    'margin-bottom':'0.75in',
    'margin-left':'0.75in',
    'no-outline':False,
    'enable-local-file-access': True
}
#使用多个层叠样式表时应以数组形式提供
css=[r'E:\book\page_styles.css',r"E:\book\stylesheet.css"]

input_files = list(input_path.glob("*.html"))

for f in input_files:
    o_f = str.format(output_path + '\\' + f.stem + '.pdf')
    pdfkit.from_file(input = str(f),
                     output_path=o_f,
                     css=css,#指定层叠样式表，不能对多个输入文件指定层叠样式表
                     options=options,configuration=config_pdf)
print('Done!')