爬虫html转换成pdf,爬虫实战【3】Python-如何将html转化为pdf(PdfKit)

最新推荐文章于 2024-05-12 09:54:41 发布

i福

最新推荐文章于 2024-05-12 09:54:41 发布

阅读量225

点赞数

文章标签：爬虫html转换成pdf

前言

前面我们对博客园的文章进行了爬取，结果比较令人满意，可以一下子下载某个博主的所有文章了。但是，我们获取的只有文章中的文本内容，并且是没有排版的，看起来也比较费劲。。。

咋么办的？一个比较好的方法是将文章的正文内容转化成pdf，就不要考虑排版的事情了，看起来比较美观，也不会丢失一些关键信息。

python中将html转化为pdf的常用工具是Wkhtmltopdf工具包，在python环境下，pdfkit是这个工具包的封装类。如何使用pdfkit以及如何配置呢？分如下几个步骤。

1、下载wkhtmltopdf安装包，并且安装到电脑上，在系统Path变量中添加wkhtmltopdf的bin路径，以便于pdfkit的调用。

下载地址：https://wkhtmltopdf.org/downloads.html

请根据自己的系统版本，选择合适的安装包。如果没有装C语言库，建议选择Windows下的第二种。

【插入图片 pdf1】

2、在pycharm中安装pdfkit库，过程就不介绍啦，前面讲过类似的内容。

3、在pycharm中安装whtmltopdf库。

这个和第一步中的安装包是两个东西，请区别开来。

用法简介

对于简单的任务来说，代码很easy，比如：

import pdfkit

pdfkit.from_url('http://baidu.com','out.pdf')

pdfkit.from_file('test.html','out.pdf')

pdfkit.from_string('Hello!','out.pdf')

pdfkit包含的方法很少，主要用的就是这三个，我们简单看一下每个函数的API：

from_ulr()

def from_url(url, output_path, options=None, toc=None, cover=None,

configuration=None, cover_first=False):

"""

Convert file of files from URLs to PDF document

:param url: url可以是某一个url也可以是url的列表，

:param output_path: 输出pdf的路径，如果设置为False意味着返回一个string

Returns: True on success

"""

r = PDFKit(url, 'url', options=options, toc=toc, cover=cover,

configuration=configuration, cover_first=cover_first)

return r.to_pdf(output_path)

from_file()

def from_file(input, output_path, options=None, toc=None, cover=None, css=None,

configuration=None, cover_first=False):

"""

Convert HTML file or files to PDF document

:param input: 输入的内容可以是一个html文件，或者一个路径的list，或者一个类文件对象

:param output_path: 输出pdf的路径，如果设置为False意味着返回一个string

Returns: True on success

"""

r = PDFKit(input, 'file', options=options, toc=toc, cover=cover, css=css,

configuration=configuration, cover_first=cover_first)

return r.to_pdf(output_path)

from_string()

def from_string(input, output_path, options=None, toc=None, cover=None, css=None,

configuration=None, cover_first=False):

#类似的，这里就不介绍了

r = PDFKit(input, 'string', options=options, toc=toc, cover=cover, css=css,

configuration=configuration, cover_first=cover_first)

return r.to_pdf(output_path)

举几个栗子

我们可以传入列表：

pdfkit.from_url(['google.com', 'yandex.ru', 'engadget.com'], 'out.pdf')

pdfkit.from_file(['file1.html', 'file2.html'], 'out.pdf')

我们可以将一个打开的文件对象传进去：

with open('file.html') as f:

pdfkit.from_file(f, 'out.pdf')

如果我们想继续操作pdf，可以将其读取成一个变量，其实就是一个string变量。

# Use False instead of output path to save pdf to a variable

pdf = pdfkit.from_url('http://google.com', False)

指定pdf的格式

我们可以指定各种选项，就是上面三个方法中的options。

具体的设置可以参考https://wkhtmltopdf.org/usage/wkhtmltopdf.txt 里面的内容。

我们这里只举个栗子：

options = {

'page-size': 'Letter',

'margin-top': '0.75in',

'margin-right': '0.75in',

'margin-bottom': '0.75in',

'margin-left': '0.75in',

'encoding': "UTF-8",

'custom-header' : [

('Accept-Encoding', 'gzip')

]

'cookie': [

('cookie-name1', 'cookie-value1'),

('cookie-name2', 'cookie-value2'),

'no-outline': None

}

pdfkit.from_url('http://google.com', 'out.pdf', options=options)

默认的，pdfkit会show出所有的output，如果你不想使用，可以设置为quite：

options = {

'quiet': ''

}

pdfkit.from_url('google.com', 'out.pdf', options=options)

我们还可以传入任何html标签，比如：

body = """

Hello World!

"""

pdfkit.from_string(body, 'out.pdf') #with --page-size=Legal and --orientation=Landscape

改进

有了上面的知识之后，我们大可以尝试一下，如果将之前的save_file方法做一些改变，就能够实现我们下载PDF的目标啦。

我们将方法名改成save_to_pdf，并且在get_body方法中直接返回str(div)，而不是div.text。代码如下：

def save_to_pdf(url):

'''

根据url，将文章保存到本地

:param url:

:return:

'''

title=get_title(url)

body=get_Body(url)

filename=author+'-'+title+'.pdf'

if '/' in filename:

filename=filename.replace('/','+')

if '\' in filename:

filename=filename.replace('\','+')

print(filename)

options = {

'page-size': 'Letter',

'encoding': "UTF-8",

'custom-header': [

('Accept-Encoding', 'gzip')

]

}

#本来直接调用pdfkid的from方法就可以了，但是由于我们的wkhtmltopdf安装包有点问题，一直没法搜到，所以只能用本办法，直接配置了wk的地址

#尴尬了，主要是一直没法下载到最新的wk，只能在网上down了旧版本的。有谁能下到的话发我一份。。。

config=pdfkit.configuration(wkhtmltopdf=r'C:Program Fileswkhtmltopdfinwkhtmltopdf.exe')

pdfkit.from_string(body,filename,options=options,configuration=config)

print('打印成功！')

【插入图片，pdf2】

哈哈，成功了，下载了这么多pdf，回头慢慢看就可以了。

i福

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫html转换成pdf,爬虫实战【3】Python-如何将html转化为pdf(PdfKit)

前言前面我们对博客园的文章进行了爬取，结果比较令人满意，可以一下子下载某个博主的所有文章了。但是，我们获取的只有文章中的文本内容，并且是没有排版的，看起来也比较费劲。。。咋么办的？一个比较好的方法是将文章的正文内容转化成pdf，就不要考虑排版的事情了，看起来比较美观，也不会丢失一些关键信息。python中将html转化为pdf的常用工具是Wkhtmltopdf工具包，在python环境下，pdfk...
复制链接

扫一扫