python将html存为pdf_【Python】用python将html转化为pdf

最新推荐文章于 2024-01-11 17:54:18 发布

weixin_39755218

最新推荐文章于 2024-01-11 17:54:18 发布

阅读量211

点赞数

文章标签： python将html存为pdf

其实早在去年就有做过，一直没有写，先简单记录下

1、主要用到的工具【wkhtmltopdf】

根据系统选择安装包，速度有点慢，先挂着

2、下载Python库

pip install pdfkit

pip install wkhtmltopdf

3、简单代码验证

importpdfkit

pdfkit.from_url('http://baidu.com','out.pdf')

pdfkit.from_file('test.html','out1.pdf')

pdfkit.from_string('Hello World!','out2.pdf')

返回Done、True说明环境没有问题了

输出的pdf文件

打开pdf

源html是动态大尺寸，pdf显示静态，尺寸有减小

文件打开正常，说明代码没有问题，后面就可以自由发挥爬虫技能

此外支持列表

pdfkit.from_url(['google.com', 'yandex.ru', 'engadget.com'], 'out.pdf')

pdfkit.from_file(['file1.html', 'file2.html'], 'out.pdf')

支持文件对象

with open('file.html') as f:

pdfkit.from_file(f,'out.pdf')

作为string变量，操作pdf

#Use False instead of output path to save pdf to a variable

pdf = pdfkit.from_url('http://google.com', False)

指定pdf格式(选项设置)

参考https://wkhtmltopdf.org/usage/wkhtmltopdf.txt

options ={'page-size': 'Letter','margin-top': '0.75in','margin-right': '0.75in','margin-bottom': '0.75in','margin-left': '0.75in','encoding': "UTF-8",'custom-header': [

('Accept-Encoding', 'gzip')

]'cookie': [

('cookie-name1', 'cookie-value1'),

('cookie-name2', 'cookie-value2'),

],'no-outline': None

}

pdfkit.from_url('http://google.com', 'out.pdf', options=options)

默认的，pdfkit会show出所有的output，如果你不想使用，可以设置为quite：

options = {'quiet': ''}

pdfkit.from_url('google.com', 'out.pdf', options=options)

传入任何html标签【烦人广告说拜拜，真正做到网页私人定制】

body = """

Hello World!

"""pdfkit.from_string(body,'out.pdf') #with --page-size=Legal and --orientation=Landscape

【改进】

将之前的save_file方法改成save_to_pdf，并且在get_body方法中直接返回str(div)，而不是div.text。代码如下：

defsave_to_pdf(url):'''根据url，将文章保存到本地

:param url:

:return:'''title=get_title(url)

body=get_Body(url)

filename=author+'-'+title+'.pdf'

#windows系统文件名特殊字符，建议网上百度，然后替换即可

if '/' infilename:

filename=filename.replace('/','+')if '\\' infilename:

filename=filename.replace('\\','+')print(filename)

options={'page-size': 'Letter','encoding': "UTF-8",'custom-header': [

('Accept-Encoding', 'gzip')

]

}

config=pdfkit.configuration(wkhtmltopdf=r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe')

pdfkit.from_string(body,filename,options=options,configuration=config)print('打印成功！')

【文件命名规范】

自媒体的出现，文件命名开始五花八门，下面用一行代码去除非法字符

#Python中过滤Windows文件名中的非法字符

importre

title='xxxxxxx'fileName= re.sub(r'[\\/:*?"<>|\r\n]+','-',title)#去掉非法字符,在[]中*不需要转义,此时*不表示多次匹配,就表示本身的字符

以后遇到好的文章，可以自己采集，存为pdf，再也不用担心源网站删除，存到自己电脑里才放心。

【参考链接】

https://blog.csdn.net/xc_zhou/article/details/80952168

weixin_39755218

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。