爬虫-根据网址将CSDN博客内容下载转换成pdf

最新推荐文章于 2024-06-14 10:31:03 发布

S-Tatum

最新推荐文章于 2024-06-14 10:31:03 发布

阅读量371

点赞数

文章标签：爬虫 pdf 前端

本文链接：https://blog.csdn.net/weixin_44791757/article/details/128951601

版权

使用前先下载安装三个库requests、pdfkit、imgkit

还需要安装这个软件，wkhtmltopdf，下载地址为：https://wkhtmltopdf.org/downloads.html

import requests
import pdfkit
import imgkit
from lxml import etree


# 网址
#url='https://blog.csdn.net/qq_45404396/article/details/105965562'
url=input('请输入网址：') # 使用时请注意，输入网址后，删除最后一个字符，然后再加上；否则直接会在浏览器中打开这个网址
# 请求头
headers={
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
}

htmlStr='''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
{}
</body>
</html>'''

# 得到随机模拟浏览器


rsp=requests.get(url=url,headers=headers)
HTML=etree.HTML(rsp.text)
content=HTML.xpath("//article[@class='baidu_pl']")[0]
html=etree.tostring(content,encoding='utf-8').decode('utf-8')
htmlStr=htmlStr.format(html)
path=input('请输入存储路径：')

# 这个博客名称就不爬取了，如果有一些博客名称违反了文件命名规范，
# 反而会报错，所以这里最好自己写入博客名称
fileName=input('请输入博客名称:')
print('正在转换成pdf！')
file='{}\{}.pdf'.format(path,fileName)

config=pdfkit.configuration(wkhtmltopdf='D:\PycharmProjects\wkhtmltopdf\\bin\wkhtmltopdf.exe')
pdfkit.from_string(htmlStr,file,configuration=config)

# 下面的方法是转换成图片的方法
print('正在转换成图片！')
file='{}\{}.png'.format(path,fileName)

config2=imgkit.config(wkhtmltoimage='D:\PycharmProjects\wkhtmltopdf\\bin\wkhtmltopdf.exe')
imgkit.from_string(htmlStr,file,config=config2)

参考文章：(87条消息) python爬虫：利用pdfkit、imgkit这两个模块下载CSDN上的博客_坚持不懈的大白的博客-CSDN博客_imgkit