利用python将喜欢的csdn文章保存成pdf

最新推荐文章于 2024-09-07 13:14:59 发布

置顶 xiao_fan_#

最新推荐文章于 2024-09-07 13:14:59 发布

阅读量3.1k

点赞数 35

分类专栏： python爬虫简单实用系列文章标签： python html

本文链接：https://blog.csdn.net/a728848944/article/details/108029892

版权

python爬虫简单实用系列专栏收录该内容

3 篇文章 0 订阅

订阅专栏

前言：

收藏了多年的csdn免费文章，忽然收费或者突然被作者删除了怎么办？

文章目录

- 前言：
1. 工具
- 1.1需要使用到的模块
- 1.2 需要安装的工具
2. 获得文章内容的html（去除无关内容）
3. 将获得的html转成pdf
== 大功完成了！==
小福利

1. 工具

1.1需要使用到的模块

pdfkit，requests，parsel

1.2 需要安装的工具

wkhtmltox-0.12.5-1.msvc2015-win64.exe 工具
链接：https://pan.baidu.com/s/1e_0_4tpyxIU8lHqJF56BhA
提取码：2141
直接傻瓜式的默认安装即可

2. 获得文章内容的html（去除无关内容）

2.1 打开浏览器右键检查进行分析

在这里插入图片描述

通过简单的分析可以发现右边那箭头就是我们需要的内容，且都是与文章相关的，不相关的是没有的

2.2 开始数据解析

import requests
import parsel
url = 'https://blog.csdn.net/A728848944/article/details/108026415'
headers = {
    'user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
html = response.text
selector = parsel.Selector(html)
title = selector.css('.title-article::text').get()
article = selector.css('article').get()  # 提取标签为article 的内容
print(article)

输出结果：
在这里插入图片描述
与上面的相互观察可指定数据提取是对的

2.3 组装html

由于提取出来的html是不全的，所以需要补充
如下是一个标准的html结构

<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>Document</title>

</head>
<body>
    相关内容
</body>
</html>

开始拼接，并保存下来

src_html = '''
<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>Document</title>

</head>
<body>
    {content}
</body>
</html>
'''
with open(title+'.html', mode='w+', encoding='utf-8') as f:
    f.write(src_html.format(content=article))
    print(title+'================保存成功')

输出结果：
在这里插入图片描述
看保存出来的html

是不是相当简洁，没有任何多余的

3. 将获得的html转成pdf

3.1 检查工具是否成功安装

成功可以在下图找到相关包如下图所示（默认安装地址都是一样的）
在这里插入图片描述

3.2 使用pdfkit模块

3.2.1 先进行简单的配置

config = pdfkit.configuration(wkhtmltopdf=r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe')

这就是上面路径下bin的wkhtmltopdf.exe一个软件

3.2.2 开始转换

pdfkit.from_file(title+'.html', title+'.pdf', configuration=config)
print(title+'.pdf','已保存成功')

输出如下：

在这里插入图片描述
观察是否成功打印如下图所示：

在这里插入图片描述
观察是否可用：

在这里插入图片描述

== 大功完成了！==

最后当然是全部代码拉

import pdfkit
import requests
import parsel
url = 'https://blog.csdn.net/A728848944/article/details/108026415'
headers = {
    'user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
html = response.text
selector = parsel.Selector(html)
title = selector.css('.title-article::text').get()
article = selector.css('article').get()  # 提取标签为article 的内容

src_html = '''
<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>Document</title>

</head>
<body>
    {content}
</body>
</html>
'''
with open(title+'.html', mode='w+', encoding='utf-8') as f:
    f.write(src_html.format(content=article))
    print(title+'================保存成功')

config = pdfkit.configuration(wkhtmltopdf=r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe')
pdfkit.from_file(title+'.html', title+'.pdf', configuration=config)

print(title+'.pdf','已保存成功')

小福利

获取一个博主所有文章pdf代码

import pdfkit
import requests
import parsel
import os
import time


src_html = '''
<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>Document</title>

</head>
<body>
    {content}
</body>
</html>
'''
headers = {
    'user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'
}

def download_one_page(page_url):
    response = requests.get(url=page_url, headers=headers)
    html = response.text
    selector = parsel.Selector(html)
    title = selector.css('.title-article::text').get()
    article = selector.css('article').get()  # 提取标签为article 的内容

    with open(title+'.html', mode='w+', encoding='utf-8') as f:
        f.write(src_html.format(content=article))

    config = pdfkit.configuration(wkhtmltopdf=r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe')
    pdfkit.from_file(title+'.html', title+'.pdf', configuration=config)
    print(title+'.pdf','=================已保存成功')

def down_all_url(index_url):
    index_response = requests.get(url=index_url,headers=headers)
    index_selector = parsel.Selector(index_response.text)
    urls = index_selector.css('.article-list h4 a::attr(href)').getall()
    for url in urls:
        download_one_page(url)
        time.sleep(2)

if __name__ == '__main__':
    down_all_url('https://blog.csdn.net/A728848944')