Python爬取CSDN文章并制作成PDF

最新推荐文章于 2024-04-12 21:11:01 发布

嗨学编程

最新推荐文章于 2024-04-12 21:11:01 发布

阅读量1.3k

点赞数

分类专栏： Python爬虫文章标签： Python

本文链接：https://blog.csdn.net/fei347795790/article/details/102997714

版权

Python爬虫专栏收录该内容

677 篇文章 324 订阅

订阅专栏

前言

文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理。

PS：如有需要Python学习资料的小伙伴可以加点击下方链接自行获取

python免费学习资料以及群交流解答点击即可加入

用 python 爬取你喜欢的 CSDN 的原创文章，制作成 PDF 慢慢看。妈妈就再也不用担心我没有学习资料了。

知识点：

requests
css选择器

第三方库：

requests
parsel
pdfkit

开发环境：

版本：anaconda5.2.0（python3.6.5）
编辑器：pycharm

没有最好的编辑器，只有最合适的编辑器

下载 csdn 文章

万一作者的突然脑抽把号给删了，也会保存备份。那岂不泪流满面

首先需要获取 HTML 文件。

pdfkit

pdfkit是将html转换为pdf格式文档的python库。

安装wkhtmltopdf

下载：https://wkhtmltopdf.org/downloads.html

制作 PDF 的三种方法

form_url:传入的参数为url

def from_url(url, output_path, options=None, toc=None,cover=None,configuration=None, cover_first=False)

form_file:传入的参数为html文件

def from_file(input, output_path, options=None, toc=None, cover=None, css=None,configuration=None, cover_first=False)

form_string:传入的参数为字符串

def from_string(input, output_path, options=None, toc=None, cover=None, css=None,configuration=None, cover_first=False)

Configuration：Configuration中传入了wkhtmltopdf参数

https://blog.csdn.net/xc_zhou/article/details/80952168

指定pdf的格式

我们可以指定各种选项，就是上面三个方法中的options。
具体的设置可以参考https://wkhtmltopdf.org/usage/wkhtmltopdf.txt 里面的内容。
我们这里只举个栗子：

options = {
    'page-size': 'Letter',
    'margin-top': '0.75in',
    'margin-right': '0.75in',
    'margin-bottom': '0.75in',
    'margin-left': '0.75in',
    'encoding': "UTF-8",
    'custom-header' : [
        ('Accept-Encoding', 'gzip')
    ]
    'cookie': [
        ('cookie-name1', 'cookie-value1'),
        ('cookie-name2', 'cookie-value2'),
    ],
    'no-outline': None
}

pdfkit.from_url('http://google.com', 'out.pdf', options=options)

完整代码

selenium_page.py

import re
import requests


def get_songid():
    """获取音乐的songid"""
    url = 'http://music.taihe.com/artist/2517'
    response = requests.get(url=url)
    html = response.text
    sids = re.findall(r'href="/song/(\d+)"', html)
    return sids


def get_music_url(songid):
    """获取下载链接"""
    api_url = f'http://musicapi.taihe.com/v1/restserver/ting?method=baidu.ting.song.playAAC&format=jsonp&songid={songid}&from=web'
    response = requests.get(api_url.format(songid=songid))
    data = response.json()
    print(data)
    try:
        music_name = data['songinfo']['title']
        music_url = data['bitrate']['file_link']
        return music_name, music_url
    except Exception as e:
        print(e)


def download_music(music_name, music_url):
    """下载音乐"""
    response = requests.get(music_url)
    content = response.content
    save_file(music_name+'.mp3', content)


def save_file(filename, content):
    """保存音乐"""
    with open(file=filename, mode="wb") as f:
        f.write(content)


if __name__ == "__main__":
    for song_id in get_songid():
        music_name, music_url = get_music_url(song_id)
        download_music(music_name, music_url)

CSDN.py

# -*- coding=utf-8 -*-
import pdfkit
import parsel
import requests

html_template = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
</head>
<body>
{content}
</body>
</html>
"""

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36",
    'Host': 'blog.csdn.net',
    'Referer': 'https://blog.csdn.net',
}


def get_csdn_cookie():
    response = requests.get('https://www.csdn.net/', headers=headers)
    return response.cookies


def get_html(url):
    """获取索引页"""
    response = requests.get(url, headers=headers)
    sel = parsel.Selector(response.text)
    list_a = sel.css('.article-list a')
    for i in list_a[2:]:
        article_index = i.css('a::attr(href)').get()
        yield article_index


def csdn(url: str, cookie=get_csdn_cookie()):
    """下载 CSDN 文章html"""
    response = requests.get(url, headers=headers, cookies=cookie)
    # 获取文章标题内容
    sel = parsel.Selector(response.text)
    # print(response.text)
    title = sel.css('.title-article::text').get()
    article = sel.css('article').get()
    return title, article


def html_to_pdf(filename_html, filename_pdf):
    """HTML 2 PDF"""
    config = pdfkit.configuration(wkhtmltopdf='C:\\Program Files\\wkhtmltopdf\\bin\\wkhtmltopdf.exe')
    options = {
        'page-size': 'Letter',
        'margin-top': '0.75in',
        'margin-right': '0.75in',
        'margin-bottom': '0.75in',
        'margin-left': '0.75in',
        'encoding': "UTF-8",
        'custom-header': [
            ('Accept-Encoding', 'gzip')
        ],
    }
    pdfkit.from_file(filename_html, filename_pdf, options=options, configuration=config)


if __name__ == '__main__':
    title, article = csdn('https://blog.csdn.net/ZackSock/article/details/101645494')
    html = html_template.format(content=article)
    with open(f'{title}.html', mode='w', encoding='utf-8') as f:
        f.write(html)

    html_to_pdf(f'{title}.html', f'{title}.pdf')
    # get_csdn_cookie()

嗨学编程

关注

0
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
Python爬取CSDN文章并制作成PDF

前言文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理。PS：如有需要Python学习资料的小伙伴可以加点击下方链接自行获取[http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef]用 python 爬取你喜欢的 CSDN 的原创文章，制作成 P...
复制链接

扫一扫

专栏目录