python爬取一个博主的所有文章至pdf

First Snowflakes

已于 2022-03-07 22:02:06 修改

阅读量1k

点赞数 1

分类专栏： Python 文章标签： python 爬虫开发语言

于 2020-11-22 11:49:22 首次发布

本文链接：https://blog.csdn.net/qq_35865125/article/details/109921762

版权

Python 专栏收录该内容

24 篇文章 0 订阅

订阅专栏

Step 1:

安装pdfkit包： Python- 网页转pdf工具包pdfkit_Peace-CSDN博客

Step 2:

将单篇文章爬取下来转成pdf。首先，根据文章的网址得到该网页的所有内容(借助urllib，bs4,re模块)，然后，从中截取文章主题部分，因为网页内容包括评论区等太多东西。

最后将主题部分转成pdf。

例子：可以运行：

import pdfkit
import os
import urllib.request
import re
from bs4 import BeautifulSoup

def get_html(url):
    '''
    返回对应url的网页源码，经过解码的内容
    :param url:
    :return:
    '''
    req = urllib.request.Request(url)
    resp = urllib.request.urlopen(req)  #这句崩溃！！！！！！！！！！！！！！！！！！！！需要关闭vpn
    html_page = resp.read().decode('utf-8')
    return html_page


def get_body_for_pdf(url):
    """
    获取url下文章的正文内容
    :param url:
    :return:
    """
    html_page = get_html(url)
    soup = BeautifulSoup(html_page,'html.parser')   #HTML文档解析器
    #提取网页中的文章正文部分，博客园的话关键字是"cnblogs_post_body"， csdn是"article_content"
    #div = soup.find(id = "cnblogs_post_body") # For 博客园#
    div = soup.find(id="article_content")      # For csdn
    return str(div)


def save_single_file_to_PDF(url):
    title = "HackedTitle"
    body = get_body_for_pdf(url)

    options = {
        'page-size':'Letter',
        'encoding':"UTF-8",
        'custom-header':[('Accept-Encoding','gzip')]
    }
    try:
        filename = title + '.pdf'
        pdfkit.from_string(body, 'Awo.pdf', options=options)#输出PDF文件到当前python文件所在目录下，也可以随便指定路径
        print(filename + "  file have saved...")     #提示文章下载完毕
    except:
        print('save_single_file_to_PDF failed!!! ' + '--------------------------------- title:'+title+'  url is'+url)
        pass

if __name__ == '__main__':
    save_single_file_to_PDF('https://blog.csdn.net/qq_35865125/article/details/109837687')

Step 3:

实现自动爬取所有文章，

打开博主的文章列表网页，该网页的源码中包含着所有的文章的题目，网址，把两者通过正则表达式的方式全部提取出来，然后，一一调用Step2中的功能姐就可以了。

FireFox浏览器查看网页html源码的方式，右键-view page source. 下图为csnd博客中某位博主的文章列表网页的html内容(Peace_First Snowflakes_CSDN博客)，可见文章的标题和对应的网址。

不同的博客平台有不同的格式，例如博客园的格式：

另外，如果博主的文章很多，文章列表会对应多个网页，各个网页的网址的最后一个数字一般是序号，例如，csdn该博主的文章列表https://blog.csdn.net/qq_35865125/article/list/1，https://blog.csdn.net/qq_35865125/article/list/2,等等。

举例：爬取csdn https://blog.csdn.net/qq_35865125/article/list/1 的所有文章题目以及网址:

如上图所示，《C++ STD标准模板库的泛型思想》这篇文章的网址是https://.../109889333。

搜索 <a href="https://blog.csdn.net/qq_35865125/article/details/109889333" 发现网页中有两处匹配，可以用正则表达式匹配它，然后剔除重复的；

搜索 <a href="https://blog.csdn.net/qq_35865125/article/details/109889333" data-report-click 只有一处匹配，可以直接用它做正则匹配：

代码：验证通过：

import pdfkit
import os
import urllib.request
import re
from bs4 import BeautifulSoup

def get_urls(url, pages):
    total_urls = []

    for i in range(1, pages+1):      #根据一个目录的url找到所有目录

        url_temp = url + str(i)
        htmlContent = get_html(url_temp)   #获取网页源码, 需要安装requests_html库 https://blog.csdn.net/weixin_43790560/article/details/86617630

        # Ref:  https://blog.csdn.net/weixin_42793426/article/details/88545939, python正则表达式https://blog.csdn.net/qq_41800366/article/details/86527810
        # https://www.cnblogs.com/wuxunyan/p/10615260.html
        #<a href="https://blog.csdn.net/qq_35865125/article/details/109920565"  data-report-click=
        net_pattern = re.compile(r'<a href="https://blog.csdn.net/qq_35865125/article/details/[0-9]*"  data-report-click=')

        url_withExtra = re.findall(net_pattern, htmlContent)     #找到一个目录中所有文章的网址

        #剔除重复元素:
        #url_withExtraNoDupli = set(url_withExtra)

        for _url in url_withExtra:
            stIdx = _url.find("https://");
            endIdx= _url.find('"  data-report-click=');
            _url_sub = _url[stIdx:endIdx]
            total_urls.append(_url_sub)            #所有文章url放在一起
            print(_url)
    return total_urls

if __name__ == '__main__':
    #save_single_file_to_PDF('https://blog.csdn.net/qq_35865125/article/details/109837687')
    get_urls('https://blog.csdn.net/qq_35865125/article/list/', 1)

另外，从文章列表页面https://blog.csdn.net/qq_35865125/article/list/1提取文章题目不太好搞(需要研究下python正则表达)，采取workAround的方式，即，从文章网页中提取该文章的题目，例如，

https://blog.csdn.net/qq_35865125/article/details/109889333 这篇文章的html内容中有title标记，可以直接用简单的正则表达式提取。

代码：

#给定一篇文章的链接，从中提取title
def get_title_of_one_artical(url):
    htmlContent = get_html(url)
    title_pattern = re.compile(r'<title>.*</title>') # .*用于匹配任何长度的任何字符
    title_withExtra = re.findall(title_pattern, htmlContent)
    if len(title_withExtra)<1:
        return 'NotFoundName'
    foune_name = title_withExtra[0]
    stIdx = foune_name.find("<title>")+7;
    endIdx = foune_name.find('</title>');
    title = foune_name[stIdx:endIdx]
    return title

Note:

1)如果文章的题目含有: / 这种字符，pdfkit.from_string函数会失败！所以需要单独处理，so, 给博客起名字的时候最好不要含有这种特殊字符。

最后的代码：--可以直接运行

import pdfkit
import os
import urllib.request
import re
from bs4 import BeautifulSoup
 
def get_html(url):
    '''
    返回对应url的网页源码，经过解码的内容
    :param url:
    :return:
    '''
    print('Enter get_html ' + '---------------------------------')
    req = urllib.request.Request(url)
    resp = urllib.request.urlopen(req)  #这句崩溃！！！！！！！！！！！！！！！！！！！！需要关闭vpn
    html_page = resp.read().decode('utf-8')
    print('Out get_html ' + '---------------------------------')
    return html_page
 
 
def get_body_for_pdf(url):
    """
    获取url下文章的正文内容
    :param url:
    :return:
    """
    print('Enter get_body_for_pdf ' + '---------------------------------')
    html_page = get_html(url)
    soup = BeautifulSoup(html_page,'html.parser')   #HTML文档解析器
    #提取网页中的文章正文部分，博客园的话关键字是"cnblogs_post_body"， csdn是"article_content"
    #div = soup.find(id = "cnblogs_post_body") # For 博客园#
    div = soup.find(id="article_content")      # For csdn
    print('Out get_body_for_pdf ' + '---------------------------------')
    return str(div)
 
 
def save_single_file_to_PDF(url, title, save_path):
    print('Enter save_single_file_to_PDF ' + '---------------------------------')
    body = get_body_for_pdf(url)
    options = {
        'page-size':'Letter',
        'encoding':"UTF-8",
        'custom-header':[('Accept-Encoding','gzip')]
    }
    try:
        filename = title + '.pdf'
        file_full_path = save_path+filename
        pdfkit.from_string(body, file_full_path, options=options)#输出PDF文件到当前python文件所在目录下，也可以随便指定路径
        print(filename + "  file have saved into"+file_full_path)
    except:
        print('save_single_file_to_PDF failed!!! ' + '--------------------------------- title:'+title+'  url is'+url)
        pass
 
#给定一篇文章的链接，从中提取title
def get_title_of_one_artical(url):
    print('Enter get_title_of_one_artical ' + '---------------------------------')
    htmlContent = get_html(url)
    title_pattern = re.compile(r'<title>.*</title>') # .*用于匹配任何长度的任何字符
    title_withExtra = re.findall(title_pattern, htmlContent)
    if len(title_withExtra)<1:
        print('title Name of ' + url +'is not founde!!---------------------------------!!!!!!!!')
        return 'NotFoundName'
    foune_name = title_withExtra[0]
    stIdx = foune_name.find("<title>")+7;
    endIdx = foune_name.find('</title>');
    title = foune_name[stIdx:endIdx]
    print('Out get_title_of_one_artical ' + '---------------------------------' +title)
    return title
 
 
def get_urls(url, pages):
    total_urls = []
 
    for i in range(1, pages+1):      #根据一个目录的url找到所有目录
 
        url_temp = url + str(i)
        htmlContent = get_html(url_temp)   #获取网页源码, 需要安装requests_html库 https://blog.csdn.net/weixin_43790560/article/details/86617630
 
        # Ref:  https://blog.csdn.net/weixin_42793426/article/details/88545939, python正则表达式https://blog.csdn.net/qq_41800366/article/details/86527810
        # https://www.cnblogs.com/wuxunyan/p/10615260.html
        net_pattern = re.compile(r'<a href="https://blog.csdn.net/qq_35865125/article/details/[0-9]*"  data-report-click=')
 
        url_withExtra = re.findall(net_pattern, htmlContent)     #找到一个目录中所有文章的网址
 
        #剔除重复元素:
        #url_withExtraNoDupli = set(url_withExtra)
 
        for _url in url_withExtra:
            stIdx = _url.find("https://");
            endIdx= _url.find('"  data-report-click=');
            _url_sub = _url[stIdx:endIdx]
            total_urls.append(_url_sub)            #所有文章url放在一起
            print(_url)
    print('i='+ str(i)+'  the number of url of curret i is '+ str(len(url_withExtra)))
    return total_urls
 
count=0
if __name__ == '__main__':
    #save_single_file_to_PDF('https://blog.csdn.net/qq_35865125/article/details/109837687')
    #title = get_title_of_one_artical("https://blog.csdn.net/qq_35865125/article/details/109837687")
    save_path = 'E:/0307Blogs/' #Note: windows下的格式为E:/0307Blogs/
    #save_single_file_to_PDF('https://hr.firstcare.com.cn/wa/validateDataPassword.action?dataPassword=7751191fg&salaryBillCO.checkYear=2019&salaryBillCO.checkMonth=07','wage',save_path)#
    
	#第二个参数为https://blog.csdn.net/qq_35865125/article/list/的页数，文章多的话会有多页
    total_urls = get_urls('https://blog.csdn.net/qq_35865125/article/list/', 2)
    for _url in total_urls:
        title = get_title_of_one_artical(_url)
        save_single_file_to_PDF(_url, title, save_path)

Ref:

爬取博主所有文章并保存到本地（.txt版）--python3.6 - Andrew_qian - 博客园

爬取博主的所有文章并保存为PDF文件 - Andrew_qian - 博客园

爬虫实战【2】Python博客园-获取某个博主所有文章的URL列表 - xingzhui - 博客园