爬虫案例-将网页内容转为PDF/HTML文件

云袖秀大本营

已于 2023-02-26 16:50:28 修改

阅读量1.9k

点赞数

分类专栏： # 爬虫知识点-基础文章标签： html python 爬虫

于 2022-01-12 01:10:44 首次发布

本文链接：https://blog.csdn.net/j1451284189/article/details/122445004

版权

爬虫知识点-基础专栏收录该内容

12 篇文章 0 订阅

订阅专栏

效果展示：

下载的PDF文件

在这里插入图片描述
PDF文件的内部展示

在这里插入图片描述

实现逻辑：

使用urllib爬取网页,获取该url的响应内容
利用BeautifulSoup选取响应中你需要的节点
拼接数据
保存HTML文件
使用wkhtmltopdf将HTML文件转换为PDF文件

相关代码：

from bs4 import BeautifulSoup #页面解析,获取数据
import urllib.request,urllib.error #指定URL,获取页面数据
import pdfkit


def main():
    #1.使用urllib爬取网页,获取该url的响应内容
    baseUrl = "https://blog.csdn.net/j1451284189/article/details/122420310"
    header = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3776.400 QQBrowser/10.6.4212.400"}
    request = urllib.request.Request(url = url,headers=header)
    html = ""
    try:
        response = urllib.request.urlopen(request)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reson"):
            print(e.reson)
    #2.利用BeautifulSoup选取响应中你需要的节点
    bs = BeautifulSoup(html,"html.parser")
    blog_detail = bs.find_all('div', class_="article_content")[0]
    file_name = bs.find_all('h1', class_="title-article")[0]
    file_name = file_name.text
    # 3.拼接数据
    html = \
        '''
            <!DOCTYPE html>
                <html lang="en">
                <head>
                    <meta charset="UTF-8">
                    <title>Title</title>
                </head>
                <body>
                    {}
                </body>
            </html>
        '''.format(blog_detail)
    # 4.保存HTML文件
    try:
        with open(r'.\blog\{}.html'.format(file_name), 'w', encoding='utf-8') as f:
            f.write(html)
    except Exception as e:
        print('文件名错误')
    # 5.使用wkhtmltopdf将HTML文件转换为PDF文件
    try:
        config = pdfkit.configuration(wkhtmltopdf=r'D:\LenovoSoftstore\wkhtmltopdf\bin\wkhtmltopdf.exe')
        pdfkit.from_file(
           '.\\blog\{}.html'.format(file_name),
            r'.\blog\{}.pdf'.format(file_name),
            configuration=config
        )
        print(r'--文件下载成功：\blog\{}.pdf'.format(file_name))
    except Exception as e:
        print(r'--文件转换为PDF失败)