Python学习日记：第一个爬虫优化进阶——将爬到的内容下载保存为PDF

本文链接：https://blog.csdn.net/jimson_zhu/article/details/130953285

上一篇我们从一个简单的爬虫例子出发，已经初步认识到Python的魅力，简短的几句，一个初具功能的爬虫就已见端倪。

这一篇我们继续从第一个例子出发深入——我们将爬到的内容保存为HTML文件和PDF文件吧！

进阶的代码如下所示：

# 导入必要的库
import os  # 用于操作文件系统
import requests  # 用于发送HTTP请求
from bs4 import BeautifulSoup  # 用于解析HTML内容
import pdfkit  # 用于将HTML文件转换为PDF文件
import time  # 用于暂停程序执行

#请求头，避免403错误
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

# 获取http://www.ci123.com/category.php/84/114'
response = requests.get('http://www.ci123.com/category.php/84/114', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# 获取所有a标签
a_tags = soup.find_all('a')
# print(len(a_tags))

# 遍历a标签
for a in a_tags:
    # 获取a标签的text
    text = a.text.strip()
    # 如果text长度大于10
    if len(text) > 10:
        # 获取文章链接
        article_url = a['href']
        article_response = requests.get(article_url, headers=headers)
        article_soup = BeautifulSoup(article_response.text, 'html.parser')
        article_content = article_soup.find('div', {'class': 'page'}).prettify()
        
        # 将文章内容保存为HTML文件
        html_file = f'{text}.html'
        # 替换文件名中的不允许字符
        html_file = html_file.replace('\\', '_').replace('/', '_').replace(':', '_').replace('*', '_').replace('?', '_').replace('"', '_').replace('<', '_').replace('>', '_').replace('|', '_')
        with open(html_file, 'w', encoding='utf-8') as f:
            f.write(article_content)
        # 将HTML文件转换为PDF文件
        pdf_file = f'{text}.pdf'
        # 替换文件名中的不允许字符
        pdf_file = pdf_file.replace('\\', '_').replace('/', '_').replace(':', '_').replace('*', '_').replace('?', '_').replace('"', '_').replace('<', '_').replace('>', '_').replace('|', '_')
        # pdf输出，并解决pdf输出乱码问题
        pdfkit.from_file(os.path.abspath(html_file), pdf_file, configuration=pdfkit.configuration(wkhtmltopdf='F:\\wkhtmltox\\bin\\wkhtmltopdf.exe'), options={'encoding': 'utf-8'})
        print(f'{pdf} saved successfully!')
        # os.remove(html_file)

        # 添加时间间隔，避免请求过于频繁
        time.sleep(2)

代码说明：

1）这段代码用于从我们的目标网站中中提取所有a链接，但是只保留文本长度超过10个字符的链接，而我们的目的时将a标签点进去的文章保存为HTML文件和PDF文件；

2）headers变量设置为模拟浏览器，以避免被网站阻止；

3）BeautifulSoup库用于解析HTML内容，然后使用find_all方法查找所有<a>标签。然后，代码循环遍历每个标签，检查链接文本是否超过10个字符。如果是，则获取链接的href属性和文本，并将文章内容保存为HTML文件。然后，使用pdfkit库将HTML文件转换为PDF文件。4）代码添加时间间隔，避免请求过于频繁，爬虫被网站封禁。

注：

这里我们只保存标题是十个字以上的a标签文章，因为我们观察网站发现文章标题基本都在十个字以上。

另外，我们需要下载一个HTML转PDF的工具“wkhtmltopdf”，并且配置系统环境变量，这里我放置的位置是“F:\wkhtmltox\bin\”，所以系统变量值也是“F:\wkhtmltox\bin\”，那么我的pdfkit.configuration配置项就是：wkhtmltopdf='F:\\wkhtmltox\\bin\\wkhtmltopdf.exe'，需要具体根据你的安装位置来。

wkhtmltopdf下载地址：wkhtmltopdf - Download