异步编程与Python：高速下载网络小说

最新推荐文章于 2024-08-10 23:14:03 发布

CoderTLL

最新推荐文章于 2024-08-10 23:14:03 发布

阅读量375

点赞数 5

文章标签： python 开发语言

本文链接：https://blog.csdn.net/qq_34666239/article/details/136157527

版权

异步编程与Python：高速下载网络小说 🚀📚

在这个数字化时代，阅读网络小说已成为许多人的日常娱乐活动之一。但有时，我们可能想要离线阅读这些小说，或者仅仅是为了备份。那么，如何高效地下载整本小说呢？答案就在Python的异步编程中！✨

开始之前 🛠️

在我们踏上这趟神奇之旅前，确保你的工具箱中装备了以下魔法工具：

aiohttp：用于异步发送HTTP请求的魔法球。
aiofiles：异步读写文件的秘密卷轴。
lxml：解析HTML文档的神谕石板。
asyncio：Python中控制异步魔法的法杖。

异步爬虫的魔法 🧙‍♂️

我们的目标是创建一个异步爬虫，它可以在不等待网络响应的同时，继续执行其他任务，从而大大加快了下载速度。这里是我们的魔法配方（代码）：

异步获取数据

我们首先定义一个fetch函数，它可以在网络的海洋中捕获数据，并且具备重试的能力，以防万一遇到了恶龙（网络错误）。

async def fetch(url, session, retries=3):
    # 使用异步请求和指数退避策略

解析列表页面

接着，我们将使用神秘的lxml石板来解析列表页面，从而找到每一章节的秘密通道（URL）。

async def parse_list_page(url, session):
    # 解析书名和章节URL

下载并保存章节

最后，我们将对每个章节使用fetch_and_save_chapter咒语，不仅捕获章节内容，还将它们保存在我们的法术书（硬盘）中。

async def fetch_and_save_chapter(url, book_name, chapter_name, session):
    # 异步下载并保存章节内容

启动我们的魔法仪式

将所有的魔法组合起来，我们使用asyncio法杖唤醒了这个强大的异步爬虫。

async def main(url):
    # 组织并执行所有的异步任务

完整代码

import time

import aiohttp
import asyncio
import aiofiles
from lxml import etree
import os
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
}

import aiohttp
import asyncio

# 定义异步的 fetch 函数，包含重试逻辑
async def fetch(url, session, retries=3):
    for attempt in range(retries):
        try:
            # 尝试发起GET请求
            async with session.get(url) as response:
                if response.status == 200:
                    # 请求成功，返回响应内容
                    return await response.read()
                else:
                    # 打印状态码非200的尝试信息
                    print(f"尝试 {attempt + 1} 失败，状态码：{response.status}")
        except aiohttp.ClientPayloadError as e:
            # 捕获并打印载入数据不完整的错误
            print(f"尝试 {attempt + 1} 失败，载入数据不完整：{e}")
            if attempt < retries - 1:
                # 如果不是最后一次尝试，等待后重试
                await asyncio.sleep(2 ** attempt)  # 指数退避
            else:
                # 最后一次尝试失败后，重新抛出异常
                raise

async def parse_list_page(url, session):
    html = await fetch(url, session)
    tree = etree.HTML(html)
    book_name = tree.xpath('//div[@class="top"]/h1/text()')[0]
    chapter_urls = tree.xpath('//div[@class="section-box"]/ul/li/a/@href')
    chapter_names = tree.xpath('//div[@class="section-box"]/ul/li/a/text()')
    return book_name, chapter_urls, chapter_names

async def fetch_and_save_chapter(url, book_name, chapter_name, session):
    html = await fetch(url, session)
    tree = etree.HTML(html)
    chapter_content = tree.xpath('//div[@id="content"]/text()')
    chapter_content = "\n".join(chapter_content).replace("\xa0\xa0\xa0\xa0", "\n")

    directory_path = f"./{book_name}"
    os.makedirs(directory_path, exist_ok=True)
    file_path = os.path.join(directory_path, f"{chapter_name}.txt")

    async with aiofiles.open(file_path, mode="w", encoding="utf-8") as f:
        await f.write(chapter_content)
    print(f"{chapter_name}保存成功")

async def main(url):
    async with aiohttp.ClientSession(headers=headers) as session:
        book_name, chapter_urls, chapter_names = await parse_list_page(url, session)
        tasks = []
        for chapter_url, chapter_name in zip(chapter_urls, chapter_names):
            full_url = os.path.join(url, chapter_url)
            tasks.append(fetch_and_save_chapter(full_url, book_name, chapter_name, session))
        await asyncio.gather(*tasks)

if __name__ == '__main__':
    # 计时
    start_time = time.time()
    url = "https://www.biquge635.com/book/40420/"
    asyncio.run(main(url))
    end_time = time.time()
    print(f"耗时：{end_time - start_time}")