项目实战：基于 Python 的爬取《蛊真人》小说的完整代码-CSDN博客

本文链接：https://blog.csdn.net/2402_86248070/article/details/144152548

免责声明：

1、本项目爬取的网站数据仅用于学习交流目的，不用于任何商业用途。
2、我们尊重原网站的版权和知识产权，不会对爬取的数据进行任何形式的篡改、传播或用于非法用途。
3、爬取行为严格遵守我国相关法律法规，如有侵犯原网站权益，请及时与我联系，我将立即停止爬取行为并删除相关数据。
4、本项目不对因使用爬取数据导致的任何损失或损害承担责任。
5、请使用者遵守本免责声明，如有违反，后果自负。

第一部分：需要用到的库

from fake_useragent import UserAgent
import requests
from lxml import html
from urllib.parse import urljoin
import json
import os
import re

可以通过以下命令统一安装这些库：

pip install fake-useragent requests lxml

说明：

fake-useragent: 用于生成模拟浏览器的 User-Agent 字符串。
requests: 用于发起 HTTP 请求的库。
lxml: 用于解析 HTML 和 XML 的库，支持 XPath 等功能。
json: Python 标准库，处理 JSON 格式数据（通常已内置，无需额外安装）。
os: Python 标准库，处理文件和操作系统相关功能（通常已内置，无需额外安装）。
re 是 Python 的标准库，它提供了正则表达式的功能，（通常已内置，无需额外安装）。

第二在这里插入代码片部分：检查小说网站是否响应

import requests
from fake_useragent import UserAgent

# 目标网站
url = "https://www.22biqu.com/biqu1/"

# 模拟浏览器访问
headers = {
    'User-Agent': UserAgent().random,
    'Referer': 'https://www.beqege.cc/',
    'Accept-Language': 'zh-CN,zh;q=0.9'
}

try:
    # 发起请求
    response = requests.get(url, headers=headers, timeout=10)
    
    # 检查响应状态
    if response.status_code == 200:
        print("请求成功，返回内容：")
        print(response.text[:100])  # 只打印前100个字符
    else:
        print(f"请求失败，状态码：{response.status_code}")
except requests.exceptions.RequestException as e:
    print(f"请求异常：{e}")

设置请求URL和头部信息：
- url = " " 设置了要访问的网站URL。
- headers 定义了HTTP请求的头部信息，包括：
  - User-Agent: 模拟一个浏览器的请求头，以防止请求被网站拒绝。
  - Referer: 通常用于指定来源地址，防止被认为是直接访问。
  - Accept-Language: 设置语言为中文。
发起请求：
- requests.get(url, headers=headers) 用来向指定的URL发起GET请求，请求头为之前定义的 headers。
- 如果在 10 秒钟内没有收到服务器的响应，就会抛出超时异常（requests.exceptions.Timeout）
检查响应状态：
- response.status_code == 200 判断响应的状态码是否为200，表示请求成功。
- 如果成功，输出响应的前100个字符（response.text[:100]）。
- 如果请求失败，则输出状态码以及响应内容的一部分。
异常处理：
- 使用 try-except 语句来捕获请求过程中的异常，如超时或连接问题。如果请求发生异常（如网络不可用、请求超时等），则会输出异常信息。

HTTP状态码

HTTP状态码用于表示服务器对客户端请求的响应结果。：

状态码类别	状态码范围	含义	常见状态码	说明
1xx	100-199	信息性响应	100	继续请求（请求已接受，继续处理）
2xx	200-299	成功响应	200	请求成功（请求已成功处理并返回）
3xx	300-399	重定向响应	301	永久重定向（请求的资源已永久移动到新位置）
4xx	400-499	客户端错误	400	错误请求（客户端发送的请求有语法错误或无效请求）
5xx	500-599	服务器错误	500	服务器内部错误（服务器遇到错误，无法完成请求）

小结：

2xx表示成功的请求。
3xx表示重定向，需要客户端做进一步的处理。
4xx表示客户端请求错误，通常是请求问题。
5xx表示服务器问题，服务器不能处理请求。

第三部分：手动建立保存爬取内容的文件夹

少量任务手动建立更加便捷，需要自动化再重新写代码

output_dir = "蛊真人"  # 修改为你的目标路径
os.makedirs(output_dir, exist_ok=True)

这段代码的作用是创建一个指定路径的文件夹（目录），如果该目录已经存在，则不会抛出错误。

os.makedirs(output_dir, exist_ok=True)
这行代码用来创建指定路径的目录。os.makedirs 可以递归地创建多级目录，如果指定的路径中有一些目录不存在，它会创建这些缺失的目录。

exist_ok=True 这个参数的作用：如果目录已经存在，不会抛出错误
exist_ok=False（默认值），当目录已经存在，会抛出 FileExistsError 错误。

第四部分：查询是否读取到需要爬取的章节名

作用：提取章节信息，获取所有的章节标题和链接，并通过翻页功能爬取多页内容

chapter_list = []

# 初始页面URL
url = "https://www.22biqu.com/biqu440/1/"  # 修改为初始页面的URL
headers = {
    'User-Agent': UserAgent().random,
    'Referer': 'https://www.beqege.cc/',
    'Accept-Language': 'zh-CN,zh;q=0.9'
}

while url:
    # 请求当前页面
    response = requests.get(url, headers=headers)

    # 解析 HTML
    tree = html.fromstring(response.text)

    # 定位到章节列表部分
    chapter_links = tree.xpath("//ul[@class='section-list fix']//a/@href")
    chapter_titles = tree.xpath("//ul[@class='section-list fix']//a/text()")

    # 输出章节链接和标题
    if chapter_links and chapter_titles:
        print(f"成功提取 {len(chapter_titles)} 个章节：")
        for title, link in zip(chapter_titles, chapter_links):
            print(f"章节标题: {title.strip()} | 链接: {urljoin(url, link)}")

        # 输出章节链接和标题到列表
        for title, link in zip(chapter_titles, chapter_links):
            if title and link:  # 如果标题和链接都存在
                # 检查 chapter_list 中是否已经存在相同的章节链接
                if not any(d['href'] == link for d in chapter_list):
                    chapter_list.append({
                        'title': title.strip(),  # 去除标题两边空格
                        'href': urljoin(url, link)  # 章节链接使用完整的 URL
                    })
    else:
        print("没有提取到章节，可能是 XPath 选择器不正确，请检查。")

    # 输出当前页面爬取的章节数据
    print(f"当前页面章节数：{len(chapter_list)}")

    # 获取下一页的链接
    next_page = tree.xpath("//a[@class='index-container-btn' and contains(text(), '下一页')]/@href")

    # 如果有下一页链接，获取下一页数据
    if next_page:
        next_url = next_page[0]
        # 使用 urljoin 拼接出完整的 URL
        url = urljoin(url, next_url)
        print(f"下一页链接: {url}")
    else:
        print("没有下一页，爬取结束。")
        break

# 输出最终的章节列表
print(f"所有爬取的章节：{len(chapter_list)}")
for chapter in chapter_list:
    print(f"标题: {chapter['title']} | 链接: {chapter['href']}")

1. 初始化数据结构和起始页面

chapter_list = []  # 存储章节数据的列表
url = "https://www.22biqu.com/biqu440/1/"  # 初始页面URL

chapter_list 是一个空的列表，用来存储爬取到的章节信息。
url 是爬虫从哪里开始抓取数据的起始页面。

2. 循环处理多页内容

while url:
    # 请求当前页面
    response = requests.get(url, headers=headers)

进入 while url: 循环，直到所有页面都被爬取完。

3. 解析 HTML 内容

tree = html.fromstring(response.text)

response.text 是从网站返回的 HTML 内容。
html.fromstring(response.text) 使用 lxml 库将返回的 HTML 内容转换成一个可供 XPath 查询的解析树对象 tree。

4. 提取章节链接和标题

chapter_links = tree.xpath("//ul[@class='section-list fix']//a/@href")
chapter_titles = tree.xpath("//ul[@class='section-list fix']//a/text()")

xpath() 是 lxml 库中的一个方法，用来从 HTML 文档中提取数据。
chapter_links 存储页面中所有章节链接的 href 属性值。通过 XPath 查询：//ul[@class='section-list fix']//a/@href，选中所有 <a> 标签的 href 属性（即章节链接）。
chapter_titles 存储章节的标题，通过 XPath 查询：//ul[@class='section-list fix']//a/text()，选中所有 <a> 标签的文本内容（即章节标题）。

5. 检查并存储章节

if chapter_links and chapter_titles:
    for title, link in zip(chapter_titles, chapter_links):
        print(f"章节标题: {title.strip()} | 链接: {urljoin(url, link)}")

    for title, link in zip(chapter_titles, chapter_links):
        if title and link:
            if not any(d['href'] == link for d in chapter_list):
                chapter_list.append({'title': title.strip(), 'href': urljoin(url, link)})

检查 chapter_links 和 chapter_titles 是否有内容，如果有，就进行提取和处理。
zip(chapter_titles, chapter_links) 将章节标题和链接配对，循环输出每个章节的标题和链接。
urljoin(url, link) 用于确保链接是完整的 URL，而不仅仅是相对路径。
chapter_list.append({...}) 将章节标题和链接存入 chapter_list 列表。如果链接已经存在于 chapter_list 中（即已经爬取过这个章节），则不重复添加。

6. 获取下一页链接

next_page = tree.xpath("//a[@class='index-container-btn' and contains(text(), '下一页')]/@href")
if next_page:
    next_url = next_page[0]
    url = urljoin(url, next_url)
    print(f"下一页链接: {url}")
else:
    print("没有下一页，爬取结束。")
    break

使用 XPath 查找 “下一页” 按钮的链接。contains(text(), '下一页') 查找包含 “下一页” 文字的 <a> 标签。
如果找到了下一页的链接，则更新 url 为新的页面地址，继续爬取下一页。
如果没有找到下一页链接（即爬取到最后一页），则打印 “没有下一页，爬取结束”，并通过 break 跳出循环，停止爬取。

7. 输出章节列表

print(f"所有爬取的章节：{len(chapter_list)}")
for chapter in chapter_list:
    print(f"标题: {chapter['title']} | 链接: {chapter['href']}")

输出已经爬取到的所有章节的数量（len(chapter_list)）。
遍历 chapter_list 列表，输出每个章节的标题和链接。

第四部分：爬取小说内容

import os
import json
import requests
from urllib.parse import urljoin
from lxml import html

# 替换非法字符的函数，避免出现OSError: [Errno 22] Invalid argument这个错误
def sanitize_filename(filename):
    return filename.replace(':', '：').replace('/', '_').replace('\\', '_') \
                   .replace('*', '_').replace('?', '_').replace('"', '_') \
                   .replace('<', '_').replace('>', '_').replace('|', '_')

# 指定保存文件夹
output_dir = "蛊真人"  # 修改为你的目标路径
os.makedirs(output_dir, exist_ok=True)

# 已完成章节文件
completed_file = "蛊真人.json"

# 如果文件存在且不为空，加载已完成的章节
if os.path.exists(completed_file) and os.path.getsize(completed_file) > 0:
    with open(completed_file, "r", encoding="utf-8") as f:
        completed_chapters = set(json.load(f))
else:
    completed_chapters = set()
    print("没有找到已完成的章节，或文件为空，开始新的爬取...")

# 遍历每个章节链接，获取章节内容（包含分页处理）
for chapter in chapter_list:
    sanitized_title = sanitize_filename(chapter['title'])  # 替换非法字符
    if sanitized_title in completed_chapters:
        print(f"章节《{chapter['title']}》已爬取，跳过...")
        continue

    chapter_url = chapter['href']
    full_content = []  # 用于存储分页的全部内容
    print(f"正在爬取章节：{chapter['title']} | 起始链接：{chapter_url}")
    
    while chapter_url:
        try:
            # 请求章节页面
            chapter_response = requests.get(chapter_url, headers=headers, timeout=10)
            chapter_response.raise_for_status()  # 如果返回错误状态码，抛出异常
            
            # 解析章节页面
            chapter_tree = html.fromstring(chapter_response.text)
            
            # 提取当前页面的内容
            content = chapter_tree.xpath("//div[@id='content']//text()")
            if content:
                full_content.append(''.join(content).strip())
            
            # 检查是否有下一页
            next_page = chapter_tree.xpath("//a[contains(text(), '下一页')]/@href")
            chapter_url = urljoin(chapter_url, next_page[0]) if next_page else None
        except Exception as e:
            print(f"请求章节《{chapter['title']}》失败：{e}")
            break

    # 保存章节内容到本地文件
    chapter_text = '\n'.join(full_content)
    if chapter_text:
        filename = os.path.join(output_dir, f"{sanitized_title}.txt")
        with open(filename, "w", encoding="utf-8") as f:
            f.write(chapter_text)
        print(f"章节《{chapter['title']}》已保存到文件：{filename}")
    
    # 标记章节为已完成
    completed_chapters.add(sanitized_title)

    # 更新已完成章节的 JSON 文件
    with open(completed_file, "w", encoding="utf-8") as f:
        json.dump(list(completed_chapters), f, ensure_ascii=False)

print("所有章节爬取完成！")

第五部分：整理成一个markdown

# 汉字数字转换成阿拉伯数字的映射
hanzi_to_num = {
    "零": 0, "一": 1, "二": 2, "三": 3, "四": 4, "五": 5, "六": 6, "七": 7, "八": 8, "九": 9,
    "十": 10, "百": 100, "千": 1000, "万": 10000
}

# 将汉字数字转换为阿拉伯数字
def convert_hanzi_to_num(hanzi):
    # 处理汉字数字：例如 "一", "二", "三", "十", "二十", "三百" 等
    result = 0
    temp = 0
    for char in hanzi:
        if char in hanzi_to_num:
            num = hanzi_to_num[char]
            if num == 10 or num == 100 or num == 1000 or num == 10000:
                if temp == 0:
                    temp = 1  # 十百千如果前面没有数字则算做1
                result += temp * num
                temp = 0
            else:
                temp += num
    result += temp
    return result

# 读取文件并按修改时间排序
def read_files_sorted_by_time(folder_path):
    # 获取文件夹中所有 .txt 文件及其修改时间
    files_with_time = [
        (filename, os.path.getmtime(os.path.join(folder_path, filename)))
        for filename in os.listdir(folder_path)
        if filename.endswith('.txt')
    ]
    
    # 按修改时间升序排序
    files_sorted = sorted(files_with_time, key=lambda x: x[1])
    
    # 按排序后的文件名读取内容
    texts = {}
    total_files = len(files_sorted)
    
    for i, (filename, _) in enumerate(files_sorted):
        with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
            texts[filename] = file.read()  # 将文件内容存入字典
        
        # 打印当前进度
        progress = (i + 1) / total_files * 100  # 计算进度百分比
        print(f"正在处理文件 {i + 1}/{total_files} ({progress:.2f}%) - {filename}")
    
    return texts  # 返回按修改时间排序的字典

# 提取文件名中的数字（阿拉伯数字或汉字数字）
def extract_chapter_number(chapter_title):
    # 提取阿拉伯数字
    match = re.search(r'(\d+)', chapter_title)
    if match:
        return int(match.group(1))  # 如果找到阿拉伯数字，返回数字
    
    # 提取汉字数字并转换为阿拉伯数字
    hanzi_match = re.search(r'([零一二三四五六七八九十百千万]+)', chapter_title)
    if hanzi_match:
        return convert_hanzi_to_num(hanzi_match.group(1))  # 如果找到汉字数字，转换成阿拉伯数字
    
    # 如果没有数字，返回一个极大的数字，表示没有数字的章节
    return float('inf')

# 创建 Markdown 文件
def create_markdown(novel_texts, output_file):
    with open(output_file, 'w', encoding='utf-8') as md_file:
        # 写入电子书的标题
        md_file.write("# 神明模拟器\n\n")  # 你可以根据需要修改标题
        
        total_chapters = len(novel_texts)
        
        # 先按文件名中的数字进行排序，没有数字的章节排在最前面
        sorted_chapters = sorted(novel_texts.items(), key=lambda x: extract_chapter_number(x[0]))
        
        # 添加章节
        for i, (chapter_title, content) in enumerate(sorted_chapters):
            # 写入章节标题
            md_file.write(f"## {chapter_title}\n\n")  # Markdown 用 '##' 来表示二级标题
            # 写入章节内容
            md_file.write(f"{content}\n\n")  # 章节内容
            
            # 打印当前进度
            progress = (i + 1) / total_chapters * 100  # 计算进度百分比
            print(f"正在生成章节 {i + 1}/{total_chapters} ({progress:.2f}%) - {chapter_title}")

# 主程序
def main(folder_path, output_file):
    # 读取并按修改时间排序文件
    print("正在读取并排序文件...")
    novel_texts = read_files_sorted_by_time(folder_path)
    
    # 创建 Markdown 文件
    print("正在生成 Markdown 文件...")
    create_markdown(novel_texts, output_file)
    
    print("Markdown 文件已生成！")

# 文件夹路径，替换为你的实际路径
folder_path = r"path"  # 你需要替换为实际的文件夹路径

# 输出的 Markdown 文件路径
output_file = r"path/name.md"  # 生成的 Markdown 文件名

# 调用主程序
main(folder_path, output_file)

读取文件并按修改时间排序：
- read_files_sorted_by_time 函数用于读取指定文件夹中的所有 .txt 文件，并根据文件的修改时间进行排序。
- 它返回一个字典，键为文件名，值为文件内容。
创建 Markdown 文件：
- create_markdown 函数用于将读取到的文本内容按章节（文件名）生成一个 Markdown 格式的文件。它会在文件中写入章节标题和章节内容，并根据文件名中的数字顺序进行排序。
- 在生成过程中，程序会显示当前进度。

总结：

目的是从一个文件夹中读取小说章节（假设每个 .txt 文件是一个章节），根据文件的修改时间进行排序，并生成一个按照章节顺序组织的 Markdown 格式电子书。章节标题是从文件名中提取的数字来确定顺序，如果文件名没有数字，则会排在前面。