【Python——爬取纵横小说网指定数据并存入MySql数据库】2

随缘859

已于 2024-09-07 14:46:51 修改

阅读量537

点赞数 16

分类专栏： python 文章标签： python 爬虫

于 2024-09-07 14:37:27 首次发布

本文链接：https://blog.csdn.net/weixin_59638462/article/details/141995210

版权

python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

要求：访问所有小说信息，针对每本小说创建一个目录，目录的名字设置成小说的名字，目录里放所有小说的章节内容，每一章小说一个txt文件，文件名设置成小说的章节名字，txt文件写入小说内容

import json
import os
import re
from lxml import etree
import requests

num = 0
n = 0
url = f'https://book.zongheng.com/store/c0/c0/b0/u4/p1/v0/s9/t0/u0/i1/ALL.html'

发送HTTP请求以获取书籍列表

response = requests.get(url=url)
content = response.content.decode('utf8')

使用正则表达式提取书籍信息

book_url = set(re.findall('<a href="https://book.zongheng.com/book/(.*?)" target="_blank">', content))
url_chapter = 'https://bookapi.zongheng.com/api/chapter/getChapterList'
headers = {
    'user-agent': '******',
    'Cookie': '******'}
# titles = re.findall(r'<a href="https://book.zongheng.com/book/\d+.html" target="_blank">(.*?)</a>', content)
# print(titles,len(titles))
id_book = re.findall(
    '<a href="https://book.zongheng.com/book/(.*?).html" target="_blank">([\u4e00-\u9fa5\u3000-\u303f\uff00-\uffefA-Za-z]+)</a>',
    content)
# print(id_book, len(id_book))

这里，book_url 用于提取书籍的URL片段，而 id_book 提取书籍的ID和标题。

循环处理每本书籍

增加计数器 n。

构造请求数据，包含书籍ID。

发送POST请求以获取章节列表。

解析JSON响应以获取章节信息。

创建一个目录来保存书籍的章节。

for j in range(0, len(id_book)):
    n += 1
    data = {'bookId': id_book[j][0]}
    # print(data)
    response_b = requests.post(url=url_chapter, data=data)
    content_b = response_b.content.decode('utf8')
    # print(content_b)
    chapters = json.loads(content_b)
    result1 = chapters['result']['chapterList']
    # print(len(result1))
    if len(result1) == 1:
        results = result1[0]['chapterViewList']
        # print(results)
    elif len(result1) != 1:
        results = result1[0]['chapterViewList'] + result1[1]['chapterViewList']
        # print(results)
    li_chapterId = []
    li_chapterName = []
    for result in results:
        li_chapterId.append(result['chapterId'])
        li_chapterName.append(result['chapterName'])
    # print(li_chapterId, li_chapterName)
    os.makedirs(fr'./data/小说/第{n}本 {id_book[j][1]}', exist_ok=True)
    print({id_book[j][1]})

提取章节内容和保存到文件

构造章节内容的URL。

发送GET请求以获取章节内容。

使用 lxml 解析HTML并提取文本。

将文本保存到文件中。

    for i in range(0, len(li_chapterId)):
        # print(id_book[j][0])
        url_content = f'https://read.zongheng.com/chapter/{id_book[j][0]}/{li_chapterId[i]}.html'
        response_c = requests.get(url=url_content, headers=headers)
        datas = response_c.content.decode("utf8")
        # print(datas)
        tree = etree.HTML(datas)
        text = tree.xpath('//div[@class="content"]/p/text()')
        text = "\n".join(text)
        # print(text)
        with open(
                fr'./data/小说/第{n}本 {id_book[j][1]}\{num} {li_chapterName[i].replace('\t', ' ')}.txt', 'w',
                encoding='utf8') as f:
            f.write(text)
        num += 1
        print(num, li_chapterName[i])

本代码中没有加try和sleep，抛出异常和限制速率，这可以添加，以用来避免在请求失败时崩溃，并避免对服务器造成不必要的压力。