Python爬虫实战(基础篇)—5获取xx小说(附完整代码)

一晌小贪欢

已于 2023-07-27 15:29:06 修改

阅读量823

点赞数

分类专栏： Python爬虫文章标签： python 爬虫数据库开发语言

于 2023-07-27 14:56:18 首次发布

本文链接：https://blog.csdn.net/weixin_42636075/article/details/131958852

版权

Python爬虫专栏收录该内容

28 篇文章 30 订阅

订阅专栏

转眼将就来到了我们爬虫基础课的第5 节课，今天我们来获取小说网网站的一些内容来进行阅读学习

PS前面几节课的内容在专栏这里，欢迎大家考古：点我

在这里插入图片描述

首先我们看一下，今天我们来随机获取一本书的内容

在这里插入图片描述

首先第 1 步，我们要点击第一章，查看URL

我们可以看到：最后的【57628310.html】这就是每一章节的页码数，只要找到虽有的页码数，那么就可以获取其内容

在这里插入图片描述

接下来我们获取其第一章的内容试一下

代码1（获取网页内容-完整版在最后）：

import requests
url = 'https://book.zongheng.com/chapter/867252/57628310.html'
book_data = requests.get(url=url).text
print(book_data)

OK ，成功拿到！

在这里插入图片描述

第 2 步：数据清洗

代码2（数据清洗-完整版在最后）

import re
import requests
url = 'https://book.zongheng.com/chapter/867252/57628310.html'
res_str = requests.get(url=url).text
# print(res_str)
title = re.findall(r'name="keywords" content="(.*?)"/>',res_str)[0]
book_content= re.findall(r'<p>(.*?)</p>',res_str)
# print(book_content)
print(title)
for c in book_content[:-1]:
    print(c)

在这里插入图片描述

第三步，获取所有章节：

点击【目录】，查看章节来源的URL

由此发现，每一本书都有自己的 id,参数为：【_】下划线

在这里插入图片描述

代码 3 （获取所有章节的页码数+数据清洗——完整版在最后）

book_id = '1690437685919'
data = requests.get("https://book.zongheng.com/showchapterList/867252?_={book_id}").text
data = json.loads(data)
# print(data)
for i in data['data']['showTomeViewList'][0]['chapterViewList']:
    print(f'章节：{i["chapterName"]}，章节id：{i["chapterId"]}')

在这里插入图片描述

与代码2进行合并，形成完整版代码：

在这里插入图片描述

完整版代码：

book_id = ‘1690437685919’，这里可以替换自己喜欢的bookid

import json
import requests
book_id = '1690437685919'
data = requests.get("https://book.zongheng.com/showchapterList/867252?_={book_id}").text
data = json.loads(data)
# print(data)
for i in data['data']['showTomeViewList'][0]['chapterViewList']:
    with open(f"./结果/xxxx-{i['chapterName']}",'w',encoding='utf-8') as f:
        print(f'章节：{i["chapterName"]}，章节id：{i["chapterId"]}')

        url = f'https://book.zongheng.com/chapter/867252/{i["chapterId"]}.html'
        res_str = requests.get(url=url).text
        # print(res_str)
        title = re.findall(r'name="keywords" content="(.*?)"/>',res_str)[0]
        book_content= re.findall(r'<p>(.*?)</p>',res_str)
        # print(book_content)
        print(title)
        for c in book_content[:-1]:
            # print(c)
            f.write(c+'\n')