爬取小说

最新推荐文章于 2024-08-07 09:00:00 发布

Harlan（lhl）

最新推荐文章于 2024-08-07 09:00:00 发布

阅读量261

点赞数 1

文章标签： python 数据挖掘

本文链接：https://blog.csdn.net/weixin_42556361/article/details/105162708

版权

这篇博客分享了一个使用Python爬取小说的案例。尽管现在有许多免费和付费的小说资源，但通过爬虫不仅可以免费阅读，还能学习编程技术。作者详细介绍了如何设置User-Agent，利用BeautifulSoup定位并获取小说标题和链接，最后展示如何循环遍历获取文章内容，并将结果保存到txt文件中。

摘要由CSDN通过智能技术生成

哈喽，大家好，我又来了

今天给大家分享一个爬取小说的案例

大家可能会说，现在的app网站什么的都可以直接免费看小说了，何必在自己动手爬取呢？

疑问出来了，我们就解答：

现在的确有很多的免费看小说的网站，有一个问题就是，免费的资源少，资源多的又收费。而且我们用python代码爬取小说即可以免费看到小说，又可以学习技术。何乐而不为呢。

让我们一起回到没有智能机，看电子书的时代吧！

准备阶段：

            代码编辑器（Pycharm， sublime等）

            需要sublime编辑器的私信我（解压直接使用）

            python3 (我用的python3.6.4)

            爬取的网站（http://www.shicimingju.com）

讲解部分：

            我们今天以《三国演义》为例

按照图片所示，一次点击“古籍”、“《三国演义》”
👇👇👇
在这里插入图片描述

User-Agent获取方法
网页任意地方右键 -> 检查

👇👇👇

在这里插入图片描述

利用bs4包里面的BeautifulSoup定位给标签
使用request方法访问
headers里面配置好User-Agent
在这里插入图片描述

生成soup对象

利用BeautifulSoup找到标题和链接
在这里插入图片描述

也就是找到它
👇👇👇
在这里插入图片描述

下一步：循环点进去每个标题
利用soup.find方法获取文章内容

在这里插入图片描述

代码中
string = ‘title’ + ‘\n’ + text + ‘\n’

代表以title为标题，逐一换行
创建sanguo.txt文件
在这里插入图片描述

效果是这样的

👇👇👇
在这里插入图片描述

完整代码：

from bs4 import BeautifulSoup
import urllib.request
import time




def handle_request(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36',
    }
    return urllib.request.Request(url=url, headers=headers)




def parse_content(content):
    # 生成soup对象
    soup = BeautifulSoup(content, 'lxml')
    # 找到所有的标题和链接
    a_title_href_list = soup.select('.book-mulu > ul > li > a')
    # print(a_title_href_list)
    # print(len(a_title_href_list))
    # 遍历列表，获取标题和链接
    for oa in a_title_href_list:
        # 获取标题
        title = oa.string
        # 获取链接
        href = 'http://www.shicimingju.com' + oa['href']
        # 向这个href发送请求
        text = get_chapter_text(href)


        string = 'title' + '\n' + text + '\n'


        with open('sanguo.txt', 'a', encoding='utf8') as fp:
            fp.write(string)


        time.sleep(2)




def get_chapter_text(href):
    # 构建请求对象
    request = handle_request(href)
    #
    content = urllib.request.urlopen(request).read().decode('utf8')
    soup = BeautifulSoup(content, 'lxml')
    # 获取指定内容
    text = soup.find('div', class_='chapter_content').text
    return text




def main():
    url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
    request = handle_request(url)
    # 发送请求，获取响应
    content = urllib.request.urlopen(request).read().decode('utf8')
    # 解析内容
    parse_content(content)




if __name__ == '__main__':
    main()