python爬虫实战二、多页爬取全篇小说并分章节保存本地

最新推荐文章于 2024-06-28 10:41:11 发布

不秃头的小李同学

最新推荐文章于 2024-06-28 10:41:11 发布

阅读量7.2k

点赞数 4

分类专栏： Python Python爬虫文章标签：爬取小说多页爬取 python爬虫李阡殇

本文链接：https://blog.csdn.net/qq_41767945/article/details/92660524

版权

Python 同时被 2 个专栏收录

23 篇文章 1 订阅

订阅专栏

Python爬虫

3 篇文章 0 订阅

订阅专栏

多页爬取全篇小说并分章节保存本地

有需要爬取一些文章来满足自己开发的实际需要，以下以爬取经典小说《西游记》为例，共计101回。
在开始之前我们需要导入我们需要的库：
①beautifulsoup4
②requests
③lxml根
据实际需要导入自己需要的库，也可以不用以上的库，自己熟悉哪种解析库，便优先选取哪种。
本次爬取的网站为诗词名句网中的西游记小说

总代码如下：

from bs4 import BeautifulSoup      #导入需要的库
import requests
import time
from lxml import etree


def get_word(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}  #获得响应头
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')  # BeautifulSoup解析网页
    html = etree.HTML(response.text)  # XPath解析
    title = soup.select(
        'body > div.layui-container > div.layui-row.layui-col-space10 > div.layui-col-md8.layui-col-sm7 > div.www-main-container.www-shadow-card > h1')
    # 设置小说文件储存位置，如果不存在在该文件夹则生成一个该文件（显然我们是来没有创建的）
    # 文件夹的名称为，每回章节的名称
    src = 'C:\\Users\LeeChoy\\Desktop\\APPs\\经典文学\\西游记素材\content\\' + title[0].get_text() + '.txt'
    # 打开文件夹并且，编码设置成utf-8
    f = open(src, 'w+', encoding='utf-8')
    # print(title[0].get_text())
    f.write(title[0].get_text() + '\n')  # title[0].text   写入章节标题
    words = soup.select(
        'body > div.layui-container > div.layui-row.layui-col-space10 >'
        ' div.layui-col-md8.layui-col-sm7 > div.www-main-container.www-shadow-card > div > p')
    for word in words:
        print(word.text)
        f.write("  " + word.text +'\n')     #写入章节内容
    f.close()   //关闭文件


if __name__ == '__main__':     #主函数开始
    for i in range(1, 101):         #利用最笨的方法构建100个网址
        url = 'http://www.shicimingju.com/book/xiyouji/'
        url = url + str(i) + '.html'
        get_word(url)
        time.sleep(1)        #延迟响应1秒防止浏览过快被封ip