爬虫爬取小说内容

最新推荐文章于 2024-05-01 21:57:08 发布

想去的远方

最新推荐文章于 2024-05-01 21:57:08 发布

阅读量2.3k

点赞数 3

分类专栏：爬虫文章标签： python 爬虫小说

本文链接：https://blog.csdn.net/qq_42185999/article/details/87966550

版权

爬虫专栏收录该内容

0 篇文章 0 订阅

订阅专栏

PS：我使用的环境是Spyder(python3.6)

大概思路：小说，章节，逐层爬取信息，在以小说名建立文件夹，以章节名为名建立.tex文件，将小说内容保存到文件中。

import requests
from lxml import etree
import os

#设计模式---面向对象，继承，封装
class Spider(object):
    
    #函数功能：请求小说网站，拿到小说名字和相应打开链接，并建立以小说名为名字的文件夹
    def start_request(self):
        #请求网站拿到数据
        response=requests.get("https://www.qidian.com/all")
        print(response.text) #打印请求网址拿到的界面内容，用于验证
        html=etree.HTML(response.text) #结构化数据
        bigtil_list=html.xpath('//div[@class="book-mid-info"]/h4/a/text()') #指定爬取得内容，这里是小说名
        bigsrc_list=html.xpath('//div[@class="book-mid-info"]/h4/a/@href')#相应小说名的链接
        print(bigtil_list,bigsrc_list)
        for bigtil,bigsrc in zip(bigtil_list,bigsrc_list):#爬取到的数据并不是对应的，我用zip让他们一一对应
            print(bigtil,bigsrc)
            if os.path.exists(bigtil)==False:#如果以该小说名为名字的文件夹不存在
                os.mkdir(bigtil) #则新建以该小说名为名的文件夹
            self.file_data(bigtil,bigsrc) #调用函数file_data
            
    
    #函数功能：    
    def file_data(self,bigtil,bigsrc):
        response=requests.get("https:"+bigsrc)#得到的数据bigsrc为不加http：的网址，需要访问，得先加上
        print(response)
        html=etree.HTML(response.text)
        littil_list=html.xpath('//ul[@class="cf"]/li/a/text()')#爬取章节名
        litsrc_list=html.xpath('//ul[@class="cf"]/li/a/@href')#相应的章节名链接
        for littil,litsrc in zip(littil_list,litsrc_list):
            print(littil,litsrc)
            self.final_data(bigtil,littil,litsrc)
            
    def final_data(self,bigtil,littil,litsrc):
        response=requests.get("https:"+litsrc)#打开网址链接，得到小说章节内容
        html=etree.HTML(response.text)
        content="\n".join(html.xpath('//div[@class="read-content j_readContent"]/p/text()'))
        file_name=bigtil+"\\"+littil+".txt" #小说名，先存在小说文件夹中，再以章节取名，存为.txt文件。
        print("正在存储文件："+file_name)
        with open(file_name,"a",encoding="utf-8") as f: #打开文件
            f.write(content) #写入小说内容
            #这里需要注意的是：写入文件内容时只能以字符串的形式写入，但爬取的内容是以list的形式呈现的，所以这里
            #使用 content="\n".join（）使它变为字符串。
            
spider=Spider() #构造实体
spider.start_request() #调用函数

运行结果：
在这里插入图片描述

想去的远方

关注

3
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
爬虫爬取小说内容

PS：我使用的环境是Spyder(python3.6)大概思路：小说，章节，逐层爬取信息，在以小说名建立文件夹，以章节名为名建立.tex文件，将小说内容保存到文件中。import requestsfrom lxml import etreeimport os#设计模式---面向对象，继承，封装class Spider(object): #函数功能：请求小说网站，拿到...
复制链接

扫一扫