scrapy爬取小说以txt是形式存储，

最新推荐文章于 2023-11-15 17:03:49 发布

陈建江！

最新推荐文章于 2023-11-15 17:03:49 发布

阅读量3.6k

点赞数

分类专栏： scrapy爬虫文章标签： scrapy爬取小说 python爬取小说多级页面的爬取文本的爬取

本文链接：https://blog.csdn.net/weixin_44841312/article/details/96700147

版权

scrapy爬虫专栏收录该内容

5 篇文章 2 订阅

订阅专栏

一。爬取的路径：

1.进入小说的目录页面https://www.x81zw.com/book/5/5182/
2.提取每个章节的连接
3.进入章节连接爬取文本内容和章节标题
4.将每个章节的内容进行保存

二。文件

1.spider.py

# -*- coding: utf-8 -*-
import scrapy
from novel.items import NovelItem
import re

class Demo1Spider(scrapy.Spider):
    name = 'demo1'#爬虫名字
    allowed_domains = ['x81zw.com']#爬取的域名
    start_urls = ['https://www.x81zw.com/book/5/5182/']#开始爬取的网址
    
    def parse(self, response):
    item = NovelItem()# NovelItem()类的对象实例化
    dds = response.xpath('//*[@id="list"]/dl/dd')#获取到目录中所有章节所在的位置
    for i in range(10, 100):#对前90章进行存取，因为dd[10]对应的是第一章，
    所以从10开始遍历
        href = "https://www.x81zw.com"+response.xpath('//*[@id="list"]/dl/dd[{}]/a/@href'.format(i))[0].extract()
        #因为提取的网址不全，所有需要对他进行补全，format函数实现i替换{}
        #[0]解决list问题，extract()是提取函数
        item['href'] = href
        #print(item['href'])
        yield scrapy.Request(url=item['href'], meta={"item": item},
                             callback=self.parse_detail, dont_filter=True)
          #这里是生成（yield）了一个request请求，请求的内容是url连接和meta数据，
          #meta数据保存着item数据的所有内容，通过callback传给parse_detail
          #进行下一步的解析，并且声明dont_filter=True，使得url不能重复


    def parse_detail(self, response):
        item = response.meta['item']
         #item接收由上面函数返回到这个函数请求数据中的meta['item']，类似于继承，
         #它既可以有上个函数属性（上个函数的item是QqnItem类的实例化），
         #又得到了它的所有数据
        contents = response.xpath('//*[@id="content"]/text()').extract()
        #对所有的文本内容进行提取，提取后是一个list表单
        item['title'] = response.xpath('//div[@class="bookname"]/h1/text()')[0].extract()
        #对标题进行提取
        item['content'] = ''
        #提取声明 item['content']这个变量
        for i in contents:
            # i = ''.join(i.split('\u3000\u3000'))
            item['content'] = item['content'] + "\n" + i.replace(u'\u3000', u''
            #对list里面的内容进行合并，并且把\u3000全部去掉
        print("最终的结果:")
        print(item['title'])
        # print(item['content'])
        yield item

2.item.py

import scrapy
class NovelItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    href = scrapy.Field()
    content = scrapy.Field()
    title = scrapy.Field()

3.piplines.py

class NovelPipeline(object):
  def process_item(self, item, spider):
    with open("星际之全能进化.txt", 'a', encoding='utf-8') as f:
        f.write(item['title'])
        f.write('\n')
        f.write('\n')
        f.write(item['content'])
        f.write('\n')
        f.write('\n')
        #这样存的目的是：标题和内容有一段间距，并且两个章节之间也要有间距