目标网站:盗墓笔记小说网站 目标网址:http://www.daomubiji.com/ 目标内容: 盗墓笔记小说的信息,具体内容包括: 书标题 章数 章标题 输出结果保存在MongoDB中 #################################### 记得每次清空redis 增加:每一章的正文 settings中添加: SCHEDULER = "scrapy_redis.scheduler.Scheduler" SCHEDULER_PERSIST = True SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue' REDIS_URL = None REDIS_HOST = '127.0.0.1' REDIS_PORT = 6379 items增加: text = Field()#用来保存小说的正文
网上查询的代码形式,建议参考
#-*- coding: utf-8 -*- from scrapy.spiders import CrawlSpider from scrapy.selector import Selector from scrapy.http import Request from novelspider.items import NovelspiderItem import json class novelSpider(CrawlSpider): name = 'novelSpider' redis_key = 'novelSpider:start_urls' start_urls = ['http://www.daomubiji.com/'] def parse(self, response): ''' 获取盗墓笔记主页各种书的链接 :param response: :return: ''' selector = Selector(response) section = selector.xpath('//article') bookUrls = section.xpath('p/a/@href').extract() print bookUrls for eachUrl in bookUrls: