python盗墓笔记爬虫爬虫scrapy_redis——MongoDB存储

最新推荐文章于 2020-07-23 17:58:34 发布

hhyiyuanyu

最新推荐文章于 2020-07-23 17:58:34 发布

阅读量645

点赞数

分类专栏：极客学院python跟学文章标签： python scrapy_redis 爬虫

本文链接：https://blog.csdn.net/hhyiyuanyu/article/details/80282937

版权

本文介绍了使用Python的Scrapy.Redis框架进行网络爬虫开发，并结合MongoDB作为数据存储的实战经验。通过示例代码展示了如何在Scrapy中配置和使用scrapy_redis中间件，以及如何将爬取的数据有效地存入MongoDB数据库。

摘要由CSDN通过智能技术生成

目标网站：盗墓笔记小说网站
目标网址：http://www.daomubiji.com/
目标内容：
    盗墓笔记小说的信息，具体内容包括：
        书标题
        章数
        章标题
    输出结果保存在MongoDB中
####################################
记得每次清空redis
增加：每一章的正文

settings中添加：
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_URL = None
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379

items增加：
   text = Field()#用来保存小说的正文

网上查询的代码形式，建议参考

#-*- coding: utf-8 -*-

from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
from novelspider.items import NovelspiderItem
import json


class novelSpider(CrawlSpider):
    name = 'novelSpider'
    redis_key = 'novelSpider:start_urls'
    start_urls = ['http://www.daomubiji.com/']


    def parse(self, response):
        '''
        获取盗墓笔记主页各种书的链接
        :param response:
        :return:
        '''
        selector = Selector(response)
        section = selector.xpath('//article')
        bookUrls = section.xpath('p/a/@href').extract()
        print bookUrls
        for eachUrl in bookUrls:

最低0.47元/天解锁文章

hhyiyuanyu

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python盗墓笔记爬虫爬虫scrapy_redis——MongoDB存储

目标网站：盗墓笔记小说网站目标网址：http://www.daomubiji.com/目标内容：盗墓笔记小说的信息，具体内容包括：书标题章数章标题输出结果保存在MongoDB中####################################记得每次清空redis增加：每一章的正文settings中添加：...
复制链接

扫一扫

专栏目录