Scrapy 框架获取豆瓣电影的信息(包括图片)和电影评论-1

最新推荐文章于 2022-11-01 11:57:08 发布

Cappuccino_Luo

最新推荐文章于 2022-11-01 11:57:08 发布

阅读量2.3k

点赞数 7

分类专栏：爬虫文章标签： python 爬虫 mongodb

本文链接：https://blog.csdn.net/cappuccino_luo/article/details/122314266

版权

爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

文章目录

一、项目目录
二、定义爬取结果存储的数据结构(items.py)
- - 电影信息
  - 评论信息
三、爬取到结果后的处理类(spiders 文件夹)
- - 电影信息(movieInfo.py)
  - 评论信息(comment.py)
四、数据管道，对获取到的数据做操作(以 pipelines 为开头命名的 py 文件)
五、项目的配置文件(settings.py)
十、我的资源包
- - [Scrapy 框架爬取豆瓣电影的信息(包括图片)和电影评论-3](https://download.csdn.net/download/Cappuccino_Luo/73929114)
  - [Scrapy 框架爬取豆瓣电影的信息(包括图片)和电影评论-2](https://blog.csdn.net/Cappuccino_Luo/article/details/122315209)

一、项目目录

爬虫项目目录

注意：仅完成 “ Scrapy 框架获取豆瓣电影的信息(包括图片)和电影评论-1 ”的爬虫还不能运行。（1、这篇里面存在两只蜘蛛(movieInfo 和 comment)和多个管道（settings.py 内众多的 ITEM_PIPELINES）如何指定的问题；2、豆瓣存在反爬虫机制还没克服）

二、定义爬取结果存储的数据结构(items.py)

电影信息

# 某部电影的介绍部分
class MovieInfoItem(scrapy.Item):
    # 片名
    movieName = scrapy.Field()
    # 图片
    photo = scrapy.Field()
    # 导演
    director = scrapy.Field()
    # 编剧
    screenwriter = scrapy.Field()
    # 主演
    performer = scrapy.Field()
    # 类型
    type = scrapy.Field()
    # 制片国家/地区
    country = scrapy.Field()
    # 语言
    language = scrapy.Field()
    # 剧情简介
    synopsis = scrapy.Field()

评论信息

# 某部电影的评论部分
class CommentItem(scrapy.Item):
    # 片名
    movieName = scrapy.Field()
    # 评论用户名
    username = scrapy.Field()
    # 评分(力荐 推荐 还行 较差 很差)
    score = scrapy.Field()
    # 评论时间
    commentTime = scrapy.Field()
    # 认为有用数(点赞数)
    fabulous = scrapy.Field()
    # 内容
    content = scrapy.Field()

三、爬取到结果后的处理类(spiders 文件夹)

电影信息(movieInfo.py)

import datetime

import scrapy
from scrapy.cmdline import execute

from douban.items import MovieInfoItem


class MovieinfoSpider(scrapy.Spider):
    name = 'movieInfo'
    allowed_domains = ['movie.douban.com']
    start_urls = ['https://movie.douban.com/']

    def parse(self, response):
        urls = response.xpath('//div[@class="screening-bd"]/ul[@class="ui-slide-content"]/li/ul/li['
                              '@class="poster"]/a/@href').extract()
        for url in urls:
            yield scrapy.Request(url, callback=self.parse1, dont_filter=True)
            pass
        pass

    # 该方法负责从页面上提取数据（图片、片名、导演、编辑、主演、类型、制片国家/地区、语言、剧情简介）
    def parse1(self, response):
        # 电影信息
        item = MovieInfoItem()
        # 片名
        item['movieName'] = response.xpath('//div[@id="content"]/h1/span')[0].xpath('./text()').extract()
        if len(item['movieName']) == 0:
            item['movieName'] = ['0']

        # 图片
        item['photo'] = response.xpath('//div[@class="subject clearfix"]/div[@id="mainpic"]/a/img/@src').extract()
        if len(item['photo']) == 0:
            item['photo'] = ['0']

        movieInfo = response.xpath('//div[@class="subject clearfix"]/div[@id="info"]')
        # 导演
        item['director'] = movieInfo.xpath('./span')[0].xpath('./span[@class="attrs"]/a/text()').extract()
        if len(item['director']) == 0:
            item['director'] = ['0']

        # 编剧
        item['screenwriter'] = movieInfo.xpath('./span')[1].xpath('./span[@class="attrs"]/a/text()').extract()
        if len(item['screenwriter']) == 0:
            item['screenwriter'] = ['0']

        # 主演
        item['performer'] = movieInfo.xpath('./span[@class="actor"]/span/a/text()').extract()
        if len(item['performer']) == 0:
            item['performer'] = ['0']

        # 类型
        item['type'] = movieInfo.xpath('./span[@property="v:genre"]/text()').extract()
        if len(item['type']) == 0:
            item['type'] = ['0']

        # 制片国家/地区
        item['country'] = movieInfo.xpath(u'./span[contains(./text(), "制片国家/地区:")]/following::text()[1]').extract()
        if len(item['country']) == 0:
            item['country'] = ['0']

        # 语言
        item['language'] = movieInfo.xpath(u'./span[contains(./text(), "语言:")]/following::text()[1]').extract()
        if len(item['language']) == 0:
            item['language'] = ['0']

        # 剧情简介
        item['synopsis'] = response.xpath('//div[@id="link-report"]/span/text()').extract()
        # ['\n                                \u3000\u3000月黑风高之夜，一群电影人被秘密召集到一起，欲将轰动一时的血案翻拍成电影，借此
        # 扬名立万。殊不知他们正身处案发现场，并步步陷入一个巨大迷局之中，而凶手就在他们中间……\n                                    ',
        # '\n                                \u3000\u3000伴随着利欲熏天的创作风暴，案件背后的故事似乎也更加扑朔迷离，戏中戏、案中案、局
        # 中局、人外人，环环相扣，一场野心与良心的较量愈演愈烈。究竟是命悬一线，还是另有惊天逆转？爱与温暖的血色花又能否从快将干涸的血河中终极绽放？
        # 一切都有待揭开。\n                        ']
        item['synopsis'] = [x.strip() for x in item['synopsis']]
        # 经过上一行处理后：['月黑风高之夜，一群电影人被秘密召集到一起，欲将轰动一时的血案翻拍成电影，借此扬名立万。殊不知他们正身处案发现场，并步
        # 步陷入一个巨大迷局之中，而凶手就在他们中间……', '伴随着利欲熏天的创作风暴，案件背后的故事似乎也更加扑朔迷离，戏中戏、案中案、局中局、人外
        # 人，环环相扣，一场野心与良心的较量愈演愈烈。究竟是命悬一线，还是另有惊天逆转？爱与温暖的血色花又能否从快将干涸的血河中终极绽放？一切都有待
        # 揭开。']
        if len(item['synopsis']) == 0:
            item['synopsis'] = ['0']
        yield item

    # 这个是main函数也是整个程序入口的惯用写法
    if __name__ == '__main__':
        execute(['scrapy', 'crawl', 'movieInfo'])
        # execute('scrapy crawl movieInfo_spider -s JOBDIR=../../crawls/movieInfo'.split())

评论信息(comment.py)

import datetime
import re

import scrapy
from scrapy.cmdline import execute

from douban.items import CommentItem


class CommentSpider(scrapy.Spider):
    name = 'comment'
    allowed_domains = ['movie.douban.com']
    start_urls = ['https://movie.douban.com/']

    # 扬名立万的介绍
    # start_urls = ['https://movie.douban.com/subject/35422807/?from=showing']
    # 点击进入扬名立万的热门评论网址
    # start_urls = ['https://movie.douban.com/subject/35422807/comments?status=P']
    # 扬名立万的最新评论网址(未爬取)
    # start_urls = ['https://movie.douban.com/subject/35422807/comments?sort=time&status=P']

    # 扬名立万的好评
    # start_urls = ['https://movie.douban.com/subject/35422807/comments?percent_type=h&limit=20&status=P&sort=new_score']
    # 扬名立万的一般评
    # start_urls = ['https://movie.douban.com/subject/35422807/comments?percent_type=m&limit=20&status=P&sort=new_score']
    # 扬名立万的差评
    # start_urls = ['https://movie.douban.com/subject/35422807/comments?percent_type=l&limit=20&status=P&sort=new_score']

    def parse(self, response):
        urls = response.xpath('//div[@class="screening-bd"]/ul[@class="ui-slide-content"]/li/ul/li['
                              '@class="poster"]/a/@href').extract()
        # 会先遍历所有的 url
        for url in urls:
            # 通过 yield 来发起一个请求，并通过 callback 参数为这个请求添加回调函数，在请求完成之后会将响应作为参数传递给回调函数
            # scrapy 框架会根据 yield 返回的实例类型来执行不同的操作，如果是 scrapy.Request 对象，scrapy 框架会去获得该对象指向的链接并在请求完成后调用该对象的回调函数
            # 如果是 scrapy.Item 对象，scrapy框架会将这个对象传递给 pipelines.py做进一步处理
            # scrapy 的 filter 功能将请求自动过滤掉，从而不会出现请求的结果。增加 dont_filter=True 这一选项，将过滤功能关闭掉
            yield scrapy.Request(url, self.parse1, dont_filter=True)

    def parse1(self, response):
        urls = response.xpath(
            '//div[@id="comments-section"]/div[@class="mod-hd"]/h2/span[@class="pl"]/a/@href').extract()
        # urls 是列表,提取列表中每个元素(元素只有一个且为字符串)
        for url in urls:
            yield scrapy.Request(url, self.parse2, dont_filter=True)
        pass

    # 有的电影有分好评\一般\差评,有的没有分
    def parse2(self, response):
        has = response.xpath('//div[@class="comment-filter"]/label/span[@class="filter-name"]/text()').extract()
        if has:
            for i in range(4):
                if i == 0:
                    # 好评
                    yield scrapy.Request(
                        re.sub('status=P', 'percent_type=h&limit=20&status=P&sort=new_score',
                               response.request.url),
                        self.parse3, dont_filter=True)
                elif i == 1:
                    # 一般
                    yield scrapy.Request(
                        re.sub('status=P', 'percent_type=m&limit=20&status=P&sort=new_score',
                               response.request.url),
                        self.parse3, dont_filter=True)
                elif i == 2:
                    # 差评
                    yield scrapy.Request(
                        re.sub('status=P', 'percent_type=l&limit=20&status=P&sort=new_score',
                               response.request.url),
                        self.parse3, dont_filter=True)
                else:
                    # 全部热门评论
                    yield scrapy.Request(
                        re.sub('status=P', 'limit=20&status=P&sort=new_score', response.request.url),
                        self.parse3, dont_filter=True)
                    pass
                pass
            pass
        else:
            yield scrapy.Request(response.request.url, self.parse3, dont_filter=True)
            pass
        pass

    # 该方法负责从页面上提取数据（片名、评论用户的评分、评论时间、点赞数、内容）
    # response 就是 Scrapy 发送请求后，服务器的响应页面
    def parse3(self, response):
        # 获取页面中评论列表
        comment_list = response.xpath('//div[@id="comments"]/div[@class="comment-item "]')

        # # 设置登录豆瓣的 Cookie （网页上获取”Cookie: “后的部分直接放入下面两个单引号之间）
        # cookies = ''
        # # 处理成字典
        # cookies = {i.split("=")[0]: i.split("=")[1] for i in cookies.split("; ")}

        for comment_li in comment_list:
            # 每个 comment_li 代表一条评论的元素，因此应该为每条评论创建一个 item 对象
            item = CommentItem()

            # 片名
            item['movieName'] = response.xpath('//div[@id="wrapper"]/div[@id="content"]/h1/text()').extract()
            # 获取到的片名为“['xxx 短评']”，这里去除多余的字符“ 短评”
            for x in item['movieName']:
                item['movieName'] = [re.sub(r' 短评', r'', x)]

            # 评论用户名
            item['username'] = comment_li.xpath(
                './div[@class="comment"]/h3/span[@class="comment-info"]/a/text()').extract()
            if len(item['username']) == 0:
                item['username'] = ['0']

            # 评论人给出的评分(存在有人不评分的情况，获取的可能是时间)
            item['score'] = comment_li.xpath('./div[@class="comment"]/h3/span[@class="comment-info"]/span')[1].xpath(
                './@title').extract()
            # 对评分中的数据进行处理
            if item['score'] != ['力荐'] and item['score'] != ['推荐'] and item['score'] != ['还行'] \
                    and item['score'] != ['较差'] and item['score'] != ['很差']:
                item['score'] = ['未评分']

            # 评论时间
            item['commentTime'] = comment_li.xpath('./div[@class="comment"]/h3/span[@class="comment-info"]/span['
                                                   '@class="comment-time "]/text()').extract()
            # 去除评论时间前后的换行和空格(['\n                    2021-11-14\n                '])
            item['commentTime'] = [x.strip() for x in item['commentTime']]
            # "2021-12-10 20:22:29" 经过下面循环的运行后变为 "2021-12-17"
            # for x in item['commentTime']:
            #     item['commentTime'] = [x[:-9]]
            if len(item['commentTime']) == 0:
                item['commentTime'] = ['0']

            # 评论人的认同数
            item['fabulous'] = comment_li.xpath(
                './div[@class="comment"]/h3/span[@class="comment-vote"]/span/text()').extract()
            if len(item['fabulous']) == 0:
                item['fabulous'] = ['0']

            # 评论人评论内容
            item['content'] = comment_li.xpath('./div[@class="comment"]/p/span/text()').extract()
            if len(item['content']) == 0:
                item['content'] = ['0']

            # 这是一个生成器，每生成一个 item 对象都交给 pipelines.py 处理
            yield item
            pass  # 空语句，是为了保持程序结构的完整性；不做任何事情，一般用做占位语句。
        # 自动请求翻页实现爬虫的深度采集
        nextPage = response.xpath('//div[@id="paginator"]/a[@class="next"]/@href')
        # 判断 nextPage 是否有效，从 a 标签中获取的 href 与 直接显示的 url 存在不同，匹配时以 a 标签的为准（因为我获取的就是 a 标签内的 href）
        # 注意：使用登陆账号的 cookie 最多能爬取热门评论500条数据，使用不登录账号的 cookie 最多只能爬取热门评论220条数据，最新评论80条
        # 分类爬取（热门全部、热门好评、热门一般评、热门差评、最新评论）获取的评论可以超过1500(登录 cookie 且每一类的评论数大于人可以浏览的评论数)
        # 2022年豆瓣的人可浏览的评论数变了，登录后可浏览量更多了且到某一个 start 以后每一页的评论数不再是每页20条，未登录的浏览量没变
        # 220条数据(if nextPage and nextPage[0].extract() != '?start=220&limit=20&sort=new_score&status=P&percent_type=':)
        # 访问时带 cookies 时用下面的判断
        # if nextPage:
        if nextPage and nextPage[0].extract() != '?start=220&limit=20&sort=new_score&status=P&percent_type=' and \
                nextPage[0].extract() != '?start=220&limit=20&sort=new_score&status=P&percent_type=h' and \
                nextPage[0].extract() != '?start=220&limit=20&sort=new_score&status=P&percent_type=l' and \
                nextPage[0].extract() != '?start=220&limit=20&sort=new_score&status=P&percent_type=m':
            # 拼接下一页的地址
            url = response.urljoin(nextPage[0].extract())
            # 发送url后页请求时添加 cookies
            # yield scrapy.Request(url, cookies=cookies, callback=self.parse3, dont_filter=True)
            # 发送url后页请求时不添加 cookies
            yield scrapy.Request(url, callback=self.parse3, dont_filter=True)

    # 这个是main函数也是整个程序入口的惯用写法
    if __name__ == '__main__':
        # execute 只执行第一个
        execute(['scrapy', 'crawl', 'comment'])
        # # 在爬虫运行过程中，会自动将状态信息存储在 crawls/comment 目录下，支持续爬
        # # 若想支持续爬，在 ctrl+c 终止爬虫时，只能按一次，爬虫在终止时需要进行善后工作，切勿连续多次按 ctrl+c
        # execute('scrapy crawl comment_spider -s JOBDIR=../../crawls/comment'.split())

四、数据管道，对获取到的数据做操作(以 pipelines 为开头命名的 py 文件)

控制台输出

电影信息(pipelines.py)(图片保存到本地)

# MovieInfoPipeline 继承 ImagesPipeline
class MovieInfoPipeline(ImagesPipeline):
    # 重写 get_media_requests 。 get_media_requests 用来发送下载请求，item['photo'] 储存的是图片的 url
    def get_media_requests(self, item, info):
        for url in item['photo']:
            yield Request(url)

    def item_completed(self, results, item, info):
        print('片名:', item['movieName'][0])
        print('导演:', item['director'][0])
        print('编剧:', item['screenwriter'][0])
        print('主演:', item['performer'][0])
        print('类型:', item['type'][0])
        print('制片国家/地区:', item['country'][0])
        print('语言:', item['language'][0])
        print('剧情简介:', item['synopsis'][0])

        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            # if not results[0][0]:
            raise DropItem('下载失败')
        return item

    # 将 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2704603496.jpg 处理为 p2704603496.jpg 给图片命名且保存
    # 定义下载图片存储的路径
    def file_path(self, request, response=None, info=None, *, item=None):
        return request.url.split('/')[-1]

评论信息(pipelines.py)

class CommentPipeline:
    # 当 process_item 和 item_completed 在同一个类中时， process_item 会执行， item_completed 不会
    # 此处的 item 参数就是蜘蛛所 yield 的每个 item
    def process_item(self, item, spider):
        # 此处你可将 item 中的信息（爬取到的信息）写入到指定的设备
        # 单纯的打印出爬取到的信息(整个列表中只有一个字符串，把列表中的字符串输出不是整个列表)
        print('片名:', item['movieName'][0])
        print('评论人给出的评分:', item['score'][0])
        print('评论时间:', item['commentTime'][0])
        print('评论人的认同数:', item['fabulous'][0])
        print('评论人评论内容:', item['content'][0])

保存到 mongodb 数据库

电影信息(pipelines2mongodb.py)(图片保存到本地后将本地图片存入 mongodb 数据库)

class MovieInfoPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for url in item['photo']:
            yield Request(url, meta={'name': item['movieName'][0]})

    def item_completed(self, results, item, info):
        client = MongoClient('127.0.0.1', 27017)
        db = client['douban']
        coll = db['movieInfo']
        data = {"movieName": item["movieName"][0], "photo": item["photo"][0], "director": item["director"],
                "screenwriter": item["screenwriter"], "performer": item["performer"], "type": item["type"],
                "country": item['country'][0], "language": item['language'][0], "synopsis": item['synopsis']}
        # self.coll.insert({"movieName": item["movieName"][0], "photo": item["photo"][0], "director": item["director"],
        #                   "screenwriter": item["screenwriter"], "performer": item["performer"], "type": item["type"],
        #                   "country": item['country'][0], "language": item['language'][0], "synopsis": item['synopsis']})
        # 不插入重复数据（如果 movieName 重复的话， movieName 不更新，其他字典如果数据不一致就会更新。）
        coll.update({"movieName": item["movieName"][0]}, {'$set': data}, True)
        # 用 __init__ 和 open_spider 初始化上面代码都失败了，那只能一连一关
        client.close()

        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            # if not results[0][0]:
            raise DropItem('下载失败')
        return item

    def file_path(self, request, response=None, info=None, *, item=None):
        name = request.meta['name']  # 接收上面meta传递过来的图片名称
        name = re.sub(r'[？\\*|“<>:/]', '', name)  # 过滤windows字符串，不经过这么一个步骤，你会发现有乱码或无法下载
        filename = name + '.jpg'  # 重命名图片
        return filename

    def close_spider(self, spider):
        client = MongoClient('127.0.0.1', 27017)
        db = client["douban"]
        dirs = "./images"
        files = os.listdir(dirs)
        # 遍历图片目录集合
        for file in files:
            # 图片的全路径
            filesname = dirs + '\\' + file
            # 分割，为了存储图片文件的格式和名称
            f = file.split('.')
            # 类似于创建文件
            datatmp = open(filesname, 'rb')
            # 创建写入流
            imgput = GridFS(db)
            # 将数据写入，文件类型和名称通过前面的分割得到
            imgput.put(datatmp, content_type=f[1], filename=f[0])
            datatmp.close()
        client.close()

评论信息(pipelines2mongodb.py)

class CommentPipeline(object):
    def __init__(self):
        self.databaseIp = '127.0.0.1'
        self.databasePort = 27017
        self.mongodbName = 'douban'
        self.mongodbCollection = 'comment'  # movie12  movieComment14
        # 连接数据库(主机ip, 端口号)
        self.client = MongoClient(self.databaseIp, self.databasePort)
        # 获得数据库的句柄(库名)
        self.db = self.client[self.mongodbName]
        # 获得 collection 的句柄
        self.coll = self.db[self.mongodbCollection]

    def process_item(self, item, spider):
        # 向数据库插入一条记录(整个列表中只有一个字符串，把列表中的字符串插入 mongodb 而不是整个列表) self.coll.insert({"movieName": item["movieName"][0],
        # "username": item["username"][0], "watch": item["watch"][0], "score": item["score"][0], "commentTime": item[
        # "commentTime"][0], "fabulous": item["fabulous"][0], "content": item["content"][0]})
        data = {"movieName": item["movieName"][0], "username": item["username"][0], "score": item["score"][0],
                "commentTime": item["commentTime"][0], "fabulous": int(item["fabulous"][0]), "content": item["content"]}
        # $set 设置字段值
        self.coll.update({"movieName": item["movieName"][0], "username": item["username"][0]}, {'$set': data}, True)
        # self.coll.insert({"movieName": item["movieName"][0], "username": item["username"][0], "score": item["score"][0],
        #                   "commentTime": item["commentTime"][0], "fabulous": int(item["fabulous"][0]), "content": item["content"]})
        # 会在控制台输出原 item 数据，可以选择不写
        return item

    # 关闭爬虫时执行，只执行一次。 (如果爬虫中间发生异常导致崩溃，close_spider可能也不会执行)
    def close_spider(self, spider):
        self.client.close()

五、项目的配置文件(settings.py)

# Scrapy settings for douban project 项目的配置文件
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'douban'

SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'douban (+http://www.yourdomain.com)'

# Obey robots.txt rules
# 遵循豆瓣的爬虫规则的话，不能爬取豆瓣的图片。所以不能遵循爬虫规则
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 开启线程数量，默认16，可以自行设置
# CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 单位时间内的点击次数，这个是很多网站都会有的反爬虫措施(延时30秒，不能动态改变，导致访问延时都差不多，也容易被发现)
# DOWNLOAD_DELAY = 30
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
# cookie 的设置，有些网站会根据访问的 cookie 判断是否为机器人，除非特殊要求，我们都禁用 cookie
# COOKIES_ENABLED = True

# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
# }

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    'douban.middlewares.DoubanSpiderMiddleware': 543,
# }

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# response 403(服务器拒绝请求) 说明人家不欢迎爬虫，这时候就需要设置 header 头部信息，用 user_agent_list 进行头部请求伪装
# useragent.py 配置到框架中去，此时框架在发送请求时，会自动添加随机的头部 user-agent 列表其中一个代理信息
# ip 存在被封的风险，所以使用代理 ip ，代理被封还有代理，代理在中间件中设置
# 数字543, 400表示中间件先被调用的次序。数字越小，越先被调用
# DOWNLOADER_MIDDLEWARES = {
#     'douban.middlewares.DoubanDownloaderMiddleware': 543,
# }

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# pipelines.py 文件是 Scrapy 框架的输出管道（输出方式），要想使用该输出模式，我们需要在 settings.py 中设置输出项
# 爬来的数据在这里只能输出一次，当下面两行输出一起运行时，只输出第一行，第二行的输出为空(甚至报错)
ITEM_PIPELINES = {
    # pipelines 将数据直接在控制台中输出
    'douban.pipelines.MovieInfoPipeline': 300,
    'douban.pipelines.CommentPipeline': 301,
    # pipelines2excel 是将数据保存到电子表格中
    # 'douban.pipelines2excel.CommentPipeline': 302,
    # 将每一个 sheet 工作簿名命名为电影名(未实现)
    # 'douban.pipelines2excels.CommentPipeline': 303,
    # pipelines2json 是将数据保存到 json 文件中(未测试)
    # 'douban.pipelines2json.CommentPipeline': 304,
    # pipelines2mongodb 是保存到 mongodb 数据库中
    'douban.pipelines2mongodb.MovieInfoPipeline': 305,
    'douban.pipelines2mongodb.CommentPipeline': 306,
    # pipelines2mysql 是保存到 mysql 数据库中(未测试)
    # 'douban.pipelines2mysql.CommentPipeline': 307,
    'douban.pipelines2redis.ProxyPoolPipeline': 308,
}
# 图片存储位置（运行时的当前位置）
IMAGES_STORE = './images/'
# 避免下载最近30天已经下载过的图像内容
IMAGES_EXPIRES = 30

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'