Scrapy爬取豆瓣电影

Scrapy爬取豆瓣电影

Scrapy爬取四部曲

  • 新建目标
  • 明确目标
  • 制作爬虫
  • 存储内容

创建项目

scrapy startproject douban

创建Spider文件

scrapy genspider douban_spider movie.douban.com
创建文件后生成代码

# -*- coding: utf-8 -*-
import scrapy
from douban.items import DoubanItem


class DoubanSpiderSpider(scrapy.Spider):
    # 爬虫名字
    name = 'douban_spider'
    # 允许抓取的域名
    allowed_domains = ['movie.douban.com']
    # 入口url,扔到调度器中,自己添加后买呢的top250
    start_urls = ['https://movie.douban.com/top250']

    # 默认解析方法
    def parse(self, response):
        pass

编写items文件

Item是保存文件爬取数据的容器,使用方法和字典相同。
创建Item需要继承Scrapt.Item类,类型定义为scrapy.Field字段。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy import Field


class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    collection = 'douban_movie'
    # 序号
    serial_number = Field()
    # 电影名称
    movie_name = Field()
    # 电影介绍
    introduce = Field()
    # 电影星级
    star = Field()
    # 电影的评论
    evaluate = Field()
    # 电影描述
    describe = Field()

解析Response

接下来使用的是Xpath解析,response自带xpath解析器。
分别解析网页的字段。

# -*- coding: utf-8 -*-
import scrapy
from douban.items import DoubanItem


class DoubanSpiderSpider(scrapy.Spider):
    # 爬虫名字
    name = 'douban_spider'
    # 允许抓取的域名
    allowed_domains = ['movie.douban.com']
    # 入口url,扔到调度器中
    start_urls = ['https://movie.douban.com/top250']

    # 默认解析方法
    def parse(self, response):
        # 循环电影条目
        movie_list = response.xpath("//div[@class='article']//ol[@class='grid_view']/li")  # 编写Xpath规则
        for i_item in movie_list:
            # 导入item文件
            douban_item = DoubanItem()
            # 写详细的xpath,数据分析
            douban_item['serial_number'] = i_item.xpath(".//div[@class='item']//em/text()").extract_first()
            douban_item['movie_name'] = i_item.xpath(
                ".//div[@class='info']/div[@class='hd']/a/span[1]/text()").extract_first()
            content = i_item.xpath(".//div[@class='info']/div[@class='bd']/p[1]/text()").extract()
            douban_item['introduce'] = [" ".join(i.split()) for i in content]
            douban_item['star'] = i_item.xpath(".//span[@class='rating_num']/text()").extract_first()
            douban_item['evaluate'] = i_item.xpath(".//div[@class='star']//span[4]/text()").extract_first()
            douban_item['describe'] = i_item.xpath(".//p[@class='quote']/span/text()").extract_first()
            # 将数据yield到piplines进行数据清洗和存储
            yield douban_item
        # 解析下一页规则,取后一页的xpath
        next_link = response.xpath("//span[@class='next']/link/@href").extract_first()
        if next_link:
            yield scrapy.Request("https://movie.douban.com/top250" + next_link, callback=self.parse)

注意在爬取电影介绍的时候,由于爬取的是多行,要进行简单的处理一下。
在解析完每一页的数据后,就要分析下一页,获取下一页的信息参数,判断是否有下一页,然后返回Yield的scrapy.Request请求返回url和回调函数。

现在就可以爬取页面了。

保存爬取结果

执行命令scrapy crawl douban -o result.csv会在执行命令的目录下生成爬取文件的结果。

保存数据到mongo

首先安装pymongopip install pymongo
在setting.py文件添加配置信息

MONGO_URL = '127.0.0.1'
MONGO_DB = 'douban'

接下来编写pipelines文件。

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo


class DoubanPipeline(object):
    def __init__(self, mongo_url, mongo_db):
        self.mongo_url = mongo_url
        self.mongo_db = mongo_db
        self.client = pymongo.MongoClient(self.mongo_url)
        self.db = self.client[self.mongo_db]

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_url=crawler.settings.get('MONGO_URL'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    def open_spider(self, spider):
        pass

    def process_item(self, item, spider):
        data = dict(item)
        self.db[item.collection].insert(data)
        return item

    def close_spider(self, item):
        self.client.close()

接下来在setting.py中开启该功能

ITEM_PIPELINES = {
    'douban.pipelines.DoubanPipeline': 300,
}

更换随机user-agent

编写中间介middlewares.py,添加一个类my_useragent。

class my_useragent(object):
    def process_request(self, request, spider):
        user_agent_list = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
        ]
        user_agent = random.choice(user_agent_list)
        request.headers['User_Agent'] = user_agent

接下来就是在setting.py打开该功能。

DOWNLOADER_MIDDLEWARES = {
    # 'douban.middlewares.DoubanDownloaderMiddleware': 543,
    'douban.middlewares.my_useragent': 543,
}

总结

  1. 每次在编写piplines.py和middlewares.py的时候记得在setting.py中开启该功能,数值越小,优先级越高。
  2. 爬虫文件和爬虫名称不能相同,spider目录不能存在相同爬虫名称文件。
  • 0
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值