Scrapy+mongoDB爬取豆瓣TOP250

本文目的:使用Scrapy爬虫框架爬取豆瓣top250 https://movie.douban.com/top250,并将爬取结果用MongoDB来储存。

  • Scpay
  • MongoDB
  • xpath文本解析

创建项目

scrapy startproject doubantop

然后在spiders文件夹里新建爬虫主体文件spider_douban.py
文件结构如下:

├── __init__.py
├── doubantop
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       ├── spider_douban.py #新建
└── scrapy.cfg

items.py

我们选择爬取电影的名字,链接,评分,评论人数,然后写进items.py中

import scrapy

class DoubanItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    star = scrapy.Field()
    num = scrapy.Field()

代码主体

爬取豆瓣网站我们先要模拟浏览器的头部信息

在setting.py中加入:

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'

spider_douban.py

# coding:utf-8

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from doubantop.items import DoubanItem
from scrapy.http import Request
import sys
reload(sys)
sys.setdefaultencoding('UTF-8')


class DoubanSpider(CrawlSpider):
    # 区别Spider,唯一名字
    name = 'douban'
    allowed_domains = ["movie.douban.com"]
    start_urls = ["http://movie.douban.com/top250"]

    def parse(self, response):
        # print response.body
        item = DoubanItem()
        selector = Selector(response)
        for sel in selector.xpath('//div[@class="info"]'):
            item['title'] = sel.xpath('div[@class="hd"]/a/span/text()').extract()[0]
            # print item['title']
            item['link'] = sel.xpath('div[@class="hd"]/a/@href').extract()[0]
            item['star'] = sel.xpath('div[2]/div/span/text()').extract()[0]
            item['num'] = sel.xpath('div[2]/div/span/text()').extract()[1]

            yield item

        # 下一页
        next_link = selector.xpath('//span[@class="next"]/a/@href').extract()[0]
        if next_link:
            print next_link
            yield Request(self.start_urls[0] + next_link, self.parse)

然后运行,保存到json文件

scrapy crawl douban -o items.json

在目录下查看json文件,是否爬取到数据

保存到mongodb

保存到本地mongodb先进行配置,setting.py:


ITEM_PIPELINES = {
   'doubantop.pipelines.MongoDBPipeline': 300,
}
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "douban"
MONGODB_COLLECTION = "movie"

pipeline.py:


import pymongo
from scrapy.conf import settings

class MongoDBPipeline(object):
    def __init__(self):
        host = settings['MONGODB_HOST']
        port = settings['MONGODB_PORT']
        client = pymongo.MongoClient(host=host, port=port)
        dbName = settings['MONGODB_DB']
        tdb = client[dbName]
        self.post = tdb[settings['MONGODB_COLLECTION']]


    def process_item(self, item, spider):
        movie_info = dict(item)
        self.post.insert(movie_info)
        return item

现在运行

scrapy crawl douban

进入mongodb可视化软件连接查看(这里使用RoboMongoDB)

共250条信息,完成!!

这里写图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值