scrapy 没有输出_Scrapy+mongoDB爬取豆瓣TOP250

2978c102e88da2c5177bab8bd279fd17.png

使用Scrapy爬虫框架爬取豆瓣top250 https://movie.douban.com/top250,并将爬取结果用MongoDB来储存。

创建项目

scrapy startproject Douban250

创建爬虫

scrapy genspider douban douban.com

文件结构

dac399149b454057b9cfef4b5767bb93.png

items.py

我们选择爬取电影的排名,名字,评分,评论人数,一句话短评,链接,信息介绍 然后写进items.py中

import scrapy


class Douban250Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    ranking = scrapy.Field()
    name = scrapy.Field()
    score = scrapy.Field()
    score_num = scrapy.Field()
    quote = scrapy.Field()
    cover_url = scrapy.Field()
    introduce = scrapy.Field()

douban.py

import scrapy
from Douban250.items import Douban250Item


class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['douban.com']
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        item = Douban250Item()
        for node in response.xpath('//ol[@class="grid_view"]/li'):

            item['ranking'] = node.xpath('.//div[@class="pic"]/em/text()').extract()[0]
            item['name'] = node.xpath('.//span[@class="title"][1]/text()').extract()[0]
            item['score'] = node.xpath('.//div[@class="star"]/span[2]/text()').extract()[0]
            item['score_num'] = node.xpath('.//div[@class="star"]/span[4]/text()').extract()[0]
            item['quote'] = node.xpath('.//p[@class="quote"]/span/text()').extract_first()#有一项没有quote所以用first
            item['cover_url'] = node.xpath('.//div[@class="pic"]/a/@href').extract()[0]
            item['introduce'] = ''.join(node.xpath('.//div[@class="bd"]/p/text()').extract()[0]).strip() + 'n' + ''.join(node.xpath('.//div[@class="bd"]/p[1]/text()').extract()[1]).strip()

            yield item
            print(item)#看看输出


        next_page = response.xpath('//span[@class="next"]/a/@href')
        if next_page:
            url = 'https://movie.douban.com/top250' + next_page[0].extract()
            yield scrapy.Request(url, callback=self.parse)

settings.py

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36'
ROBOTSTXT_OBEY = False

也可以加一句,在settings中设置log级别,在settings.py中添加一行:

LOG_LEVEL = 'WARNING'

scrapy默认显示DEBUG级别的log信息

然后可以运行,看下print输出爬取结果如何

scrapy crawl douban

为了更方便些,也可以写一个运行脚本,在项目名文件夹下创建一个.py 文件,我自己取得名叫run.py,写入:

from scrapy import cmdline
cmdline.execute("scrapy crawl douban".split())

这样每次点击run就可以运行

66b86c38357735e8676b967b0ce3e104.png

print出来的内容

e0da2143ac1e36324d5349775352db7f.png

储存到mongodb

pipelines.py

import pymongo
from scrapy.conf import settings

class Douban250Pipeline(object):
    def __init__(self):
        host = settings["MONGODB_HOST"]
        port = settings["MONGODB_PORT"]
        dbname = settings["MONGODB_DBNAME"]
        sheetname = settings["MONGODB_SHEETNAME"]
        # 创建MONGODB数据库链接
        client = pymongo.MongoClient(host=host, port=port)
        # 指定数据库
        mydb = client[dbname]
        # 存放数据的数据库表名
        self.post = mydb[sheetname]

    def process_item(self, item, spider):
        data = dict(item)
        self.post.insert(data)
        return item

settings.py

ITEM_PIPELINES = {
    'Douban250.pipelines.Douban250Pipeline': 300,
}
# MONGODB 主机名
MONGODB_HOST = "127.0.0.1"
# MONGODB 端口号
MONGODB_PORT = 27017
# 数据库名称
MONGODB_DBNAME = "Douban"
# 存放数据的表名称
MONGODB_SHEETNAME = "doubanmovies"

运行

进入mongodb可视化软件连接查看(这里使用pycharm中的Mongo Explorer)

共250条信息,完成!!

b5d95a997ae50cf485fa887d9ec22d86.png
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值