028. (7.27) scrapy爬取IMDb TOP250电影基本信息

本文介绍了使用Scrapy爬取IMDb TOP250电影基本信息的过程,强调了提前分析网页的重要性,以及如何通过正则表达式提取关键数据。同时,针对'FeedExporter'对象无'slot'属性的错误,提供了关闭文件再运行Scrapy的解决方案,并提及在使用Request和meta传递数据时,应用deepcopy避免数据污染。
摘要由CSDN通过智能技术生成

主要代码

items:

import scrapy

class ImdbItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    rank = scrapy.Field()

    movie_name = scrapy.Field()
    movie_type = scrapy.Field()
    director = scrapy.Field()
    writer = scrapy.Field()
    stars = scrapy.Field()
    score = scrapy.Field()

    country = scrapy.Field()
    metascore = scrapy.Field()
    movie_length = scrapy.Field()
    year = scrapy.Field()
    comment_num = scrapy.Field()
    critic_num = scrapy.Field()
    CWG = scrapy.Field()
    # budget = scrapy.Field()
    # budget_type = scrapy.Field()

spiders:

# -*- coding: utf-8 -*-
import scrapy
from imdb.items import ImdbItem
import re
import time
import copy

# scrapy crawl rank -o rank.csv

class RankSpider(scrapy.Spider):
    name = 'rank'
    allowed_domains = ['imdb.com']
    start_urls = ['https://www.imdb.com/chart/top/?ref_=nv_mv_250']

    # request top250 page, get movie url
    def parse(self, response):
        item = ImdbItem()
        rank_list = response.xpath('//td[@class="titleColumn"]/text()').re('\d+')
        movie_index = 0

        for i in rank_list:
            detail_url = response.xpath(
  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值