python 爬虫保存豆瓣TOP250电影海报及修改名称

最新推荐文章于 2024-04-08 09:22:57 发布

猩猩真可爱

最新推荐文章于 2024-04-08 09:22:57 发布

阅读量2.5k

点赞数

文章标签： python scrapy 爬虫

本文链接：https://blog.csdn.net/u011311418/article/details/78995984

版权

该博客介绍了使用Python的Scrapy框架爬取豆瓣Top250电影的海报信息，并在爬取过程中对电影的标题和评分进行了处理。在spider代码中，通过XPath定位不同元素，逐个提取每部电影的title、star和图片链接。同时，设置了pipeline来处理这些数据，包括图片的下载和文件名的修改。在settings文件中，配置了多个pipeline以及代理设置以实现更稳定的爬取。

摘要由CSDN通过智能技术生成

1. spider代码：这里注意找title和star，以及pic时xpath不同。前两者是在info下，后者是在pic下。for循环中按item寻找，每次找到一个item（电影）的title、star和图片信息，每次调用一次yield生成器，在pipeline里面进行处理。在item找完后，找下一个page的链接，再调用parse进行解析

# -*- coding: utf-8 -*-
import scrapy
from douban.items import DoubanItem

class Douban250Spider(scrapy.Spider):
    name = 'douban250'
    # allowed_domains = ['https://movie.douban.com/']
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        for sel in response.xpath('//div[@class="item"]'):
            item = DoubanItem()
            item['title'] = sel.xpath('div[@class="info"]/div[@class="hd"]/a/span/text()').extract()[0]
            item['star'] = sel.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]\
            /span[@class="rating_num"]/text()').extract()[0]
            item['image_urls'] = sel.xpath('div[@class="pic"]/a/img/@src').extract()          
            yield item
        nextPage = sel.xpath('//div[@class="paginator"]/\
                             span[@class="next"]/a/@href').extract()[0].strip()
        if nextPage:
            next_url = 'https://movie.douban.com/top250'+nextPage
            yield scrapy.http.Request(next_url,callback=self.parse,dont_filter=True)

2. settings文件：指定pipeline。这里有处理文字和图片两个pipeline，设置随机代理：

# -*- coding: utf-8 -*-

# Scrapy settings for douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/la