1. spider代码:这里注意找title和star,以及pic时xpath不同。前两者是在info下,后者是在pic下。for循环中按item寻找,每次找到一个item(电影)的title、star和图片信息,每次调用一次yield生成器,在pipeline里面进行处理。在item找完后,找下一个page的链接,再调用parse进行解析
# -*- coding: utf-8 -*- import scrapy from douban.items import DoubanItem class Douban250Spider(scrapy.Spider): name = 'douban250' # allowed_domains = ['https://movie.douban.com/'] start_urls = ['https://movie.douban.com/top250'] def parse(self, response): for sel in response.xpath('//div[@class="item"]'): item = DoubanItem() item['title'] = sel.xpath('div[@class="info"]/div[@class="hd"]/a/span/text()').extract()[0] item['star'] = sel.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]\ /span[@class="rating_num"]/text()').extract()[0] item['image_urls'] = sel.xpath('div[@class="pic"]/a/img/@src').extract() yield item nextPage = sel.xpath('//div[@class="paginator"]/\ span[@class="next"]/a/@href').extract()[0].strip() if nextPage: next_url = 'https://movie.douban.com/top250'+nextPage yield scrapy.http.Request(next_url,callback=self.parse,dont_filter=True)
2. settings文件:指定pipeline。这里有处理文字和图片两个pipeline,设置随机代理:
# -*- coding: utf-8 -*- # Scrapy settings for douban project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/la