功能描述
• 目标:获取豆瓣电影Top250的电影详细信息
• 输出:保存到csv文件中
• 技术路线:采用scrapy框架进行爬取
程序结构设计
(1)首先查看如何进行翻页处理,用鼠标点击其他页时不难发现网址中的start参数的取值会发生变化,由于每页罗列了25个电影信息,start的值是(对应页数-1)*25
,采用一个循环就可以得到每页的url地址
(2)由于该网页是静态HTML页面,因此可以直接根据网页源代码爬取相关电影信息
(3)将结果存储到文件
代码实现
此处在spiders文件下新创建了一个top_movies.py文件
top_movies.py
# -*- coding: utf-8 -*-
import scrapy
import re
from ScrapyDouban.items import ScrapydoubanItem
class TopMoviesSpider(scrapy.Spider):
name = 'top_movies'
kv = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/84.0.4147.105 Safari/537.36 "
}
current_page = 0
def start_requests(self):
start_url = "https://movie.douban.com/top250?start="
while self.current_page < 10:
url = start_url + str(self.current_page * 25)
self.current_page += 1
yield scrapy.Request(url=url, headers=self.kv, callback=self.parse)
def parse(self, response):
list_selectors = response.xpath("//div[@class='item']")
for selector in list_selectors:
# 排名
rank = selector.xpath("div[@class='pic']/em/text()").extract()[0]
# 名称
name = selector.xpath("div[@class='info']/div[@class='hd']/a/span[1]/text()").extract()[0]
# 导演
director = selector.xpath("div[@class='info']/div[@class='bd']/p/text()").extract()[0].split(':')[1].split('主演')[0].strip()
# 主演
actor = selector.xpath("div[@class='info']/div[@class='bd']/p/text()").extract()[0].split(':')[2].split('/')[0].strip()
# 年份
year = re.search(r'\d{4}', selector.xpath("div[@class='info']/div[@class='bd']/p/text()").extract()[1]).group(0)
# 产地
country = selector.xpath("div[@class='info']/div[@class='bd']/p/text()").extract()[1].split('/')[1].strip()
# 类型
type = selector.xpath("div[@class='info']/div[@class='bd']/p/text()").extract()[1].split('/')[2].strip()
# 评分
mark = selector.xpath("div[@class='info']/div[@class='bd']/div[@class='star']/span[@class='rating_num']/text()").extract()[0]
item = ScrapydoubanItem()
item["rank"] = rank
item["name"] = name
item["director"] = director
item["actor"] = actor
item["year"] = year
item["country"] = country
item["type"] = type
item["mark"] = mark
yield item
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ScrapydoubanItem(scrapy.Item):
# define the fields for your item here like:
rank = scrapy.Field() # 排名
name = scrapy.Field() # 名称
director = scrapy.Field() # 导演
actor = scrapy.Field() # 主演
year = scrapy.Field() # 年份
country = scrapy.Field() # 产地
type = scrapy.Field() # 类型
mark = scrapy.Field() # 评分
此处直接采用命令将数据保存到csv文件中,没有用到pipelines.py
scrapy crawl top_movies -o movies.csv
然后就可以得到top250所有电影相关信息了,由于是异步执行所以数据并不会按照排名依次罗列,如果想要按排名顺序显示,使用excel按照排名排序即可,当然也可以在pipelines.py文件中进行数据清洗