一顿骚操作:
scrapy startproject douban
cd douban
scrapy genspider douban_movie -t basic douban.com
生成项目文件:
爬取前250的相关数据, 打开相关url https://movie.douban.com/top250可以看到排行前25的数据。通过点击下一步,继续发现url发生了变化https://movie.douban.com/top250?start=25&filter=, 我们可以去掉&filter=,保留https://movie.douban.com/top250?start=25, 发现是可以对网站进行访问并得到相关数据,并且发现豆瓣电影每一页展示25条电影信息,我们可以通过25的倍数去访问相关的页数。
通过观察,进行修改相关的文件
douban_movie.py:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from douban.items import DoubanItem
class DoubanMovieSpider(scrapy.Spider):
name = 'douban_movie'
allowed_domains = []
def start_requests(self):
for i in range(0,11):
num = i*25
print(num)
url = "https://movie.douban.com/top250?start=%s"% num
yield Request(url, callback=self.parse)
def parse(self, response):
doubans = DoubanItem()
names = response.xpath("//div[@class='pic']/a/img/@alt").extract()
scores = response.xpath("//div[@class='star']/span[@class='rating_num']/text()").extract()
quotes = response.xpath("//p[@class='quote']/span[@class='inq']/text()").extract()
doubans['movie_name'] = names
doubans['movie_score'] = scores
doubans['movie_quote'] = quotes
yield doubans
将获取豆瓣的排行250数据的电影名称和评分和相关简介进行爬取,并保持在douban.txt文件中
items.py
import scrapy
class DoubanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
movie_name = scrapy.Field()
movie_score = scrapy.Field()
movie_quote = scrapy.Field()
pipelines.py
class DoubanPipeline(object):
def process_item(self, item, spider):
for i in range(0, len(item['movie_name'])):
with open('douban.txt', 'a', encoding='utf-8') as w:
w.write(item['movie_name'][i] + ' 评分:' + item['movie_score'][i] + ' 介绍:' + item['movie_quote'][i] + '\n')
return item
最后将settings.py的
ROBOTSTXT_OBEY=False,ITEM_PIPELIES = { 'douban.pipelines.DoubanPipeline': 300, }
通过scrapy crawl douban_movie 进行爬取动作,得到结果