创建项目
scrapy startproject douban
创建spider
cd douban
scrapy genspider douban_spider movie.douban.com
编写项目文件
1、 修改 items.py
import scrapy
class DoubanItem(scrapy.Item):
# define the fields for your item here like:
serial_number = scrapy.Field()
movie_name = scrapy.Field()
introduce = scrapy.Field()
star = scrapy.Field()
evaluate = scrapy.Field()
describe = scrapy.Field()
2、修改 douban_spider.py
# -*- coding: utf-8 -*-
import scrapy
from douban.items import DoubanItem
class DoubanSpiderSpider(scrapy.Spider):
name = 'douban_spider'
allowed_domains = ['movie.douban.com']
start_urls = ['https://movie.douban.com/top250']
def parse(self, response):
movie_list = response.xpath("//div[@class='article']//ol[@class='grid_view']/li")
for i_item in movie_list:
douban_item = DoubanItem()
douban_item['serial_number'] = i_item.xpath("./div[@class='item']//em/text()").extract_first()
douban_item['movie_name'] = i_item.xpath(".//div[@class='info']/div[@class='hd']/a/span[1]/text()").extract_first()
content = i_item.xpath(".//div[@class='bd']/p[1]/text()").extract()
for i_content in content:
content_s = "".join(i_content.split())
douban_item['introduce'] = content_s
douban_item['star'] = i_item.xpath(".//span[contains(@class,'rating_num')]/text()").extract_first()
douban_item['evaluate'] = i_item.xpath(".//div[@class='star']//span[4]/text()").extract_first()
douban_item['describe'] = i_item.xpath(".//p[@class='quote']/span/text()").extract_first()
yield douban_item
next_link = response.xpath("//span[@class='next']/link/@href").extract()
if next_link:
next_link = next_link[0]
yield scrapy.Request("https://movie.douban.com/top250"+next_link,callback=self.parse)
3、修改 settings.py
USER_AGENT、ROBOTSTXT_OBEY、DOWNLOAD_DELAY
启动项目
进入到该项目
只开启爬虫只执行下面命令
scrapy crawl douban_spider
导出数据则执行下面命令
scrapy crawl douban_spider -o douban.csv
scrapy crawl douban_spider -o douban.json