1 . 新建项目
进入打算存储代码的目录,命令行运行如下语句
scrapy startproject tutorial
2 . 定义Item
import scrapy
class DoubanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title=scrapy.Field() #用来存储豆瓣电影标题
star=scrapy.Field() #用来存储豆瓣电影评分
3 . 新建spyder
import scrapy
#from douban.items import DoubanItem
class DoubanSpider(scrapy.Spider):
name = "douban" #爬虫名称
allowed_domains = ["https://movie.douban.com/"]
start_urls = ["https://movie.douban.com/top250"]
def parse(self, response):
for sel in response.xpath('//div[@class="info"]'):
#item = DmozItem()
title = sel.xpath('div[@class="hd"]/a/span/text()').extract()[0] #不加[0]会变成Unicode形式
star= sel.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]
print star,title
4 . 防止爬虫被屏蔽,伪装user_agent
USER_AGENT = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5’
否则会报403错误
2016-12-31 23:20:16 [scrapy] INFO: Spider opened
2016-12-31 23:20:17 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-31 23:20:17 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-31 23:20:17 [scrap