scrapy css selector 抓取学习(一)
抓取B站热门推荐里的电影列表 ,50部的信息
1.创建项目
scrapy startproject bilibili
切换到项目目录
给爬虫命名 和定义抓取的网址
scrapy genspider [爬虫名] [网址]
2.抓取网页信息
推荐一个比较好用的插件: selectorgadget
下面展示一些 内联代码片
。
//首先定义要抓取的网址 :
start_url : https://www.bilibili.com/ranking/cinema/23/0/3/
抓取影片的名字、综合得分、上映时间、播放量、弹幕数、点赞数
// An highlighted block
class VideoinfoSpider(scrapy.Spider):
name = 'videoinfo'
allowed_domains = ['bilibili']
start_urls = ['https://www.bilibili.com/ranking/cinema/23/0/3/']
def parse(self, response):
title = response.css('.title::text').extract()
score = response.css('.pts div::text').extract()
time = response.css('.pgc-info::text').extract()
play = response.css('.data-box:nth-child(1)::text').extract()
comment = response.css('.data-box:nth-child(2)::text').extract()
like = response.css('.data-box:nth-child(3)::text').extract()
for item in zip(title, score, time, play, comment, like):
yield{
"title" : item[0],
"score" : item[1],
"time" : item[2],
"play" : item[3],
"comment" : item[4],
"like" : item[5]
}
print(item)
运行:
scrapy crawl videoinfo -o output.json