使用Scrapy框架,爬取b站番剧信息。
感觉好久没写爬虫的,今天看了在b站浏览了一会儿,发现b站有很多东西可以爬取的,比如首页的排行榜,番剧感觉很容易找到数据来源的,所以就拿主页的番剧来练练手的。
爬取的网址:
https://www.bilibili.com/anime/index/#season_version=-1&area=-1&is_finish=-1©right=-1&season_status=-1&season_month=-1&year=-1&style_id=-1&order=3&st=1&sort=0&page=1*
通过观察url的规律,去除一些不影响请求网站的url中的数据,得到url
https://api.bilibili.com/pgc/season/index//resultpage=1&season_type=1&pagesize=20&type=1,然后发现只需每次改变page=的值就可以得到想要的信息,page最大值为153,感觉这次爬取的信息作用不大,不过还是把代码写出来了
运行scrapy的main方法,无需每次scrapy crawl name
# -*- coding: utf-8 -*-
#@Project filename:PythonDemo dramaMain.py
#@IDE :IntelliJ IDEA
#@Author :ganxiang
#@Date :2020/03/02 0002 19:16
from scrapy.cmdline import execute
import os
import sys
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
execute(['scrapy','crawl','drama'])
编写的dramaSeries.py
# -*- coding: utf-8 -*-
import scrapy
import json
from ..items import DramaseriesItem
class DramaSpider(scrapy.Spider):
name = 'drama'
allowed_domains = ['https://api.bilibili.com/']
i =1
start_urls = ['https://api.bilibili.com/pgc/season/index//result?page=%s&season_type=1&pagesize=20&type=1'% s for s in range(1,101)]
def parse(self, response):
item =DramaseriesItem()
drama =json.loads(response.text)
data =drama['data']
data_list =data['list']
# print(data_list)
for filed in data_list:
item['number']=self.i
item['badge']=filed['badge']
item['cover_img']=filed['cover']
item['index_show']=filed['index_show']
item['link']=filed['link']
item['media_id']=filed['media_id']
item['order_type']=filed['order_type']
item['season_id']=filed['season_id']
item['title']=filed['title']
print(self.i,item)
self.i+=1
yield item
self.i+=20
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class DramaseriesItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
number =scrapy.Field()
badge =scrapy.Field()
cover_img =scrapy.Field()
index_show =scrapy.Field()
link =scrapy.Field()
media_id =scrapy.Field()
order_type =scrapy.Field()
season_id =scrapy.Field()
title =scrapy.Field()
pass
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from openpyxl import Workbook
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
class DramaseriesPipeline(object):
excelBook =Workbook()
activeSheet =excelBook.active
file =['number','title','link','media_id','season_id','index_show','cover_img','badge']
activeSheet.append(file)
def process_item(self, item, spider):
files =[item['number'],item['title'],item['link'],item['media_id'],item['season_id'],item['index_show'],item['cover_img'],item['badge']]
self. activeSheet.append(files)
self.excelBook.save('./drama.xlsx')
return item
settings.py
打开
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
ITEM_PIPELINES = {
'dramaSeries.pipelines.DramaseriesPipeline': 300,
}
运行结果爬取了两千多字段,其实还可以爬很多的。