使用Scrapy框架,爬取b站番剧信息。

使用Scrapy框架,爬取b站番剧信息。

感觉好久没写爬虫的,今天看了在b站浏览了一会儿,发现b站有很多东西可以爬取的,比如首页的排行榜,番剧感觉很容易找到数据来源的,所以就拿主页的番剧来练练手的。

爬取的网址:
https://www.bilibili.com/anime/index/#season_version=-1&area=-1&is_finish=-1&copyright=-1&season_status=-1&season_month=-1&year=-1&style_id=-1&order=3&st=1&sort=0&page=1*
在这里插入图片描述
在这里插入图片描述
通过观察url的规律,去除一些不影响请求网站的url中的数据,得到url
https://api.bilibili.com/pgc/season/index//resultpage=1&season_type=1&pagesize=20&type=1,然后发现只需每次改变page=的值就可以得到想要的信息,page最大值为153,感觉这次爬取的信息作用不大,不过还是把代码写出来了

运行scrapy的main方法,无需每次scrapy crawl name

# -*- coding: utf-8 -*-
#@Project filename:PythonDemo  dramaMain.py
#@IDE   :IntelliJ IDEA
#@Author :ganxiang
#@Date   :2020/03/02 0002 19:16

from scrapy.cmdline import execute
import os
import sys

sys.path.append(os.path.dirname(os.path.abspath(__file__)))
execute(['scrapy','crawl','drama'])

编写的dramaSeries.py

# -*- coding: utf-8 -*-
import scrapy
import json
from ..items import DramaseriesItem
class DramaSpider(scrapy.Spider):
    name = 'drama'
    allowed_domains = ['https://api.bilibili.com/']
    i =1
    start_urls = ['https://api.bilibili.com/pgc/season/index//result?page=%s&season_type=1&pagesize=20&type=1'% s for s in range(1,101)]
    def parse(self, response):
        item =DramaseriesItem()
        drama =json.loads(response.text)
        data =drama['data']
        data_list =data['list']
        # print(data_list)
        for filed in data_list:
            item['number']=self.i
            item['badge']=filed['badge']
            item['cover_img']=filed['cover']
            item['index_show']=filed['index_show']
            item['link']=filed['link']
            item['media_id']=filed['media_id']
            item['order_type']=filed['order_type']
            item['season_id']=filed['season_id']
            item['title']=filed['title']
            print(self.i,item)
            self.i+=1
            yield item
        self.i+=20

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class DramaseriesItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    number =scrapy.Field()
    badge =scrapy.Field()
    cover_img =scrapy.Field()
    index_show =scrapy.Field()
    link =scrapy.Field()
    media_id =scrapy.Field()
    order_type =scrapy.Field()
    season_id =scrapy.Field()
    title =scrapy.Field()

    pass

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from openpyxl import Workbook
from scrapy.utils.project import get_project_settings
settings = get_project_settings()

class DramaseriesPipeline(object):
    excelBook =Workbook()
    activeSheet =excelBook.active
    file =['number','title','link','media_id','season_id','index_show','cover_img','badge']
    activeSheet.append(file)

    def process_item(self, item, spider):
        files =[item['number'],item['title'],item['link'],item['media_id'],item['season_id'],item['index_show'],item['cover_img'],item['badge']]
        self. activeSheet.append(files)
        self.excelBook.save('./drama.xlsx')
        return item

settings.py
打开

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}
ITEM_PIPELINES = {
   'dramaSeries.pipelines.DramaseriesPipeline': 300,
}

运行结果爬取了两千多字段,其实还可以爬很多的。
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

好的,以下是爬取2022年全国天气数据的Scrapy示例代码: ```python import scrapy class WeatherSpider(scrapy.Spider): name = 'weather' def start_requests(self): urls = ['http://www.tianqihoubao.com/lishi/'] for url in urls: yield scrapy.Request(url=url, callback=self.parse_province) def parse_province(self, response): provinces = response.xpath('//div[@class="citychk"]/dl/dd/a') for province in provinces: province_name = province.xpath('./text()').get() province_url = province.xpath('./@href').get() yield scrapy.Request(url=response.urljoin(province_url), callback=self.parse_city, meta={'province': province_name}) def parse_city(self, response): province_name = response.meta['province'] cities = response.xpath('//div[@class="citychk"]/dl/dd/a') for city in cities: city_name = city.xpath('./text()').get() city_url = city.xpath('./@href').get() yield scrapy.Request(url=response.urljoin(city_url), callback=self.parse_weather, meta={'province': province_name, 'city': city_name}) def parse_weather(self, response): province_name = response.meta['province'] city_name = response.meta['city'] rows = response.xpath('//table[@class="b"]/tr[position()>1]') for row in rows: date = row.xpath('./td[1]/a/text()').get() weather = row.xpath('./td[2]/text()').get() temperature = row.xpath('./td[3]/text()').get() wind = row.xpath('./td[4]/text()').get() yield { 'province': province_name, 'city': city_name, 'date': date, 'weather': weather, 'temperature': temperature, 'wind': wind } ``` 这个爬虫会从http://www.tianqihoubao.com/lishi/开始,依次爬取所有省份,再依次爬取所有城市,最后爬取每个城市的天气数据。爬取的结果会被保存在字典中,并通过yield返回。你可以在爬虫中加入自己的存储逻辑,将数据存入数据库或文件。
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值