数据分析: kaggle比赛 - 销量预测

本文详述了一次Kaggle销量预测比赛的数据分析过程,包括数据读取、单个商品销量分布、时间维度分析、商品ID、商品类别、生命周期、门店、时间组合和价格维度的深入探究。通过对不同时间颗粒度、商品类别、门店和价格的分析,揭示了销售模式、季节性趋势和商品引入周期等关键信息。
摘要由CSDN通过智能技术生成

目录

Part 0: 数据读取

0.1 数据内容简述-Data Files:

0.2 数据读入

Part I: 抽检单个商品的销量分布

Part II:时间纬度 - 查看不同时间颗粒度下的分布

Part III: 商品id纬度 - 随机抽取多个商品查看销量 - 识别潜在特征

Part IV: 商品类别维度 - 对比分析不同类别商品销量 - 总数/随时间分布

Part V: 生命周期维度 - 分析商品销售-新品与停售(sku数量)

Part VI: 门店维度 - 门店销量by Date

Part VII: 时间组合纬度-星期按日期的展开

Part VIII:价格维度


继之前的Kaggle比赛实战记录,参考其他kaggler的经验,总结一篇数据分析,以作日后参考。

比赛链接(已结束,可下载数据):https://www.kaggle.com/c/m5-forecasting-accuracy

Part 0: 数据读取

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from itertools import cycle
pd.set_option('max_columns', 50)
plt.style.use('bmh')
color_pal = plt.rcParams['axes.prop_cycle'].by_key()['color']
color_cycle = cycle(plt.rcParams['axes.prop_cycle'].by_key()['color'])

0.1 数据内容简述-Data Files:

原始数据文件:

  • calendar.csv - Contains information about the dates on which the products are sold.
  • sales_train_validation.csv - Contains the historical daily unit sales data per product and store [d_1 - d_1913]
  • sample_submission.csv - The correct format for submissions. Reference the Evaluation tab for more info.
  • sell_prices.csv - Contains information about the price of the products sold per store and date.

预测目标

We are trying for forecast sales for 28 forecast days. The sample submission has the following format:

  • The columns represent 28 forecast days. We will fill these forecast days with our predictions.
  • The rows each represent a specific item. This id tells us the item type, state, and store. We don't know what these items are exactly.

0.2 数据读入

# Read in the data
INPUT_DIR = '../input/m5-forecasting-accuracy'
cal = pd.read_csv(f'{INPUT_DIR}/calendar.csv')
stv = pd.read_csv(f'{INPUT_DIR}/sales_train_validation.csv')
ss = pd.read_csv(f'{INPUT_DIR}/sample_submission.csv')
sellp = pd.read_csv(f'{INPUT_DIR}/sell_prices.csv')

We are given historic sales data in the `sales_train_validation` dataset.
- rows exist in this dataset for days d_1 to d_1913. We are given the department, category, state, and store id of the item.
- d_1914 - d_1941 represents the `validation` rows which we will predict in stage 1
- d_1942 - d_1969 represents the `evaluation` rows which we will predict for the final competition standings.

Part I: 抽检单个商品的销量分布

1. 便于形成初步印象

Visualizing the data for a single item

  • Lets take a random item that sell a lot and see how it's sales look across the training data.
  • FOODS_3_090_CA_3_validation sells a lot
  • Note there are days where it appears the item is unavailable and sales flatline
d_cols = [c for c in stv.columns if 'd_' in c] # sales data columns

# Below we are chaining the following steps in pandas:
# 1. Select the item.
# 2. Set the id as the index, Keep only sales data columns
# 3. Transform so it's a column
# 4. Plot the data
stv.loc[stv['id'] == 'FOODS_3_090_CA_3_validation'] \
    .set_index('id')[d_cols] \
    .T \
    .plot(figsize=(15, 5),
          title='FOODS_3_090_CA_3 sales by "d" number',
          color=next(color_cycle))
plt.legend('Sales')
plt.show()

Merging the data with real dates

  • We are given a calendar with additional information about past and future dates.
  • The calendar data can be merged with our days data
  • From this we can find weekly and annual trends
cal[['d', 'date', 'event_name_1', 'event_name_2', 'event_type_1', 'event_type_2', 'snap_CA']].head()
# Example 1: FOODS_3_090_CA_3_validation
# Merge calendar on our items' data
example = stv.loc[stv['id'] == 'FOODS_3_090_CA_3_validation'][d_cols].T # The index and col_names become index
example = example.rename(columns={8412:'FOODS_3_090_CA_3'}) # Name it correctly
example = example.reset_index().rename(columns={'index': 'd'}) # make the index "d"
example = 
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值