目录
Part III: 商品id纬度 - 随机抽取多个商品查看销量 - 识别潜在特征
Part IV: 商品类别维度 - 对比分析不同类别商品销量 - 总数/随时间分布
Part V: 生命周期维度 - 分析商品销售-新品与停售(sku数量)
继之前的Kaggle比赛实战记录,参考其他kaggler的经验,总结一篇数据分析,以作日后参考。
比赛链接(已结束,可下载数据):https://www.kaggle.com/c/m5-forecasting-accuracy
Part 0: 数据读取
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from itertools import cycle
pd.set_option('max_columns', 50)
plt.style.use('bmh')
color_pal = plt.rcParams['axes.prop_cycle'].by_key()['color']
color_cycle = cycle(plt.rcParams['axes.prop_cycle'].by_key()['color'])
0.1 数据内容简述-Data Files:
原始数据文件:
calendar.csv
- Contains information about the dates on which the products are sold.sales_train_validation.csv
- Contains the historical daily unit sales data per product and store [d_1 - d_1913]sample_submission.csv
- The correct format for submissions. Reference the Evaluation tab for more info.sell_prices.csv
- Contains information about the price of the products sold per store and date.
预测目标
We are trying for forecast sales for 28 forecast days. The sample submission has the following format:
- The columns represent 28 forecast days. We will fill these forecast days with our predictions.
- The rows each represent a specific item. This id tells us the item type, state, and store. We don't know what these items are exactly.
0.2 数据读入
# Read in the data
INPUT_DIR = '../input/m5-forecasting-accuracy'
cal = pd.read_csv(f'{INPUT_DIR}/calendar.csv')
stv = pd.read_csv(f'{INPUT_DIR}/sales_train_validation.csv')
ss = pd.read_csv(f'{INPUT_DIR}/sample_submission.csv')
sellp = pd.read_csv(f'{INPUT_DIR}/sell_prices.csv')
We are given historic sales data in the `sales_train_validation` dataset.
- rows exist in this dataset for days d_1 to d_1913. We are given the department, category, state, and store id of the item.
- d_1914 - d_1941 represents the `validation` rows which we will predict in stage 1
- d_1942 - d_1969 represents the `evaluation` rows which we will predict for the final competition standings.
Part I: 抽检单个商品的销量分布
1. 便于形成初步印象
Visualizing the data for a single item
- Lets take a random item that sell a lot and see how it's sales look across the training data.
FOODS_3_090_CA_3_validation
sells a lot- Note there are days where it appears the item is unavailable and sales flatline
d_cols = [c for c in stv.columns if 'd_' in c] # sales data columns
# Below we are chaining the following steps in pandas:
# 1. Select the item.
# 2. Set the id as the index, Keep only sales data columns
# 3. Transform so it's a column
# 4. Plot the data
stv.loc[stv['id'] == 'FOODS_3_090_CA_3_validation'] \
.set_index('id')[d_cols] \
.T \
.plot(figsize=(15, 5),
title='FOODS_3_090_CA_3 sales by "d" number',
color=next(color_cycle))
plt.legend('Sales')
plt.show()
Merging the data with real dates
- We are given a calendar with additional information about past and future dates.
- The calendar data can be merged with our days data
- From this we can find weekly and annual trends
cal[['d', 'date', 'event_name_1', 'event_name_2', 'event_type_1', 'event_type_2', 'snap_CA']].head()
# Example 1: FOODS_3_090_CA_3_validation
# Merge calendar on our items' data
example = stv.loc[stv['id'] == 'FOODS_3_090_CA_3_validation'][d_cols].T # The index and col_names become index
example = example.rename(columns={8412:'FOODS_3_090_CA_3'}) # Name it correctly
example = example.reset_index().rename(columns={'index': 'd'}) # make the index "d"
example =