M5 Forecasting - Accuracy
总体思路
TODO LIST (端午节完成,目前基础资料已齐备,不需要过多收集资料):
- [DONE] 1. 完成数据分析,分五个方面-我的数据分析baseline;
- [DONE] 2. 完成数据工程pipeline - 基于lightGBM 模型解析;
- [DONE] 3. 压缩数据量,从原有数据中随机抽取1/10;
- [DONE] 4. 构建RMSE, MAPE, WMAPE评估方式;
- [CLOSED] 5. 单lightGBM模型基于Grid Search 方法调参;
- [CLOSED] 6. 基于多模型Prophet, RF, LightGBM进行集成;
- [DOING] 7. 整理报告;
- 8. 扩展,用于大仓,高频SKU预测;
Part I. 课题了解;
1.1 OBJECT:
How much camping gear will one store sell each month in a year?
In this competition, the fifth iteration, you will use hierarchical sales data from Walmart, the world’s largest company by revenue, to forecast daily sales for the next 28 days.
1.1.1 输入数据
The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events. Together, this robust dataset can be used to improve forecasting accuracy.
Time range: [2011-01-29 , 2016-06-19]
Totally 1969 days.
/kaggle/input/m5-forecasting-accuracy/sample_submission.csv
/kaggle/input/m5-forecasting-accuracy/sales_train_validation.csv
/kaggle/input/m5-forecasting-accuracy/sell_prices.csv
/kaggle/input/m5-forecasting-accuracy/calendar.csv
1.1.2 输出结果与输出形式
Each row contains an id
that is a concatenation of an item_id
and a store_id
, which is either validation
(corresponding to the Public leaderboard), or evaluation
(corresponding to the Private leaderboard).
In the challenge, you are predicting item sales at stores in various locations for two 28-day time periods.
id,F1,...F28
HOBBIES_1_001_CA_1_validation,0,...,2
HOBBIES_1_002_CA_1_validation,2,...,11
...
HOBBIES_1_001_CA_1_evaluation,3,...,7
HOBBIES_1_002_CA_1_evaluation,1,...,4
1.1.3 baseline模型 (见附录)
1.1.4 优化方法论:
- 数据分析,根据销售数量分布来选择模型,如是否有规律可循,是否量很少,销量是否平稳,是否有异常点outlier;
- 将历史销量,结合滑动平均作为特征来预测未来销量;
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from itertools import cycle
pd.set_option('max_columns', 50)
plt.style.use('bmh')
color_pal = plt.rcParams['axes.prop_cycle'].by_key()['color']
color_cycle = cycle(plt.rcParams['axes.prop_cycle'].by_key()['color'])
# Read in the data
INPUT_DIR = '../input/m5-forecasting-accuracy'
cal = pd.read_csv(f'{INPUT_DIR}/calendar.csv')
stv = pd.read_csv(f'{INPUT_DIR}/sales_train_validation.csv')
ss = pd.read_csv(f'{INPUT_DIR}/sample_submission.csv')
sellp = pd.read_csv(f'{INPUT_DIR}/sell_prices.csv')
Part II. 数据分析
2.1 数据范围分析
-
Visualizing the data for a single item
d_cols = [c for c in stv.columns if 'd_' in c] # sales data columns
# Below we are chaining the following steps in pandas:
# 1. Select the item.
# 2. Set the id as the index, Keep only sales data columns
# 3. Transform so it's a column
# 4. Plot the data
stv.loc[stv['id'] == 'FOODS_3_090_CA_3_validation'] \
.set_index('id')[d_cols] \
.T \
.plot(figsize=(15, 5),
title='FOODS_3_090_CA_3 sales by "d" number',
color=next(color_cycle))
plt.legend('')
plt.show()