kaggle比赛 - 销量预测实战全记录

M5 Forecasting - Accuracy

总体思路

TODO LIST (端午节完成,目前基础资料已齐备,不需要过多收集资料):

  • [DONE] 1. 完成数据分析,分五个方面-我的数据分析baseline;
  • [DONE] 2. 完成数据工程pipeline - 基于lightGBM 模型解析;
  • [DONE] 3. 压缩数据量,从原有数据中随机抽取1/10;
  • [DONE] 4. 构建RMSE, MAPE, WMAPE评估方式;
  • [CLOSED] 5. 单lightGBM模型基于Grid Search 方法调参;
  • [CLOSED] 6. 基于多模型Prophet, RF, LightGBM进行集成;
  • [DOING] 7. 整理报告;
  • 8. 扩展,用于大仓,高频SKU预测;

Part I. 课题了解;


1.1 OBJECT:

How much camping gear will one store sell each month in a year?

In this competition, the fifth iteration, you will use hierarchical sales data from Walmart, the world’s largest company by revenue, to forecast daily sales for the next 28 days.

1.1.1 输入数据

The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events. Together, this robust dataset can be used to improve forecasting accuracy.

Time range: [2011-01-29 ,  2016-06-19]

Totally 1969 days.

/kaggle/input/m5-forecasting-accuracy/sample_submission.csv
/kaggle/input/m5-forecasting-accuracy/sales_train_validation.csv
/kaggle/input/m5-forecasting-accuracy/sell_prices.csv
/kaggle/input/m5-forecasting-accuracy/calendar.csv

1.1.2 输出结果与输出形式

Each row contains an id that is a concatenation of an item_id and a store_id, which is either validation (corresponding to the Public leaderboard), or evaluation (corresponding to the Private leaderboard). 

In the challenge, you are predicting item sales at stores in various locations for two 28-day time periods.

id,F1,...F28
HOBBIES_1_001_CA_1_validation,0,...,2
HOBBIES_1_002_CA_1_validation,2,...,11
...
HOBBIES_1_001_CA_1_evaluation,3,...,7
HOBBIES_1_002_CA_1_evaluation,1,...,4

1.1.3 baseline模型 (见附录)

1.1.4 优化方法论:

  • 数据分析,根据销售数量分布来选择模型,如是否有规律可循,是否量很少,销量是否平稳,是否有异常点outlier;
  • 将历史销量,结合滑动平均作为特征来预测未来销量;
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from itertools import cycle
pd.set_option('max_columns', 50)
plt.style.use('bmh')
color_pal = plt.rcParams['axes.prop_cycle'].by_key()['color']
color_cycle = cycle(plt.rcParams['axes.prop_cycle'].by_key()['color'])

# Read in the data
INPUT_DIR = '../input/m5-forecasting-accuracy'
cal = pd.read_csv(f'{INPUT_DIR}/calendar.csv')
stv = pd.read_csv(f'{INPUT_DIR}/sales_train_validation.csv')
ss = pd.read_csv(f'{INPUT_DIR}/sample_submission.csv')
sellp = pd.read_csv(f'{INPUT_DIR}/sell_prices.csv')

Part II. 数据分析


2.1 数据范围分析

  • Visualizing the data for a single item

d_cols = [c for c in stv.columns if 'd_' in c] # sales data columns

# Below we are chaining the following steps in pandas:
# 1. Select the item.
# 2. Set the id as the index, Keep only sales data columns
# 3. Transform so it's a column
# 4. Plot the data
stv.loc[stv['id'] == 'FOODS_3_090_CA_3_validation'] \
    .set_index('id')[d_cols] \
    .T \
    .plot(figsize=(15, 5),
          title='FOODS_3_090_CA_3 sales by "d" number',
          color=next(color_cycle))
plt.legend('')
plt.show()

  • 0
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值