数据分析： kaggle比赛 - 销量预测

最新推荐文章于 2024-08-08 17:33:20 发布

Elffer

最新推荐文章于 2024-08-08 17:33:20 发布

阅读量4k

点赞数

分类专栏：数据分析文章标签：数据分析

本文链接：https://blog.csdn.net/bmwlwg/article/details/107139656

版权

本文详述了一次Kaggle销量预测比赛的数据分析过程，包括数据读取、单个商品销量分布、时间维度分析、商品ID、商品类别、生命周期、门店、时间组合和价格维度的深入探究。通过对不同时间颗粒度、商品类别、门店和价格的分析，揭示了销售模式、季节性趋势和商品引入周期等关键信息。

摘要由CSDN通过智能技术生成

Part 0: 数据读取

0.1 数据内容简述-Data Files：

0.2 数据读入

Part I: 抽检单个商品的销量分布

Part II:时间纬度 - 查看不同时间颗粒度下的分布

Part III: 商品id纬度 - 随机抽取多个商品查看销量 - 识别潜在特征

Part IV: 商品类别维度 - 对比分析不同类别商品销量 - 总数/随时间分布

Part V: 生命周期维度 - 分析商品销售-新品与停售（sku数量）

Part VI: 门店维度 - 门店销量by Date

Part VII: 时间组合纬度-星期按日期的展开

Part VIII:价格维度

继之前的Kaggle比赛实战记录，参考其他kaggler的经验，总结一篇数据分析，以作日后参考。

比赛链接（已结束，可下载数据）：https://www.kaggle.com/c/m5-forecasting-accuracy

Part 0: 数据读取

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from itertools import cycle
pd.set_option('max_columns', 50)
plt.style.use('bmh')
color_pal = plt.rcParams['axes.prop_cycle'].by_key()['color']
color_cycle = cycle(plt.rcParams['axes.prop_cycle'].by_key()['color'])

0.1 数据内容简述-Data Files：

原始数据文件：

calendar.csv - Contains information about the dates on which the products are sold.
sales_train_validation.csv - Contains the historical daily unit sales data per product and store [d_1 - d_1913]
sample_submission.csv - The correct format for submissions. Reference the Evaluation tab for more info.
sell_prices.csv - Contains information about the price of the products sold per store and date.

预测目标

We are trying for forecast sales for 28 forecast days. The sample submission has the following format:

The columns represent 28 forecast days. We will fill these forecast days with our predictions.
The rows each represent a specific item. This id tells us the item type, state, and store. We don't know what these items are exactly.

0.2 数据读入

# Read in the data
INPUT_DIR = '../input/m5-forecasting-accuracy'
cal = pd.read_csv(f'{INPUT_DIR}/calendar.csv')
stv = pd.read_csv(f'{INPUT_DIR}/sales_train_validation.csv')
ss = pd.read_csv(f'{INPUT_DIR}/sample_submission.csv')
sellp = pd.read_csv(f'{INPUT_DIR}/sell_prices.csv')

We are given historic sales data in the `sales_train_validation` dataset.
- rows exist in this dataset for days d_1 to d_1913. We are given the department, category, state, and store id of the item.
- d_1914 - d_1941 represents the `validation` rows which we will predict in stage 1
- d_1942 - d_1969 represents the `evaluation` rows which we will predict for the final competition standings.

Part I: 抽检单个商品的销量分布

1. 便于形成初步印象

Visualizing the data for a single item

Lets take a random item that sell a lot and see how it's sales look across the training data.
FOODS_3_090_CA_3_validation sells a lot
Note there are days where it appears the item is unavailable and sales flatline

d_cols = [c for c in stv.columns if 'd_' in c] # sales data columns

# Below we are chaining the following steps in pandas:
# 1. Select the item.
# 2. Set the id as the index, Keep only sales data columns
# 3. Transform so it's a column
# 4. Plot the data
stv.loc[stv['id'] == 'FOODS_3_090_CA_3_validation'] \
    .set_index('id')[d_cols] \
    .T \
    .plot(figsize=(15, 5),
          title='FOODS_3_090_CA_3 sales by "d" number',
          color=next(color_cycle))
plt.legend('Sales')
plt.show()

Merging the data with real dates

We are given a calendar with additional information about past and future dates.
The calendar data can be merged with our days data
From this we can find weekly and annual trends

cal[['d', 'date', 'event_name_1', 'event_name_2', 'event_type_1', 'event_type_2', 'snap_CA']].head()

# Example 1: FOODS_3_090_CA_3_validation
# Merge calendar on our items' data
example = stv.loc[stv['id'] == 'FOODS_3_090_CA_3_validation'][d_cols].T # The index and col_names become index
example = example.rename(columns={8412:'FOODS_3_090_CA_3'}) # Name it correctly
example = example.reset_index().rename(columns={'index': 'd'}) # make the index "d"
example =