任务
利用深度学习、强化学习等机器学习手段为某连锁商店预测每个商店未来12周的销售额进行估计,以便对商店的经营更好掌握和对库存调拨更好管理。
数据解读及简单探索
所有数据包含store.csv,train.csv 和 test.csv一共3个csv文件,其中store.csv主要是介绍每个商店的情况,具体表格包含的字段如下
字段名称 | 描述 |
---|---|
商店ID | 商店的唯一标识符 |
商店模式 | 有四种:a,b,c,d |
商店级别 | 有三种:a,b,c。 注:a = 基本,b = 额外,c = 扩展 |
竞争者最近距离 | 商店的竞争者中距离该商店的最近距离 |
从store表可以看到一共有1115个门店,每个店的经营模式是直营,特许经营,管理合同还是战略联盟,每个店的级别高低是属于基本,额外还是扩展,以及通过与竞争者的距离可以大致判断该店的竞争激烈程度,一般来说,距离越远竞争压力越小,距离越近竞争压力会越大,这种竞争对于直营的低级别模式的店铺显得尤为明显。
train.csv 和 test.csv分别用于模型的训练和估计值的生成,包含的字段如下
字段名称 | 描述 |
---|---|
商店ID | 商店的唯一标识符 |
年 | 所处年份 |
周 | 所处年份的第几周 |
营业天数 | 该周开门营业天数 |
打折天数 | 该周有打折活动的天数 |
非节日 | 该周无节日的天数 |
节日A | 该周拥有节日类型为A的天数 |
节日B | 该周拥有节日类型为B的天数 |
节日C | 该周拥有节日类型为C的天数 |
Sales | 该周销售额(即:真实标签值) |
通过train和test表可以看到不同商店在不同时期的实际销量走势情况,可以分析出打折等营销手段对销量的影响,可以分析节假日对销量的影响。
相关性探索
从相关性矩阵图可以看到折扣天数,营业天数和非节日对周销量影响大,且是正向影响,节日A,节日B,节日C和距离对周销量影响小,且是负向影响。
解决思路
销量预测作为供应链域的核心算法,算法设计应该围绕准确率高,响应快,运行稳健三个方面进行,既然是要为所有商店未来12周的销售额进行预测,而且商店数量算中等,从生成预测值的数据量来说至少有1115*12约13000左右,这种多主体多期的预测常规的时间序列会显得很吃力,宜采用传统的机器学习方法和深度学习方法。
传统机器学习方法
对于这种结构化的数据,传统机器学习方法还是很巴适的,只是稍微要多做一些特征工程,对于类别型变量可以适当分多类别和少类别。
代码
# -*- encoding: utf-8 -*-
'''
@Project : sales_train
@Desc : 连锁店销量预测
@Time : 2023/02/02 15:19:28
@Author : 帅帅de三叔,zengbowengood@163.com
'''
import math
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['Simhei']
plt.rcParams['axes.unicode_minus']=False
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_percentage_error
import pickle
store = pd.read_csv(r"D:\项目\商简智能\回归预测题目\store.csv")
train = pd.read_csv(r"D:\项目\商简智能\回归预测题目\train.csv")
train_df = pd.merge(left=train, right=store, left_on='商店ID', right_on='商店ID', how='left')
train_df = train_df.query("周销量>0")
train_df['商店ID'] = train_df['商店ID'].astype('str')
train_df['年'] = train_df['年'].astype('str')
train_df['周'] = train_df['周'].astype('str')
train_df['节日A'] = train_df['节日A'].astype('bool')
train_df['节日B'] = train_df['节日B'].astype('bool')
train_df['节日C'] = train_df['节日C'].astype('bool')
print(train_df.info)
X, y = train_df.drop(['周销量'], axis=1), train_df.周销量 #特征和目标
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle=True, random_state=0) #划分训练测试集
less_cat_col = [col_name for col_name in X_train.columns if X_train[col_name].dtype=='object' and X_train[col_name].nunique()<10] #少类别型变量
more_cat_col = [col_name for col_name in X_train.columns if X_train[col_name].dtype=='object' and X_train[col_name].nunique()>=10] #多类别型变量
num_col = [col_name for col_name in X_train.columns if X_train[col_name].dtype in ['int64', 'float64']] #数值型特征
print(less_cat_col, more_cat_col, num_col)
print(train_df.corr())
# sns.heatmap(train_df.corr())
# plt.show()
less_cat_transform = Pipeline(steps = [('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))]
) #类别型变量先用众数填充再独热编码
more_cat_transform = Pipeline(steps = [('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))]
) #类别型变量先用众数填充再普通编码
num_transform = Pipeline(steps = [('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())]
) #数值型变量采用均值填充和标准化
preprocessor = ColumnTransformer(transformers = [('less_cat', less_cat_transform, less_cat_col),
('more_cat', more_cat_transform, more_cat_col),
('num', num_transform, num_col)]
) #不同的预处理步骤打包到一起
model = GradientBoostingRegressor(n_estimators = 500, learning_rate = 0.05, max_depth = 9, min_samples_leaf= 3, random_state=0) # 模型初始化
pipe = Pipeline(steps=[('preprocessing', preprocessor),
('model', model)]
)
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
RMSPE = math.sqrt(sum([((x - y)/x) ** 2 for x, y in zip(y_test, y_pred)]) / len(y_test)) #均方根百分比误差
score = pipe.score(X_test, y_test)
print("Root Mean Square Percentage Error: {}, and model score: {}".format(RMSPE, score))
with open(r"D:\项目\商简智能\回归预测题目\sales_predict.pickle", "wb") as model_file: #保存model/
pickle.dump(pipe, model_file)
# model = GradientBoostingRegressor(random_state=0) #
# pipe = Pipeline(steps=[('preprocessing', preprocessor),
# ('model', model)]
# )
# params = {
# 'model__n_estimators':[100, 200, 300],
# 'model__learning_rate':[0.01, 0.05, 0.1, 0.2],
# 'model__max_depth': [3, 5, 7, 9,],
# 'model__max_features':[9, 11, 14],
# 'model__min_samples_leaf': [1, 2, 3]
# }
# gs = GridSearchCV(pipe, param_grid = params)
# gs.fit(X_train, y_train)
# print(gs.best_params_)
# y_pred = gs.best_estimator_.predict(X_test)
# RMSPE = math.sqrt(sum([((x - y)/x) ** 2 for x, y in zip(y_test, y_pred)]) / len(y_test)) #均方根百分比误差
# score = gs.score(X_test, y_test)
# print("Root Mean Square Percentage Error: {}, and model score: {}".format(RMSPE, score))
评估
按照以往的业务经验,采用均方根百分比误差 (Root Mean Square Percentage Error, RMSPE) 指标作为评价标准会比较好,公式如下:
R
M
S
P
E
=
1
n
∑
i
=
1
n
(
y
i
−
y
^
i
y
i
)
2
R M S P E=\sqrt{\frac{1}{n} \sum_{i=1}^n\left(\frac{y_i-\hat{y}_i}{y_i}\right)^2}
RMSPE=n1i=1∑n(yiyi−y^i)2
-
其中 y i y_i yi 代表门店当天的真实销售额, y ^ i \hat{y}_i y^i 代表相对应的预测销售额, n n n 代表样本的数量。
-
如果真实销售额为 0 ,计算误差时可忽略该条数据。
-
R M S P E RMSPE RMSPE 值越小代表误差就越小,评分越高。
本案例在测试集上的模型得分和均方根百分比误差如下
Root Mean Square Percentage Error: 0.10308496736657641, and model score: 0.9485827177221499
深度学习方法
与传统机器学习不一样,深度学习不需要那么多特征工程,只需要按要求调整数据格式输入即可,同时把自定义得损失函数定义好。
代码
# -*- encoding: utf-8 -*-
'''
@Project : sales
@Desc : 基于pytorch的深度学习销量预测
@Time : 2023/02/07 16:30:14
@Author : 帅帅de三叔,zengbowengood@163.com
'''
import math
import torch
from torch import nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler # 数据标准化
store = pd.read_csv(r"D:\项目\商简智能\回归预测题目\store.csv")
train = pd.read_csv(r"D:\项目\商简智能\回归预测题目\train.csv")
train_df = pd.merge(left=train, right=store, left_on='商店ID', right_on='商店ID', how='left')
train_df = train_df.query("周销量>0")
train_df.dropna(how='any', inplace=True)
train_df['商店ID'] = train_df['商店ID'].astype('float64')
train_df['年'] = train_df['年'].astype('float64')
train_df['周'] = train_df['周'].astype('float64')
train_df['节日A'] = train_df['节日A'].astype('float64')
train_df['节日B'] = train_df['节日B'].astype('float64')
train_df['节日C'] = train_df['节日C'].astype('float64')
train_df['商店模式'] = LabelEncoder() .fit_transform(train_df['商店模式'] ) #编码
train_df['商店级别'] = LabelEncoder() .fit_transform(train_df['商店级别'] ) #编码
train_df = train_df[["商店ID", "年", "周", "营业天数", "打折天数", "非节日", "节日A", "节日B", "节日C", "商店模式", "商店级别", "竞争者最近距离", "周销量"]]
X, y = train_df.drop(['周销量'], axis=1), train_df.周销量 #特征和目标
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle=True, random_state=0) #划分训练测试集
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
X_train, X_test = X_train.values, X_test.values
y_train, y_test = y_train.values.reshape(-1, 1), y_test.values.reshape(-1, 1)
X_train = torch.from_numpy(X_train).type(torch.FloatTensor)
X_test = torch.from_numpy(X_test).type(torch.FloatTensor)
y_train = torch.from_numpy(y_train).type(torch.FloatTensor)
y_test = torch.from_numpy(y_test).type(torch.FloatTensor)
training_data = TensorDataset(X_train, y_train)
test_data = TensorDataset(X_test, y_test)
train_dataloader = DataLoader(training_data, batch_size=64)
test_dataloader = DataLoader(test_data, batch_size=64)
class Net(nn.Module):
def __init__(self, features):
super(Net, self).__init__()
self.h1 = nn.Linear(features, 30, bias= True)
self.a1 = nn.ReLU()
self.h2 = nn.Linear(30, 10)
self.a2 = nn.ReLU()
self.regression = nn.Linear(10, 1)
def forward(self, x):
x = self.h1(x)
x = self.a1(x)
x = self.h2(x)
x = self.a2(x)
y_pred = self.regression(x)
return y_pred
class CustomLoss(nn.Module): #自定义损失函数
def __init__(self):
super(CustomLoss, self).__init__()
def forward(self, x, y):
loss = torch.sqrt(torch.sum(torch.pow((x-y)/x, 2))/len(x))
return loss
epochs = 201
model = Net(features=X_train.shape[1])
# criterion = nn.MSELoss(reduction='mean')
criterion = CustomLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
losses = []
train_loss = []
train_acc = []
test_loss = []
test_acc = []
for epoch in range(epochs):
model.train()
for xb, yb in train_dataloader:
pred = model(xb)
loss = criterion(pred, yb)
# loss.requires_grad_(True)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch%10==0:
model.eval()
with torch.no_grad():
train_epoch_loss = sum(criterion(model(xb), yb) for xb, yb in train_dataloader)
test_epoch_loss = sum(criterion(model(xb), yb) for xb, yb in test_dataloader)
train_loss.append(train_epoch_loss.data.item() / len(train_dataloader))
test_loss.append(test_epoch_loss.data.item() / len(test_dataloader))
template = ("epoch:{:2d}, 训练损失:{:.5f}, 验证损失:{:.5f}")
print(template.format(epoch, train_epoch_loss.data.item() / len(train_dataloader), test_epoch_loss.data.item() / len(test_dataloader)))
print('训练完成')
fig = plt.figure(figsize = (6,4))
plt.plot(range(len(train_loss)), train_loss, label = "train_loss")
plt.plot(range(len(test_loss)), test_loss, label = "test_loss")
plt.legend()
plt.show()
torch.save(model.state_dict(), ".\model_parameter.pkl")
print("Saved pytorch model state to model_parameter.pkl")
评估
把训练集的损失曲线和测试机的损失曲线都画出来
建议
对于这种结构化的数据,传统机器学习算法已经表现得很好了,深度学习不一定就比传统机器学习强,针对既有其他影响因子又有时序因子的数据应该是一种时空数据,可以考虑时空图网络算法,具体问题还需要具体分析。
参考文献
1,https://blog.csdn.net/hba646333407/article/details/128529557
2,https://mp.weixin.qq.com/s/Zdx3gcgEoT_VyfPdrNKAGw