连锁店销量预测

任务

利用深度学习、强化学习等机器学习手段为某连锁商店预测每个商店未来12周的销售额进行估计,以便对商店的经营更好掌握和对库存调拨更好管理。

数据解读及简单探索

所有数据包含store.csv,train.csv 和 test.csv一共3个csv文件,其中store.csv主要是介绍每个商店的情况,具体表格包含的字段如下

字段名称描述
商店ID商店的唯一标识符
商店模式有四种:a,b,c,d
商店级别有三种:a,b,c。 注:a = 基本,b = 额外,c = 扩展
竞争者最近距离商店的竞争者中距离该商店的最近距离

从store表可以看到一共有1115个门店,每个店的经营模式是直营,特许经营,管理合同还是战略联盟,每个店的级别高低是属于基本,额外还是扩展,以及通过与竞争者的距离可以大致判断该店的竞争激烈程度,一般来说,距离越远竞争压力越小,距离越近竞争压力会越大,这种竞争对于直营的低级别模式的店铺显得尤为明显。

train.csv 和 test.csv分别用于模型的训练和估计值的生成,包含的字段如下

字段名称描述
商店ID商店的唯一标识符
所处年份
所处年份的第几周
营业天数该周开门营业天数
打折天数该周有打折活动的天数
非节日该周无节日的天数
节日A该周拥有节日类型为A的天数
节日B该周拥有节日类型为B的天数
节日C该周拥有节日类型为C的天数
Sales该周销售额(即:真实标签值)

通过train和test表可以看到不同商店在不同时期的实际销量走势情况,可以分析出打折等营销手段对销量的影响,可以分析节假日对销量的影响。

相关性探索

关系矩阵
从相关性矩阵图可以看到折扣天数,营业天数和非节日对周销量影响大,且是正向影响,节日A,节日B,节日C和距离对周销量影响小,且是负向影响。

解决思路

销量预测作为供应链域的核心算法,算法设计应该围绕准确率高,响应快,运行稳健三个方面进行,既然是要为所有商店未来12周的销售额进行预测,而且商店数量算中等,从生成预测值的数据量来说至少有1115*12约13000左右,这种多主体多期的预测常规的时间序列会显得很吃力,宜采用传统的机器学习方法和深度学习方法。

传统机器学习方法

对于这种结构化的数据,传统机器学习方法还是很巴适的,只是稍微要多做一些特征工程,对于类别型变量可以适当分多类别和少类别。

代码

# -*- encoding: utf-8 -*-
'''
@Project :   sales_train
@Desc    :   连锁店销量预测
@Time    :   2023/02/02 15:19:28
@Author  :   帅帅de三叔,zengbowengood@163.com
'''
import math
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['Simhei'] 
plt.rcParams['axes.unicode_minus']=False
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_percentage_error
import pickle

store = pd.read_csv(r"D:\项目\商简智能\回归预测题目\store.csv")
train = pd.read_csv(r"D:\项目\商简智能\回归预测题目\train.csv")
train_df = pd.merge(left=train, right=store, left_on='商店ID', right_on='商店ID', how='left')

train_df = train_df.query("周销量>0")
train_df['商店ID'] = train_df['商店ID'].astype('str')
train_df['年'] = train_df['年'].astype('str')
train_df['周'] = train_df['周'].astype('str')
train_df['节日A'] = train_df['节日A'].astype('bool')
train_df['节日B'] = train_df['节日B'].astype('bool')
train_df['节日C'] = train_df['节日C'].astype('bool')
print(train_df.info)
X, y =  train_df.drop(['周销量'], axis=1), train_df.周销量 #特征和目标
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle=True, random_state=0) #划分训练测试集
less_cat_col = [col_name for col_name in X_train.columns if X_train[col_name].dtype=='object' and X_train[col_name].nunique()<10] #少类别型变量
more_cat_col = [col_name for col_name in X_train.columns if X_train[col_name].dtype=='object' and X_train[col_name].nunique()>=10] #多类别型变量
num_col = [col_name for col_name in X_train.columns if X_train[col_name].dtype in ['int64', 'float64']] #数值型特征
print(less_cat_col, more_cat_col, num_col)
print(train_df.corr())
# sns.heatmap(train_df.corr())
# plt.show()

less_cat_transform = Pipeline(steps = [('imputer', SimpleImputer(strategy='most_frequent')),
                                ('encoder', OneHotEncoder(handle_unknown='ignore'))]
                        ) #类别型变量先用众数填充再独热编码
more_cat_transform = Pipeline(steps = [('imputer', SimpleImputer(strategy='most_frequent')),
                                ('encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))]
                        ) #类别型变量先用众数填充再普通编码

num_transform = Pipeline(steps = [('imputer', SimpleImputer(strategy='mean')),
                            ('scaler', StandardScaler())]
                    ) #数值型变量采用均值填充和标准化
preprocessor = ColumnTransformer(transformers = [('less_cat', less_cat_transform, less_cat_col),
                                        ('more_cat', more_cat_transform, more_cat_col),
                                    ('num', num_transform, num_col)]
                            ) #不同的预处理步骤打包到一起
model = GradientBoostingRegressor(n_estimators = 500, learning_rate = 0.05, max_depth = 9,  min_samples_leaf= 3, random_state=0) # 模型初始化
pipe = Pipeline(steps=[('preprocessing', preprocessor),
                ('model', model)]
            )
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

RMSPE = math.sqrt(sum([((x - y)/x) ** 2 for x, y in zip(y_test, y_pred)]) / len(y_test)) #均方根百分比误差
score = pipe.score(X_test, y_test)
print("Root Mean Square Percentage Error: {}, and model score: {}".format(RMSPE, score))
with open(r"D:\项目\商简智能\回归预测题目\sales_predict.pickle", "wb") as model_file: #保存model/
    pickle.dump(pipe, model_file)

# model = GradientBoostingRegressor(random_state=0) #
# pipe = Pipeline(steps=[('preprocessing', preprocessor),
#                 ('model', model)]
#             )
# params = {
#     'model__n_estimators':[100, 200, 300],
#     'model__learning_rate':[0.01, 0.05, 0.1, 0.2],
#     'model__max_depth': [3, 5, 7, 9,],
#     'model__max_features':[9, 11, 14],
#     'model__min_samples_leaf': [1, 2, 3]
# }
# gs = GridSearchCV(pipe, param_grid = params)
# gs.fit(X_train, y_train)
# print(gs.best_params_)
# y_pred = gs.best_estimator_.predict(X_test)
# RMSPE = math.sqrt(sum([((x - y)/x) ** 2 for x, y in zip(y_test, y_pred)]) / len(y_test)) #均方根百分比误差
# score = gs.score(X_test, y_test)
# print("Root Mean Square Percentage Error: {}, and model score: {}".format(RMSPE, score))

评估

按照以往的业务经验,采用均方根百分比误差 (Root Mean Square Percentage Error, RMSPE) 指标作为评价标准会比较好,公式如下:
R M S P E = 1 n ∑ i = 1 n ( y i − y ^ i y i ) 2 R M S P E=\sqrt{\frac{1}{n} \sum_{i=1}^n\left(\frac{y_i-\hat{y}_i}{y_i}\right)^2} RMSPE=n1i=1n(yiyiy^i)2

  • 其中 y i y_i yi 代表门店当天的真实销售额, y ^ i \hat{y}_i y^i 代表相对应的预测销售额, n n n 代表样本的数量。

  • 如果真实销售额为 0 ,计算误差时可忽略该条数据。

  • R M S P E RMSPE RMSPE 值越小代表误差就越小,评分越高。

本案例在测试集上的模型得分和均方根百分比误差如下

Root Mean Square Percentage Error: 0.10308496736657641, and model score: 0.9485827177221499

深度学习方法

与传统机器学习不一样,深度学习不需要那么多特征工程,只需要按要求调整数据格式输入即可,同时把自定义得损失函数定义好。

代码

# -*- encoding: utf-8 -*-
'''
@Project :   sales
@Desc    :   基于pytorch的深度学习销量预测
@Time    :   2023/02/07 16:30:14
@Author  :   帅帅de三叔,zengbowengood@163.com
'''

import math
import torch
from torch import nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler # 数据标准化

store = pd.read_csv(r"D:\项目\商简智能\回归预测题目\store.csv")
train = pd.read_csv(r"D:\项目\商简智能\回归预测题目\train.csv")
train_df = pd.merge(left=train, right=store, left_on='商店ID', right_on='商店ID', how='left')
train_df = train_df.query("周销量>0")
train_df.dropna(how='any', inplace=True)
train_df['商店ID'] = train_df['商店ID'].astype('float64')
train_df['年'] = train_df['年'].astype('float64')
train_df['周'] = train_df['周'].astype('float64')
train_df['节日A'] = train_df['节日A'].astype('float64')
train_df['节日B'] = train_df['节日B'].astype('float64')
train_df['节日C'] = train_df['节日C'].astype('float64')
train_df['商店模式'] = LabelEncoder() .fit_transform(train_df['商店模式'] ) #编码
train_df['商店级别'] = LabelEncoder() .fit_transform(train_df['商店级别'] ) #编码
train_df = train_df[["商店ID", "年",  "周",  "营业天数",  "打折天数",  "非节日",  "节日A",  "节日B",  "节日C",  "商店模式",  "商店级别",  "竞争者最近距离",  "周销量"]]

X, y =  train_df.drop(['周销量'], axis=1), train_df.周销量 #特征和目标

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle=True, random_state=0) #划分训练测试集
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
X_train, X_test = X_train.values, X_test.values
y_train, y_test = y_train.values.reshape(-1, 1), y_test.values.reshape(-1, 1)

X_train  = torch.from_numpy(X_train).type(torch.FloatTensor)
X_test = torch.from_numpy(X_test).type(torch.FloatTensor)
y_train = torch.from_numpy(y_train).type(torch.FloatTensor)
y_test = torch.from_numpy(y_test).type(torch.FloatTensor)

training_data = TensorDataset(X_train, y_train)
test_data = TensorDataset(X_test, y_test)
train_dataloader = DataLoader(training_data, batch_size=64)
test_dataloader = DataLoader(test_data, batch_size=64)

class Net(nn.Module):      
    def __init__(self, features):
        super(Net, self).__init__() 
        self.h1 = nn.Linear(features, 30, bias= True)
        self.a1 = nn.ReLU()
        self.h2 = nn.Linear(30, 10)
        self.a2 = nn.ReLU()
        self.regression = nn.Linear(10, 1)
        
    def forward(self, x): 
        x = self.h1(x)
        x = self.a1(x)
        x = self.h2(x)
        x = self.a2(x)
        y_pred = self.regression(x)
        return y_pred

class CustomLoss(nn.Module): #自定义损失函数
    def __init__(self):
        super(CustomLoss, self).__init__()
    
    def forward(self, x, y):
        loss = torch.sqrt(torch.sum(torch.pow((x-y)/x, 2))/len(x))
        return loss

epochs = 201
model = Net(features=X_train.shape[1])
# criterion = nn.MSELoss(reduction='mean')
criterion = CustomLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
losses = []

train_loss = []
train_acc = []

test_loss = []
test_acc = []


for epoch in range(epochs):
    model.train()
    for xb, yb in train_dataloader:
        pred = model(xb)
        loss = criterion(pred, yb)
        # loss.requires_grad_(True) 
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    if epoch%10==0:
        model.eval()
        with torch.no_grad():
            train_epoch_loss = sum(criterion(model(xb), yb) for xb, yb in train_dataloader)
            test_epoch_loss = sum(criterion(model(xb), yb) for xb, yb in test_dataloader)
        train_loss.append(train_epoch_loss.data.item() / len(train_dataloader))
        test_loss.append(test_epoch_loss.data.item() / len(test_dataloader))
        template = ("epoch:{:2d}, 训练损失:{:.5f}, 验证损失:{:.5f}")
        print(template.format(epoch, train_epoch_loss.data.item() / len(train_dataloader), test_epoch_loss.data.item() / len(test_dataloader)))
print('训练完成')

fig = plt.figure(figsize = (6,4))
plt.plot(range(len(train_loss)), train_loss, label = "train_loss")
plt.plot(range(len(test_loss)), test_loss, label = "test_loss")
plt.legend()
plt.show()

torch.save(model.state_dict(), ".\model_parameter.pkl")
print("Saved pytorch model state to model_parameter.pkl")

评估

把训练集的损失曲线和测试机的损失曲线都画出来
在这里插入图片描述

建议

对于这种结构化的数据,传统机器学习算法已经表现得很好了,深度学习不一定就比传统机器学习强,针对既有其他影响因子又有时序因子的数据应该是一种时空数据,可以考虑时空图网络算法,具体问题还需要具体分析。

参考文献

1,https://blog.csdn.net/hba646333407/article/details/128529557
2,https://mp.weixin.qq.com/s/Zdx3gcgEoT_VyfPdrNKAGw

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

三行数学

赞赏也是一种肯定!

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值