本篇文章想通过历史的销量数据来预测未来几个月的数据,用以定目标和供运营、供应链人员备货需求参考,本文通过SARIMA模型来实现,预测下一个月的销量相对误差在17%,预测结果仅供参考,具体还得结合运营人员经验共同预测得出,后续也会尝试其他模型来提高预测准确性
一、导入需要的包
from __future__ import division
from datetime import datetime, timedelta, date
import pandas as pd
% matplotlib inline
import matplotlib. pyplot as plt
import numpy as np
from statsmodels. graphics. tsaplots import plot_pacf
from statsmodels. graphics. tsaplots import plot_acf
from statsmodels. tsa. stattools import adfuller
from statsmodels. stats. diagnostic import acorr_ljungbox
from statsmodels. tsa. statespace. sarimax import SARIMAX
import warnings
warnings. filterwarnings( "ignore" )
二、导入历史数据
df = pd. read_excel( "C:\\Users\\10237\\Desktop\\F7NP历史销量.xlsx" )
plt. figure( figsize= ( 10 , 5 ) )
df[ "sales" ] . plot( )
plt. legend( [ "sales" ] )
plt. show( )
三、自相关图
df[ "sales" ] = df[ "sales" ] . astype( "float" )
plot_acf( df[ "sales" ] ) . show( )
自相关系数
平稳序列通常具有短期相关性,即随着延迟期数k的增加,平稳序列的自相关系数会很快地衰减向零,而非平稳序列的自相关系数的衰减速度会比较慢,这就是我们利用自相关图判断平稳性的标准。 其横轴表示延迟期数,纵轴表示自相关系数,从图中可以看出自相关系数衰减到零的速度比较缓慢,在很长的延迟期内,自相关系数一直为正,然后为负,呈现出三角对称性,这是具有单调趋势的非平稳序列的一种典型的自相关图形式
四、平稳性检验
adfuller( df[ "sales" ] )
(1.8408412567719112,
0.9984266704565955,
8,
16,
{'1%': -3.9240193847656246, '5%': -3.0684982031250003, '10%': -2.67389265625},
283.93234809442)
P值为0.99大于显著性水平0.05,接受原假设(非平稳序列),说明原始序列是非平稳序列
五、一阶差分序列的检验
df1 = df. diff( periods= 1 , axis= 0 ) . dropna( )
df1 = df1[ "sales" ]
plt. figure( figsize= ( 10 , 5 ) )
df1. plot( )
plot_acf( df1) . show( )
adfuller( df1)
(0.6199434003434391,
0.9880970219599929,
7,
16,
{'1%': -3.9240193847656246, '5%': -3.0684982031250003, '10%': -2.67389265625},
267.98692610966367)
可见原始序列波动太大,直接差分意义不大,所以下面先取对数把数值控制在较小的波动区间内,再进行差分序列容易平稳
六、取对数进行差分
df[ "npsales" ] = np. log( df[ "sales" ] )
df[ "diffsales" ] = df[ "npsales" ] . diff( ) . dropna( )
df2 = df. drop( df. index[ 0 ] )
df
date sales npsales diffsales 0 2021-12-01 3972.0 8.287025 NaN 1 2022-01-01 2140.0 7.668561 -0.618464 2 2022-02-01 1884.0 7.541152 -0.127409 3 2022-03-01 2293.0 7.737616 0.196464 4 2022-04-01 1681.0 7.427144 -0.310472 5 2022-05-01 1788.0 7.488853 0.061709 6 2022-06-01 1742.0 7.462789 -0.026064 7 2022-07-01 2560.0 7.847763 0.384973 8 2022-08-01 2329.0 7.753194 -0.094568 9 2022-09-01 2309.0 7.744570 -0.008624 10 2022-10-01 3343.0 8.114624 0.370054 11 2022-11-01 6817.0 8.827175 0.712551 12 2022-12-01 8446.0 9.041448 0.214273 13 2023-01-01 6769.0 8.820109 -0.221340 14 2023-02-01 6538.0 8.785387 -0.034722 15 2023-03-01 8709.0 9.072112 0.286726 16 2023-04-01 9131.0 9.119430 0.047318 17 2023-05-01 12873.0 9.462887 0.343457 18 2023-06-01 15708.0 9.661925 0.199038 19 2023-07-01 29836.0 10.303471 0.641546 20 2023-08-01 17128.0 9.748470 -0.555001 21 2023-09-01 16098.0 9.686450 -0.062020 22 2023-10-01 20588.0 9.932464 0.246013 23 2023-11-01 36124.0 10.494713 0.562249 24 2023-12-01 36785.0 10.512845 0.018133
plt. figure( figsize= [ 15 , 7.5 ] )
plt. plot( df[ "diffsales" ] )
plot_acf( df2[ "diffsales" ] ) . show( )
adfuller( df2[ "diffsales" ] )
(-1.9799606891943407,
0.2954281636375788,
8,
15,
{'1%': -3.9644434814814815,
'5%': -3.0849081481481484,
'10%': -2.6818144444444445},
-2.6195422547466265)
P值0.29大于显著性水平0.05,接受原假设,一阶差分不平稳,继续二阶差分
df2[ "diff2sales" ] = df[ "diffsales" ] . diff( 1 )
df3 = df2. drop( [ 1 , 2 ] , axis= 0 ) . reset_index( drop= True )
df3
date sales npsales diffsales diff2sales 0 2022-03-01 2293.0 7.737616 0.196464 0.323872 1 2022-04-01 1681.0 7.427144 -0.310472 -0.506936 2 2022-05-01 1788.0 7.488853 0.061709 0.372181 3 2022-06-01 1742.0 7.462789 -0.026064 -0.087773 4 2022-07-01 2560.0 7.847763 0.384973 0.411037 5 2022-08-01 2329.0 7.753194 -0.094568 -0.479542 6 2022-09-01 2309.0 7.744570 -0.008624 0.085944 7 2022-10-01 3343.0 8.114624 0.370054 0.378679 8 2022-11-01 6817.0 8.827175 0.712551 0.342497 9 2022-12-01 8446.0 9.041448 0.214273 -0.498277 10 2023-01-01 6769.0 8.820109 -0.221340 -0.435613 11 2023-02-01 6538.0 8.785387 -0.034722 0.186618 12 2023-03-01 8709.0 9.072112 0.286726 0.321448 13 2023-04-01 9131.0 9.119430 0.047318 -0.239407 14 2023-05-01 12873.0 9.462887 0.343457 0.296139 15 2023-06-01 15708.0 9.661925 0.199038 -0.144419 16 2023-07-01 29836.0 10.303471 0.641546 0.442508 17 2023-08-01 17128.0 9.748470 -0.555001 -1.196547 18 2023-09-01 16098.0 9.686450 -0.062020 0.492982 19 2023-10-01 20588.0 9.932464 0.246013 0.308033 20 2023-11-01 36124.0 10.494713 0.562249 0.316236 21 2023-12-01 36785.0 10.512845 0.018133 -0.544116
plt. figure( figsize= [ 15 , 7.5 ] )
plt. plot( df3[ "diff2sales" ] )
plot_acf( df3[ "diff2sales" ] ) . show( )
adfuller( df3[ "diff2sales" ] )
(-4.377278173091498,
0.00032555635791855304,
9,
12,
{'1%': -4.137829282407408,
'5%': -3.1549724074074077,
'10%': -2.7144769444444443},
-26.776419186897854)
P值0.0003小于显著性0.05,拒绝原假设,说明二阶差分是平稳序列
七、白噪声检验
acorr_ljungbox( df3[ "diff2sales" ] , lags= [ 6 , 12 , 20 ] )
lb_stat lb_pvalue 6 17.270867 0.008338 12 30.720564 0.002173 20 52.256762 0.000104
P值均小于0.05,拒绝原假设,说明二阶差分序列非白噪声序列
八、拟合参数
plot_acf( df3[ 'diff2sales' ] ) . show( )
plot_pacf( df3[ 'diff2sales' ] , lags= 10 ) . show( )
import itertools
import warnings
p = q = range ( 0 , 4 )
d = range ( 0 , 4 )
pdq = list ( itertools. product( p, d, q) )
seasonal_pdq = [ ( x[ 0 ] , x[ 1 ] , x[ 2 ] , 12 ) for x in list ( itertools. product( p, d, q) ) ]
warnings. filterwarnings( "ignore" )
for param in pdq:
for param_seasonal in seasonal_pdq:
try :
mod = sm. tsa. statespace. SARIMAX( df3[ "diff2sales" ] ,
order= param,
seasonal_order= param_seasonal,
enforce_stationarity= False ,
enforce_invertibility= False )
results = mod. fit( )
print ( 'SARIMA{}x{} - AIC:{}' . format ( param, param_seasonal, results. aic) )
except :
continue
综上,我们得知SARIMA模型使AIC最小的参数组合为:SARIMA(0,0,0)(0,2,0)s s=12
best_model1 = SARIMAX( df3[ 'diff2sales' ] , order= ( 0 , 0 , 0 ) , seasonal_order= ( 0 , 2 , 0 , 12 ) ) . fit( dis= - 1 )
print ( best_model1. summary( ) )
SARIMAX Results
================================================================================
Dep. Variable: diff2sales No. Observations: 22
Model: SARIMAX(0, 2, 0, 12) Log Likelihood 0.000
Date: Wed, 07 Feb 2024 AIC 2.000
Time: 13:59:29 BIC nan
Sample: 0 HQIC nan
- 22
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
sigma2 1e-10 -0 -inf 0.000 1e-10 1e-10
===================================================================================
Ljung-Box (L1) (Q): nan Jarque-Bera (JB): nan
Prob(Q): nan Prob(JB): nan
Heteroskedasticity (H): nan Skew: nan
Prob(H) (two-sided): nan Kurtosis: nan
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number inf. Standard errors may be unstable.
df3[ 'sarima_model' ] = best_model1. fittedvalues
df3[ 'sarima_model' ] [ : 11 + 1 ] = np. NaN
forecast = best_model1. predict( start= df3. shape[ 0 ] , end= df3. shape[ 0 ] + 12 )
forecast = df3[ 'sarima_model' ] . append( forecast)
plt. figure( figsize= ( 15 , 7.5 ) )
plt. plot( forecast, color= 'r' , label= 'model' )
plt. axvspan( df3. index[ - 1 ] , forecast. index[ - 1 ] , alpha= 0.5 , color= 'lightgrey' )
plt. plot( df3[ 'diff2sales' ] , label= 'actual' )
plt. legend( )
plt. show( )
从预测值与实际值的曲线来看,误差不大相对吻合
forecast
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 0.485809
13 -0.760404
14 0.558271
15 -0.131659
16 0.616556
17 -0.719312
18 0.128916
19 0.568018
20 0.513745
21 -0.747416
22 -0.653420
23 0.279926
24 0.319023
25 0.028121
26 0.220096
27 -0.201065
28 0.473978
29 -1.913552
30 0.900020
31 0.237387
32 0.289975
33 -0.589955
34 -0.871226
dtype: float64
df4 = pd. concat( [ df3, forecast] , axis= 1 )
df4
date sales npsales diffsales diff2sales sarima_model 0 0 2022-03-01 2293.0 7.737616 0.196464 0.323872 NaN NaN 1 2022-04-01 1681.0 7.427144 -0.310472 -0.506936 NaN NaN 2 2022-05-01 1788.0 7.488853 0.061709 0.372181 NaN NaN 3 2022-06-01 1742.0 7.462789 -0.026064 -0.087773 NaN NaN 4 2022-07-01 2560.0 7.847763 0.384973 0.411037 NaN NaN 5 2022-08-01 2329.0 7.753194 -0.094568 -0.479542 NaN NaN 6 2022-09-01 2309.0 7.744570 -0.008624 0.085944 NaN NaN 7 2022-10-01 3343.0 8.114624 0.370054 0.378679 NaN NaN 8 2022-11-01 6817.0 8.827175 0.712551 0.342497 NaN NaN 9 2022-12-01 8446.0 9.041448 0.214273 -0.498277 NaN NaN 10 2023-01-01 6769.0 8.820109 -0.221340 -0.435613 NaN NaN 11 2023-02-01 6538.0 8.785387 -0.034722 0.186618 NaN NaN 12 2023-03-01 8709.0 9.072112 0.286726 0.321448 0.485809 0.485809 13 2023-04-01 9131.0 9.119430 0.047318 -0.239407 -0.760404 -0.760404 14 2023-05-01 12873.0 9.462887 0.343457 0.296139 0.558271 0.558271 15 2023-06-01 15708.0 9.661925 0.199038 -0.144419 -0.131659 -0.131659 16 2023-07-01 29836.0 10.303471 0.641546 0.442508 0.616556 0.616556 17 2023-08-01 17128.0 9.748470 -0.555001 -1.196547 -0.719312 -0.719312 18 2023-09-01 16098.0 9.686450 -0.062020 0.492982 0.128916 0.128916 19 2023-10-01 20588.0 9.932464 0.246013 0.308033 0.568018 0.568018 20 2023-11-01 36124.0 10.494713 0.562249 0.316236 0.513745 0.513745 21 2023-12-01 36785.0 10.512845 0.018133 -0.544116 -0.747416 -0.747416 22 NaT NaN NaN NaN NaN NaN -0.653420 23 NaT NaN NaN NaN NaN NaN 0.279926 24 NaT NaN NaN NaN NaN NaN 0.319023 25 NaT NaN NaN NaN NaN NaN 0.028121 26 NaT NaN NaN NaN NaN NaN 0.220096 27 NaT NaN NaN NaN NaN NaN -0.201065 28 NaT NaN NaN NaN NaN NaN 0.473978 29 NaT NaN NaN NaN NaN NaN -1.913552 30 NaT NaN NaN NaN NaN NaN 0.900020 31 NaT NaN NaN NaN NaN NaN 0.237387 32 NaT NaN NaN NaN NaN NaN 0.289975 33 NaT NaN NaN NaN NaN NaN -0.589955 34 NaT NaN NaN NaN NaN NaN -0.871226
九、还原预测值
def diff1_reduction ( y_pred_test_1, y_train_real) :
'''
:param y_pred_test_1: 一阶差分后测试集的预测结果
:param y_train_real: 训练集的输出
:return: 测试集的真实预测结果(未经差分)
'''
y_test_real = y_pred_test_1. cumsum( ) + y_train_real. values[ - 1 ]
return y_test_real. dropna( )
def diff2_reduction ( y_pred_test_2, y_train_real) :
'''
:param y_pred_test_2: 二阶差分后测试集的预测结果
:param y_train_real: 训练集的输出
:return: 测试集的真实预测结果(未经差分)
'''
data_diff_1_reduction = y_pred_test_2. cumsum( ) + (
y_train_real. values[ - 1 ] - y_train_real. values[ - 2 ] )
y_test_real = diff1_reduction( data_diff_1_reduction, y_train_real)
return y_test_real
def diff3_reduction ( y_pred_test_3, y_train_real) :
'''
:param y_pred_test_3: 三阶差分后测试集的预测结果
:param y_train_real: 训练集的输出
:return: 测试集的真实预测结果(未经差分)
'''
data_diff_2_reduction = y_pred_test_3. cumsum( ) + (
y_train_real. values[ - 1 ] - 2 * y_train_real. values[ - 2 ] + y_train_real. values[ - 3 ] )
y_test_real = diff2_reduction( data_diff_2_reduction, y_train_real)
return y_test_real
y_return = diff2_reduction( forecast[ 22 : ] , df3[ "npsales" ] )
y_return
22 9.877559
23 9.522198
24 9.485860
25 9.477644
26 9.689524
27 9.700338
28 10.185131
29 8.756372
30 8.227632
31 7.936280
32 7.934902
33 7.343569
34 5.881009
dtype: float64
np. exp( y_return)
22 19488.085127
23 13659.601305
24 13172.153768
25 13064.368259
26 16147.550146
27 16323.129592
28 26506.122567
29 6351.027108
30 3742.960429
31 2796.935436
32 2793.084329
33 1546.220029
34 358.170530
dtype: float64
十、预测值与实际值对比结论
1、得到下一个月也就是24年1月的预测值是19488,实际24年1月的销售额是16559,相对误差(|预测值-实际值|/实际值)为17.68%。
2、另外从预测数据来看,29行之后数据没有太大的价值,后面每月会填入最新的实际值来滚动预测最新的数据,也会尝试其他模型来提高预测准确性。