【SARIMAX】Champagne Sales

ppt的代码加自己整理的笔记,并没有完全搞明白,后面应该会有更新,欢迎交流学习!

Introduction to computational social science

Time Series Analysis

SARIMAX

The Algorithm (Step by Step)

❑Import data

❑Pre-processing

❑Divide the dataset into train and test parts

❑Finding best parameter (auto ARIMA) based on the training part

❑Make the model (SARIMAX) based on the achieved best parameters and analyze its summary (evaluate model)

❑Make the prediction by the achieved model over the test part and calculate RMSE (to find the accuracy)

❑If the accuracy was acceptable, proceed with the model to predict for unseen data (beyond the dataset)

Import Data

import pandas as pd
df = pd.read_csv('Champagne.csv')
df
MonthSales
01964-012815
11964-022672
21964-032755
31964-042721
41964-052946
.........
1001972-054618
1011972-065312
1021972-074298
1031972-081413
1041972-095877

105 rows × 2 columns

Data Preprocessing

  1. Names should be meaningful
  2. NaN need to be dropped or imputed
  3. To apply time series analysis, the index should be timestamp format
df.columns=["Month","Sales"]
df.head()
MonthSales
01964-012815
11964-022672
21964-032755
31964-042721
41964-052946
df.dropna()
MonthSales
01964-012815
11964-022672
21964-032755
31964-042721
41964-052946
.........
1001972-054618
1011972-065312
1021972-074298
1031972-081413
1041972-095877

105 rows × 2 columns

df['Month']=pd.to_datetime(df['Month'])
df.head()
MonthSales
01964-01-012815
11964-02-012672
21964-03-012755
31964-04-012721
41964-05-012946
df.set_index('Month',inplace=True)
df.head()
Sales
Month
1964-01-012815
1964-02-012672
1964-03-012755
1964-04-012721
1964-05-012946

Divide the dataset into train and test parts

  1. train–> find best parameters
  2. test–> predict on it and calculate the acuracy. If the acuracy is acceptable, use the model to predict unseen data(这时用所有已知的 data(train+test) to train model,然后预测)
df.shape
(105, 1)

80% - 85% should be assigned to train

这里最后13个给了test,其它train

train = df.iloc[:-13]
test = df.iloc[-13:]
print("Train shape:", train.shape)
print("Test shape:", test.shape)
print(train)
test
Train shape: (92, 1)
Test shape: (13, 1)
            Sales
Month            
1964-01-01   2815
1964-02-01   2672
1964-03-01   2755
1964-04-01   2721
1964-05-01   2946
...           ...
1971-04-01   4676
1971-05-01   5010
1971-06-01   4874
1971-07-01   4633
1971-08-01   1659

[92 rows x 1 columns]
Sales
Month
1971-09-015951
1971-10-016981
1971-11-019851
1971-12-0112670
1972-01-014348
1972-02-013564
1972-03-014577
1972-04-014788
1972-05-014618
1972-06-015312
1972-07-014298
1972-08-011413
1972-09-015877

Finding best parameter

auto_arima 自动 detect best parameter

seasonal=True–>SARIMA

suppress_warnings=True–>因为warnings很烦

stepwise=True–>step by step

from pmdarima.arima import auto_arima
stepwise_fit=auto_arima(train['Sales'],m=12,seasonal=True,d=None,D=1,trace=True,error_action='ignore',suppress_warnings=True,stepwise=True)
Performing stepwise search to minimize aic
 ARIMA(2,0,2)(1,1,1)[12] intercept   : AIC=1305.033, Time=0.49 sec
 ARIMA(0,0,0)(0,1,0)[12] intercept   : AIC=1300.270, Time=0.02 sec
 ARIMA(1,0,0)(1,1,0)[12] intercept   : AIC=1298.319, Time=0.14 sec
 ARIMA(0,0,1)(0,1,1)[12] intercept   : AIC=1299.128, Time=0.23 sec
 ARIMA(0,0,0)(0,1,0)[12]             : AIC=1309.721, Time=0.02 sec
 ARIMA(1,0,0)(0,1,0)[12] intercept   : AIC=1299.950, Time=0.02 sec
 ARIMA(1,0,0)(2,1,0)[12] intercept   : AIC=1298.706, Time=0.12 sec
 ARIMA(1,0,0)(1,1,1)[12] intercept   : AIC=1299.277, Time=0.36 sec
 ARIMA(1,0,0)(0,1,1)[12] intercept   : AIC=1299.289, Time=0.05 sec
 ARIMA(1,0,0)(2,1,1)[12] intercept   : AIC=1300.076, Time=0.46 sec
 ARIMA(0,0,0)(1,1,0)[12] intercept   : AIC=1300.359, Time=0.11 sec
 ARIMA(2,0,0)(1,1,0)[12] intercept   : AIC=1300.344, Time=0.16 sec
 ARIMA(1,0,1)(1,1,0)[12] intercept   : AIC=1300.399, Time=0.17 sec
 ARIMA(0,0,1)(1,1,0)[12] intercept   : AIC=1298.227, Time=0.19 sec
 ARIMA(0,0,1)(0,1,0)[12] intercept   : AIC=1300.004, Time=0.02 sec
 ARIMA(0,0,1)(2,1,0)[12] intercept   : AIC=1298.163, Time=0.24 sec
 ARIMA(0,0,1)(2,1,1)[12] intercept   : AIC=1299.854, Time=0.37 sec
 ARIMA(0,0,1)(1,1,1)[12] intercept   : AIC=1299.239, Time=0.34 sec
 ARIMA(0,0,0)(2,1,0)[12] intercept   : AIC=1300.543, Time=0.09 sec
 ARIMA(1,0,1)(2,1,0)[12] intercept   : AIC=1300.219, Time=0.40 sec
 ARIMA(0,0,2)(2,1,0)[12] intercept   : AIC=1300.184, Time=0.32 sec
 ARIMA(1,0,2)(2,1,0)[12] intercept   : AIC=1302.144, Time=0.50 sec
 ARIMA(0,0,1)(2,1,0)[12]             : AIC=1301.353, Time=0.19 sec

Best model:  ARIMA(0,0,1)(2,1,0)[12] intercept
Total fit time: 5.012 seconds
  • 没有BIC,AIC is enough, AIC 越小越好
  • review:SARIMA(p,d,q)(P,D,Q)s
    • p: AR 的 # of lags, PACF
    • d: # of differencing
    • q: MA 的 # of lags, ACF
    • P: seasonality for AR
    • D: seasonality for differencing
    • Q: seasonality for MA
    • s: # Of time steps for a single seasonal period

Evaluate model: analyze its summary

衡量刚刚得到的 best model–> 输入刚刚的参数, 用train parts fit model

from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(train['Sales'], order=(0,0,1), seasonal_order=(2,1,0,12),enforce_stationary=False,enforce_invertibility=False)
results = model.fit()
results.summary()
SARIMAX Results
Dep. Variable:Sales No. Observations: 92
Model:SARIMAX(0, 0, 1)x(2, 1, [], 12) Log Likelihood -646.676
Date:Fri, 28 Jan 2022 AIC 1301.353
Time:18:37:07 BIC 1310.881
Sample:01-01-1964 HQIC 1305.173
- 08-01-1971
Covariance Type:opg
coefstd errzP>|z|[0.0250.975]
ma.L1 0.2411 0.096 2.500 0.012 0.052 0.430
ar.S.L12 -0.0456 0.072 -0.638 0.523 -0.186 0.095
ar.S.L24 0.2839 0.105 2.711 0.007 0.079 0.489
sigma2 5.868e+05 7.71e+04 7.607 0.000 4.36e+05 7.38e+05
Ljung-Box (L1) (Q):0.03 Jarque-Bera (JB): 4.88
Prob(Q):0.87 Prob(JB): 0.09
Heteroskedasticity (H):1.86 Skew: -0.02
Prob(H) (two-sided):0.11 Kurtosis: 4.21


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
  1. No. Observations: 92 --> train data
  2. ma.L1–> (0,0,1) lags for MA is 1

    ar.S.L12 ar.S.L24–> (2,1,0) lags of seasonality for AR is 2
    • and because seasonaily = 12. Lags jump back twice
  3. P>|z| if p-value<0.05, the coefficient to achieve is significant ->proceed with that
  4. 最下面那一坨是其它的衡量指标,可以用于比较SARIMA和其它类型的model

Find the accuracy: Prediction on Test Data

start=len(train)
end = len(train)+len(test)-1
df['forecast'] = results.predict(start=start,end=end,dynamic=True)
df['forecast']

Month
1964-01-01            NaN
1964-02-01            NaN
1964-03-01            NaN
1964-04-01            NaN
1964-05-01            NaN
                 ...     
1972-05-01    4335.597877
1972-06-01    4637.333846
1972-07-01    4811.011815
1972-08-01    1639.045080
1972-09-01    5216.793461
Name: forecast, Length: 105, dtype: float64
import matplotlib.pyplot as plt
df[['Sales','forecast']].plot(figsize=(12,8))
plt.show()


在这里插入图片描述

test['Sales'].mean()
5711.384615384615
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(df['forecast'].dropna(),test['Sales']))
rmse
643.8346721715318

RSME << Mean Test Data -> Model is good

Prediction (on Unseen Data)

用所有已知的数据训练模型来预测 unseen

model2 = SARIMAX(df['Sales'], order = (0,0,1), seasonal_order=(2,1,0,12),enforce_stationarity=False,enforce_invertibility=False)
results2 = model2.fit()
results2.summary()
SARIMAX Results
Dep. Variable:Sales No. Observations: 105
Model:SARIMAX(0, 0, 1)x(2, 1, [], 12) Log Likelihood -559.547
Date:Fri, 28 Jan 2022 AIC 1127.095
Time:18:37:07 BIC 1136.031
Sample:01-01-1964 HQIC 1130.640
- 09-01-1972
Covariance Type:opg
coefstd errzP>|z|[0.0250.975]
ma.L1 0.3183 0.105 3.036 0.002 0.113 0.524
ar.S.L12 -0.2382 0.114 -2.086 0.037 -0.462 -0.014
ar.S.L24 0.1056 0.092 1.148 0.251 -0.075 0.286
sigma2 6.382e+05 9.14e+04 6.979 0.000 4.59e+05 8.17e+05
Ljung-Box (L1) (Q):0.25 Jarque-Bera (JB): 3.51
Prob(Q):0.62 Prob(JB): 0.17
Heteroskedasticity (H):0.43 Skew: -0.05
Prob(H) (two-sided):0.05 Kurtosis: 4.10


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

ar.S.L24 的 p-value >0.05, but it’s inside the model, we can proceed with the model ?????

df.tail()
Salesforecast
Month
1972-05-0146184335.597877
1972-06-0153124637.333846
1972-07-0142984811.011815
1972-08-0114131639.045080
1972-09-0158775216.793461
from pandas.tseries.offsets import DateOffset
future_dates = [df.index[-1]+ DateOffset(months=x) for x in range(0,24)]
future_datest_df = pd.DataFrame(index=future_dates[1:],columns=df.columns)
future_datest_df
Salesforecast
1972-10-01NaNNaN
1972-11-01NaNNaN
1972-12-01NaNNaN
1973-01-01NaNNaN
1973-02-01NaNNaN
1973-03-01NaNNaN
1973-04-01NaNNaN
1973-05-01NaNNaN
1973-06-01NaNNaN
1973-07-01NaNNaN
1973-08-01NaNNaN
1973-09-01NaNNaN
1973-10-01NaNNaN
1973-11-01NaNNaN
1973-12-01NaNNaN
1974-01-01NaNNaN
1974-02-01NaNNaN
1974-03-01NaNNaN
1974-04-01NaNNaN
1974-05-01NaNNaN
1974-06-01NaNNaN
1974-07-01NaNNaN
1974-08-01NaNNaN
future_df = pd.concat([df,future_datest_df]) #合起来,好plot
future_df
Salesforecast
1964-01-012815NaN
1964-02-012672NaN
1964-03-012755NaN
1964-04-012721NaN
1964-05-012946NaN
.........
1974-04-01NaNNaN
1974-05-01NaNNaN
1974-06-01NaNNaN
1974-07-01NaNNaN
1974-08-01NaNNaN

128 rows × 2 columns

future_df['forecast']=results2.predict(start=104,end=128,dynamic=True)
future_df[['Sales','forecast']].plot(figsize=(12,8))
plt.show()

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值