【SARIMAX】Champagne Sales

最新推荐文章于 2024-08-30 08:56:20 发布

小毛毛。。。。。嘻嘻

最新推荐文章于 2024-08-30 08:56:20 发布

阅读量1.1k

点赞数 1

文章标签： python

本文链接：https://blog.csdn.net/qq_54413568/article/details/122735174

版权

ppt的代码加自己整理的笔记，并没有完全搞明白，后面应该会有更新，欢迎交流学习！

Introduction to computational social science

Time Series Analysis

SARIMAX

The Algorithm (Step by Step)

❑Import data

❑Pre-processing

❑Divide the dataset into train and test parts

❑Finding best parameter (auto ARIMA) based on the training part

❑Make the model (SARIMAX) based on the achieved best parameters and analyze its summary (evaluate model)

❑Make the prediction by the achieved model over the test part and calculate RMSE (to find the accuracy)

❑If the accuracy was acceptable, proceed with the model to predict for unseen data (beyond the dataset)

Import Data

import pandas as pd
df = pd.read_csv('Champagne.csv')
df

	Month	Sales
0	1964-01	2815
1	1964-02	2672
2	1964-03	2755
3	1964-04	2721
4	1964-05	2946
...	...	...
100	1972-05	4618
101	1972-06	5312
102	1972-07	4298
103	1972-08	1413
104	1972-09	5877

105 rows × 2 columns

Data Preprocessing

Names should be meaningful
NaN need to be dropped or imputed
To apply time series analysis, the index should be timestamp format

df.columns=["Month","Sales"]
df.head()

	Month	Sales
0	1964-01	2815
1	1964-02	2672
2	1964-03	2755
3	1964-04	2721
4	1964-05	2946

df.dropna()

	Month	Sales
0	1964-01	2815
1	1964-02	2672
2	1964-03	2755
3	1964-04	2721
4	1964-05	2946
...	...	...
100	1972-05	4618
101	1972-06	5312
102	1972-07	4298
103	1972-08	1413
104	1972-09	5877

105 rows × 2 columns

df['Month']=pd.to_datetime(df['Month'])
df.head()

	Month	Sales
0	1964-01-01	2815
1	1964-02-01	2672
2	1964-03-01	2755
3	1964-04-01	2721
4	1964-05-01	2946

df.set_index('Month',inplace=True)
df.head()

	Sales
Month
1964-01-01	2815
1964-02-01	2672
1964-03-01	2755
1964-04-01	2721
1964-05-01	2946

Divide the dataset into train and test parts

train–> find best parameters
test–> predict on it and calculate the acuracy. If the acuracy is acceptable, use the model to predict unseen data(这时用所有已知的 data(train+test) to train model，然后预测)

df.shape

(105, 1)

80% - 85% should be assigned to train

这里最后13个给了test，其它train

train = df.iloc[:-13]
test = df.iloc[-13:]
print("Train shape:", train.shape)
print("Test shape:", test.shape)
print(train)
test

Train shape: (92, 1)
Test shape: (13, 1)
            Sales
Month            
1964-01-01   2815
1964-02-01   2672
1964-03-01   2755
1964-04-01   2721
1964-05-01   2946
...           ...
1971-04-01   4676
1971-05-01   5010
1971-06-01   4874
1971-07-01   4633
1971-08-01   1659

[92 rows x 1 columns]

	Sales
Month
1971-09-01	5951
1971-10-01	6981
1971-11-01	9851
1971-12-01	12670
1972-01-01	4348
1972-02-01	3564
1972-03-01	4577
1972-04-01	4788
1972-05-01	4618
1972-06-01	5312
1972-07-01	4298
1972-08-01	1413
1972-09-01	5877

Finding best parameter

auto_arima 自动 detect best parameter

seasonal=True–>SARIMA

suppress_warnings=True–>因为warnings很烦

stepwise=True–>step by step

from pmdarima.arima import auto_arima
stepwise_fit=auto_arima(train['Sales'],m=12,seasonal=True,d=None,D=1,trace=True,error_action='ignore',suppress_warnings=True,stepwise=True)

Performing stepwise search to minimize aic
 ARIMA(2,0,2)(1,1,1)[12] intercept   : AIC=1305.033, Time=0.49 sec
 ARIMA(0,0,0)(0,1,0)[12] intercept   : AIC=1300.270, Time=0.02 sec
 ARIMA(1,0,0)(1,1,0)[12] intercept   : AIC=1298.319, Time=0.14 sec
 ARIMA(0,0,1)(0,1,1)[12] intercept   : AIC=1299.128, Time=0.23 sec
 ARIMA(0,0,0)(0,1,0)[12]             : AIC=1309.721, Time=0.02 sec
 ARIMA(1,0,0)(0,1,0)[12] intercept   : AIC=1299.950, Time=0.02 sec
 ARIMA(1,0,0)(2,1,0)[12] intercept   : AIC=1298.706, Time=0.12 sec
 ARIMA(1,0,0)(1,1,1)[12] intercept   : AIC=1299.277, Time=0.36 sec
 ARIMA(1,0,0)(0,1,1)[12] intercept   : AIC=1299.289, Time=0.05 sec
 ARIMA(1,0,0)(2,1,1)[12] intercept   : AIC=1300.076, Time=0.46 sec
 ARIMA(0,0,0)(1,1,0)[12] intercept   : AIC=1300.359, Time=0.11 sec
 ARIMA(2,0,0)(1,1,0)[12] intercept   : AIC=1300.344, Time=0.16 sec
 ARIMA(1,0,1)(1,1,0)[12] intercept   : AIC=1300.399, Time=0.17 sec
 ARIMA(0,0,1)(1,1,0)[12] intercept   : AIC=1298.227, Time=0.19 sec
 ARIMA(0,0,1)(0,1,0)[12] intercept   : AIC=1300.004, Time=0.02 sec
 ARIMA(0,0,1)(2,1,0)[12] intercept   : AIC=1298.163, Time=0.24 sec
 ARIMA(0,0,1)(2,1,1)[12] intercept   : AIC=1299.854, Time=0.37 sec
 ARIMA(0,0,1)(1,1,1)[12] intercept   : AIC=1299.239, Time=0.34 sec
 ARIMA(0,0,0)(2,1,0)[12] intercept   : AIC=1300.543, Time=0.09 sec
 ARIMA(1,0,1)(2,1,0)[12] intercept   : AIC=1300.219, Time=0.40 sec
 ARIMA(0,0,2)(2,1,0)[12] intercept   : AIC=1300.184, Time=0.32 sec
 ARIMA(1,0,2)(2,1,0)[12] intercept   : AIC=1302.144, Time=0.50 sec
 ARIMA(0,0,1)(2,1,0)[12]             : AIC=1301.353, Time=0.19 sec

Best model:  ARIMA(0,0,1)(2,1,0)[12] intercept
Total fit time: 5.012 seconds

没有BIC，AIC is enough, AIC 越小越好
review：SARIMA(p,d,q)(P,D,Q)s
- p: AR 的 # of lags, PACF
- d: # of differencing
- q: MA 的 # of lags, ACF
- P: seasonality for AR
- D: seasonality for differencing
- Q: seasonality for MA
- s: # Of time steps for a single seasonal period

Evaluate model: analyze its summary

衡量刚刚得到的 best model–> 输入刚刚的参数, 用train parts fit model

from statsmodels.tsa.statespace.sarimax import SARIMAX

model = SARIMAX(train['Sales'], order=(0,0,1), seasonal_order=(2,1,0,12),enforce_stationary=False,enforce_invertibility=False)
results = model.fit()
results.summary()

SARIMAX Results
Dep. Variable:	Sales	No. Observations:	92
Model:	SARIMAX(0, 0, 1)x(2, 1, [], 12)	Log Likelihood	-646.676
Date:	Fri, 28 Jan 2022	AIC	1301.353
Time:	18:37:07	BIC	1310.881
Sample:	01-01-1964	HQIC	1305.173
	- 08-01-1971
Covariance Type:	opg

	coef	std err	z	P>\|z\|	[0.025	0.975]
ma.L1	0.2411	0.096	2.500	0.012	0.052	0.430
ar.S.L12	-0.0456	0.072	-0.638	0.523	-0.186	0.095
ar.S.L24	0.2839	0.105	2.711	0.007	0.079	0.489
sigma2	5.868e+05	7.71e+04	7.607	0.000	4.36e+05	7.38e+05

Ljung-Box (L1) (Q):	0.03	Jarque-Bera (JB):	4.88
Prob(Q):	0.87	Prob(JB):	0.09
Heteroskedasticity (H):	1.86	Skew:	-0.02
Prob(H) (two-sided):	0.11	Kurtosis:	4.21

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

No. Observations: 92 --> train data
ma.L1–> (0,0,1) lags for MA is 1

ar.S.L12 ar.S.L24–> (2,1,0) lags of seasonality for AR is 2
- and because seasonaily = 12. Lags jump back twice
P>|z| if p-value<0.05, the coefficient to achieve is significant ->proceed with that
最下面那一坨是其它的衡量指标，可以用于比较SARIMA和其它类型的model

Find the accuracy: Prediction on Test Data

start=len(train)
end = len(train)+len(test)-1
df['forecast'] = results.predict(start=start,end=end,dynamic=True)
df['forecast']

Month
1964-01-01            NaN
1964-02-01            NaN
1964-03-01            NaN
1964-04-01            NaN
1964-05-01            NaN
                 ...     
1972-05-01    4335.597877
1972-06-01    4637.333846
1972-07-01    4811.011815
1972-08-01    1639.045080
1972-09-01    5216.793461
Name: forecast, Length: 105, dtype: float64

import matplotlib.pyplot as plt
df[['Sales','forecast']].plot(figsize=(12,8))
plt.show()

在这里插入图片描述

test['Sales'].mean()

5711.384615384615

from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(df['forecast'].dropna(),test['Sales']))
rmse

643.8346721715318

RSME << Mean Test Data -> Model is good

Prediction (on Unseen Data)

用所有已知的数据训练模型来预测 unseen

model2 = SARIMAX(df['Sales'], order = (0,0,1), seasonal_order=(2,1,0,12),enforce_stationarity=False,enforce_invertibility=False)
results2 = model2.fit()
results2.summary()

SARIMAX Results
Dep. Variable:	Sales	No. Observations:	105
Model:	SARIMAX(0, 0, 1)x(2, 1, [], 12)	Log Likelihood	-559.547
Date:	Fri, 28 Jan 2022	AIC	1127.095
Time:	18:37:07	BIC	1136.031
Sample:	01-01-1964	HQIC	1130.640
	- 09-01-1972
Covariance Type:	opg

	coef	std err	z	P>\|z\|	[0.025	0.975]
ma.L1	0.3183	0.105	3.036	0.002	0.113	0.524
ar.S.L12	-0.2382	0.114	-2.086	0.037	-0.462	-0.014
ar.S.L24	0.1056	0.092	1.148	0.251	-0.075	0.286
sigma2	6.382e+05	9.14e+04	6.979	0.000	4.59e+05	8.17e+05

Ljung-Box (L1) (Q):	0.25	Jarque-Bera (JB):	3.51
Prob(Q):	0.62	Prob(JB):	0.17
Heteroskedasticity (H):	0.43	Skew:	-0.05
Prob(H) (two-sided):	0.05	Kurtosis:	4.10

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

ar.S.L24 的 p-value >0.05, but it’s inside the model, we can proceed with the model ？？？？？

df.tail()

	Sales	forecast
Month
1972-05-01	4618	4335.597877
1972-06-01	5312	4637.333846
1972-07-01	4298	4811.011815
1972-08-01	1413	1639.045080
1972-09-01	5877	5216.793461

from pandas.tseries.offsets import DateOffset
future_dates = [df.index[-1]+ DateOffset(months=x) for x in range(0,24)]
future_datest_df = pd.DataFrame(index=future_dates[1:],columns=df.columns)
future_datest_df

	Sales	forecast
1972-10-01	NaN	NaN
1972-11-01	NaN	NaN
1972-12-01	NaN	NaN
1973-01-01	NaN	NaN
1973-02-01	NaN	NaN
1973-03-01	NaN	NaN
1973-04-01	NaN	NaN
1973-05-01	NaN	NaN
1973-06-01	NaN	NaN
1973-07-01	NaN	NaN
1973-08-01	NaN	NaN
1973-09-01	NaN	NaN
1973-10-01	NaN	NaN
1973-11-01	NaN	NaN
1973-12-01	NaN	NaN
1974-01-01	NaN	NaN
1974-02-01	NaN	NaN
1974-03-01	NaN	NaN
1974-04-01	NaN	NaN
1974-05-01	NaN	NaN
1974-06-01	NaN	NaN
1974-07-01	NaN	NaN
1974-08-01	NaN	NaN

future_df = pd.concat([df,future_datest_df]) #合起来，好plot
future_df

	Sales	forecast
1964-01-01	2815	NaN
1964-02-01	2672	NaN
1964-03-01	2755	NaN
1964-04-01	2721	NaN
1964-05-01	2946	NaN
...	...	...
1974-04-01	NaN	NaN
1974-05-01	NaN	NaN
1974-06-01	NaN	NaN
1974-07-01	NaN	NaN
1974-08-01	NaN	NaN

128 rows × 2 columns

future_df['forecast']=results2.predict(start=104,end=128,dynamic=True)
future_df[['Sales','forecast']].plot(figsize=(12,8))
plt.show()

在这里插入图片描述

小毛毛。。。。。嘻嘻

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫