ppt的代码加自己整理的笔记,并没有完全搞明白,后面应该会有更新,欢迎交流学习!
Introduction to computational social science
Time Series Analysis
SARIMAX
The Algorithm (Step by Step)
❑Import data
❑Pre-processing
❑Divide the dataset into train and test parts
❑Finding best parameter (auto ARIMA) based on the training part
❑Make the model (SARIMAX) based on the achieved best parameters and analyze its summary (evaluate model)
❑Make the prediction by the achieved model over the test part and calculate RMSE (to find the accuracy)
❑If the accuracy was acceptable, proceed with the model to predict for unseen data (beyond the dataset)
Import Data
import pandas as pd
df = pd.read_csv('Champagne.csv')
df
Month | Sales | |
---|---|---|
0 | 1964-01 | 2815 |
1 | 1964-02 | 2672 |
2 | 1964-03 | 2755 |
3 | 1964-04 | 2721 |
4 | 1964-05 | 2946 |
... | ... | ... |
100 | 1972-05 | 4618 |
101 | 1972-06 | 5312 |
102 | 1972-07 | 4298 |
103 | 1972-08 | 1413 |
104 | 1972-09 | 5877 |
105 rows × 2 columns
Data Preprocessing
- Names should be meaningful
- NaN need to be dropped or imputed
- To apply time series analysis, the index should be timestamp format
df.columns=["Month","Sales"]
df.head()
Month | Sales | |
---|---|---|
0 | 1964-01 | 2815 |
1 | 1964-02 | 2672 |
2 | 1964-03 | 2755 |
3 | 1964-04 | 2721 |
4 | 1964-05 | 2946 |
df.dropna()
Month | Sales | |
---|---|---|
0 | 1964-01 | 2815 |
1 | 1964-02 | 2672 |
2 | 1964-03 | 2755 |
3 | 1964-04 | 2721 |
4 | 1964-05 | 2946 |
... | ... | ... |
100 | 1972-05 | 4618 |
101 | 1972-06 | 5312 |
102 | 1972-07 | 4298 |
103 | 1972-08 | 1413 |
104 | 1972-09 | 5877 |
105 rows × 2 columns
df['Month']=pd.to_datetime(df['Month'])
df.head()
Month | Sales | |
---|---|---|
0 | 1964-01-01 | 2815 |
1 | 1964-02-01 | 2672 |
2 | 1964-03-01 | 2755 |
3 | 1964-04-01 | 2721 |
4 | 1964-05-01 | 2946 |
df.set_index('Month',inplace=True)
df.head()
Sales | |
---|---|
Month | |
1964-01-01 | 2815 |
1964-02-01 | 2672 |
1964-03-01 | 2755 |
1964-04-01 | 2721 |
1964-05-01 | 2946 |
Divide the dataset into train and test parts
- train–> find best parameters
- test–> predict on it and calculate the acuracy. If the acuracy is acceptable, use the model to predict unseen data(这时用所有已知的 data(train+test) to train model,然后预测)
df.shape
(105, 1)
80% - 85% should be assigned to train
这里最后13个给了test,其它train
train = df.iloc[:-13]
test = df.iloc[-13:]
print("Train shape:", train.shape)
print("Test shape:", test.shape)
print(train)
test
Train shape: (92, 1)
Test shape: (13, 1)
Sales
Month
1964-01-01 2815
1964-02-01 2672
1964-03-01 2755
1964-04-01 2721
1964-05-01 2946
... ...
1971-04-01 4676
1971-05-01 5010
1971-06-01 4874
1971-07-01 4633
1971-08-01 1659
[92 rows x 1 columns]
Sales | |
---|---|
Month | |
1971-09-01 | 5951 |
1971-10-01 | 6981 |
1971-11-01 | 9851 |
1971-12-01 | 12670 |
1972-01-01 | 4348 |
1972-02-01 | 3564 |
1972-03-01 | 4577 |
1972-04-01 | 4788 |
1972-05-01 | 4618 |
1972-06-01 | 5312 |
1972-07-01 | 4298 |
1972-08-01 | 1413 |
1972-09-01 | 5877 |
Finding best parameter
auto_arima 自动 detect best parameter
seasonal=True
–>SARIMA
suppress_warnings=True
–>因为warnings很烦
stepwise=True
–>step by step
from pmdarima.arima import auto_arima
stepwise_fit=auto_arima(train['Sales'],m=12,seasonal=True,d=None,D=1,trace=True,error_action='ignore',suppress_warnings=True,stepwise=True)
Performing stepwise search to minimize aic
ARIMA(2,0,2)(1,1,1)[12] intercept : AIC=1305.033, Time=0.49 sec
ARIMA(0,0,0)(0,1,0)[12] intercept : AIC=1300.270, Time=0.02 sec
ARIMA(1,0,0)(1,1,0)[12] intercept : AIC=1298.319, Time=0.14 sec
ARIMA(0,0,1)(0,1,1)[12] intercept : AIC=1299.128, Time=0.23 sec
ARIMA(0,0,0)(0,1,0)[12] : AIC=1309.721, Time=0.02 sec
ARIMA(1,0,0)(0,1,0)[12] intercept : AIC=1299.950, Time=0.02 sec
ARIMA(1,0,0)(2,1,0)[12] intercept : AIC=1298.706, Time=0.12 sec
ARIMA(1,0,0)(1,1,1)[12] intercept : AIC=1299.277, Time=0.36 sec
ARIMA(1,0,0)(0,1,1)[12] intercept : AIC=1299.289, Time=0.05 sec
ARIMA(1,0,0)(2,1,1)[12] intercept : AIC=1300.076, Time=0.46 sec
ARIMA(0,0,0)(1,1,0)[12] intercept : AIC=1300.359, Time=0.11 sec
ARIMA(2,0,0)(1,1,0)[12] intercept : AIC=1300.344, Time=0.16 sec
ARIMA(1,0,1)(1,1,0)[12] intercept : AIC=1300.399, Time=0.17 sec
ARIMA(0,0,1)(1,1,0)[12] intercept : AIC=1298.227, Time=0.19 sec
ARIMA(0,0,1)(0,1,0)[12] intercept : AIC=1300.004, Time=0.02 sec
ARIMA(0,0,1)(2,1,0)[12] intercept : AIC=1298.163, Time=0.24 sec
ARIMA(0,0,1)(2,1,1)[12] intercept : AIC=1299.854, Time=0.37 sec
ARIMA(0,0,1)(1,1,1)[12] intercept : AIC=1299.239, Time=0.34 sec
ARIMA(0,0,0)(2,1,0)[12] intercept : AIC=1300.543, Time=0.09 sec
ARIMA(1,0,1)(2,1,0)[12] intercept : AIC=1300.219, Time=0.40 sec
ARIMA(0,0,2)(2,1,0)[12] intercept : AIC=1300.184, Time=0.32 sec
ARIMA(1,0,2)(2,1,0)[12] intercept : AIC=1302.144, Time=0.50 sec
ARIMA(0,0,1)(2,1,0)[12] : AIC=1301.353, Time=0.19 sec
Best model: ARIMA(0,0,1)(2,1,0)[12] intercept
Total fit time: 5.012 seconds
- 没有BIC,AIC is enough, AIC 越小越好
- review:SARIMA(p,d,q)(P,D,Q)s
- p: AR 的 # of lags, PACF
- d: # of differencing
- q: MA 的 # of lags, ACF
- P: seasonality for AR
- D: seasonality for differencing
- Q: seasonality for MA
- s: # Of time steps for a single seasonal period
Evaluate model: analyze its summary
衡量刚刚得到的 best model–> 输入刚刚的参数, 用train parts fit model
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(train['Sales'], order=(0,0,1), seasonal_order=(2,1,0,12),enforce_stationary=False,enforce_invertibility=False)
results = model.fit()
results.summary()
Dep. Variable: | Sales | No. Observations: | 92 |
---|---|---|---|
Model: | SARIMAX(0, 0, 1)x(2, 1, [], 12) | Log Likelihood | -646.676 |
Date: | Fri, 28 Jan 2022 | AIC | 1301.353 |
Time: | 18:37:07 | BIC | 1310.881 |
Sample: | 01-01-1964 | HQIC | 1305.173 |
- 08-01-1971 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
ma.L1 | 0.2411 | 0.096 | 2.500 | 0.012 | 0.052 | 0.430 |
ar.S.L12 | -0.0456 | 0.072 | -0.638 | 0.523 | -0.186 | 0.095 |
ar.S.L24 | 0.2839 | 0.105 | 2.711 | 0.007 | 0.079 | 0.489 |
sigma2 | 5.868e+05 | 7.71e+04 | 7.607 | 0.000 | 4.36e+05 | 7.38e+05 |
Ljung-Box (L1) (Q): | 0.03 | Jarque-Bera (JB): | 4.88 |
---|---|---|---|
Prob(Q): | 0.87 | Prob(JB): | 0.09 |
Heteroskedasticity (H): | 1.86 | Skew: | -0.02 |
Prob(H) (two-sided): | 0.11 | Kurtosis: | 4.21 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
No. Observations: 92
--> train datama.L1
–> (0,0,1) lags for MA is 1
ar.S.L12
ar.S.L24
–> (2,1,0) lags of seasonality for AR is 2- and because seasonaily = 12. Lags jump back twice
P>|z|
if p-value<0.05, the coefficient to achieve is significant ->proceed with that- 最下面那一坨是其它的衡量指标,可以用于比较SARIMA和其它类型的model
Find the accuracy: Prediction on Test Data
start=len(train)
end = len(train)+len(test)-1
df['forecast'] = results.predict(start=start,end=end,dynamic=True)
df['forecast']
Month
1964-01-01 NaN
1964-02-01 NaN
1964-03-01 NaN
1964-04-01 NaN
1964-05-01 NaN
...
1972-05-01 4335.597877
1972-06-01 4637.333846
1972-07-01 4811.011815
1972-08-01 1639.045080
1972-09-01 5216.793461
Name: forecast, Length: 105, dtype: float64
import matplotlib.pyplot as plt
df[['Sales','forecast']].plot(figsize=(12,8))
plt.show()
test['Sales'].mean()
5711.384615384615
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(df['forecast'].dropna(),test['Sales']))
rmse
643.8346721715318
RSME << Mean Test Data -> Model is good
Prediction (on Unseen Data)
用所有已知的数据训练模型来预测 unseen
model2 = SARIMAX(df['Sales'], order = (0,0,1), seasonal_order=(2,1,0,12),enforce_stationarity=False,enforce_invertibility=False)
results2 = model2.fit()
results2.summary()
Dep. Variable: | Sales | No. Observations: | 105 |
---|---|---|---|
Model: | SARIMAX(0, 0, 1)x(2, 1, [], 12) | Log Likelihood | -559.547 |
Date: | Fri, 28 Jan 2022 | AIC | 1127.095 |
Time: | 18:37:07 | BIC | 1136.031 |
Sample: | 01-01-1964 | HQIC | 1130.640 |
- 09-01-1972 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
ma.L1 | 0.3183 | 0.105 | 3.036 | 0.002 | 0.113 | 0.524 |
ar.S.L12 | -0.2382 | 0.114 | -2.086 | 0.037 | -0.462 | -0.014 |
ar.S.L24 | 0.1056 | 0.092 | 1.148 | 0.251 | -0.075 | 0.286 |
sigma2 | 6.382e+05 | 9.14e+04 | 6.979 | 0.000 | 4.59e+05 | 8.17e+05 |
Ljung-Box (L1) (Q): | 0.25 | Jarque-Bera (JB): | 3.51 |
---|---|---|---|
Prob(Q): | 0.62 | Prob(JB): | 0.17 |
Heteroskedasticity (H): | 0.43 | Skew: | -0.05 |
Prob(H) (two-sided): | 0.05 | Kurtosis: | 4.10 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
ar.S.L24
的 p-value >0.05, but it’s inside the model, we can proceed with the model ?????
df.tail()
Sales | forecast | |
---|---|---|
Month | ||
1972-05-01 | 4618 | 4335.597877 |
1972-06-01 | 5312 | 4637.333846 |
1972-07-01 | 4298 | 4811.011815 |
1972-08-01 | 1413 | 1639.045080 |
1972-09-01 | 5877 | 5216.793461 |
from pandas.tseries.offsets import DateOffset
future_dates = [df.index[-1]+ DateOffset(months=x) for x in range(0,24)]
future_datest_df = pd.DataFrame(index=future_dates[1:],columns=df.columns)
future_datest_df
Sales | forecast | |
---|---|---|
1972-10-01 | NaN | NaN |
1972-11-01 | NaN | NaN |
1972-12-01 | NaN | NaN |
1973-01-01 | NaN | NaN |
1973-02-01 | NaN | NaN |
1973-03-01 | NaN | NaN |
1973-04-01 | NaN | NaN |
1973-05-01 | NaN | NaN |
1973-06-01 | NaN | NaN |
1973-07-01 | NaN | NaN |
1973-08-01 | NaN | NaN |
1973-09-01 | NaN | NaN |
1973-10-01 | NaN | NaN |
1973-11-01 | NaN | NaN |
1973-12-01 | NaN | NaN |
1974-01-01 | NaN | NaN |
1974-02-01 | NaN | NaN |
1974-03-01 | NaN | NaN |
1974-04-01 | NaN | NaN |
1974-05-01 | NaN | NaN |
1974-06-01 | NaN | NaN |
1974-07-01 | NaN | NaN |
1974-08-01 | NaN | NaN |
future_df = pd.concat([df,future_datest_df]) #合起来,好plot
future_df
Sales | forecast | |
---|---|---|
1964-01-01 | 2815 | NaN |
1964-02-01 | 2672 | NaN |
1964-03-01 | 2755 | NaN |
1964-04-01 | 2721 | NaN |
1964-05-01 | 2946 | NaN |
... | ... | ... |
1974-04-01 | NaN | NaN |
1974-05-01 | NaN | NaN |
1974-06-01 | NaN | NaN |
1974-07-01 | NaN | NaN |
1974-08-01 | NaN | NaN |
128 rows × 2 columns
future_df['forecast']=results2.predict(start=104,end=128,dynamic=True)
future_df[['Sales','forecast']].plot(figsize=(12,8))
plt.show()