一.代码
数据预处理:
matplotlib,pandas库
pd.Series是一维数组,基于Numpy的ndarray 结构.字典,列表都可以使用pd.Series表示成一维数组.time_series.index的index为可选参数,若不填写则默认为index从0开始.这里用年份表示.格式为pd.Series([list],index=[list]).
time_series.plot将字典{年份1:变量,年份2:变量}用plot的图表示.
#将数据进行对数变换
这个时间序列是呈指数形式的,波动性比较大,不是稳定的时间序列,一般对于这种指数形式的数据,可以对其取对数,将其转化为线性趋势。取了对数之后的时间序列图明显具有线性趋势
time_series = np.log(time_series)
ADF检验
为了确定其稳定性,取对数后的数据进行 adf 检验.adf检验又叫做单位根检验.单位根检验是指检验时间序列中是否存在单位根.因为存在单位根就是非平稳时间序列了。单位根就是指单位根过程,时间序列中存在单位根过程就不平稳,会使回归分析中存在伪回归。
检验结果如下:
检验项 | 检验结果 |
---|---|
Test Statistic Value | 0.807369 |
p-value | 0.991754 |
Lags Used | 1 |
Number of Observations Used | 31 |
Critical Value(1%) | -3.66143 |
Critical Value(5%) | -2.96053 |
Critical Value(10%) | -2.61932 |
由上表可知,t统计量(0.807369)要大于任何置信度的临界值(1%,5%,10%),P-value值大于5%,因此认为该序列是非平稳的时间序列,所以再对序列进行差分处理,统计量要小于置信度的临界值,P-value值小于CV5%,发现差分之后的序列基本达到稳定.
差分处理
确定自相关系数p和平均移动系数q,ARIMA(p,q).
根据时间序列的识别规则,采用 ACF 图、PAC 图,AIC 准则(赤道信息量准则)和 BIC 准则(贝叶斯准则)相结合的方式来确定 ARMA 模型的阶数, 应当选取 AIC 和 BIC 值达到最小的那一组为理想阶数。
偏自相关图
自相关图
lag AC Q PAC Prob(>Q)
0 1.0 0.394807 5.470618 0.394807 0.019339
1 2.0 0.154901 6.340817 -0.001150 0.041986
2 3.0 0.055267 6.455413 -0.006522 0.091438
3 4.0 -0.248371 8.852447 -0.317338 0.064895
4 5.0 -0.260612 11.589317 -0.074026 0.040870
5 6.0 -0.208604 13.410270 -0.056045 0.036964
6 7.0 -0.285771 16.964330 -0.177312 0.017628
7 8.0 -0.174044 18.337541 -0.069038 0.018833
8 9.0 -0.030905 18.382724 0.005425 0.030984
9 10.0 0.029119 18.424658 0.007243 0.048209
10 11.0 -0.066373 18.652901 -0.262074 0.067616
11 12.0 -0.146481 19.820146 -0.259094 0.070566
12 13.0 -0.072985 20.125172 -0.032766 0.092133
13 14.0 -0.127924 21.114325 -0.175995 0.098738
14 15.0 0.014849 21.128436 -0.019744 0.132776
15 16.0 0.162653 22.927440 -0.002973 0.115688
16 17.0 0.106608 23.751802 -0.076931 0.126337
17 18.0 0.114241 24.766054 -0.159748 0.131497
18 19.0 0.138092 26.362025 -0.125241 0.120414
19 20.0 0.094671 27.174631 -0.002794 0.130422
20 21.0 0.021795 27.221614 -0.110668 0.163638
21 22.0 0.044433 27.436416 0.011325 0.195118
22 23.0 0.036667 27.598943 0.022050 0.231329
23 24.0 -0.076350 28.391726 -0.152129 0.243865
24 25.0 -0.035761 28.590497 -0.112621 0.281395
25 26.0 0.013603 28.624051 -0.040208 0.328453
26 27.0 -0.118207 31.664578 -0.105866 0.244735
27 28.0 -0.028374 31.883563 0.008338 0.279275
28 29.0 0.017351 31.992746 0.015010 0.320192
29 30.0 -0.070272 34.679110 -0.043223 0.254504
30 31.0 0.051978 37.618628 0.022608 0.191978
#-*- coding:utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import acf,pacf,plot_acf,plot_pacf
from statsmodels.tsa.arima_model import ARMA
time_series = pd.Series([151.0, 188.46, 199.38, 219.75, 241.55, 262.58, 328.22, 396.26, 442.04, 517.77, 626.52, 717.08,
824.38, 913.38, 1088.39, 1325.83, 1700.92, 2109.38, 2499.77, 2856.47, 3114.02, 3229.29, 3545.39,
3880.53, 4212.82, 4757.45, 5633.24, 6590.19, 7617.47, 9333.4, 11328.92, 12961.1, 15967.61])
time_series.index = pd.Index(sm.tsa.datetools.dates_from_range('1978','2010'))
print(time_series.index)
print(time_series)
time_series.plot(figsize=(12,8))
time_series_ori = time_series
plt.show()
time_series = np.log(time_series)
time_series_log = time_series
time_series.plot(figsize=(8,6))
plt.show()
t=sm.tsa.stattools.adfuller(time_series, )
output=pd.DataFrame(index=['Test Statistic Value', "p-value", "Lags Used", "Number of Observations Used","Critical Value(1%)","Critical Value(5%)","Critical Value(10%)"],columns=['value'])
output['value']['Test Statistic Value'] = t[0]
output['value']['p-value'] = t[1]
output['value']['Lags Used'] = t[2]
output['value']['Number of Observations Used'] = t[3]
output['value']['Critical Value(1%)'] = t[4]['1%']
output['value']['Critical Value(5%)'] = t[4]['5%']
output['value']['Critical Value(10%)'] = t[4]['10%']
print(output)
time_series = time_series.diff(1)
time_series = time_series.dropna(how=any)
time_series.plot(figsize=(8,6))
plt.show()
t=sm.tsa.stattools.adfuller(time_series)
output=pd.DataFrame(index=['Test Statistic Value', "p-value", "Lags Used", "Number of Observations Used","Critical Value(1%)","Critical Value(5%)","Critical Value(10%)"],columns=['value'])
output['value']['Test Statistic Value'] = t[0]
output['value']['p-value'] = t[1]
output['value']['Lags Used'] = t[2]
output['value']['Number of Observations Used'] = t[3]
output['value']['Critical Value(1%)'] = t[4]['1%']
output['value']['Critical Value(5%)'] = t[4]['5%']
output['value']['Critical Value(10%)'] = t[4]['10%']
print(output)
plot_acf(time_series)
plot_pacf(time_series)
plt.show()
r,rac,Q = sm.tsa.acf(time_series, qstat=True)
prac = pacf(time_series,method='ywmle')
table_data = np.c_[range(1,len(r)), r[1:],rac,prac[1:len(rac)+1],Q]
table = pd.DataFrame(table_data, columns=['lag', "AC","Q", "PAC", "Prob(>Q)"])
print(table)
p,d,q = (1,1,2)
arma_mod = ARMA(time_series,(p,d,q)).fit(disp=-1,method='mle')
summary = (arma_mod.summary2(alpha=.05, float_format="%.8f"))
print(summary)
(p, q) =(sm.tsa.arma_order_select_ic(time_series,max_ar=3,max_ma=3,ic='aic')['aic_min_order'])
#这里需要设定自动取阶的 p和q 的最大值,即函数里面的max_ar,和max_ma。ic 参数表示选用的选取标准,这里设置的为aic,当然也可以用bic。然后函数会算出每个 p和q 组合(这里是(0,0)~(3,3)的AIC的值,取其中最小的,这里的结果是(p=0,q=1)。
arma_mod = ARMA(time_series,(0,1,1)).fit(disp=-1,method='mle')
resid = arma_mod.resid
t=sm.tsa.stattools.adfuller(resid)
output=pd.DataFrame(index=['Test Statistic Value', "p-value", "Lags Used", "Number of Observations Used","Critical Value(1%)","Critical Value(5%)","Critical Value(10%)"],columns=['value'])
output['value']['Test Statistic Value'] = t[0]
output['value']['p-value'] = t[1]
output['value']['Lags Used'] = t[2]
output['value']['Number of Observations Used'] = t[3]
output['value']['Critical Value(1%)'] = t[4]['1%']
output['value']['Critical Value(5%)'] = t[4]['5%']
output['value']['Critical Value(10%)'] = t[4]['10%']
print(output)
arma_model = sm.tsa.ARMA(time_series,(0,1)).fit(disp=-1,maxiter=100)
predict_data = arma_model.predict(start=str(1979), end=str(2010+3), dynamic = False)
print(predict_data)
predict_data_ori = predict_data
predict_data_ori_re = predict_data_ori.reset_index()
predict_data_ori_re.rename(columns={0:'value1'}, inplace = True)
time_series_log_re = time_series_log.reset_index()
time_series_log_re.rename(columns={0:'value2'}, inplace = True)
time_series_log_re.rename(columns={'index':'index1'}, inplace = True)
# concat
df_predict = pd.concat([predict_data_ori_re,time_series_log_re],axis=1)
df_predict.fillna(0,inplace = True)
df_predict[:10]
df_predict['predict'] = df_predict['value1'] + df_predict['value2']
df_predict.loc[33,'predict'] = df_predict['predict'][32]+df_predict['value1'][33]
df_predict.loc[34,'predict'] = df_predict['predict'][33]+df_predict['value1'][34]
predict_data = df_predict[['index','predict']]
predict_data['exp'] = np.exp(predict_data['predict'])
df_data = predict_data[['index','exp']]
# time_series_ori.plot(figsize=(12,8))
# df_data.plot(figsize=(12,8))
plt.figure(figsize =(12,8))
plt.plot(time_series_ori.index,time_series_ori.values)
plt.plot(df_data['index'],df_data['exp'])
print(df_data)
plt.show()
根据上面的几个图,我们可以先取 p=1, q=2。进行模型估计,结果见下图。
Results: ARMA
====================================================================
Model: ARMA BIC: -89.4537
Dependent Variable: y Log-Likelihood: 51.658
Date: 2020-07-16 01:46 Scale: 1.0000
No. Observations: 32 Method: mle
Df Model: 3 Sample: 12-31-1979
Df Residuals: 29 12-31-2010
Converged: 1.0000 S.D. of innovations: 0.048
No. Iterations: 24.0000 HQIC: -93.373
AIC: -95.3166
----------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
----------------------------------------------------------------------
const 0.1487 0.0130 11.4221 0.0000 0.1232 0.1743
ar.L1.y -0.0346 0.3180 -0.1087 0.9135 -0.6578 0.5887
ma.L1.y 0.5940 0.2520 2.3571 0.0184 0.1001 1.0880
-----------------------------------------------------------------------------
Real Imaginary Modulus Frequency
-----------------------------------------------------------------------------
AR.1 -28.9419 0.0000 28.9419 0.5000
MA.1 -1.6834 0.0000 1.6834 0.5000
这里的 p和q 参数可以调整,然后找出最佳的(AIC最小,BIC最小),经过比较, p=0,q=1 为理想阶数。
这里有一个自动取 p和q 的函数,如果要自动定阶的话,可以采用
#这里有一个自动取 p和q 的函数,如果要自动定阶的话,可以采用
(p, q) =(sm.tsa.arma_order_select_ic(time_series,max_ar=3,max_ma=3,ic='aic')['aic_min_order'])
#这里需要设定自动取阶的 p和q 的最大值,即函数里面的max_ar,和max_ma。ic 参数表示选用的选取标准,这里设置的为aic,当然也可以用bic。然后函数会算出每个 p和q 组合(这里是(0,0)~(3,3)的AIC的值,取其中最小的,这里的结果是(p=0,q=1)。
残差和白噪声检验
这个就是对模型 ARIMA(0,1,1) 的残差序列 arma_mod.resid 进行 ADF 检验。
arma_mod = ARMA(time_series,(0,1,1)).fit(disp=-1,method='mle')
resid = arma_mod.resid
t=sm.tsa.stattools.adfuller(resid)
output=pd.DataFrame(index=['Test Statistic Value', "p-value", "Lags Used", "Number of Observations Used","Critical Value(1%)","Critical Value(5%)","Critical Value(10%)"],columns=['value'])
output['value']['Test Statistic Value'] = t[0]
output['value']['p-value'] = t[1]
output['value']['Lags Used'] = t[2]
output['value']['Number of Observations Used'] = t[3]
output['value']['Critical Value(1%)'] = t[4]['1%']
output['value']['Critical Value(5%)'] = t[4]['5%']
output['value']['Critical Value(10%)'] = t[4]['10%']
print(output)
#结果如下
Test Statistic Value -3.114
p-value 0.025534
Lags Used 1
Number of Observations Used 30
Critical Value(1%) -3.66992
Critical Value(5%) -2.96407
Critical Value(10%) -2.62117
模型预测
1979-12-31 0.148903
1980-12-31 0.180234
1981-12-31 0.083328
1982-12-31 0.156691
1983-12-31 0.113660
1984-12-31 0.131687
1985-12-31 0.201155
1986-12-31 0.141603
1987-12-31 0.130448
1988-12-31 0.164734
1989-12-31 0.163724
1990-12-31 0.132480
1991-12-31 0.152886
1992-12-31 0.120100
1993-12-31 0.180473
1994-12-31 0.158549
1995-12-31 0.200706
1996-12-31 0.157206
1997-12-31 0.156108
1998-12-31 0.135910
1999-12-31 0.120548
2000-12-31 0.100750
2001-12-31 0.144692
2002-12-31 0.117811
2003-12-31 0.128515
2004-12-31 0.144937
2005-12-31 0.162649
2006-12-31 0.145614
2007-12-31 0.148473
2008-12-31 0.180175
2009-12-31 0.156672
2010-12-31 0.136277
2011-12-31 0.190270
2012-12-31 0.148903
2013-12-31 0.148903
Freq: A-DEC, dtype: float64
预测结果还原
对预测出来的数据,进行逆差分操作(由原始数据取对数后的数据加上预测出来的数据),然后再取指数即可还原。
index exp
0 1979-12-31 175.244657
1 1980-12-31 225.680412
2 1981-12-31 216.705691
3 1982-12-31 257.027171
4 1983-12-31 270.625744
5 1984-12-31 299.538305
6 1985-12-31 401.351978
7 1986-12-31 456.538812
8 1987-12-31 503.633305
9 1988-12-31 610.492068
10 1989-12-31 737.970970
11 1990-12-31 818.659037
12 1991-12-31 960.560884
13 1992-12-31 1029.935870
14 1993-12-31 1303.656133
15 1994-12-31 1553.619439
16 1995-12-31 2078.975662
17 1996-12-31 2468.474595
18 1997-12-31 2922.112404
19 1998-12-31 3272.310709
20 1999-12-31 3512.971870
21 2000-12-31 3571.595473
22 2001-12-31 4097.347615
23 2002-12-31 4365.716948
24 2003-12-31 4790.560978
25 2004-12-31 5499.452576
26 2005-12-31 6628.204874
27 2006-12-31 7623.198041
28 2007-12-31 8836.736352
29 2008-12-31 11176.065185
30 2009-12-31 13250.438645
31 2010-12-31 14853.411147
32 2011-12-31 19314.030486
33 2012-12-31 22415.103624
34 2013-12-31 26014.087057
#预测结果还原
predict_data_ori = predict_data
predict_data_ori_re = predict_data_ori.reset_index()
predict_data_ori_re.rename(columns={0:'value1'}, inplace = True)
time_series_log_re = time_series_log.reset_index()
time_series_log_re.rename(columns={0:'value2'}, inplace = True)
time_series_log_re.rename(columns={'index':'index1'}, inplace = True)
# concat
df_predict = pd.concat([predict_data_ori_re,time_series_log_re],axis=1)
df_predict.fillna(0,inplace = True)
df_predict[:10]
df_predict['predict'] = df_predict['value1'] + df_predict['value2']
df_predict.loc[33,'predict'] = df_predict['predict'][32]+df_predict['value1'][33]
df_predict.loc[34,'predict'] = df_predict['predict'][33]+df_predict['value1'][34]
predict_data = df_predict[['index','predict']]
predict_data['exp'] = np.exp(predict_data['predict'])
df_data = predict_data[['index','exp']]
# time_series_ori.plot(figsize=(12,8))
# df_data.plot(figsize=(12,8))
plt.figure(figsize =(12,8))
plt.plot(time_series_ori.index,time_series_ori.values)
plt.plot(df_data['index'],df_data['exp'])
print(df_data)
plt.show()
预测对结果看,2011年到2013年的预测结果和实际的差别不大。这个模型在短期预测结果比较好。模型处理主要还是应用了Python 第三方库 statsmodels 中的模型算法.
缺点:该模型是太依赖基于早期的历史数据.