ARIMA模型分析.

最新推荐文章于 2023-10-06 18:29:03 发布

维他柠檬可乐

最新推荐文章于 2023-10-06 18:29:03 发布

阅读量542

点赞数 1

分类专栏：数据分析杂记

本文链接：https://blog.csdn.net/qq_31244453/article/details/107272591

版权

数据分析同时被 2 个专栏收录

5 篇文章 0 订阅

订阅专栏

杂记

2 篇文章 0 订阅

订阅专栏

一.代码

数据预处理：

matplotlib，pandas库

pd.Series是一维数组，基于Numpy的ndarray 结构.字典，列表都可以使用pd.Series表示成一维数组.time_series.index的index为可选参数，若不填写则默认为index从0开始.这里用年份表示.格式为pd.Series([list],index=[list]).

time_series.plot将字典{年份1：变量，年份2：变量}用plot的图表示.

#将数据进行对数变换

这个时间序列是呈指数形式的，波动性比较大，不是稳定的时间序列，一般对于这种指数形式的数据，可以对其取对数，将其转化为线性趋势。取了对数之后的时间序列图明显具有线性趋势

time_series = np.log(time_series)

ADF检验

为了确定其稳定性，取对数后的数据进行 adf 检验.adf检验又叫做单位根检验.单位根检验是指检验时间序列中是否存在单位根.因为存在单位根就是非平稳时间序列了。单位根就是指单位根过程，时间序列中存在单位根过程就不平稳，会使回归分析中存在伪回归。

检验结果如下：

检验项	检验结果
Test Statistic Value	0.807369
p-value	0.991754
Lags Used	1
Number of Observations Used	31
Critical Value(1%)	-3.66143
Critical Value(5%)	-2.96053
Critical Value(10%)	-2.61932

由上表可知，t统计量(0.807369)要大于任何置信度的临界值(1%,5%,10%)，P-value值大于5%,因此认为该序列是非平稳的时间序列，所以再对序列进行差分处理，统计量要小于置信度的临界值，P-value值小于CV5%，发现差分之后的序列基本达到稳定.

差分处理

确定自相关系数p和平均移动系数q,ARIMA(p,q).

根据时间序列的识别规则，采用 ACF 图、PAC 图，AIC 准则（赤道信息量准则）和 BIC 准则（贝叶斯准则）相结合的方式来确定 ARMA 模型的阶数, 应当选取 AIC 和 BIC 值达到最小的那一组为理想阶数。

偏自相关图

自相关图

lag        AC          Q       PAC  Prob(>Q)
0    1.0  0.394807   5.470618  0.394807  0.019339
1    2.0  0.154901   6.340817 -0.001150  0.041986
2    3.0  0.055267   6.455413 -0.006522  0.091438
3    4.0 -0.248371   8.852447 -0.317338  0.064895
4    5.0 -0.260612  11.589317 -0.074026  0.040870
5    6.0 -0.208604  13.410270 -0.056045  0.036964
6    7.0 -0.285771  16.964330 -0.177312  0.017628
7    8.0 -0.174044  18.337541 -0.069038  0.018833
8    9.0 -0.030905  18.382724  0.005425  0.030984
9   10.0  0.029119  18.424658  0.007243  0.048209
10  11.0 -0.066373  18.652901 -0.262074  0.067616
11  12.0 -0.146481  19.820146 -0.259094  0.070566
12  13.0 -0.072985  20.125172 -0.032766  0.092133
13  14.0 -0.127924  21.114325 -0.175995  0.098738
14  15.0  0.014849  21.128436 -0.019744  0.132776
15  16.0  0.162653  22.927440 -0.002973  0.115688
16  17.0  0.106608  23.751802 -0.076931  0.126337
17  18.0  0.114241  24.766054 -0.159748  0.131497
18  19.0  0.138092  26.362025 -0.125241  0.120414
19  20.0  0.094671  27.174631 -0.002794  0.130422
20  21.0  0.021795  27.221614 -0.110668  0.163638
21  22.0  0.044433  27.436416  0.011325  0.195118
22  23.0  0.036667  27.598943  0.022050  0.231329
23  24.0 -0.076350  28.391726 -0.152129  0.243865
24  25.0 -0.035761  28.590497 -0.112621  0.281395
25  26.0  0.013603  28.624051 -0.040208  0.328453
26  27.0 -0.118207  31.664578 -0.105866  0.244735
27  28.0 -0.028374  31.883563  0.008338  0.279275
28  29.0  0.017351  31.992746  0.015010  0.320192
29  30.0 -0.070272  34.679110 -0.043223  0.254504
30  31.0  0.051978  37.618628  0.022608  0.191978

#-*- coding:utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import acf,pacf,plot_acf,plot_pacf
from statsmodels.tsa.arima_model import ARMA

time_series = pd.Series([151.0, 188.46, 199.38, 219.75, 241.55, 262.58, 328.22, 396.26, 442.04, 517.77, 626.52, 717.08,
                         824.38, 913.38, 1088.39, 1325.83, 1700.92, 2109.38, 2499.77, 2856.47, 3114.02, 3229.29, 3545.39,
                         3880.53, 4212.82, 4757.45, 5633.24, 6590.19, 7617.47, 9333.4, 11328.92, 12961.1, 15967.61])
time_series.index = pd.Index(sm.tsa.datetools.dates_from_range('1978','2010'))
print(time_series.index)
print(time_series)
time_series.plot(figsize=(12,8))

time_series_ori = time_series
plt.show()

time_series = np.log(time_series)
time_series_log = time_series
time_series.plot(figsize=(8,6))
plt.show()

t=sm.tsa.stattools.adfuller(time_series, )
output=pd.DataFrame(index=['Test Statistic Value', "p-value", "Lags Used", "Number of Observations Used","Critical Value(1%)","Critical Value(5%)","Critical Value(10%)"],columns=['value'])
output['value']['Test Statistic Value'] = t[0]
output['value']['p-value'] = t[1]
output['value']['Lags Used'] = t[2]
output['value']['Number of Observations Used'] = t[3]
output['value']['Critical Value(1%)'] = t[4]['1%']
output['value']['Critical Value(5%)'] = t[4]['5%']
output['value']['Critical Value(10%)'] = t[4]['10%']
print(output)

time_series = time_series.diff(1)
time_series = time_series.dropna(how=any)
time_series.plot(figsize=(8,6))
plt.show()
t=sm.tsa.stattools.adfuller(time_series)
output=pd.DataFrame(index=['Test Statistic Value', "p-value", "Lags Used", "Number of Observations Used","Critical Value(1%)","Critical Value(5%)","Critical Value(10%)"],columns=['value'])
output['value']['Test Statistic Value'] = t[0]
output['value']['p-value'] = t[1]
output['value']['Lags Used'] = t[2]
output['value']['Number of Observations Used'] = t[3]
output['value']['Critical Value(1%)'] = t[4]['1%']
output['value']['Critical Value(5%)'] = t[4]['5%']
output['value']['Critical Value(10%)'] = t[4]['10%']
print(output)

plot_acf(time_series)
plot_pacf(time_series)
plt.show()

r,rac,Q = sm.tsa.acf(time_series, qstat=True)
prac = pacf(time_series,method='ywmle')
table_data = np.c_[range(1,len(r)), r[1:],rac,prac[1:len(rac)+1],Q]
table = pd.DataFrame(table_data, columns=['lag', "AC","Q", "PAC", "Prob(>Q)"])
print(table)

p,d,q = (1,1,2)
arma_mod = ARMA(time_series,(p,d,q)).fit(disp=-1,method='mle')
summary = (arma_mod.summary2(alpha=.05, float_format="%.8f"))
print(summary)

(p, q) =(sm.tsa.arma_order_select_ic(time_series,max_ar=3,max_ma=3,ic='aic')['aic_min_order'])
#这里需要设定自动取阶的 p和q 的最大值，即函数里面的max_ar,和max_ma。ic 参数表示选用的选取标准，这里设置的为aic,当然也可以用bic。然后函数会算出每个 p和q 组合(这里是(0,0)~(3,3)的AIC的值，取其中最小的,这里的结果是(p=0,q=1)。


arma_mod = ARMA(time_series,(0,1,1)).fit(disp=-1,method='mle')
resid = arma_mod.resid
t=sm.tsa.stattools.adfuller(resid)
output=pd.DataFrame(index=['Test Statistic Value', "p-value", "Lags Used", "Number of Observations Used","Critical Value(1%)","Critical Value(5%)","Critical Value(10%)"],columns=['value'])
output['value']['Test Statistic Value'] = t[0]
output['value']['p-value'] = t[1]
output['value']['Lags Used'] = t[2]
output['value']['Number of Observations Used'] = t[3]
output['value']['Critical Value(1%)'] = t[4]['1%']
output['value']['Critical Value(5%)'] = t[4]['5%']
output['value']['Critical Value(10%)'] = t[4]['10%']
print(output)

arma_model = sm.tsa.ARMA(time_series,(0,1)).fit(disp=-1,maxiter=100)
predict_data = arma_model.predict(start=str(1979), end=str(2010+3), dynamic = False)
print(predict_data)


predict_data_ori = predict_data
predict_data_ori_re = predict_data_ori.reset_index()
predict_data_ori_re.rename(columns={0:'value1'}, inplace = True)
time_series_log_re = time_series_log.reset_index()

time_series_log_re.rename(columns={0:'value2'}, inplace = True)
time_series_log_re.rename(columns={'index':'index1'}, inplace = True)

# concat
df_predict = pd.concat([predict_data_ori_re,time_series_log_re],axis=1)
df_predict.fillna(0,inplace = True)

df_predict[:10]

df_predict['predict'] = df_predict['value1'] + df_predict['value2']
df_predict.loc[33,'predict'] = df_predict['predict'][32]+df_predict['value1'][33]
df_predict.loc[34,'predict'] = df_predict['predict'][33]+df_predict['value1'][34]

predict_data = df_predict[['index','predict']]
predict_data['exp'] = np.exp(predict_data['predict'])
df_data = predict_data[['index','exp']]

# time_series_ori.plot(figsize=(12,8))
# df_data.plot(figsize=(12,8))
plt.figure(figsize =(12,8))
plt.plot(time_series_ori.index,time_series_ori.values)
plt.plot(df_data['index'],df_data['exp'])
print(df_data)
plt.show()

根据上面的几个图，我们可以先取 p=1, q=2。进行模型估计,结果见下图。

Results: ARMA
====================================================================
Model:              ARMA             BIC:                 -89.4537  
Dependent Variable: y                Log-Likelihood:      51.658    
Date:               2020-07-16 01:46 Scale:               1.0000    
No. Observations:   32               Method:              mle       
Df Model:           3                Sample:              12-31-1979
Df Residuals:       29                                    12-31-2010
Converged:          1.0000           S.D. of innovations: 0.048     
No. Iterations:     24.0000          HQIC:                -93.373   
AIC:                -95.3166                                        
----------------------------------------------------------------------
               Coef.    Std.Err.      t      P>|t|     [0.025   0.975]
----------------------------------------------------------------------
const          0.1487     0.0130   11.4221   0.0000    0.1232   0.1743
ar.L1.y       -0.0346     0.3180   -0.1087   0.9135   -0.6578   0.5887
ma.L1.y        0.5940     0.2520    2.3571   0.0184    0.1001   1.0880
-----------------------------------------------------------------------------
                Real            Imaginary          Modulus          Frequency
-----------------------------------------------------------------------------
AR.1          -28.9419             0.0000          28.9419             0.5000
MA.1           -1.6834             0.0000           1.6834             0.5000

这里的 p和q 参数可以调整，然后找出最佳的（AIC最小，BIC最小），经过比较， p=0,q=1 为理想阶数。
这里有一个自动取 p和q 的函数，如果要自动定阶的话，可以采用

#这里有一个自动取 p和q 的函数，如果要自动定阶的话，可以采用
(p, q) =(sm.tsa.arma_order_select_ic(time_series,max_ar=3,max_ma=3,ic='aic')['aic_min_order'])
#这里需要设定自动取阶的 p和q 的最大值，即函数里面的max_ar,和max_ma。ic 参数表示选用的选取标准，这里设置的为aic,当然也可以用bic。然后函数会算出每个 p和q 组合(这里是(0,0)~(3,3)的AIC的值，取其中最小的,这里的结果是(p=0,q=1)。

残差和白噪声检验

这个就是对模型 ARIMA(0,1,1) 的残差序列 arma_mod.resid 进行 ADF 检验。

arma_mod = ARMA(time_series,(0,1,1)).fit(disp=-1,method='mle')
resid = arma_mod.resid
t=sm.tsa.stattools.adfuller(resid)
output=pd.DataFrame(index=['Test Statistic Value', "p-value", "Lags Used", "Number of Observations Used","Critical Value(1%)","Critical Value(5%)","Critical Value(10%)"],columns=['value'])
output['value']['Test Statistic Value'] = t[0]
output['value']['p-value'] = t[1]
output['value']['Lags Used'] = t[2]
output['value']['Number of Observations Used'] = t[3]
output['value']['Critical Value(1%)'] = t[4]['1%']
output['value']['Critical Value(5%)'] = t[4]['5%']
output['value']['Critical Value(10%)'] = t[4]['10%']
print(output)
#结果如下
Test Statistic Value           -3.114
p-value                      0.025534
Lags Used                           1
Number of Observations Used        30
Critical Value(1%)           -3.66992
Critical Value(5%)           -2.96407
Critical Value(10%)          -2.62117

模型预测

1979-12-31    0.148903
1980-12-31    0.180234
1981-12-31    0.083328
1982-12-31    0.156691
1983-12-31    0.113660
1984-12-31    0.131687
1985-12-31    0.201155
1986-12-31    0.141603
1987-12-31    0.130448
1988-12-31    0.164734
1989-12-31    0.163724
1990-12-31    0.132480
1991-12-31    0.152886
1992-12-31    0.120100
1993-12-31    0.180473
1994-12-31    0.158549
1995-12-31    0.200706
1996-12-31    0.157206
1997-12-31    0.156108
1998-12-31    0.135910
1999-12-31    0.120548
2000-12-31    0.100750
2001-12-31    0.144692
2002-12-31    0.117811
2003-12-31    0.128515
2004-12-31    0.144937
2005-12-31    0.162649
2006-12-31    0.145614
2007-12-31    0.148473
2008-12-31    0.180175
2009-12-31    0.156672
2010-12-31    0.136277
2011-12-31    0.190270
2012-12-31    0.148903
2013-12-31    0.148903
Freq: A-DEC, dtype: float64

预测结果还原

对预测出来的数据，进行逆差分操作（由原始数据取对数后的数据加上预测出来的数据），然后再取指数即可还原。

 index           exp
0  1979-12-31    175.244657
1  1980-12-31    225.680412
2  1981-12-31    216.705691
3  1982-12-31    257.027171
4  1983-12-31    270.625744
5  1984-12-31    299.538305
6  1985-12-31    401.351978
7  1986-12-31    456.538812
8  1987-12-31    503.633305
9  1988-12-31    610.492068
10 1989-12-31    737.970970
11 1990-12-31    818.659037
12 1991-12-31    960.560884
13 1992-12-31   1029.935870
14 1993-12-31   1303.656133
15 1994-12-31   1553.619439
16 1995-12-31   2078.975662
17 1996-12-31   2468.474595
18 1997-12-31   2922.112404
19 1998-12-31   3272.310709
20 1999-12-31   3512.971870
21 2000-12-31   3571.595473
22 2001-12-31   4097.347615
23 2002-12-31   4365.716948
24 2003-12-31   4790.560978
25 2004-12-31   5499.452576
26 2005-12-31   6628.204874
27 2006-12-31   7623.198041
28 2007-12-31   8836.736352
29 2008-12-31  11176.065185
30 2009-12-31  13250.438645
31 2010-12-31  14853.411147
32 2011-12-31  19314.030486
33 2012-12-31  22415.103624
34 2013-12-31  26014.087057

#预测结果还原
predict_data_ori = predict_data
predict_data_ori_re = predict_data_ori.reset_index()
predict_data_ori_re.rename(columns={0:'value1'}, inplace = True)
time_series_log_re = time_series_log.reset_index()

time_series_log_re.rename(columns={0:'value2'}, inplace = True)
time_series_log_re.rename(columns={'index':'index1'}, inplace = True)

# concat
df_predict = pd.concat([predict_data_ori_re,time_series_log_re],axis=1)
df_predict.fillna(0,inplace = True)

df_predict[:10]

df_predict['predict'] = df_predict['value1'] + df_predict['value2']
df_predict.loc[33,'predict'] = df_predict['predict'][32]+df_predict['value1'][33]
df_predict.loc[34,'predict'] = df_predict['predict'][33]+df_predict['value1'][34]

predict_data = df_predict[['index','predict']]
predict_data['exp'] = np.exp(predict_data['predict'])
df_data = predict_data[['index','exp']]

# time_series_ori.plot(figsize=(12,8))
# df_data.plot(figsize=(12,8))
plt.figure(figsize =(12,8))
plt.plot(time_series_ori.index,time_series_ori.values)
plt.plot(df_data['index'],df_data['exp'])
print(df_data)
plt.show()