1 背景
1.1 AR模型
AutoRegress,AR,自回归模型,描述当前值和历史数据的关系,利用历史数据进行预测。AR模型必须满足平稳性。p阶自回归方程的定义为:
y
t
=
u
+
∑
i
=
1
p
r
i
y
t
−
i
+
ϵ
t
y_t=u+\sum_{i=1}^{p}r_iy_{t-i}+\epsilon_t
yt=u+i=1∑priyt−i+ϵt
其中yt是当前值,u是常数项,p是阶数,ri是自回归系数,后面是误差
- 自回归(AR),就是指当前值只与历史值有关,用自己预测自己
- p阶自回归,指当前值与前p个值有关
- 求常数u与自回归系数ri
- 自回归模型的限制
(1)自回归模型是用自身的数据来进行预测,即建模使用的数据与预测使用的数据是同一组数据;
(2)必须具有平稳性;
(3)必须具有自相关性,如果自相关系数(φi)小于0.5,则不宜采用;
(4)自回归只适用于预测与自身前期相关的现象。
1.2 移动平均模型MA
移动平均模型关注的是自回归模型中的误差项的累加,移动平均法能有效地消除预测中的随机波动。q阶移动平均模型公式为:
y
t
=
u
+
ϵ
t
+
∑
i
=
−
1
q
θ
i
ϵ
t
−
i
y_t = u+\epsilon_t+\sum_{i=-1}^q\theta_i\epsilon_{t-i}
yt=u+ϵt+i=−1∑qθiϵt−i
- q阶自回归,指当前值与前q个误差有关
- 求常数u与系数θi
1.3 自回归移动平均模型(ARMA)
y t = u + ∑ i = 1 p r i y t − i + ϵ t + ∑ i = 1 q θ i ϵ t − i y_t=u+\sum_{i=1}^{p}r_iy_{t-i}+\epsilon_t+\sum_{i=1}^{q}\theta_i\epsilon_{t-i} yt=u+i=1∑priyt−i+ϵt+i=1∑qθiϵt−i
- p与q分别为自回归模型与移动平均模型的阶数,需要人为定义
- ri与θi分别是两个模型的相关系数,需要求解
- 如果原始数据不满足平稳性要求而进行了差分,则为差分自相关移动平均模型(ARIMA),将差分后所得的新数据带入ARMA公式中即可
1.4 判断时序数据是否稳定的方法
严谨的定义: 一个时间序列的随机变量是稳定的,当且仅当它的所有统计特征都是独立于时间的(是关于时间的常量)。
判断的方法:
(1)稳定的数据是没有趋势(trend),没有周期性(seasonality)的; 即它的均值,在时间轴上拥有常量的振幅,并且它的方差,在时间轴上是趋于同一个稳定的值的。
(2)可以使用Dickey-Fuller Test进行假设检验
1.5 ARIMA
ARIMA模型
有三个参数:p,d,q。
- p --代表预测模型中采用的时序数据本身的滞后数(lags) ,也叫做AR/Auto-Regressive项
- d --代表时序数据需要进行几阶差分化,才是稳定的,也叫Integrated项。
- q --代表预测模型中采用的预测误差的滞后数(lags),也叫做MA/Moving Average项
先解释一下差分: 假设y表示t时刻的Y的差分:
ARIMA的预测模型可以表示为:
假设p,q,d已知,ARIMA用数学形式表示为:
建模步骤:
- 获取被观测系统时间序列数据;
- 对数据绘图,观测是否为平稳时间序列;对于非平稳时间序列要先进行d阶差分运算,化为平稳时间序列;
- 经过第二步处理,已经得到平稳时间序列。要对平稳时间序列分别求得其自相关系数ACF和偏自相关系数PACF,通过对自相关图和偏自相关图的分析,得到最佳的阶层 p 和阶数 q
- 由以上得到的d、q、p,得到ARIMA模型。然后开始对得到的模型进行模型检验。
总结:
d、p、q这三者的选择,一般而言就算不是非平稳的时间序列数据,经过一阶差分或者二阶差分就可以转换为弱平稳甚至是平稳的时间序列数据;对于p和q的选择一般需要根据ACF和PACF图进行判断,下面说明如何根据ACF和PACF图得到相应的p、q值。
ARIMA优缺点
- 优点: 模型十分简单,只需要内生变量而不需要借助其他外生变量。
- 缺点:
(1)要求时序数据是稳定的(stationary),或者是通过差分化(differencing)后是稳定的。
(2)本质上只能捕捉线性关系,而不能捕捉非线性关系。
注意,采用ARIMA模型预测时序数据,必须是稳定的,如果不稳定的数据,是无法捕捉到规律的。比如股票数据用ARIMA无法预测的原因就是股票数据是非稳定的,常常受政策和新闻的影响而波动。
2 SARIMA简介
Seasonal Autoregression Moving Average model, SARIMA季节自回归移动平均模型。对于周期性时间序列,首先需要去除周期性,去除的方式是在周期间隔上做一次ARIMA,此时可以得到一个非平稳非周期性的时间序列,然后在此基础之上再一次使用ARIMA进行分析。可以表示为:
S
A
R
I
M
A
(
p
,
d
,
q
)
×
(
P
,
D
,
Q
)
s
SARIMA(p,d,q)\times(P,D,Q)s
SARIMA(p,d,q)×(P,D,Q)s
P: 周期性自回归阶数.
D: 周期性差分阶数.
Q: 周期性移动平均阶数.
S: 周期时间间隔.
p,d,q的含义与上面的ARIMA里面含义相同
3 代码
SARIMA代码如下
2.1 导入相关包
import warnings # do not disturbe mode
warnings.filterwarnings('ignore')
# Load packages
import numpy as np # vectors and matrices
import pandas as pd # tables and data manipulations
import matplotlib.pyplot as plt # plots
import seaborn as sns # more plots
from dateutil.relativedelta import relativedelta # working with dates with style
from scipy.optimize import minimize # for function minimization
import statsmodels.formula.api as smf # statistics and econometrics
import statsmodels.tsa.api as smt
import statsmodels.api as sm
import scipy.stats as scs
from itertools import product # some useful functions
from tqdm import tqdm_notebook
# Importing everything from forecasting quality metrics
from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error
from sklearn.metrics import median_absolute_error, mean_squared_error, mean_squared_log_error
2.2 ACF,PACF画图函数
ACF和PACF对于估计参数p,q非常重要
# MAPE 平均绝对误差
def mean_absolute_percentage_error(y_true, y_pred):
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
def tsplot(y, lags=None, figsize=(12, 7), style='bmh'):
"""
Plot time series, its ACF and PACF, calculate Dickey–Fuller test
y - timeseries
lags - how many lags to include in ACF, PACF calculation
"""
if not isinstance(y, pd.Series):
y = pd.Series(y)
with plt.style.context(style):
fig = plt.figure(figsize=figsize)
layout = (2, 2)
ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
acf_ax = plt.subplot2grid(layout, (1, 0))
pacf_ax = plt.subplot2grid(layout, (1, 1))
y.plot(ax=ts_ax)
p_value = sm.tsa.stattools.adfuller(y)[1]
ts_ax.set_title('Time Series Analysis Plots\n Dickey-Fuller: p={0:.5f}'.format(p_value))
smt.graphics.plot_acf(y, lags=lags, ax=acf_ax)
smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)
plt.tight_layout()
2.3 可视化
导入数据
ads = pd.read_csv('ads.csv',index_col=['Time'], parse_dates=['Time'])
数据折线图
plt.figure(figsize=(18, 6))
plt.plot(ads.Ads)
plt.title('Ads watched (hourly data)')
plt.grid(True)
plt.show()
ACF和PACF图
tsplot(ads.Ads, lags=60)
# The seasonal difference
ads_diff = ads.Ads - ads.Ads.shift(24)
tsplot(ads_diff[24:], lags=60)
ads_diff = ads_diff - ads_diff.shift(1)
tsplot(ads_diff[24+1:], lags=60)
2.3 SARIMA 参数
- p is most probably 4 since it is the last significant lag on the PACF, after which, most others are not significant.
- d equals 1 because we had first differences
- q should be somewhere around 4 as well as seen on the ACF
- P might be 2, since 24-th and 48-th lags are somewhat significant on the PACF
- D again equals 1 because we performed seasonal differentiation
- Q is probably 1. The 24-th lag on ACF is significant while the 48-th is not
# setting initial values and some bounds for them
ps = range(2, 5)
d=1
qs = range(2, 5)
Ps = range(0, 2)
D=1
Qs = range(0, 2)
s = 24 # season length is still 24
# creating list with all the possible combinations of parameters
parameters = product(ps, qs, Ps, Qs)
parameters_list = list(parameters)
len(parameters_list)
36
2.4 训练模型
def optimizeSARIMA(y, parameters_list, d, D, s):
"""Return dataframe with parameters and corresponding AIC
y - time series
parameters_list - list with (p, q, P, Q) tuples
d - integration order in ARIMA model
D - seasonal integration order
s - length of season
"""
results = []
best_aic = float("inf")
for param in tqdm_notebook(parameters_list):
# we need try-except because on some combinations model fails to converge
try:
model=sm.tsa.statespace.SARIMAX(y, order=(param[0], d, param[1]),
seasonal_order=(param[2], D, param[3], s)).fit(disp=-1)
except:
continue
aic = model.aic
# saving best model, AIC and parameters
if aic < best_aic:
best_model = model
best_aic = aic
best_param = param
results.append([param, model.aic])
result_table = pd.DataFrame(results)
result_table.columns = ['parameters', 'aic']
# sorting in ascending order, the lower AIC is - the better
result_table = result_table.sort_values(by='aic', ascending=True).reset_index(drop=True)
return result_table
%%time
warnings.filterwarnings("ignore")
result_table = optimizeSARIMA(ads.Ads, parameters_list, d, D, s)
0%| | 0/36 [00:00<?, ?it/s]
Wall time: 1min 34s
result_table.head()
parameters | aic | |
---|---|---|
0 | (2, 3, 1, 1) | 3888.642174 |
1 | (3, 2, 1, 1) | 3888.763568 |
2 | (4, 2, 1, 1) | 3890.279740 |
3 | (3, 3, 1, 1) | 3890.513196 |
4 | (2, 4, 1, 1) | 3892.302849 |
# set the parameters that give the lowest AIC
p, q, P, Q = result_table.parameters[0]
best_model=sm.tsa.statespace.SARIMAX(ads.Ads, order=(p, d, q),
seasonal_order=(P, D, Q, s)).fit(disp=-1)
print(best_model.summary())
SARIMAX Results
============================================================================================
Dep. Variable: Ads No. Observations: 216
Model: SARIMAX(2, 1, 3)x(1, 1, [1], 24) Log Likelihood -1936.321
Date: Mon, 29 Nov 2021 AIC 3888.642
Time: 10:29:43 BIC 3914.660
Sample: 09-13-2017 HQIC 3899.181
- 09-21-2017
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.7913 0.270 2.928 0.003 0.262 1.321
ar.L2 -0.5503 0.306 -1.799 0.072 -1.150 0.049
ma.L1 -0.7316 0.262 -2.793 0.005 -1.245 -0.218
ma.L2 0.5651 0.282 2.005 0.045 0.013 1.118
ma.L3 -0.1811 0.092 -1.964 0.049 -0.362 -0.000
ar.S.L24 0.3312 0.076 4.351 0.000 0.182 0.480
ma.S.L24 -0.7635 0.104 -7.361 0.000 -0.967 -0.560
sigma2 4.574e+07 5.61e-09 8.15e+15 0.000 4.57e+07 4.57e+07
===================================================================================
Ljung-Box (L1) (Q): 0.88 Jarque-Bera (JB): 10.56
Prob(Q): 0.35 Prob(JB): 0.01
Heteroskedasticity (H): 0.65 Skew: -0.28
Prob(H) (two-sided): 0.09 Kurtosis: 4.00
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 3.18e+32. Standard errors may be unstable.
tsplot(best_model.resid[24+1:], lags=60)
2.5 预测
定义可视化函数
def plotSARIMA(series, model, n_steps):
"""Plots model vs predicted values
series - dataset with timeseries
model - fitted SARIMA model
n_steps - number of steps to predict in the future
"""
# adding model values
data = series.copy()
data.columns = ['actual']
data['sarima_model'] = model.fittedvalues
# making a shift on s+d steps, because these values were unobserved by the model
# due to the differentiating
data['sarima_model'][:s+d] = np.NaN
# forecasting on n_steps forward
forecast = model.predict(start = data.shape[0], end = data.shape[0]+n_steps)
forecast = data.sarima_model.append(forecast)
# calculate error, again having shifted on s+d steps from the beginning
error = mean_absolute_percentage_error(data['actual'][s+d:], data['sarima_model'][s+d:])
plt.figure(figsize=(15, 7))
plt.title("Mean Absolute Percentage Error: {0:.2f}%".format(error))
plt.plot(forecast, color='r', label="model")
plt.axvspan(data.index[-1], forecast.index[-1], alpha=0.5, color='lightgrey')
plt.plot(data.actual, label="actual")
plt.legend()
plt.grid(True)
预测长度为50
plotSARIMA(ads, best_model, 50)