ARIMA
An observed time series can be decomposed into three main components: the trend i.e the long cycle, the seasonal systematic or calendar related movements, and the irregular unsystematic or short term fluctuations.
ARIMA 模型由三个部分构成: Auto-Regressive(AR), Integrated(I), and Moving Averages(MA).
AR:指autoregression,使用自己的历史值来回归自己,emphasizes the dependent relationship between an observation and its preceding or ‘lagged’ observations.
I:指integrated,通过差分来保证数据的平稳性,差分的次数,It typically involves subtracting an observation from its preceding observation.
MA:指moving average,使用移动平均给历史变量建模导致的残差和当前变量之间的关系,This component zeroes in on the relationship between an observation and the residual error from a moving average model based on lagged observations.
- p: lag order,是lag observation的数目,
- d:degree of defference,是raw observation are differentiated的次数
- q: order of moving average, 是moving average window的大小
在建模ARIMA模型前,假设时序数据是stationary和单变量的,所以完整的建模流程包括:
- Load the data and preprocess the data.
- Check the stationarity of the data by making a dickey-fuller test(from statsmodels.tsa.stattools import adfuller).- if stationary then proceed for the further steps and if not then make it stationary.
- determine the degree of differencing(d).
- Determine the order of lag( p) and moving average(q), which can be done by making a PACF(partial autocorrelation function) and ACF(autocorrelation function) plot.
- Fitting the model and making the prediction.
- Check the performance of the model by calculating RMSE(root mean square error) between the actual and predicted values.
AR模型
A pure Auto Regressive (AR only) model:
where, Y{t-1} is the lag1 of the series, beta1 is the coefficient of lag1 that the model estimates and alpha
is the intercept term, also estimated by the model.
MA模型
a pure Moving Average (MA only) model is one where Yt depends only on the lagged forecast errors:
where the error terms are the errors of the autoregressive models of the respective lags. The errors Et and E(t-1) are the errors from the following equations :
ARIMA模型
An ARIMA model is one where the time series was differenced at least once to make it stationary and you combine the AR and the MA terms. So the equation becomes:ARIMA模型是变量经历过至少一次差分后将AR和MA建模结合使用的结果
Predicted Yt = Constant + Linear combination Lags of Y (upto p lags) + Linear Combination of Lagged forecast errors (upto q lags)
确定pdq参数取值
确定d
平稳性的定义?
平稳分为严平稳和宽平稳,严平稳的是一种条件很严格的平稳性定义,是所有统计性质都不会随着时间的推移而变化的;宽平稳的条件就比较宽松了,只要保证序列低阶矩平稳。
序列平稳的两个重要性质: 1、序列的均值为常数。 2、自协方差函数和自相关函数仅与时间平移长度有关而与时间的起止点无关。
差分的目的是确保数据是平稳的,因此要注意不要过度差分(虽然此时数据still是平稳的但会影响模型参数确定)。差分的正确阶数是得到一个近乎平稳(围绕定义的平均值漫游,并且 ACF 图很快达到0)的时序的最小差分数。
当D选择地不合适会有:
- autocorrelations对于lags大于10时仍然呈现出正值,那么这个时序需要进一步差分
- autocorrelations的lag=1是个很负的负值,那么时序被过度差分了
- 在实际情况中,很难在两个d中选择一个,那么就选择标准差更小的差分结果对应的d
example
-
首先使用Augmented Dickey Fuller test(from statsmodels.tsa.stattools import adfuller)判断数据的平稳性,The null hypothesis of the ADF test is that the time series is non-stationary. So, if the p-value of the test is less than the significance level (0.05) then you reject the null hypothesis and infer that the time series is indeed stationary.
-
在这个例子中,p value >0.05,所以数据不平稳,需要进行差分
-
如果平稳,则d=0;如果不平稳,则进行差分,在逐渐增大D的过程中,分别绘制自相关图和偏相关图来帮助我们判断合适的D的取值
- import numpy as np, pandas as pd
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
plt.rcParams.update({
'figure.figsize':(9,7), 'figure.dpi':120})
# Import data
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/wwwusage.csv', names=['value'], header=0)
# Original Series
fig, axes = plt.subplots(3, 2, sharex=True)
axes[0, 0].plot(df.value);
axes[0, 0].set_title('Original Series')
plot_acf(df.value, ax=axes[0, 1])
# 1st Differencing
axes[1, 0].plot(df.value.diff());
axes[1, 0].set_title('1st Order Differencing')
plot_acf(df.value.diff().dropna(), ax=axes[1, 1])
# 2nd Differencing
axes[2, 0].plot(df.value.diff().diff());
axes[2, 0].set_title('2nd Order Differencing')
plot_acf(df.value.diff().diff().dropna(), ax=axes[2, 1])
plt