欢迎使用CSDN-markdown编辑器

最新推荐文章于 2024-08-30 08:30:00 发布

tlgt2078

最新推荐文章于 2024-08-30 08:30:00 发布

阅读量510

点赞数

分类专栏： TS 文章标签： test

TS 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Jupyter 设置
In [1]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
return false;
}
使用pandas 和 pyplot
In [2]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt

Jupyter 设置

%matplotlib inline
读取数据及简单的预处理
载入数据集
In [3]:
df_fx_data = pd.read_csv(‘BOE-XUDLERD.csv’)
df_fx_data.head()
Out[3]:
Date Value
0 2017-06-09 0.8941
1 2017-06-08 0.8912
2 2017-06-07 0.8878
3 2017-06-06 0.8875
4 2017-06-05 0.8888
查看日期的数据类型：str
In [4]:
type(df_fx_data[‘Date’][0])
Out[4]:
str
In [5]:
df_fx_data[‘Date’] = pd.to_datetime(df_fx_data[‘Date’])
df_fx_data.head()
Out[5]:
Date Value
0 2017-06-09 0.8941
1 2017-06-08 0.8912
2 2017-06-07 0.8878
3 2017-06-06 0.8875
4 2017-06-05 0.8888
现在日期类型为： pandas._libs.tslib.Timestamp
In [6]:
type(df_fx_data[‘Date’][0])
Out[6]:
pandas._libs.tslib.Timestamp
用变量Date作为索引
In [7]:
indexed_df = df_fx_data.set_index(‘Date’)
时间序列的变量即为Value
In [8]:
ts = indexed_df[‘Value’]
ts.head(5)
Out[8]:
Date
2017-06-09 0.8941
2017-06-08 0.8912
2017-06-07 0.8878
2017-06-06 0.8875
2017-06-05 0.8888
Name: Value, dtype: float64
绘制折线图
In [9]:
plt.plot(ts)
Out[9]:
[

seed random number generator

seed(1)

create white noise series

whitenoise = Series([gauss(0.0, 1.0) for i in range(1000)])
acf_pacf(whitenoise, lags=40, title=’White Noise Series’)

汇率时间序列, 显然不是一个白噪声序列
In [15]:
acf_pacf(ts_week, lags=40, title=’Weekly Euro Dollar Exchange Rate’)

稳定时间序列
检查稳定性(Stationarity)
稳定性(Stationarity): 在稳定的时间序列中变量的统计属性，诸如均值(mean)，方差(variance)，自相关(autocorrelation)，自协方差(autocovariance) 等不随时间改变。
adfuller(timeseries, autolag=’AIC’) autolag: 按某个度量自动选择测试使用的lag数量以使该度量最小化
AIC: Akaike Information Criterion, 估算两个回归模型之间的KL散度/相对熵; KL散度/相对熵用于衡量近似分布与真实分布的接近程度
p-value: 显著性概率
In [16]:
from statsmodels.tsa.stattools import adfuller

def test_stationarity(timeseries, window=10):
# 求移动平均和移动标准差
# window为选取的时间窗口个数
rolmean = timeseries.rolling(window=window, center=False).mean()
rolstd = timeseries.rolling(window=window, center=False).std()

# 将以周重取样后的时间序列、移动平均和移动标准差制成折线图
orig = plt.plot(timeseries, color='blue', label='Original')
mean = plt.plot(rolmean, color='red', label='Rolling Mean')
std = plt.plot(rolstd, color='black', label='Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show(block=False)

# 用Augmented Dickey-Fuller检验测试时间序列稳定性:
print('Results of Augmented Dickey-Fuller Test:')
# 使用减小AIC的办法估算ADF测试所需的滞后数
dftest = adfuller(timeseries, autolag='AIC')
# 将ADF测试结果、显著性概率、所用的滞后数和所用的观测数打印出来
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', 
                                         'Num Lags Used', 
                                         'Num Observations Used'])
for key, value in dftest[4].items():
    dfoutput['Critical Value (%s)' % key] = value
print(dfoutput)

In [17]:
test_stationarity(ts_week)

Results of Augmented Dickey-Fuller Test:
Test Statistic -2.058902
p-value 0.261358
Num Lags Used 2.000000
Num Observations Used 2212.000000
Critical Value (1%) -3.433310
Critical Value (10%) -2.567466
Critical Value (5%) -2.862848
dtype: float64
-2.217962 > Critical Value (5%), p-value > 0.05
移除趋势和季节性
seasonaldecompose: 按照滑动均值将维经过指数转化的时间序列分为趋势(trend), 季节性(seasonality)和残差(residual)三部分
有明显的趋势但季节性并不明显(数值变化范围小）
残差不是白噪音
对数转换: np.log(series)
使用第一差分去掉趋势: $Y' = Y{t} - Y_{t-1}$
比较发现这里log再差分的平稳性较仅差分要好
In [18]:
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.stats.diagnostic import acorr_ljungbox
def decompose_plot(series, title=”):
decomposition = seasonal_decompose(series)
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
fig = decomposition.plot()
fig.set_size_inches(10,6)
fig.suptitle(title)
fig.tight_layout()
fig2 = acf_pacf(residual, title=’Residuals’, figsize=(10,4))

每周汇率

decompose_plot(ts_week, title=’ts_week’)

In [19]:

使用第一差分去掉趋势

ts_week_diff = ts_week - ts_week.shift(1)
test_stationarity(ts_week_diff.dropna())
decompose_plot(ts_week_diff.dropna(), title=’ts_week_diff’)

Results of Augmented Dickey-Fuller Test:
Test Statistic -29.988468
p-value 0.000000
Num Lags Used 1.000000
Num Observations Used 2212.000000
Critical Value (1%) -3.433310
Critical Value (10%) -2.567466
Critical Value (5%) -2.862848
dtype: float64

In [20]:

对数转换, 使用第一差分去掉趋势

ts_week_log = np.log(ts_week)
ts_week_log_diff = ts_week_log - ts_week_log.shift(1)
test_stationarity(ts_week_log_diff.dropna())
decompose_plot(ts_week_log_diff.dropna(), title=’ts_week_log_diff’)

plt.show()

Results of Augmented Dickey-Fuller Test:
Test Statistic -36.417784
p-value 0.000000
Num Lags Used 0.000000
Num Observations Used 2213.000000
Critical Value (1%) -3.433308
Critical Value (10%) -2.567466
Critical Value (5%) -2.862847
dtype: float64

建立模型
ARIMA(p, d, q)
找到合适的差分阶数 d
刚才通过一次差分已达到平稳 d=1
也可以通过ACF和PACF来找到合适的d
In [21]:
acf_pacf(ts_week_log, title=’ts_week_log’)

序列在lag>1出有显著的正自相关, 需提高差分阶数
lag=1 时自相关为0或负值, 无需差分
lag=1 时自相关< -0.5, 则为过差分
因此ts_week_log 需要差分一次, ts_week_log_diff是差分一次后的ts_week_log
In [22]:
acf_pacf(ts_week_log_diff, title=’ts_week_log_diff’, lags=10)

序列在lag>1已无显著的正自相关, 无需差分
AR和MA项数 p, q
得到平稳的时间序列后需要决定p和q
参照上图
p: 如果PACF(lag<=k)显著且PACF(lag>k)均不显著, 考虑将p设为k
q: 如果ACF(lag<=k)显著且ACF(lag>k)均不显著, 考虑将q设为k
p: 如果差分后序列的PACF迅速衰减到0而ACF在lag较高处仍有显著的较强的相关, 和/或 ACF(lag=1) > 0, 考虑增加p
q: 如果差分后序列的ACF迅速衰减到0, 和/或 ACF(lag=1) < 0, 考虑增加q
绘出拟合后的ARIMA的拟合残差的ACF和PACF, 按上述规则调整p和/或q
下图为ARIMA(0,1,0)的拟合残差
可以考虑p=1, q=1
In [23]:
from statsmodels.tsa.arima_model import ARIMA

ARIMA的参数: order = p, d, q

model = ARIMA(ts_week_log, order=(0, 1, 0))

已拟合的模型

results_ARIMA = model.fit()

拟合的数据和原始数据的残差

residuals = pd.DataFrame(results_ARIMA.resid)
acf_pacf(residuals, lags=10, title=’Residuals, p=0, d=1, q=0’)

这里ARIMA(1,1,0), ARIMA(0,1,1) 和 ARIMA(1,1,1)都是可用的
ARIMA模型是通过优化AIC, HQIC, BIC等得出的, 一般来说AIC较小的模型拟合度较好, 但拟合度较好不能说明预测能力好
In [24]:
a110=ARIMA(ts_week_log, order=(1, 1, 0)).fit()
a011=ARIMA(ts_week_log, order=(0, 1, 1)).fit()
a111=ARIMA(ts_week_log, order=(1, 1, 1)).fit()
acf_pacf(a110.resid, lags=10, title=’Residuals, p=1, d=1, q=0, aic=%f’%a110.aic)
acf_pacf(a011.resid, lags=10, title=’Residuals, p=1, d=1, q=1, aic=%f’%a011.aic)
acf_pacf(a111.resid, lags=10, title=’Residuals, p=0, d=1, q=1, aic=%f’%a111.aic)

拟合ARIMA模型
In [25]:
from statsmodels.tsa.arima_model import ARIMA

ARIMA的参数: order = p, d, q

model = ARIMA(ts_week_log, order=(1, 1, 1))

已拟合的模型

results_ARIMA = model.fit()

ARIMA的结果

print(results_ARIMA.summary())

比较拟合的数据和原始数据,

注意fittedvalues返回的是d次差分后的序列, 因此和ts_week_log_diff比较

plt.plot(ts_week_log_diff)
plt.plot(results_ARIMA.fittedvalues, color=’red’)

拟合的数据和原始数据的残差

residuals = pd.DataFrame(results_ARIMA.resid)

计算拟残差平方和(RSS)

plt.title(‘RSS: %.4f’% sum((results_ARIMA.fittedvalues-ts_week_log_diff)**2))
acf_pacf(residuals, lags=10, title=’Residuals’)

残差的核密度估计

residuals.plot(kind=’kde’)
print(‘\n\nResiduals Summary:\n’, residuals.describe())

plt.show()

ARIMA Model Results

Dep. Variable: D.Value No. Observations: 2214
Model: ARIMA(1, 1, 1) Log Likelihood 6796.505
Method: css-mle S.D. of innovations 0.011
Date: Tue, 20 Jun 2017 AIC -13585.010
Time: 05:37:20 BIC -13562.200
Sample: 01-12-1975 HQIC -13576.678

- 06-11-2017

coef std err z P>|z| [0.025 0.975]

const 6.925e-05 0.000 0.223 0.824 -0.001 0.001
ar.L1.D.Value 0.1447 0.087 1.666 0.096 -0.026 0.315
ma.L1.D.Value 0.1125 0.087 1.286 0.199 -0.059 0.284

Roots

Real Imaginary Modulus Frequency

AR.1 6.9095 +0.0000j 6.9095 0.0000

MA.1 -8.8887 +0.0000j 8.8887 0.5000

Residuals Summary:
0
count 2214.000000
mean 0.000001
std 0.011237
min -0.061446
25% -0.006740
50% -0.000251
75% 0.006910
max 0.063923

ARIMA模型会使拟合和原始数据的残差变为白噪声
In [26]:
acf_pacf(results_ARIMA.fittedvalues-ts_week_log_diff, title=’Residuals’)

将拟合的值逆向转换并可视化
In [27]:
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff.head()
Out[27]:
Date
1975-01-12 0.000069
1975-01-19 -0.002406
1975-01-26 0.001011
1975-02-02 -0.004028
1975-02-09 -0.001174
Freq: W-SUN, dtype: float64
差分的逆向转换为累计和
指数的逆向转换为对数
In [28]:

一阶差分的逆向转换为累计和

predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()

对数函数的逆向转换为指数函数

predictions_ARIMA_log = pd.Series(ts_week_log.iloc [0], index=ts_week_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)
比较拟合的序列和原序列的均根差, 即为拟合错误
In [29]:
predictions_ARIMA = np.exp(predictions_ARIMA_log)
plt.plot(ts_week)
plt.plot(predictions_ARIMA)
plt.title(‘RMSE: %.4f’% np.sqrt(sum((predictions_ARIMA-ts_week)**2)/len(ts_week)))
Out[29]:

我们的ARIMA模型针对的是序列ts_week_log

因此使用的数据为ts_week_log

因此须将预测做指数转换

使用 n_test_obs 个点校验

n_test_obs = 50
size = int(len(ts_week_log) - n_test_obs)

切分数据集

train, test = ts_week_log[:size], ts_week_log[size:len(ts_week_log)]

历史值

history = [x for x in train]

预测值

predictions = list()

置信区间

confidence_intervals = list()

print(‘Predicted vs Expected Values…’)
print(‘\n’)
for t in range(len(test)):
model = ARIMA(history, order=(1,1,1))
model_fit = model.fit()
output = model_fit.forecast(steps=1)
yhat = output[0]
predictions.append(float(yhat))
confidence_intervals.append(output[2])
obs = test[t]
history.append(obs)

# 须将预测做指数转换
print('predicted=%f, expected=%f' % (np.exp(yhat), np.exp(obs)))

error = mean_squared_error(test, predictions)

print(‘\n’)
predictions_series = pd.Series(predictions, index = test.index)
print(‘Test MSE: %.6f’ % error)
Predicted vs Expected Values…

predicted=0.886215, expected=0.902880
predicted=0.907080, expected=0.902180
predicted=0.901661, expected=0.902660
predicted=0.902891, expected=0.908120
predicted=0.909561, expected=0.905840
predicted=0.905181, expected=0.896540
predicted=0.894294, expected=0.897600
predicted=0.898145, expected=0.886780
predicted=0.884036, expected=0.885600
predicted=0.885608, expected=0.896075
predicted=0.898827, expected=0.891100
predicted=0.889625, expected=0.891240
predicted=0.891471, expected=0.893720
predicted=0.894386, expected=0.890820
predicted=0.890071, expected=0.894120
predicted=0.895091, expected=0.904860
predicted=0.907586, expected=0.913620
predicted=0.915668, expected=0.917360
predicted=0.918185, expected=0.903520
predicted=0.899997, expected=0.913160
predicted=0.916041, expected=0.936860
predicted=0.942785, expected=0.945060
predicted=0.946632, expected=0.941900
predicted=0.941006, expected=0.936860
predicted=0.935742, expected=0.948440
predicted=0.951627, expected=0.958640
predicted=0.961011, expected=0.954200
predicted=0.952895, expected=0.951600
predicted=0.951158, expected=0.945180
predicted=0.943669, expected=0.938440
predicted=0.936958, expected=0.932520
predicted=0.931239, expected=0.928520
predicted=0.927704, expected=0.935940
predicted=0.938014, expected=0.942780
predicted=0.944393, expected=0.945760
predicted=0.946430, expected=0.945500
predicted=0.945439, expected=0.943900
predicted=0.943574, expected=0.935900
predicted=0.933967, expected=0.926700
predicted=0.924628, expected=0.927140
predicted=0.927547, expected=0.938880
predicted=0.941948, expected=0.941850
predicted=0.942353, expected=0.933150
predicted=0.930947, expected=0.918800
predicted=0.915453, expected=0.914425
predicted=0.913740, expected=0.917740
predicted=0.918738, expected=0.900400
predicted=0.895946, expected=0.891920
predicted=0.890302, expected=0.890350
predicted=0.890181, expected=0.889880

Test MSE: 0.000073
In [31]:
predictions_series = pd.Series(predictions, index = test.index)

fig, ax = plt.subplots(figsize=(10, 6))
ax.set(title=’Spot Exchange Rate, Euro into USD’, xlabel=’Date’, ylabel=’Euro into USD’)
ax.plot(ts_week[-n_test_obs:], ‘o’, label=’observed’)

须将预测做指数转换

ax.plot(np.exp(predictions_series), ‘g’, label=’rolling one-step out-of-sample forecast’)
legend = ax.legend(loc=’upper left’)

可以对季节性建模的 SARIMA 模型
In [32]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
from dateutil.relativedelta import relativedelta
import seaborn as sns
import statsmodels.api as sm
from statsmodels.tsa.stattools import acf
from statsmodels.tsa.stattools import pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from pandas import DatetimeIndex
Portland Oregon 月平均乘公车人数 (/100) January 1960 至 June 1969, 样本数=114
In [33]:

读取CSV

df = pd.read_csv(‘portland-oregon-average-monthly-.csv’)
print(df.head(), ‘\nindex type:\n’, type(df.index))

按’%Y-%m’格式转换CSV的日期, 为了方便处理, 将时间段转为timestamp

df[‘Month’] = pd.to_datetime(df[‘Month’], format=’%Y-%m’)

索引并resample为月

indexed_df = df.set_index(‘Month’)
ts = indexed_df[‘riders’]
ts = ts.resample(‘M’).sum()
print(ts.head(), ‘\nindex type:\n’, type(ts.index))
Month riders
0 1960-01 648
1 1960-02 646
2 1960-03 639
3 1960-04 654
4 1960-05 630
index type:

SARIMA的参数: order = p, d, q, seasonal_order=P, D, Q, season的周期=12

model = SARIMAX(ts, order=(0,1,0),seasonal_order=(0,1,1,12))

已拟合的模型

fitted = model.fit()

ARIMA的结果

print(fitted.summary())

比较拟合的数据和原始数据,

ts.plot()
fitted.fittedvalues.plot(color=’red’)

拟合的数据和原始数据的残差

residuals = pd.DataFrame(fitted.resid)

acf_pacf(residuals, title=’Residuals’, figsize=(10,4))

计算拟残差平方和(RSS)

plt.title(‘RSS: %.4f’% sum((fitted.fittedvalues-ts)**2))

残差的核密度估计

residuals.plot(kind=’kde’)
print(‘\n\nResiduals Summary:\n’, residuals.describe())

plt.show()

Statespace Model Results

Dep. Variable: riders No. Observations: 114
Model: SARIMAX(0, 1, 0)x(0, 1, 1, 12) Log Likelihood -504.683
Date: Tue, 20 Jun 2017 AIC 1013.365
Time: 05:37:32 BIC 1018.838
Sample: 01-31-1960 HQIC 1015.586
- 06-30-1969

Covariance Type: opg

coef std err z P>|z| [0.025 0.975]

ma.S.L12 -0.6937 0.118 -5.867 0.000 -0.925 -0.462

sigma2 1185.5644 183.539 6.459 0.000 825.834 1545.295

Ljung-Box (Q): 43.16 Jarque-Bera (JB): 2.01
Prob(Q): 0.34 Prob(JB): 0.37
Heteroskedasticity (H): 1.49 Skew: 0.28

Prob(H) (two-sided): 0.25 Kurtosis: 3.39

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Residuals Summary:
0
count 114.000000
mean 0.920410
std 72.659634
min -214.018001
25% -22.898384
50% -5.576031
75% 18.514757
max 648.000000

预测并校验模型
这种校验方法称为 one-step out-of-sample forecast
out-of-sample 是指预测时所用样本/观测值不在训练集内
将原序列分为训练和测试集
将训练集设为历史, 此时时间为t
迭代, 迭代次数为测试集的大小:
1. 在历史上拟合SARIMA模型, 参数为先前得到的p,d,q,P,D,Q
2. 生成t+1的预测, 并将其加入历史
比较预测值和真实值
In [40]:
from sklearn.metrics import mean_squared_error

我们的ARIMA模型针对的是序列ts_week_log

因此使用的数据为ts_week_log

因此须将预测做指数转换

使用 n_test_obs 个点校验

n_test_obs = 20
size = int(len(ts) - n_test_obs)

切分数据集

train, test = ts[:size], ts[size:len(ts)]

历史值

history = [x for x in train]

预测值

predictions = list()

置信区间

confidence_intervals = list()

print(‘Predicted vs Expected Values…’)
print(‘\n’)
for t in range(len(test)):
model = SARIMAX(history, order=(0,1,0),seasonal_order=(0,1,1,12))
model_fit = model.fit()
output = model_fit.forecast(steps=1)
yhat = output[0]
predictions.append(float(yhat))
obs = test[t]
history.append(obs)

print('predicted=%f, expected=%f' % (yhat, obs))

error = mean_squared_error(test, predictions)

print(‘\n’)
predictions_series = pd.Series(predictions, index = test.index)
print(‘Test MSE: %.6f’ % error)

predictions_series = pd.Series(predictions, index = test.index)

fig, ax = plt.subplots(figsize=(10, 6))
ax.set(title=’Bus Riders’, xlabel=’Month’, ylabel=’100 Bus Riders’)
ax.plot(ts[-n_test_obs:], ‘o’, label=’observed’)
ax.plot(predictions_series, ‘g’, label=’rolling one-step out-of-sample forecast’)
legend = ax.legend(loc=’upper left’)
Predicted vs Expected Values…

predicted=1475.509981, expected=1424.000000
predicted=1367.194981, expected=1360.000000
predicted=1441.147682, expected=1429.000000
predicted=1461.899309, expected=1440.000000
predicted=1406.502132, expected=1414.000000
predicted=1438.593286, expected=1424.000000
predicted=1396.225163, expected=1408.000000
predicted=1386.002350, expected=1337.000000
predicted=1299.644308, expected=1258.000000
predicted=1214.892724, expected=1214.000000
predicted=1324.750929, expected=1326.000000
predicted=1353.201730, expected=1417.000000
predicted=1424.747027, expected=1417.000000
predicted=1357.550530, expected=1329.000000
predicted=1407.819886, expected=1461.000000
predicted=1483.069065, expected=1425.000000
predicted=1392.212819, expected=1419.000000
predicted=1441.571043, expected=1432.000000
predicted=1408.266795, expected=1394.000000
predicted=1357.971777, expected=1327.000000