我正在进行滚动,例如在
this link(
https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk)中找到的数据集的100窗口OLS回归估计,如下面的格式.
time X Y
0.000543 0 10
0.000575 0 10
0.041324 1 10
0.041331 2 10
0.041336 3 10
0.04134 4 10
...
9.987735 55 239
9.987739 56 239
9.987744 57 239
9.987749 58 239
9.987938 59 239
我的数据集中的第三列(Y)是我的真实值 – 这就是我想要预测的(估计).我想做一个Y的预测(即根据之前的X的3个滚动值预测Y的当前值.为此,我使用statsmodels进行以下python脚本工作.
# /usr/bin/python -tt
import pandas as pd
import numpy as np
import statsmodels.api as sm
df=pd.read_csv('estimated_pred.csv')
df=df.dropna() # to drop nans in case there are any
window = 100
#print(df.index) # to print index
df['a']=None #constant
df['b1']=None #beta1
df['b2']=None #beta2
for i in range(window,len(df)):
temp=df.iloc[i-window:i,:]
RollOLS=sm.OLS(temp.loc[:,'Y'],sm.add_constant(temp.loc[:,['time','X']], has_constant = 'add')).fit()
df.iloc[i,df.columns.get_loc('a')]=RollOLS.params[0]
df.iloc[i,df.columns.get_loc('b1')]=RollOLS.params[1]
df.iloc[i,df.columns.get_loc('b2')]=RollOLS.params[2]
# Predicted values in a row
df['predicted']=df['a'].shift(1)+df['b1'].shift(1)*df['time']+df['b2'].shift(1)*df['X']
#print(df['predicted'])
print(temp)
这给了我以下格式的示例输出.
time X Y a b1 b2 predicted
0 0.000543 0 10 None None None NaN
1 0.000575 0 10 None None None NaN
2 0.041324 1 10 None None None NaN
3 0.041331 2 10 None None None NaN
4 0.041336 3 10 None None None NaN
.. ... .. .. ... ... ... ...
50 0.041340 4 10 10 0 1.55431e-15 NaN
51 0.041345 5 10 10 1.7053e-13 7.77156e-16 10
52 0.041350 6 10 10 1.74623e-09 -7.99361e-15 10
53 0.041354 7 10 10 6.98492e-10 -6.21725e-15 10
.. ... .. .. ... ... ... ...
509 0.160835 38 20 20 4.88944e-09 -1.15463e-14 20
510 0.160839 39 20 20 1.86265e-09 5.32907e-15 20
.. ... .. .. ... ... ... ...
最后,我想包括所有预测(OLS回归分析的摘要)值的均方误差(MSE).例如,如果我们查看第5行,则X的值为2,Y的值为10.假设当前行的y的预测值为6,因此mse将为(10-6)^ 2 . sm.OLS返回此类的实例< class'statsmodels.regression.linear_model.OLS'>当我们打印(RollOLS.summary()).
OLS Regression Results
==============================================================================
Dep. Variable: Y R-squared: -inf
Model: OLS Adj. R-squared: -inf
Method: Least Squares F-statistic: -48.50
Date: Tue, 04 Jul 2017 Prob (F-statistic): 1.00
Time: 22:19:18 Log-Likelihood: 2359.7
No. Observations: 100 AIC: -4713.
Df Residuals: 97 BIC: -4706.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 239.0000 2.58e-09 9.26e+10 0.000 239.000 239.000
time 4.547e-13 2.58e-10 0.002 0.999 -5.12e-10 5.13e-10
X -3.886e-16 1.1e-13 -0.004 0.997 -2.19e-13 2.19e-13
==============================================================================
Omnibus: 44.322 Durbin-Watson: 0.000
Prob(Omnibus): 0.000 Jarque-Bera (JB): 86.471
Skew: -1.886 Prob(JB): 1.67e-19
Kurtosis: 5.556 Cond. No. 9.72e+04
==============================================================================
但是,例如,rsquared(print(RollOLS.rsquared))的值应该介于0和1之间,而不是-inf,这似乎是缺少截取的问题.如果我们想打印mse,我们按照documentation打印(RollOLS.mse_model)…等.我们如何添加截距并使用正确的值打印回归统计数据,就像我们对预测值一样?我在这里做错了什么?或者是否有另一种使用scikit-learnlibraries的方法?