逐步回归(stepwise)的思想
从大量的变量中,选择对建立回归模型起重要影响作用变量的方法。
- 每一步只引入或剔除一个变量,即每次引入对因变量影响最显著的自变量,并对已在模型中的变量进行逐个检验,把编委不显著的变量从方程中剔除,最终得到的模型中既不漏掉对因变量显著的,又不包含对因变量影响不显著的变量。
- 需要设置引入变量和剔除变量的显著性水平,alpha(in)<=alpha(out),否则会出现死循环,即被剔除的自变量又被选入了。
- 一般来说,对小样本,会考虑alpha(in)=0.05,alpha(out)=0.1;对大样本,会考虑alpha(in)=0.1,alpha(out)=0.15;
- 当自变量的F检验的p值<alpha(in)时,变量被引入;引入新变量后,如果模型中有自变量的F检验的P值>alpha(out),则应该被剔除;
- 显著性水平的设定,对回归方程还是有较大的影响的。
逐步回归本质上就是一般的最小二乘回归,只是在回归后会考虑自变量对因变量的显著性而进行重新的变量选择。
在Python中会主要借助 statsmodels.api模块中的OLS做最小二乘回归。
代码实现:
import pandas as pd
import numpy as np
import statsmodels.api as sm
def stepwise(X,y,alpha_in=0.1,alpha_out=0.15):
'''X为所有可能的自变量构成的DataFrame,
y为响应变量,
alpha_in为引入变量的显著性水平,
alpha_out为剔除变量的显著性水平'''
included=[]
while True:
changed=False
excluded=list(set(X.columns)-set(included))
p_val=pd.Series(index=excluded)
for new_column in excluded:
model=sm.OLS(y,sm.add_constant(X[included+[new_column]])).fit()
p_val[new_column]=model.pvalues[new_column]
min_pval=p_val.min()
#forward step
if min_pval < alpha_in:
changed=True
add_feature=p_val.idxmin()
included.append(add_feature)
print("Add {:20} with p_value {:.6}".format(add_feature,min_pval))
#backward step
model=sm.OLS(y,sm.add_constant(X[included])).fit()
p_val=model.pvalues.iloc[1:]
max_pval=p_val.max()
if max_pval > alpha_out:
changed=True
drop_feature=p_val.idxmax()
included.remove(drop_feature)
print("Drop {:20} with p_value {:.6}".format(drop_feature,max_pval))
if not changed:
break
return included
数据示例
data=pd.read_csv("/Users/mac/Desktop/test.csv",sep=",")
columns=['x1', 'x2', 'x3', 'x4']
X=data[columns]
y=data["y"]
result = stepwise(X,y)
print('resulting features:')
print(result)
####在找到了最优自变量时,下面就使用该变量建立回归方程
new_colmuns=result
model=sm.OLS(y,sm.add_constant(X[new_colmuns])).fit()
model.summary2()
根据最大调整R方前向逐步回选模型
#根据最大调整R方前向逐步回选模型
def forward_selection(X,y):
included=[]
current_R,best_R=0.0,0.0
while True:
changed=False
excluded=list(set(X.columns)-set(included))
adj_R=pd.Series(index=excluded)
for new_column in excluded:
model=sm.OLS(y,sm.add_constant(X[included+[new_column]])).fit()
adj_R[new_column]=model.rsquared_adj
best_R=adj_R.max()
add_feature=adj_R.idxmax()
if best_R > current_R:
changed=True
included.append(add_feature)
current_R=best_R
print("Add {}".format(add_feature))
if not changed:
break
return included