先默写本次学习的代码:
一段代码是aic的前向选择,还有两小段是Lasso回归和Ridge回归。
aic指的是赤池信息准则,与其同类还有bic,即贝叶斯信息准则,其属于特征提取的范畴。
前向选择这段代码的作用是,对于一个数据集,有一些特征与自变量相关性不大,过于纠结这些特征会作茧自缚,使得模型的泛化能力弱,不能达到预测的鲁棒性。所以从逐个挑选优质的特征,当特征的增加使得aic的值不是逐渐减小,说明该特征及以后的特征对于预测很可能有反作用,故取其精华去其糟泊。
import numpy as np
import pandas as pd
from sklearn import datasets
from statsmodels.formula.api import ols
boston = datasets.load_boston()
X = boston.data
y = boston.target
features = boston.feature_names
boston_data = pd.DataFrame(X,columns=features)
boston_data['Price'] = y
def forward_select(data,target):
variate = set(data.columns)
variate.remove(target)
selected = []
current_score,best_new_score = float('inf'),float('inf')
while variate:
aic_with_variate = []
for candidate in variate:
formula = "{}~{}".format(target,"+".join(selected+[candidate]))
aic = ols(formula=formula,data=data).fit().aic
aic_with_variate.append((aic,candidate))
aic_with_variate.sort(reverse=True)
best_new_score,best_candidate = aic_with_variate.pop()
if current_score > best_new_score:
variate.remove(best_candidate)
selected.append(best_candidate)
current_score = best_new_score
print("aic is {},continuing!".format(current_score))
else:
print("selection is over!")
break
formula = "{}~{}".format(target,"+".join(selected))
print("the final formula is {}".format(formula))
model = ols(formula=formula,data=data).fit()
return(model)
boston_forward_select_model = forward_select(data=boston_data,target='Price')
boston_forward_select_model.summary()
from sklearn import linear_model
reg_rid = linear_model.Ridge(alpha=0.5)
reg_rid.fit(X,y)
reg_rid.score(X,y)
from sklearn import linear_model
reg_lasso = linear_model.Lasso(alpha=0.5)
reg_lasso.fit(X,y)
reg_lasso.score(X,y)
绘制两幅图,一幅是方差与偏差,另外一幅是Ridge回归和Lasso回归