按照第一个教程里的普通线性回归我们在load_boston数据集上构建一个普通线性回归模型,我们看看会出现什么问题?
我们先来看看load_boston数据集
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.data.shape)
You will see something similar to:
(506, 13)
表示这个数据集有13个特征,一共506个样本
from sklearn.datasets import load_boston
boston = load_boston()
import numpy as np
from sklearn import linear_model
from sklearn.metrics import mean_squared_error,r2_score
X = boston.data
y = boston.target
num_training = int(0.7*len(X))
#分割数据集
X_train = X[:num_training]
y_train = y[:num_training]
X_test = X[num_training:]
y_test = y[num_training:]
reg = linear_model.LinearRegression()
#训练模型
reg.fit(X_train,y_train)
#预测模型
y_pred = reg.predict(X_test)
#输出模型参数
print("系数",reg.coef_)
print(reg.coef_.shape)
print("常数项",reg.intercept_)
#计算均方误差
print("在测试集均方误差",mean_squared_error(y_test,y_pred))
#计算r2值
print("r2值",r2_score(y_test,y_pred))
You will see something similar to:
系数 [ 1.29693856 0.01469497 0.04050457 0.79060732 -9.12933243 9.24839787
-0.0451214 -0.91395374 0.14079658 -0.01477291 -0.63369567 0.01577172
-0.09514128]
(13,)
常数项 -13.6721465522
在测试集均方误差 545.445002115
r2值 -7.2211853282
这个线性回归模型有13个参数,一个常数项
Ridge regression addresses some of the problems of Ordinary Least Squares(i.e overfiting 过拟合) by imposing a penalty on the size of coefficients.(正则化)
当样本特征较多时,而样本数量相对较少时,普通线性回归很容易陷入过拟合,为了解决这个问题岭回归通过引入L2正则化来降低过拟合风险.
from sklearn.datasets import load_boston
boston = load_boston()
import numpy as np
from sklearn import linear_model
from sklearn.metrics import mean_squared_error,r2_score
X = boston.data
y = boston.target
num_training = int(0.7*len(X))
#分割数据集
X_train = X[:num_training]
y_train = y[:num_training]
X_test = X[num_training:]
y_test = y[num_training:]
reg = linear_model.Ridge(alpha = .5)
#训练模型
reg.fit(X_train,y_train)
#预测模型
y_pred = reg.predict(X_test)
#输出模型参数
print("系数",reg.coef_)
print(reg.coef_.shape)
print("常数项",reg.intercept_)
#计算均方误差
print("在测试集均方误差",mean_squared_error(y_test,y_pred))
#计算r2值
print("r2值",r2_score(y_test,y_pred))
You will see something similar to:
系数 [ 1.06913232 0.01534766 0.03083921 0.81470562 -5.44619698 9.22075685
-0.04681829 -0.86607139 0.13700694 -0.01498462 -0.60960326 0.01610884
-0.10287555]
(13,)
常数项 -15.7171489971
在测试集均方误差 398.766928585
r2值 -5.01038933338
可以看到,岭回归确实能减小过拟合的风险.同样的数据集上,岭回归的误差小于普通线性回归.
使用交叉验证来选择参数
from sklearn import linear_model
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
boston = load_boston()
X = boston.data
y = boston.target
num_training = int(0.7*len(X))
X_train = X[:num_training]
y_train = y[:num_training]
X_test = X[num_training:]
y_test = y[num_training:]
reg = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0])
reg.fit(X_train,y_train)
print("最优alpha:",reg.alpha_)
y_pred = reg.predict(X_test)
print("在测试集均方误差",mean_squared_error(y_test,y_pred))
You will see something similar to:
最优alpha: 0.1
在测试集均方误差 499.704726051
Reference
http://scikit-learn.org/stable/supervised_learning.html#supervised-learning
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets