使用L2正则化的线性模型---- 岭回归
岭回归也是回归分析中常用的线性模型,它实际上是一种改良的最小二乘法。岭回归是一种能够避免过拟合的线性模型。模型会保留所有的特征变量,但是会减小特征变量的系数值,让特征变量对预测结果的影响变小。
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import Ridge data_url = "http://lib.stat.cmu.edu/datasets/boston" raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None) data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]]) target = raw_df.values[1::2, 2]
X_train, X_test, y_train, y_test = train_test_split(data, target) print(X_train.shape)
(379, 13)
ridge = Ridge().fit(X_train, y_train) print("岭回归训练数据集得分:{:.2f}".format(ridge.score(X_train, y_train))) print("岭回归测试数据集得分:{:.2f}".format(ridge.score(X_test, y_test)))
岭回归训练数据集得分:0.74 岭回归测试数据集得分:0.70
在岭回归中,通过改变alpha参数来控制减小特征变量
ridge10 = Ridge(alpha=10).fit(X_train, y_train) print("岭回归训练数据集得分:{:.2f}".format(ridge10.score(X_train, y_train))) print("岭回归测试数据集得分:{:.2f}".format(ridge10.score(X_test, y_test)))
岭回归训练数据集得分:0.74 岭回归测试数据集得分:0.69
ridge01 = Ridge(alpha=0.1).fit(X_train, y_train) print("岭回归训练数据集得分:{:.2f}".format(ridge01.score(X_train, y_train))) print("岭回归测试数据集得分:{:.2f}".format(ridge01.score(X_test, y_test)))
岭回归训练数据集得分:0.75 岭回归测试数据集得分:0.70
import matplotlib.pyplot as plt plt.plot(ridge.coef_, 's', label = 'Ridge alpha = 1') plt.plot(ridge10.coef_, '^', label = 'Ridge alpha = 10') plt.plot(ridge01.coef_, 'v', label = 'Ridge alpha = 0.1') plt.xlabel("coefficient index") plt.ylabel("coefficient magnitude") plt.grid() # plt.hlines(0,0, np.linspace(0,10,10)) plt.legend() plt.show()
使用L1正则化的线性模型---- 套索回归
套索回归 lasso, 与L2正则化不同,L1正则化会导致在使用套索回归的时候,有一部分特征的系数会正好等于0。当特征特别多的时候,L1正则化会忽视一些特征。有助于让模型更容易理解,突出模型中的重要特征。
from sklearn.linear_model import Lasso lasso = Lasso().fit(X_train, y_train) print("套索回归训练数据集得分:{:.2f}".format(lasso.score(X_train, y_train))) print("套索回归测试数据集得分:{:.2f}".format(lasso.score(X_test, y_test))) print("套索回归使用的特征数:{}".format(np.sum(lasso.coef_ !=0 )))
套索回归训练数据集得分:0.70 套索回归测试数据集得分:0.65 套索回归使用的特征数:10
lasso01 = Lasso(alpha=0.1, max_iter= 100000 ).fit(X_train, y_train) print("套索回归训练数据集得分:{:.2f}".format(lasso01.score(X_train, y_train))) print("套索回归测试数据集得分:{:.2f}".format(lasso01.score(X_test, y_test))) print("套索回归使用的特征数:{}".format(np.sum(lasso01.coef_ !=0 )))
套索回归训练数据集得分:0.74 套索回归测试数据集得分:0.68 套索回归使用的特征数:12
lasso10 = Lasso(alpha=10, max_iter= 100000 ).fit(X_train, y_train) print("套索回归训练数据集得分:{:.2f}".format(lasso10.score(X_train, y_train))) print("套索回归测试数据集得分:{:.2f}".format(lasso10.score(X_test, y_test))) print("套索回归使用的特征数:{}".format(np.sum(lasso10.coef_ !=0 )))
套索回归训练数据集得分:0.51 套索回归测试数据集得分:0.57 套索回归使用的特征数:4
lasso0001 = Lasso(alpha=0.0001, max_iter= 100000 ).fit(X_train, y_train) print("套索回归训练数据集得分:{:.2f}".format(lasso0001.score(X_train, y_train))) print("套索回归测试数据集得分:{:.2f}".format(lasso0001.score(X_test, y_test))) print("套索回归使用的特征数:{}".format(np.sum(lasso0001.coef_ !=0 )))
套索回归训练数据集得分:0.75 套索回归测试数据集得分:0.70 套索回归使用的特征数:13 plt.plot(lasso.coef_, 's', label = 'Ridge alpha = 1') plt.plot(lasso10.coef_, '^', label = 'Ridge alpha = 10') plt.plot(lasso01.coef_, 'v', label = 'Ridge alpha = 0.1') plt.plot(lasso0001.coef_, 'o', label = 'Ridge alpha = 0.0001') plt.xlabel("coefficient index") plt.ylabel("coefficient magnitude") plt.grid() plt.legend() plt.show()