机器学习基础篇（三）——正则化

最新推荐文章于 2022-07-29 20:59:46 发布

柚子味的羊

最新推荐文章于 2022-07-29 20:59:46 发布

阅读量342

点赞数 1

分类专栏：数据分析机器学习文章标签：机器学习 python 数据分析

本文链接：https://blog.csdn.net/qq_43368987/article/details/113773256

版权

数据分析同时被 2 个专栏收录

33 篇文章 19 订阅

订阅专栏

机器学习

33 篇文章 25 订阅

订阅专栏

机器学习基础篇（三）——正则化

一、概述

在前两节，我们建立了适当的模型，并且使用训练集对模型进行训练，然后我们在测试集中，使用模型预测，最终得到了预测值，将预测值与测试集的实际值进行比较，得出评分，不同的评分代表了模型的准确性不同，当我们观察一组数据中，会发现其中存在基础数据和噪音数据，对于模型而言，我们只希望训练基础数据而不希望受到噪音数据的干扰。

下面是一个数据集，我们看一下用不同模型拟合数据集的情况。
图1使用线性模型来拟合数据

# 线性模型
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import PolynomialFeatures
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import numpy as np

# step1 创建数据集
x,y=make_regression(n_samples=100,n_features=1,noise=15,random_state=0)
y=y**2
# step2 将模型训练的过程进行封装，使用线性模型来拟合数据
model=Pipeline([('poly',PolynomialFeatures(degree=1)),('linear',LinearRegression(fit_intercept=False))])
# step3 用模型对数据集进行训练
model=model.fit(x,y)
# step4 用模型预测数据
y_predictions=model.predict(x)
# step5 画图
sns.set_style('darkgrid')
plt.plot(x,y_predictions,color='black')
plt.scatter(x,y,marker='o')
plt.xticks(())
plt.yticks(())
plt.tight_layout()
plt.show()

运行结果
图1 线性模型
图2使用二次模型来拟合数据

# 线性模型
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import PolynomialFeatures
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import numpy as np

# step1 创建数据集
x,y=make_regression(n_samples=100,n_features=1,noise=15,random_state=0)
y=y**2
# step2 将模型训练的过程进行封装，使用二次模型来拟合数据
model=Pipeline([('poly',PolynomialFeatures(degree=2)),('linear',LinearRegression(fit_intercept=False))])
# step3 用模型对数据集进行训练
model=model.fit(x,y)
# step4 用模型预测数据
x_plot=np.linspace(min(x)[0],max(x)[0],100)
x_plot=x_plot[:,np.newaxis]
y_predictions=model.predict(x_plot)
# step5 画图
sns.set_style('darkgrid')
plt.plot(x_plot,y_predictions,color='black')
plt.scatter(x,y,marker='o')
plt.xticks(())
plt.yticks(())
plt.tight_layout()
plt.show()

运行结果 图2 二次模型
图3使用高次多项式模型来拟合数据

# 线性模型
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import PolynomialFeatures
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import numpy as np

# step1 创建数据集
x,y=make_regression(n_samples=100,n_features=1,noise=15,random_state=0)
y=y**2
# step2 将模型训练的过程进行封装，使用高次模型来拟合数据
model=Pipeline([('poly',PolynomialFeatures(degree=10)),('linear',LinearRegression(fit_intercept=False))])
# step3 用模型对数据集进行训练
model=model.fit(x,y)
# step4 用模型预测数据
x_plot=np.linspace(min(x)[0],max(x)[0],100)
x_plot=x_plot[:,np.newaxis]
y_predictions=model.predict(x_plot)
# step5 画图
sns.set_style('darkgrid')
plt.plot(x_plot,y_predictions,color='black')
plt.scatter(x,y,marker='o')
plt.xticks(())
plt.yticks(())
plt.tight_layout()
plt.show()

运行结果 图3 高次模型
分析： 图1的模型欠拟合，图2的模型拟合准确度较高，图3的模型过拟合数据。在上述三个模型中，看起来图3的模型似乎更加精确一点，但是当我们增加更多的数据，会发现图3模型预测的结果不在准确，而图2的模型预测的结果仍然比较准确。这是因为图3的模型产生了过度拟合的效果。过度拟合代表着这个模型在测试集的表现效果很好，但是难以推广到实际中去。 图2模型在测试集中的表现不如图3模型，但是图2模型的迁移效果较强，所以综合来看，图2的模型是最合适的模型。

正则化 是通过对于复杂的模型进行惩罚来避免过度拟合效应的。它会在代价函数中增加一个变量和函数用以作为惩罚。越复杂的模型则具有更高的权重函数，收到更多的惩罚，我们通过不断地调整的值来获得更合适的代价函数

二、正则化的方法

1.岭回归（Ridge regression）

岭回归是一种正则化的方法，在此方法中，函数包含了对于权重进行平方求和的部分。下面是一个岭回归正则化的代价函数的示例
岭回归使得我们的模型能够学习到数据的全部特征，但是并不会出现过度拟合的问题，当我们没有大量的特征，但是又想避免过度拟合的时候，我们就可以选择岭回归的方法
图4 运用岭回归和不运用岭回归的模型比较

# 线性模型
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import PolynomialFeatures
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression,Ridge
from sklearn.pipeline import Pipeline
import numpy as np

# step1 创建数据集
x,y=make_regression(n_samples=100,n_features=1,noise=15,random_state=0)
y=y**2
# step2.1 将模型训练的过程进行封装，使用高次模型来拟合数据
model=Pipeline([('poly',PolynomialFeatures(degree=10)),('linear',LinearRegression(fit_intercept=False))])
# step2.2 岭回归后的多项式模型
regModel=Pipeline([('poly',PolynomialFeatures(degree=10)),('ridge',Ridge(alpha=5.0))])

# step3 用模型对数据集进行训练
model=model.fit(x,y)
regModel=regModel.fit(x,y)
# step4 用模型预测数据
x_plot=np.linspace(min(x)[0],max(x)[0],100)
x_plot=x_plot[:,np.newaxis]
y_plot=model.predict(x_plot)
yReg_plot=regModel.predict(x_plot)
# step5 画图
sns.set_style('darkgrid')
plt.plot(x_plot,y_plot,color='black')
plt.plot(x_plot,yReg_plot,color='red')
plt.scatter(x,y,marker='o')
plt.xticks(())
plt.yticks(())
plt.tight_layout()
plt.show()

运行结果
图4 运用岭回归和不运用岭回归的比较
分析： 在上图，黑色的线条代表未使用岭回归的模型，红色线条代表使用岭回归的模型，观察发现，红色的线条显得更加平滑，对于实际的数据，拟合效果更好。
添加的代码如下：

# step2.1 将模型训练的过程进行封装，使用高次模型来拟合数据
model=Pipeline([('poly',PolynomialFeatures(degree=10)),('linear',LinearRegression(fit_intercept=False))])
# step2.2 岭回归后的多项式模型
regModel=Pipeline([('poly',PolynomialFeatures(degree=10)),('ridge',Ridge(alpha=5.0))])

可以发现，添加岭回归的方式就像在封装过程中添加一个额外的参数一样简单，其中alpha代表调整参数

2.套索回归（Lasso regression）

套索回归是一种正则化的方法，在此方法中，函数包含了对于权重进行绝对值求和的部分。套索回归和岭回归十分相似，区别在于套索回归是对权重的绝对值求和，而岭回归是对权重的平方进行求和。
与岭回归相比，套索回归可以强制权重为0。这意味着在模型的训练过程中，我们可以不考虑全部的特征。例如，我们有一百万个特征，其中只有少部分是有用的，我们可以只考虑这一部分特征而不考虑其他的特征。套索回归可以让我们避免过度拟合的问题，只考虑有用的特征。
图5展现了运用套索回归和不运用套索回归的模型的对比。

# 线性模型
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import PolynomialFeatures
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression,Lasso
from sklearn.pipeline import Pipeline
import numpy as np

# step1 创建数据集
x,y=make_regression(n_samples=100,n_features=1,noise=15,random_state=0)
y=y**2
# step2.1 将模型训练的过程进行封装，使用高次模型来拟合数据
model=Pipeline([('poly',PolynomialFeatures(degree=10)),('linear',LinearRegression(fit_intercept=False))])
# step2.2 套索回归后的多项式模型
regModel=Pipeline([('poly',PolynomialFeatures(degree=10)),('lasso',Lasso(alpha=5,max_iter=1000000))])

# step3 用模型对数据集进行训练
model=model.fit(x,y)
regModel=regModel.fit(x,y)
# step4 用模型预测数据
x_plot=np.linspace(min(x)[0],max(x)[0],100)
x_plot=x_plot[:,np.newaxis]
y_plot=model.predict(x_plot)
yReg_plot=regModel.predict(x_plot)
# step5 画图
sns.set_style('darkgrid')
plt.plot(x_plot,y_plot,color='black')
plt.plot(x_plot,yReg_plot,color='red')
plt.scatter(x,y,marker='o')
plt.xticks(())
plt.yticks(())
plt.tight_layout()
plt.show()

运行结果
在这里插入图片描述
分析： 在上图，黑色的线条代表未使用套索回归的模型，而红色的线条代表使用套索回归的模型。可以观察发现，红色的线条显得更加平滑，对于时机的数据，拟合效果更好。
添加的代码如下：

# step2.1 将模型训练的过程进行封装，使用高次模型来拟合数据
model=Pipeline([('poly',PolynomialFeatures(degree=10)),('linear',LinearRegression(fit_intercept=False))])
# step2.2 套索回归后的多项式模型
regModel=Pipeline([('poly',PolynomialFeatures(degree=10)),('lasso',Lasso(alpha=5,max_iter=1000000))])

可以发现，添加套索回归的方法就像添加岭回归的方式一样简单，其中alpha代表了调整参数，max_iter代表了最大迭代次数

三、小结

过拟合是机器学习建模中很常见的一种问题，运用正则化的方法，我们找到了避免过度拟合的途径，并且了解了不同正则化方法的适用条件，下一节我们将会学习监督学习的内容。
自学自用，希望可以和大家积极沟通交流，小伙伴们加油鸭，如有错误还请指正，不喜勿喷，喜欢的小伙伴帮忙点个赞支持，蟹蟹呀

柚子味的羊

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
机器学习基础篇（三）——正则化

机器学习基础篇（三）——正则化一、概述在前两节，我们建立了适当的模型，并且使用训练集对模型进行训练，然后我们在测试集中，使用模型预测，最终得到了预测值，将预测值与测试集的实际值进行比较，得出评分，不同的评分代表了模型的准确性不同，当我们观察一组数据中，会发现其中存在基础数据和噪音数据，对于模型而言，我们只希望训练基础数据而不希望受到噪音数据的干扰。下面是一个数据集，我们看一下用不同模型拟合数据集的情况。图1使用线性模型来拟合数据# 线性模型import matplotlib.pyplot as
复制链接

扫一扫