用学习曲线评估机器学习模型

最新推荐文章于 2024-04-28 10:05:15 发布

GwentBoy

最新推荐文章于 2024-04-28 10:05:15 发布

阅读量1.2k

点赞数 3

分类专栏：笔记文章标签： python 机器学习

本文链接：https://blog.csdn.net/GwentBoy/article/details/113568617

版权

笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

本文目录

什么是学习曲线
学习曲线作用
绘制学习曲线
学习曲线评估模型
- 构建多个模型进行比较
- 学习曲线评估模型效果
完整代码

什么是学习曲线

学习曲线就是通过画出不同训练集大小时训练集和验证数据集的准确率，可以看到不同训练集训练出的模型在新数据上的表现，进而来判断模型是否是欠拟合或过拟合。
学习曲线的横坐标是训练样本的数量，纵坐标是损失函数的值。
在这里插入图片描述

学习曲线作用

表现能力：也就是模型的预测准确率；
对模型进行评估：通过学习曲线可以清晰的看出模型对数据的过拟合和欠拟合；
查看模型学习效果：是否可以通过增加训练数据集的方式提高模型准确率。

绘制学习曲线

先构建一个二元线性方程，然后在此基础上我们增加一些noise，构建我们的数据集。

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(666)
x = np.random.uniform(-3, 3, size=100)
X = x.reshape(-1, 1)
y = 0.5 * x ** 2 + x + 2. + np.random.normal(0, 1, size=100)
plt.scatter(X,y)
plt.show()

在这里插入图片描述

对数据集进行划分，划分成训练数据集和测试数据集：

from sklearn.model_selection import train_test_split

# random_state=10：随机种子；
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10)

先使用线性回归去拟合数据集，对于误差我们使用均方根误差来计算。

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

train_score = []
test_score = []

for i in range(1,76):
    lin_reg = LinearRegression()
    lin_reg.fit(X_train[:i],y_train[:i])

    y_train_predict = lin_reg.predict(X_train[:i])
    train_score.append(mean_squared_error(y_train[:i],y_train_predict )) 

    y_test_predict = lin_reg.predict(X_test)
    test_score.append(mean_squared_error(y_test,y_test_predict))

plt.plot([i for i in range(1,76)], np.sqrt(train_score),label = 'Train')
plt.plot([i for i in range(1,76)], np.sqrt(test_score),label = 'Test')
plt.legend()
plt.show()

在这里插入图片描述
至此我们就画出了一个模型的学习曲线。

学习曲线评估模型

构建多个模型进行比较

刚才我们构建了一元线性回归的模型并且绘制了其学习曲线，现在我们构建一个多元线性回归模型：

from sklearn.preprocessing import PolynomialFeatures,StandardScaler
from sklearn.pipeline import Pipeline
def PolynomialRegression(degree):
    return Pipeline([
        ("ploy",PolynomialFeatures(degree = degree)),
        ("std_scaler",StandardScaler()),
        ("lin_reg",LinearRegression())
    ])

这里稍微解释一下，由于真实生产生活中，并不是所有特征和结果都是呈线性关系，所以用线性回归去拟合数据并不能完全拟合出样本的特征。
例如本文开头构建的数据集就是一个二次函数

y = 0.5 * x ** 2 + x + 2. + np.random.normal(0, 1, size=100)

所以上面我们引入PolynomialFeatures，PolynomialFeatures可以生成一个新的特征矩阵，该矩阵由度小于或等于指定度的特征的所有多项式组合组成。例如，如果输入样本是二维且格式为[a，b]，则2阶多项式特征为[1，a，b，a ^ 2，ab，b ^ 2]。
上面PolynomialRegression的方法就是生成一个最高次为degree的多项式，并且利用这个多项式去拟合我们一开始构建的数据集。

我们整理下代码，将上面画学习曲线的代码抽象成一个方法，这个方法会返回训练集和测试数据集的纵坐标值：

def train_test_score(algo, X_train, X_test, y_train, y_test):
    train_scores, test_scores = [], []
    for i in range(1, 76):
        algo.fit(X_train[:i], y_train[:i])
        y_train_predict = algo.predict(X_train[:i])
        y_test_predict = algo.predict(X_test)
        train_scores.append(mean_squared_error(y_train[:i], y_train_predict))
        test_scores.append(mean_squared_error(y_test, y_test_predict))
    return train_scores, test_scores

下面我们来创建3个模型，degree分别是1,2和20

poly1_reg = PolynomialRegression( degree = 1)
poly2_reg = PolynomialRegression( degree = 2)
poly20_reg = PolynomialRegression( degree = 20)

利用3个不同的模型，得到对应学习曲线中的纵坐标值：

train_score, test_score = train_test_score(poly1_reg, X_train, X_test, y_train, y_test)
train_score_poly2, test_score_poly2 = train_test_score(poly2_reg, X_train, X_test, y_train, y_test)
train_score_poly20, test_score_poly20 = train_test_score(poly20_reg, X_train, X_test, y_train, y_test)

然后我们来分别绘制3个模型对应的学习曲线，为了后面更直观的看出这个模型是过拟合还是欠拟合，我还把模型预测的曲线绘制了出来：

ax1 = plt.subplot(2,3,1)
ax2 = plt.subplot(2,3,2)
ax3 = plt.subplot(2,3,3)
ax4 = plt.subplot(2,3,4)
ax5 = plt.subplot(2,3,5)
ax6 = plt.subplot(2,3,6)
plt.sca(ax1)
plt.scatter(x,y)
plt.plot(x,poly1_reg.predict(X),color='red')
plt.title('poly1')
plt.sca(ax2)
plt.scatter(X,y)
y_predict2 = poly2_reg.predict(X)
plt.plot(x[np.argsort(x)],y_predict2[np.argsort(x)],color='red')
plt.title('poly2')
plt.sca(ax3)
plt.scatter(X,y)
plt.title('poly20')
y_predict20 = poly20_reg.predict(X)
plt.plot(x[np.argsort(x)],y_predict20[np.argsort(x)],color='red')
plt.sca(ax4)
plt.plot([i for i in range(1, 76)], np.sqrt(train_score), label='train')
plt.plot([i for i in range(1, 76)], np.sqrt(test_score), label='test')
plt.legend()
plt.sca(ax5)
plt.plot([i for i in range(1, 76)], np.sqrt(train_score_poly2), label='train')
plt.plot([i for i in range(1, 76)], np.sqrt(test_score_poly2), label='test')
plt.legend()
plt.sca(ax6)
plt.plot([i for i in range(1, 76)], np.sqrt(train_score_poly20), label='train')
plt.plot([i for i in range(1, 76)], np.sqrt(test_score_poly20), label='test')
ax4.axis([0,80,0,4])
ax5.axis([0,80,0,4])
ax6.axis([0,80,0,4])
plt.legend()
plt.show()

学习曲线评估模型效果

在这里插入图片描述

如上图所示，当训练集从0不断增加时，训练集的损失不断增大，测试数据集的损失不断减小，最终两条线趋于稳定。

关于这两条线的理解：假设当只有一个样本数据去训练模型时，训练集的误差很小，因为随便一个模型都能完全拟合训练集，所以蓝色线的纵坐标趋近于0。当训练样本增多时，由于本来样本中存在的误差，所以蓝色线上升，损失变大，但同时，模型也学习到更多的特征，所以在测试样本中的损失下降。

对于不同阶的多项式学习曲线之间的比较：

poly1和poly2中，poly1随着训练样本的增加，纵坐标趋近于1.5左右，而在poly2的图中，纵坐标趋近于1.0左右。总的来说poly2的误差较小，所以poly1可能没有完全学习到样本的特征，poly1可能存在欠拟合的情况，所以poly2对样本的拟合更优。
poly2和poly3中，poly3一开始训练集的误差很小，说明模型完全学习到了样本的特征，但是在训练集上的误差非常大。后面随着训练样本增加，虽然训练集的误差和poly2中差不多，都趋近于1，但是测试集上poly3的误差明细高于poly2，所以这个模型在训练集上表现很好，但是预测新样本时却存在较大误差，所以poly3泛化能力不如poly2好，可能是过拟合了。

完整代码

from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split

np.random.seed(666)
x = np.random.uniform(-3, 3, size=100)
X = x.reshape(-1, 1)
y = 0.5 * x ** 2 + x + 2. + np.random.normal(0, 1, size=100)
# plt.scatter(X,y)
# plt.show()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=10)

from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures,StandardScaler
from sklearn.pipeline import Pipeline

def train_test_score(algo, X_train, X_test, y_train, y_test):
    train_scores, test_scores = [], []
    for i in range(1, 76):
        algo.fit(X_train[:i], y_train[:i])
        y_train_predict = algo.predict(X_train[:i])
        y_test_predict = algo.predict(X_test)
        train_scores.append(mean_squared_error(y_train[:i], y_train_predict))
        test_scores.append(mean_squared_error(y_test, y_test_predict))
    return train_scores, test_scores

def PolynomialRegression(degree):
    return Pipeline([
        ("ploy",PolynomialFeatures(degree = degree)),
        ("std_scaler",StandardScaler()),
        ("lin_reg",LinearRegression())
    ])


poly1_reg = PolynomialRegression( degree = 1)
poly2_reg = PolynomialRegression( degree = 2)
poly20_reg = PolynomialRegression( degree = 20)
train_score, test_score = train_test_score(poly1_reg, X_train, X_test, y_train, y_test)
train_score_poly2, test_score_poly2 = train_test_score(poly2_reg, X_train, X_test, y_train, y_test)
train_score_poly20, test_score_poly20 = train_test_score(poly20_reg, X_train, X_test, y_train, y_test)
ax1 = plt.subplot(2,3,1)
ax2 = plt.subplot(2,3,2)
ax3 = plt.subplot(2,3,3)
ax4 = plt.subplot(2,3,4)
ax5 = plt.subplot(2,3,5)
ax6 = plt.subplot(2,3,6)
plt.sca(ax1)
plt.scatter(x,y)
plt.plot(x,poly1_reg.predict(X),color='red')
plt.title('poly1')
plt.sca(ax2)
plt.scatter(X,y)
y_predict2 = poly2_reg.predict(X)
plt.plot(x[np.argsort(x)],y_predict2[np.argsort(x)],color='red')
plt.title('poly2')
plt.sca(ax3)
plt.scatter(X,y)
plt.title('poly20')
y_predict20 = poly20_reg.predict(X)
plt.plot(x[np.argsort(x)],y_predict20[np.argsort(x)],color='red')
plt.sca(ax4)
plt.plot([i for i in range(1, 76)], np.sqrt(train_score), label='train')
plt.plot([i for i in range(1, 76)], np.sqrt(test_score), label='test')
plt.legend()
plt.sca(ax5)
plt.plot([i for i in range(1, 76)], np.sqrt(train_score_poly2), label='train')
plt.plot([i for i in range(1, 76)], np.sqrt(test_score_poly2), label='test')
plt.legend()
plt.sca(ax6)
plt.plot([i for i in range(1, 76)], np.sqrt(train_score_poly20), label='train')
plt.plot([i for i in range(1, 76)], np.sqrt(test_score_poly20), label='test')
ax4.axis([0,80,0,4])
ax5.axis([0,80,0,4])
ax6.axis([0,80,0,4])
plt.legend()
plt.show()

GwentBoy

关注

3
点赞
踩
12

收藏

觉得还不错? 一键收藏
1
评论
用学习曲线评估机器学习模型

本文目录什么是学习曲线学习曲线作用绘制学习曲线学习曲线的解读完整代码什么是学习曲线学习曲线就是通过画出不同训练集大小时训练集和验证数据集的准确率，可以看到不同训练集训练出的模型在新数据上的表现，进而来判断模型是否是欠拟合或过拟合。学习曲线的横坐标是训练样本的数量，纵坐标是损失函数的值。学习曲线作用表现能力：也就是模型的预测准确率；对模型进行评估：通过学习曲线可以清晰的看出模型对数据的过拟合和欠拟合；查看模型学习效果：是否可以通过增加训练数据集的方式提高模型准确率。绘制学习曲线先构建一
复制链接

扫一扫