模型选择、过拟合和欠拟合

最新推荐文章于 2024-08-31 15:42:00 发布

SupremeNO.1

最新推荐文章于 2024-08-31 15:42:00 发布

阅读量73

点赞数

分类专栏：深度学习文章标签：机器学习深度学习 python

本文链接：https://blog.csdn.net/supremelv/article/details/129865819

版权

深度学习专栏收录该内容

10 篇文章 0 订阅

订阅专栏

文章讨论了训练误差和泛化误差的概念，通过例子解释了两者之间的区别。介绍了验证数据集和测试数据集的作用，以及K-折交叉验证方法。文章还探讨了模型容量对过拟合和欠拟合的影响，以及如何根据数据复杂度调整模型容量。最后，提供了相关代码示例来说明这些概念。

摘要由CSDN通过智能技术生成

训练误差

模型在训练数据上的误差

泛化误差

模型在新数据上的误差

例子

根据模拟考试成绩预测未来考试分数

在过去的考试中成绩很好（训练误差）不代表未来考试成绩一定好（泛化误差）
学生A通过背书在模考中拿到很好地成绩
学生B知道答案后面的原因

验证数据集和测试数据集

验证数据集

定义： 一个用来评估模型好坏的数据集
例如拿出50%的训练数据
不要跟训练数据混在一起（常见错误）

测试数据集

定义： 只用一次的数据集。
未来的考试
我出价的房子的实际成交价
用在kaggle私有排行榜中的数据集

K-则交叉验证

在没有足够数据的时候使用
算法：
- 将训练数据分割成K块
- for i=1,…K
  - 使用第i块作为验证数据集，其余的作为训练数据集
- 报告K个验证集误差的平均
常用：K=5或10

总结

训练数据集：训练模型参数
验证数据集：选择模型超参数
非大数据集上通常使用K-则交叉验证

过拟合和欠拟合

在这里插入图片描述

模型容量

拟合各种函数的能力
低容量的模型难以拟合训练数据
高容量的模型可以记住所有的训练数据

模型容量的影响

在这里插入图片描述

估计模型容量

在这里插入图片描述

数据复杂度

样本个数
每个样本元素的个数
时间、空间的结构
多样性

总结

模型容量需要匹配数据复杂度，否则可能导致欠拟合和过拟合
统计机器学习提供数学工具来衡量模型复杂度
实际中一般靠观察训练误差和验证误差

代码

import math
import numpy as np
import torch
from torch import nn
from d2l import torch as d2l

max_degree = 20  # 多项式的最大阶数
n_train, n_test = 100, 100  # 训练和测试数据集大小
true_w = np.zeros(max_degree)  # 分配大量的空间
true_w[0:4] = np.array([5, 1.2, -3.4, 5.6])

features = np.random.normal(size=(n_train + n_test, 1))
# print(features)
np.random.shuffle(features)
poly_features = np.power(features, np.arange(max_degree).reshape(1, -1))
#print(poly_features[0:2])
for i in range(max_degree):
    poly_features[:, i] /= math.gamma(i + 1)  # gamma(n)=(n-1)!
# labels的维度:(n_train+n_test,)
labels = np.dot(poly_features, true_w)
labels += np.random.normal(scale=0.1, size=labels.shape)  # 加上噪声

# NumPy ndarray转换为tensor
true_w, features, poly_features, labels = [torch.tensor(x, dtype=torch.float32) for x in [true_w, features, poly_features, labels]]

def evaluate_loss(net, data_iter, loss):  #@save
    """评估给定数据集上模型的损失"""
    metric = d2l.Accumulator(2)  # 损失的总和,样本数量
    for X, y in data_iter:
        out = net(X)
        y = y.reshape(out.shape)
        l = loss(out, y)
        metric.add(l.sum(), l.numel())
    return metric[0] / metric[1]

def train(train_features, test_features, train_labels, test_labels,
          num_epochs=400):
    loss = nn.MSELoss(reduction='none')
    input_shape = train_features.shape[-1]
    # 不设置偏置，因为我们已经在多项式中实现了它
    net = nn.Sequential(nn.Linear(input_shape, 1, bias=False))
    batch_size = min(10, train_labels.shape[0])
    train_iter = d2l.load_array((train_features, train_labels.reshape(-1,1)),
                                batch_size)
    test_iter = d2l.load_array((test_features, test_labels.reshape(-1,1)),
                               batch_size, is_train=False)
    trainer = torch.optim.SGD(net.parameters(), lr=0.01)
    animator = d2l.Animator(xlabel='epoch', ylabel='loss', yscale='log',
                            xlim=[1, num_epochs], ylim=[1e-3, 1e2],
                            legend=['train', 'test'])
    for epoch in range(num_epochs):
        d2l.train_epoch_ch3(net, train_iter, loss, trainer)
        if epoch == 0 or (epoch + 1) % 20 == 0:
            animator.add(epoch + 1, (evaluate_loss(net, train_iter, loss),
                                     evaluate_loss(net, test_iter, loss)))
    print('weight:', net[0].weight.data.numpy())

# 从多项式特征中选择前4个维度，即1,x,x^2/2!,x^3/3!
train(poly_features[:n_train, :4], poly_features[n_train:, :4],
      labels[:n_train], labels[n_train:])

# 从多项式特征中选择前2个维度，即1和x
train(poly_features[:n_train, :2], poly_features[n_train:, :2],
      labels[:n_train], labels[n_train:])

# 从多项式特征中选取所有维度
train(poly_features[:n_train, :], poly_features[n_train:, :],
      labels[:n_train], labels[n_train:], num_epochs=1500)