二、模型复杂度与训练集大小
模型训练经常出现两类现象:过拟合(训练误差远小于泛化误差)和欠拟合(训练误差较高),导致这两类现象的两个重要因素是:模型复杂度和训练集大小。机器学习模型应关注降低泛化误差。
1. 训练集大小
如果训练集过小,特别是比模型参数数量(按元素计)更小时,过拟合更容易发生。另外,泛化误差不会随训练集的增大而增大,所以我们通常希望训练集大一些。
2. 模型复杂度
如果模型参数过多(少),则模型的复杂度会较高(低),从而导致过拟合(欠拟合)。模型复杂度对过拟合和欠拟合的影响,如下图所示:
3.编程实验(多项式函数拟合实验)
3.1 模型复杂度
设数据集中输入特征向量为X,输出目标值为y,且X和y满足下面数学函数关系:
y = 2 X 3 + 3 X 2 − 12 X + 1 + ϵ , 其 中 噪 音 ϵ ∼ N ( 0 , 1 ) y = 2X^3+3X^2-12X+1+\epsilon,其中噪音 \epsilon \sim N(0,1) y=2X3+3X2−12X+1+ϵ,其中噪音ϵ∼N(0,1)
过拟合模型:
y = w 0 ∗ X 5 + w 1 ∗ X 4 + w 2 ∗ x 3 + w 3 ∗ x 2 + w 4 ∗ x + b y = w0 * X^5 + w1 * X^4 + w2 * x^3 + w3 * x^2 + w4 * x + b y=w0∗X5+w1∗X4+w2∗x3+w3∗x2+w4∗x+b
正常模型:
y = w 0 ∗ X 3 + w 1 ∗ X 2 + w 2 ∗ x + b y = w0 * X^3 + w1 * X^2 + w2 * x + b y=w0∗X3+w1∗X2+w2∗x+b
欠拟合模型:
y = w 0 ∗ x + b y = w0 * x + b y=w0∗x+b
代码实现:
# -*- coding: utf-8 -*-
"""
Created on Wed Feb 12 10:51:57 2020
@author: chengang
"""
import numpy as np
import matplotlib.pyplot as plt
import torch
np.random.seed(53)
batch_size = 10
lr = 0.01
num_epochs = 300
sample_num = 1000
train_num = 700
def GenerateDataset(X, y):
train_dataset = torch.utils.data.TensorDataset(X[:train_num], y[:train_num])
train_dataset = torch.utils.data.DataLoader(train_dataset, batch_size = batch_size, shuffle = True)
X_test, y_test = X[train_num:], y[train_num:]
return train_dataset, X_test, y_test
# overfitting model
def net1(x, w, b):
return w[0] * x**5 + w[1] * x**4 + w[2] * x**3 + w[3] * x**2 + w[4] * x + b
# normal model
def net2(x, w, b):
return w[0] * x**3 + w[1] * x**2 + w[2] * x + b
# underfitting model
def net3(x, w, b):
return w[0] * x + w[1] + b
if __name__ == '__main__':
X_data = torch.from_numpy(np.random.randn(sample_num, 1).astype(np.float32))
y_data = 2 * torch.pow(X_data, 3) + 3 * torch.pow(X_data, 2) - 12 * X_data + 1 + torch.randn(X_data.shape)
train_dataset, X_test, y_test = GenerateDataset(X_data, y_data)
# parameters for overfitting model, normal model, underfitting model
w1 = torch.randn(5, requires_grad