为什么数据增强能防止过拟合：以曲线拟合为例_数据增强对防治小型图像数据集的过拟合非常重要-CSDN博客

本文链接：https://blog.csdn.net/bby1987/article/details/109005301

文章目录

本文档暂未完成
多项式+最小二乘法做曲线拟合的数学基础
程序
解释

本文档暂未完成

多项式+最小二乘法做曲线拟合的数学基础

记n次多项式的表达式为：
$y = a_0 + a_1x^1 + a_2x^2 + ... + a_nx^n$
拟合的目的就是利用已知点求取所有的系数 $a_i$ 。

假设我们有m个点用来做拟合，m个点的坐标为： $x_1, y_1), (x_2, y_2), ... , (x_m, y_m)$

上述点的坐标均为已知内容，将其代入多项式可得：
$a_0 + a_1{x_1} + a_2{x_1}^2 + ... + a_n{x_1}^n = y_1 \\[2ex] a_0 + a_1{x_2} + a_2{x_2}^2 + ... + a_n{x_2}^n = y_2 \\[2ex] ... \\[2ex] a_0 + a_1{x_m} + a_2{x_m}^2 + ... + a_n{x_m}^n = y_m$
将上述方程组写为线性方程形式：
$\begin{bmatrix} 1 & {x_1}^1 & {x_1}^2 & ... & {x_1}^n \\ 1 & {x_2}^1 & {x_2}^2 & ... & {x_2}^n \\ && ... \\ 1 & {x_m}^1 & {x_m}^2 & ... & {x_m}^n \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \\ ... \\ a_n \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ ... \\ y_m \end{bmatrix}$
记：
$\begin{bmatrix} 1 & {x_1}^1 & {x_1}^2 & ... & {x_1}^n \\ 1 & {x_2}^1 & {x_2}^2 & ... & {x_2}^n \\ && ... \\ 1 & {x_m}^1 & {x_m}^2 & ... & {x_m}^n \end{bmatrix}$
$\begin{bmatrix} a_0 \\ a_1 \\ ... \\ a_n \end{bmatrix}$
$\begin{bmatrix} y_1 \\ y_2 \\ ... \\ y_m \end{bmatrix}$
那么有如下推导：
$\\[2ex] X^TXA=X^TY \\[2ex] (X^TX)^{-1}X^TXA=(X^TX)^{-1}X^TY \\[2ex] A=(X^TX)^{-1}X^TY$
最后的表达式就是A的最小二乘解，将求解出的A代入到多项式中，就可以得到多项式关于已知点的最小二乘拟合曲线。
注意，只有当 $m > n$ 时，得到的是最小二乘意义下的解；当 $m = n$ 时得到的是精确解，此时拟合曲线可以精准穿过所有的已知点；当 $m < n$ 时，就不能称之为最小二乘解了（该叫什么我也忘了，好像是最小方差解？），此时尽管多项式本身有更加富足的能力穿过所有的已知点，但是通过上述线性代数求解方法则得不到这样的解。

程序

在代码中，将系数 $a_i$ 称为模型 model，将待拟合的已知点称为训练集train data，将用来测试的点称为测试集test data。

代码中拟合是fit，fit的目的就是通过训练集求出模型（我们这里只是借用了训练的说法，实际上也没啥可训的，求解过程是一步到位。。。）。做预测是predict，predict的目的是针对任意输入，使用训练得到的模型求得输出。

代码构造了x_train, y_train, x_test, y_test用于测试，并根据这些数据画了一些图片以帮助理解。

# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt

DEGREE = [2, 4, 6, 8]
DEGREE_AUG = [2, 4, 6, 8, 10, 12]


class PolynomialModel(object):
    def __init__(self, degree):
        self.model = None
        self.degree = degree

    def _generate_x_matrix(self, X):
        """
        generate X matrix

        Parameters:
        -----------
        X: list or 1-dim numpy array
            x coordinates of points

        Returns:
        --------
        x_matrix: 2-dim numpy array
            X matrix
        """
        x_matrix = np.ones([len(X), self.degree + 1])
        for i in range(1, self.degree + 1):
            x_matrix[:, i] = np.power(X, i)
        return x_matrix

    def fit(self, X, Y):
        """
        compute model by least square fitting

        Parameters:
        -----------
        X, Y: list or 1-dim numpy array
            coordinates of points (x, y)
        """
        Y = np.array(Y)
        x_mat = self._generate_x_matrix(X)
        Y = np.reshape(Y, [Y.shape[0], 1])
        self.model = np.linalg.inv(x_mat.T @ x_mat) @ x_mat.T @ Y

    def predict(self, X):
        """
        predict output of input X, using the computed model

        Parameters:
        -----------
        X: list or 1-dim numpy array
            x coordinates of points
        """
        x_mat = self._generate_x_matrix(X)
        Y = x_mat @ self.model
        return Y


def compute_mse(y1, y2):
    """
    compute mean square error

    Parameters:
    -----------
    y1, y2: list or numpy array, with same shape

    Returns:
    --------
    mse: mean square error
    """
    y1 = np.array(y1)
    y2 = np.array(y2)
    assert np.product(y1.shape) == np.product(y2.shape)
    y2 = np.reshape(y2, y1.shape)
    mse = np.mean(np.power(y1 - y2, 2))
    return mse


def gauss_augmentation(X, Y, aug_num, std):
    """
    augment data using gauss distributed random number

    Parameters:
    -----------
    X, Y: list or 1-dim numpy array
        coordinates of points (x, y)
    aug_num: augmentation number for each point
    std: std of random number

    Returns:
    --------
    x_aug, y_aug: augmented points
    """
    assert len(X) == len(Y)
    length = len(X)

    x_aug = []
    y_aug = []
    for i in range(length):
        x = X[i] + np.random.randn(aug_num) * std
        y = Y[i] + np.random.randn(aug_num) * std
        x_aug.append(x)
        y_aug.append(y)
    x_aug = np.array(x_aug).reshape([-1])
    y_aug = np.array(y_aug).reshape([-1])
    return x_aug, y_aug


def get_x_range(x, increment):
    """
    get evenly distributed coordinates by input x, the low limit and high
    limit of coordinates are min(x) and max(x), separately.

    Parameters:
    -----------
    x: list or 1-dim numpy array
        x coordinates of points
    increment: float number
        interval of adjacent coordinates

    Returns:
    --------
    range: evenly distributed coordinates
    """
    x = np.array(x)
    xmin = np.min(x)
    xmax = np.max(x) + increment
    range = np.arange(xmin, xmax, increment)
    return range


def plot_original_dataset(x_train, y_train, x_test, y_test):
    plt.figure(1, figsize=(9, 6))
    plt.plot(x_train, y_train,
             color='red', marker='o', markersize=6,
             linewidth=0, label='train_data')
    plt.plot(x_test, y_test,
             color='green', marker='o', markersize=6,
             linewidth=0, label='test_data')
    plt.legend()
    plt.savefig('original_dataset.png')


def plot_augmented_dataset(x_train, y_train, x_test, y_test, x_aug, y_aug):
    plt.figure(2, figsize=(9, 6))
    plt.plot(x_train, y_train,
             color='red', marker='o', markersize=6,
             linewidth=0, label='train_data')
    plt.plot(x_test, y_test,
             color='green', marker='o', markersize=6,
             linewidth=0, label='test_data')
    plt.plot(x_aug, y_aug,
             color='blue', marker='o', markersize=3,
             linewidth=0, label='augmented_data')
    plt.legend()
    plt.savefig('augmented_dataset.png')


def plot_no_augmentation_model(x_train, y_train, x_test, y_test):
    x_range = get_x_range(x_train, 0.1)
    plt.figure(3, figsize=(15, 9))
    for i, degree in enumerate(DEGREE):
        plt.subplot(2, 2, i + 1)
        poly_model = PolynomialModel(degree)
        poly_model.fit(x_train, y_train)
        y_predict = poly_model.predict(x_range)
        plt.plot(x_train, y_train,
                 color='red', marker='o', markersize=6,
                 linewidth=0, label='train_data')
        plt.plot(x_test, y_test,
                 color='green', marker='o', markersize=6,
                 linewidth=0, label='test_data')
        plt.plot(x_range, y_predict, color='black', label='fitted curve')
        plt.title('degree = %d' % degree)
        plt.legend()
    plt.savefig('no_augmentation_model.png')


def plot_loss_for_overfitting(x_train, y_train, x_test, y_test):
    plt.figure(4, figsize=(9, 6))
    degrees = get_x_range(DEGREE, 1)
    train_loss = []
    test_loss = []
    for _, degree in enumerate(degrees):
        poly_model = PolynomialModel(degree)
        poly_model.fit(x_train, y_train)
        train_predict = poly_model.predict(x_train)
        test_predict = poly_model.predict(x_test)
        train_mse = compute_mse(y_train, train_predict)
        test_mse = compute_mse(y_test, test_predict)
        train_loss.append(train_mse)
        test_loss.append(test_mse)
    plt.plot(degrees, train_loss, color='red', label='train_loss')
    plt.plot(degrees, test_loss, color='green', label='test_loss')
    plt.xlabel('degree')
    plt.ylabel('mean square error')
    plt.legend()
    plt.savefig('loss_for_overfitting.png')


def plot_augmentation_model(x_train, y_train, x_aug, y_aug):
    x_range = get_x_range(x_train, 0.1)
    plt.figure(5, figsize=(15, 15))
    for i, degree in enumerate(DEGREE_AUG):
        plt.subplot(3, 2, i + 1)
        poly_model = PolynomialModel(degree)
        poly_model.fit(x_aug, y_aug)
        y_predict = poly_model.predict(x_range)
        plt.plot(x_train, y_train,
                 color='red', marker='o', markersize=6,
                 linewidth=0, label='train_data')
        plt.plot(x_test, y_test,
                 color='green', marker='o', markersize=6,
                 linewidth=0, label='test_data')
        plt.plot(x_range, y_predict, color='black', label='fitted curve')
        plt.title('degree = %d' % degree)
        plt.legend()
    plt.savefig('augmentation_model.png')


def plot_loss_for_augmentation(x_train, y_train, x_test, y_test, x_aug, y_aug):
    plt.figure(6, figsize=(9, 6))
    degrees = get_x_range(DEGREE_AUG, 1)
    train_loss = []
    test_loss = []
    for _, degree in enumerate(degrees):
        poly_model = PolynomialModel(degree)
        poly_model.fit(x_aug, y_aug)
        train_predict = poly_model.predict(x_train)
        test_predict = poly_model.predict(x_test)
        train_mse = compute_mse(y_train, train_predict)
        test_mse = compute_mse(y_test, test_predict)
        train_loss.append(train_mse)
        test_loss.append(test_mse)
    plt.plot(degrees, train_loss, color='red', label='train_loss')
    plt.plot(degrees, test_loss, color='green', label='test_loss')
    plt.xlabel('degree')
    plt.ylabel('mean square error')
    plt.legend()
    plt.savefig('loss_for_augmentation.png')


if __name__ == '__main__':
    np.random.seed(1337)
    x_train = [-3.0, -2.1, -0.9, 0.1, 1.2, 2.0, 3]
    y_train = [2.5, 1.2, 1.1, -2.9, -0.7, -3.2, 1.3]

    x_test = [-3.0, -2.7, -2.3, -2.0, -1.8, -1.6, -1.3, -1.0, -0.9, -0.6, -0.2,
              0.1, 0.4, 0.7, 1.0, 1.2, 1.5, 1.8, 2.0, 2.3, 2.5, 2.7, 3.0]
    y_test = [2.4, 2.1, 1.6, 1.1, 1.3, 1.0, 1.2, 1.0, 0.8, -0.2, -1.3, -2.3,
              -2.7, -2.3, -1.5, -1.2, -1.5, -2.9, -2.5, -1.3, -1.1, -0.4, 1.1]

    x_aug, y_aug = gauss_augmentation(x_train, y_train, 20, 0.4)

    plot_original_dataset(x_train, y_train, x_test, y_test)
    plot_augmented_dataset(x_train, y_train, x_test, y_test, x_aug, y_aug)
    plot_no_augmentation_model(x_train, y_train, x_test, y_test)
    plot_loss_for_overfitting(x_train, y_train, x_test, y_test)
    plot_augmentation_model(x_train, y_train, x_aug, y_aug)
    plot_loss_for_augmentation(x_train, y_train, x_test, y_test, x_aug, y_aug)