吴恩达机器学习课后作业python实现（5）：Bias vs. Variance（文末附带全部代码）

最新推荐文章于 2024-07-17 21:24:48 发布

TCQD

最新推荐文章于 2024-07-17 21:24:48 发布

阅读量1k

点赞数

分类专栏：吴恩达课后作业python实现我爱我的女朋友-瑶瑶文章标签： python 机器学习深度学习人工智能可视化

本文链接：https://blog.csdn.net/weixin_48577398/article/details/117464940

版权

吴恩达课后作业python实现同时被 2 个专栏收录

5 篇文章 225 订阅

订阅专栏

我爱我的女朋友-瑶瑶

5 篇文章 1 订阅

订阅专栏

写在前面：

训练集、验证集、测试集
如果给定的样本数据充足，我们通常使用均匀随机抽样的方式将数据集划分成3个部分——训练集、验证集和测试集，这三个集合不能有交集，常见的比例是8:1:1。需要注意的是，通常都会给定训练集和测试集，而不会给验证集。这时候验证集该从哪里得到呢？一般的做法是，从训练集中均匀随机抽样一部分样本作为验证集。

训练集
训练集用来训练模型，即确定模型的权重和偏置这些参数，通常我们称这些参数为学习参数。

验证集
而验证集用于模型的选择，更具体地来说，验证集并不参与学习参数的确定，也就是验证集并没有参与梯度下降的过程。验证集只是为了选择超参数，比如网络层数、网络节点数、迭代次数、学习率这些都叫超参数。比如在k-NN算法中，k值就是一个超参数。所以可以使用验证集来求出误差率最小的k。

测试集
测试集只使用一次，即在训练完成后评价最终的模型时使用。它既不参与学习参数过程，也不参数超参数选择过程，而仅仅使用于模型的评价。
值得注意的是，千万不能在训练过程中使用测试集，而后再用相同的测试集去测试模型。这样做其实是一个cheat，使得模型测试时准确率很高。

训练集直接参与了模型调慘的过程，显然不能用来反映模型真实的能力，这样一些对课本死记硬背的学生(过拟合)将会拥有最好的成绩，显然不对。同理，由于验证集参与了人工调参(超参数)的过程，也不能用来最终评判一个模型，就像刷题库的学生也不能算是学习好的学生是吧。所以要通过最终的考试(测试集)来考察一个学(模)生(型)真正的能力。

交叉验证
之所以出现交叉验证，主要是因为训练集较小。无法直接像前面那样只分出训练集，验证集，测试就可以了（简单交叉验证）。
需要说明的是，在实际情况下，人们不是很喜欢用交叉验证，主要是因为它会耗费较多的计算资源。一般直接把训练集按照50%-90%的比例分成训练集和验证集。但这也是根据具体情况来定的：如果超参数数量多，你可能就想用更大的验证集，而验证集的数量不够，那么最好还是用交叉验证吧。至于分成几份比较好，一般都是分成3、5和10份。

原文链接：https://blog.csdn.net/Chaolei3/article/details/79270939

本次练习主要是针对bias vs. variance
以大坝的高度和出水率为例子

1 Prepare dataset

1.1 Visualizing the dataset

数据集分为：训练集，交叉验证集和测试集

import numpy as np
import matplotlib.pyplot as plt
from scipy.io import loadmat
import scipy.optimize as opt

'''
1.Prepare dataset
'''
data = loadmat('ex5data1.mat')
print(data)
X, y = data['X'], data['y']
Xval, yval = data['Xval'], data['yval']
Xtest, ytest = data['Xtest'], data['ytest']

# Insert a column of 1's to all of the X's, as usual
X = np.insert(X, 0, 1, axis=1)
Xval = np.insert(Xval, 0, 1, axis=1)
Xtest = np.insert(Xtest, 0, 1, axis=1)
print(X.shape, y.shape)  # (12, 2) (12, 1)
print(Xval.shape, yval.shape)  # (21, 2) (21, 1)
print(Xtest.shape, ytest.shape)  # (21, 2) (21, 1)


# Visualizing the data 可视化数据


def plotData(X, y):
    '''可视化数据'''
    fig, ax = plt.subplots(figsize=(6, 4))
    X = X[:, 1:]
    ax.scatter(X, y, c='r')
    ax.set_xlabel('Change in water level (x)')
    ax.set_ylabel('Water flowing out of the dam (y)')
    ax.grid(True)
    # plt.show()
    pass

1.2 Regularized linear regression cost function

$J(\theta)=\frac{1}{2 m}\left(\sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}\right)+\frac{\lambda}{2 m}\left(\sum_{j=1}^{n} \theta_{j}^{2}\right)$

def costReg(theta, X, y, lam):
    '''
    do not regularizethe theta0
    theta is a 1-d array with shape (n+1,)
    X is a matrix with shape (m, n+1)
    y is a matrix with shape (m, 1)
    :param theta:weights
    :param X:输入矩阵
    :param y:输出向量
    :param lam:lambda
    :return:costReg
    '''
    cost = np.nansum((X @ theta - y.flatten()) ** 2)
    reg = lam * theta[1:] @ theta[1:]
    return (cost + reg) * (1 / (2 * len(X)))


theta = np.ones(X.shape[1])
a = costReg(theta, X, y, 1)  # 303.9931922202643

$\begin{array}{ll} \frac{\partial J(\theta)}{\partial \theta_{0}}=\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)} & \text { for } j=0 \\ \frac{\partial J(\theta)}{\partial \theta_{j}}=\left(\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)}\right)+\frac{\lambda}{m} \theta_{j} & \text { for } j \geq 1 \end{array}$

def gradientReg(theta, X, y, lam):
    '''
    theta: 1-d array with shape (2,)
    X: 2-d array with shape (12, 2)
    y: 2-d array with shape (12, 1)
    l: lambda constant
    grad has same shape as theta (2,)
    :param theta:weights
    :param X:输入矩阵
    :param y:输出向量
    :param lam:lambda
    :return:costReg
    '''
    grad = (X @ theta - y.flatten()) @ X
    reg = lam * theta
    reg[0] = 0
    return (grad + reg) / (len(X))


b = gradientReg(theta, X, y, 1)  # [-15.30301567 598.25074417]

1.4 Fitting linear regression

def trainLinearReg(X, y, lam):
    theta = np.zeros(X.shape[1])
    res = opt.minimize(fun=costReg, jac=gradientReg, method='TNC', args=(X, y, lam), x0=theta)
    return res.x


fit_theta = trainLinearReg(X, y, 0)
# plotData(X, y)
# plt.plot(X[:, 1], X @ fit_theta)
# plt.show()

2 Bias-variance

2.1 Learning curves 学习曲线

$J_{\mathrm{train}}(\theta)=\frac{1}{2 m}\left[\sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}\right]$

计算交叉验证代价时记得整个交叉验证集来计算，无需分为子集。

def plot_learning_curve(X, y, Xval, yval, lam):
    '''画出学习曲线，即交叉验证误差和训练需查随样本数量的变化的变化'''
    xx = range(1, len(X) + 1)
    training_cost, cv_cost = [], []
    for i in xx:
        res = trainLinearReg(X[:i], y[:i], lam)
        training_cost_i = costReg(res, X[:i], y[:i], lam)
        cv_cost_i = costReg(res, Xval, yval, lam)
        training_cost.append(training_cost_i)
        cv_cost.append(cv_cost_i)
        pass
    plt.figure(figsize=(8, 5))
    plt.plot(xx, training_cost, label='training cost')
    plt.plot(xx, cv_cost, label='cross validation cost')
    plt.legend()
    plt.xlabel('Number of Training samples')
    plt.ylabel('Error')
    plt.title('Learning curve for linear regression')
    plt.grid(True)
    plt.show()
    pass

# plot_learning_curve(X, y, Xval, yval, 0)

3 Polynomial regression 多项式回归

数据预处理

X，Xval，Xtest都需要添加多项式特征，这里我们选择增加到6次方
不要忘了标准化。

# 数据预处理
def genPolyFeatures(X, power):
    '''
    添加多项式特征
     每次在array的最后一列插入第二列的i+2次方（第一列为偏置）
    从二次方开始开始插入（因为本身含有一列一次方）
    :param X:
    :param power:
    :return:
    '''
    Xpoly = X.copy()
    for i in range(2, power+1):
        Xpoly = np.insert(Xpoly, Xpoly.shape[1], np.power(Xpoly[:, 1], i), axis=1)
        pass
    return Xpoly

关于归一化，所有数据集应该都用训练集的均值和样本标准差处理。切记。所以要将训练集的均值和样本标准差存储起来，对后面的数据进行处理。

def get_mean_std(X):
    '''获取数据的均值和误差，用来标准化所有数据'''
    means = np.nanmean(X, axis=0)
    # 每一行的相加起来除以总数
    stds = np.nanstd(X, axis=0, ddof=1)     # ddof=1 means 样本标准差
    return means, stds


def featureNormalize(myX, means, stds):
    '''标准化'''
    X_norm = myX.copy()
    X_norm[:, 1:] = X_norm[:, 1:] - means[1:]
    # 减去相应的位置
    X_norm[:, 1:] = X_norm[:, 1:] / stds[1:]
    # 除相应的位置
    return X_norm

而且注意这里是样本标准差而不是总体标准差，使用np.std()时，将ddof=1则是样本标准差，默认=0是总体标准差。而pandas默认计算样本标准差。

获取添加多项式特征以及标准化之后的数据。

power = 6
T = genPolyFeatures(X, power)
train_means, train_stds = get_mean_std(genPolyFeatures(X, power))
X_norm = featureNormalize(genPolyFeatures(X, power), train_means, train_stds)
Xval_norm = featureNormalize(genPolyFeatures(Xval, power), train_means, train_stds)
Xtest_norm = featureNormalize(genPolyFeatures(Xtest, power), train_means, train_stds)

def plot_fit(means, stds, lam, power):
    '''画出拟合曲线'''
    theta = trainLinearReg(X_norm, y, lam)
    x = np.linspace(-75, 55, 50)
    xmat = x.reshape(-1, 1)
    # 数组新的shape属性应该要与原来的配套，如果等于-1的话，那么Numpy会根据剩下的维度计算出数组的另外一个shape属性值。
    xmat = np.insert(xmat, 0, 1, axis=1)
    Xmat = genPolyFeatures(xmat, power)
    Xmat_norm = featureNormalize(Xmat, means, stds)

    plotData(X, y)
    plt.plot(x, Xmat_norm @ theta, 'b--')
    plt.show()
    pass

# lambda = 0
plot_fit(train_means, train_stds, 0, 6)
plot_learning_curve(X_norm, y, Xval_norm, yval, 0)

# lambda = 1
plot_fit(train_means, train_stds, 1, 6)
plot_learning_curve(X_norm, y, Xval_norm, yval, 1)

# lambda = 100
plot_fit(train_means, train_stds, 100, 6)
plot_learning_curve(X_norm, y, Xval_norm, yval, 100)

惩罚过多，欠拟合状态

3.1 Selecting λ using a cross validation set

lambdas = [0, 0.01, 0.02, 0.04, 0.08, 0.15, 0.32, 0.64, 1.28, 2.56, 3, 5.12, 10]
errors_train, errors_val = [], []
for l in lambdas:
    theta = trainLinearReg(X_norm, y, l)
    errors_train.append(costReg(theta, X_norm, y, 0))     # 记得把lambda = 0
    errors_val.append(costReg(theta, Xval_norm, yval, 0))
    pass
plt.figure(figsize=(8, 6))
plt.plot(lambdas, errors_train, c='b', label='train')
plt.plot(lambdas, errors_val, c='r', label='cv')
plt.legend()
plt.xlabel('Learning parameters: λ')
plt.ylabel('Error')
plt.grid(True)
# plt.show()

# 可以看到时交叉验证代价最小的是 lambda = 2.56
var = lambdas[np.nanargmin(errors_val)]  # 2.56
# print(var)

3.2 Computing test set error

'''
6.Computing test set error
'''
theta = trainLinearReg(X_norm, y, 2.56)
print('test cost(l={}) = {}'.format(2.56, costReg(theta, Xtest_norm, ytest, 0)))
# for l in lambdas:
#     theta = trainLinearReg(X_norm, y, l)
#     print('test cost(l={}) = {}'.format(l, costReg(theta, Xtest_norm, ytest, 0

完整代码：

import numpy as np
import matplotlib.pyplot as plt
from scipy.io import loadmat
import scipy.optimize as opt

'''
1.Prepare dataset
'''
data = loadmat('ex5data1.mat')
print(data)
X, y = data['X'], data['y']
Xval, yval = data['Xval'], data['yval']
Xtest, ytest = data['Xtest'], data['ytest']

# Insert a column of 1's to all of the X's, as usual
X = np.insert(X, 0, 1, axis=1)
Xval = np.insert(Xval, 0, 1, axis=1)
Xtest = np.insert(Xtest, 0, 1, axis=1)
print(X.shape, y.shape)  # (12, 2) (12, 1)
print(Xval.shape, yval.shape)  # (21, 2) (21, 1)
print(Xtest.shape, ytest.shape)  # (21, 2) (21, 1)


# Visualizing the data 可视化数据


def plotData(X, y):
    '''可视化数据'''
    fig, ax = plt.subplots(figsize=(6, 4))
    X = X[:, 1:]
    ax.scatter(X, y, c='r')
    ax.set_xlabel('Change in water level (x)')
    ax.set_ylabel('Water flowing out of the dam (y)')
    ax.grid(True)
    # plt.show()
    pass



'''
2.Regularized linear regression cost function
'''


def costReg(theta, X, y, lam):
    '''
    do not regularizethe theta0
    theta is a 1-d array with shape (n+1,)
    X is a matrix with shape (m, n+1)
    y is a matrix with shape (m, 1)
    :param theta:weights
    :param X:输入矩阵
    :param y:输出向量
    :param lam:lambda
    :return:costReg
    '''
    cost = np.nansum((X @ theta - y.flatten()) ** 2)
    reg = lam * theta[1:] @ theta[1:]
    return (cost + reg) * (1 / (2 * len(X)))


theta = np.ones(X.shape[1])
a = costReg(theta, X, y, 1)  # 303.9931922202643


def gradientReg(theta, X, y, lam):
    '''
    theta: 1-d array with shape (2,)
    X: 2-d array with shape (12, 2)
    y: 2-d array with shape (12, 1)
    l: lambda constant
    grad has same shape as theta (2,)
    :param theta:weights
    :param X:输入矩阵
    :param y:输出向量
    :param lam:lambda
    :return:costReg
    '''
    grad = (X @ theta - y.flatten()) @ X
    reg = lam * theta
    reg[0] = 0
    return (grad + reg) / (len(X))


b = gradientReg(theta, X, y, 1)  # [-15.30301567 598.25074417]
'''
3.Fitting linear regression
'''


def trainLinearReg(X, y, lam):
    theta = np.zeros(X.shape[1])
    res = opt.minimize(fun=costReg, jac=gradientReg, method='TNC', args=(X, y, lam), x0=theta)
    return res.x


fit_theta = trainLinearReg(X, y, 0)
# plotData(X, y)
# plt.plot(X[:, 1], X @ fit_theta)
# plt.show()


def plot_learning_curve(X, y, Xval, yval, lam):
    '''画出学习曲线，即交叉验证误差和训练需查随样本数量的变化的变化'''
    xx = range(1, len(X) + 1)
    training_cost, cv_cost = [], []
    for i in xx:
        res = trainLinearReg(X[:i], y[:i], lam)
        training_cost_i = costReg(res, X[:i], y[:i], lam)
        cv_cost_i = costReg(res, Xval, yval, lam)
        training_cost.append(training_cost_i)
        cv_cost.append(cv_cost_i)
        pass
    plt.figure(figsize=(8, 5))
    plt.plot(xx, training_cost, label='training cost')
    plt.plot(xx, cv_cost, label='cross validation cost')
    plt.legend()
    plt.xlabel('Number of Training samples')
    plt.ylabel('Error')
    plt.title('Learning curve for linear regression')
    plt.grid(True)
    plt.show()
    pass


# plot_learning_curve(X, y, Xval, yval, 0)
'''
4.Polynomial regression 多项式回归
'''


# 数据预处理
def genPolyFeatures(X, power):
    '''
    添加多项式特征
     每次在array的最后一列插入第二列的i+2次方（第一列为偏置）
    从二次方开始开始插入（因为本身含有一列一次方）
    :param X:
    :param power:
    :return:
    '''
    Xpoly = X.copy()
    for i in range(2, power+1):
        Xpoly = np.insert(Xpoly, Xpoly.shape[1], np.power(Xpoly[:, 1], i), axis=1)
        pass
    return Xpoly


def get_mean_std(X):
    '''获取数据的均值和误差，用来标准化所有数据'''
    means = np.nanmean(X, axis=0)
    # 每一行的相加起来除以总数
    stds = np.nanstd(X, axis=0, ddof=1)     # ddof=1 means 样本标准差
    return means, stds


def featureNormalize(myX, means, stds):
    '''标准化'''
    X_norm = myX.copy()
    X_norm[:, 1:] = X_norm[:, 1:] - means[1:]
    # 减去相应的位置
    X_norm[:, 1:] = X_norm[:, 1:] / stds[1:]
    # 除相应的位置
    return X_norm


power = 6
T = genPolyFeatures(X, power)
train_means, train_stds = get_mean_std(genPolyFeatures(X, power))
X_norm = featureNormalize(genPolyFeatures(X, power), train_means, train_stds)
Xval_norm = featureNormalize(genPolyFeatures(Xval, power), train_means, train_stds)
Xtest_norm = featureNormalize(genPolyFeatures(Xtest, power), train_means, train_stds)


def plot_fit(means, stds, lam, power):
    '''画出拟合曲线'''
    theta = trainLinearReg(X_norm, y, lam)
    x = np.linspace(-75, 55, 50)
    xmat = x.reshape(-1, 1)
    # 数组新的shape属性应该要与原来的配套，如果等于-1的话，那么Numpy会根据剩下的维度计算出数组的另外一个shape属性值。
    xmat = np.insert(xmat, 0, 1, axis=1)
    Xmat = genPolyFeatures(xmat, power)
    Xmat_norm = featureNormalize(Xmat, means, stds)

    plotData(X, y)
    plt.plot(x, Xmat_norm @ theta, 'b--')
    plt.show()
    pass


# lambda = 0
# plot_fit(train_means, train_stds, 0, 6)
# plot_learning_curve(X_norm, y, Xval_norm, yval, 0)

# lambda = 1
# plot_fit(train_means, train_stds, 1, 6)
# plot_learning_curve(X_norm, y, Xval_norm, yval, 1)


# lambda = 100
# plot_fit(train_means, train_stds, 100, 6)
# plot_learning_curve(X_norm, y, Xval_norm, yval, 100)


'''
5.Selecting λ using a cross validation set
'''
lambdas = [0, 0.01, 0.02, 0.04, 0.08, 0.15, 0.32, 0.64, 1.28, 2.56, 3, 5.12, 10]
errors_train, errors_val = [], []
for l in lambdas:
    theta = trainLinearReg(X_norm, y, l)
    errors_train.append(costReg(theta, X_norm, y, 0))     # 记得把lambda = 0
    errors_val.append(costReg(theta, Xval_norm, yval, 0))
    pass
plt.figure(figsize=(8, 6))
plt.plot(lambdas, errors_train, c='b', label='train')
plt.plot(lambdas, errors_val, c='r', label='cv')
plt.legend()
plt.xlabel('Learning parameters: λ')
plt.ylabel('Error')
plt.grid(True)
# plt.show()

# 可以看到时交叉验证代价最小的是 lambda = 2.56
var = lambdas[np.nanargmin(errors_val)]  # 2.56
# print(var)


'''
6.Computing test set error
'''
theta = trainLinearReg(X_norm, y, 2.56)
print('test cost(l={}) = {}'.format(2.56, costReg(theta, Xtest_norm, ytest, 0)))
# for l in lambdas:
#     theta = trainLinearReg(X_norm, y, l)
#     print('test cost(l={}) = {}'.format(l, costReg(theta, Xtest_norm, ytest, 0)))

参考链接：https://blog.csdn.net/Cowry5/article/details/80421712

TCQD

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
吴恩达机器学习课后作业python实现（5）：Bias vs. Variance（文末附带全部代码）

写在前面：训练集、验证集、测试集如果给定的样本数据充足，我们通常使用均匀随机抽样的方式将数据集划分成3个部分——训练集、验证集和测试集，这三个集合不能有交集，常见的比例是8:1:1。需要注意的是，通常都会给定训练集和测试集，而不会给验证集。这时候验证集该从哪里得到呢？一般的做法是，从训练集中均匀随机抽样一部分样本作为验证集。训练集训练集用来训练模型，即确定模型的权重和偏置这些参数，通常我们称这些参数为学习参数。验证集而验证集用于模型的选择，更具体地来说，验证集并不参与学习参数的确定，也就是验证集
复制链接

扫一扫

专栏目录