吴恩达机器学习课后作业python实现(5):Bias vs. Variance(文末附带全部代码)

写在前面:

训练集、验证集、测试集
如果给定的样本数据充足,我们通常使用均匀随机抽样的方式将数据集划分成3个部分——训练集、验证集和测试集,这三个集合不能有交集,常见的比例是8:1:1。需要注意的是,通常都会给定训练集和测试集,而不会给验证集。这时候验证集该从哪里得到呢?一般的做法是,从训练集中均匀随机抽样一部分样本作为验证集。

训练集
训练集用来训练模型,即确定模型的权重和偏置这些参数,通常我们称这些参数为学习参数。

验证集
而验证集用于模型的选择,更具体地来说,验证集并不参与学习参数的确定,也就是验证集并没有参与梯度下降的过程。验证集只是为了选择超参数,比如网络层数、网络节点数、迭代次数、学习率这些都叫超参数。比如在k-NN算法中,k值就是一个超参数。所以可以使用验证集来求出误差率最小的k。

测试集
测试集只使用一次,即在训练完成后评价最终的模型时使用。它既不参与学习参数过程,也不参数超参数选择过程,而仅仅使用于模型的评价。
值得注意的是,千万不能在训练过程中使用测试集,而后再用相同的测试集去测试模型。这样做其实是一个cheat,使得模型测试时准确率很高。

训练集直接参与了模型调慘的过程,显然不能用来反映模型真实的能力,这样一些 对课本死记硬背的学生(过拟合)将会拥有最好的成绩,显然不对。同理,由于验证集参与了人工调参(超参数)的过程,也不能用来最终评判一个模型,就像刷题库的学生也不能算是学习好的学生是吧。所以要通过最终的考试(测试集)来考察一个学(模)生(型)真正的能力。

交叉验证
之所以出现交叉验证,主要是因为训练集较小。无法直接像前面那样只分出训练集,验证集,测试就可以了(简单交叉验证)。
需要说明的是,在实际情况下,人们不是很喜欢用交叉验证,主要是因为它会耗费较多的计算资源。一般直接把训练集按照50%-90%的比例分成训练集和验证集。但这也是根据具体情况来定的:如果超参数数量多,你可能就想用更大的验证集,而验证集的数量不够,那么最好还是用交叉验证吧。至于分成几份比较好,一般都是分成3、5和10份。

原文链接:https://blog.csdn.net/Chaolei3/article/details/79270939

本次练习主要是针对bias vs. variance
以大坝的高度和出水率为例子

1 Prepare dataset

1.1 Visualizing the dataset

数据集分为:训练集,交叉验证集和测试集

import numpy as np
import matplotlib.pyplot as plt
from scipy.io import loadmat
import scipy.optimize as opt
'''
1.Prepare dataset
'''
data = loadmat('ex5data1.mat')
print(data)
X, y = data['X'], data['y']
Xval, yval = data['Xval'], data['yval']
Xtest, ytest = data['Xtest'], data['ytest']

# Insert a column of 1's to all of the X's, as usual
X = np.insert(X, 0, 1, axis=1)
Xval = np.insert(Xval, 0, 1, axis=1)
Xtest = np.insert(Xtest, 0, 1, axis=1)
print(X.shape, y.shape)  # (12, 2) (12, 1)
print(Xval.shape, yval.shape)  # (21, 2) (21, 1)
print(Xtest.shape, ytest.shape)  # (21, 2) (21, 1)


# Visualizing the data 可视化数据


def plotData(X, y):
    '''可视化数据'''
    fig, ax = plt.subplots(figsize=(6, 4))
    X = X[:, 1:]
    ax.scatter(X, y, c='r')
    ax.set_xlabel('Change in water level (x)')
    ax.set_ylabel('Water flowing out of the dam (y)')
    ax.grid(True)
    # plt.show()
    pass

image-20210530150851882

1.2 Regularized linear regression cost function

J ( θ ) = 1 2 m ( ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 ) + λ 2 m ( ∑ j = 1 n θ j 2 ) J(\theta)=\frac{1}{2 m}\left(\sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}\right)+\frac{\lambda}{2 m}\left(\sum_{j=1}^{n} \theta_{j}^{2}\right) J(θ)=2m1(i=1m(hθ(x(i))y(i))2)+2mλ(j=1nθj2)

def costReg(theta, X, y, lam):
    '''
    do not regularizethe theta0
    theta is a 1-d array with shape (n+1,)
    X is a matrix with shape (m, n+1)
    y is a matrix with shape (m, 1)
    :param theta:weights
    :param X:输入矩阵
    :param y:输出向量
    :param lam:lambda
    :return:costReg
    '''
    cost = np.nansum((X @ theta - y.flatten()) ** 2)
    reg = lam * theta[1:] @ theta[1:]
    return (cost + reg) * (1 / (2 * len(X)))


theta = np.ones(X.shape[1])
a = costReg(theta, X, y, 1)  # 303.9931922202643

∂ J ( θ ) ∂ θ 0 = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i )  for  j = 0 ∂ J ( θ ) ∂ θ j = ( 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) ) + λ m θ j  for  j ≥ 1 \begin{array}{ll} \frac{\partial J(\theta)}{\partial \theta_{0}}=\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)} & \text { for } j=0 \\ \frac{\partial J(\theta)}{\partial \theta_{j}}=\left(\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)}\right)+\frac{\lambda}{m} \theta_{j} & \text { for } j \geq 1 \end{array} θ0J(θ)=m1i=1m(hθ(x(i))y(i))xj(i)θjJ(θ)=(m1i=1m(hθ(x(i))y(i))xj(i))+mλθj for j=0 for j1

def gradientReg(theta, X, y, lam):
    '''
    theta: 1-d array with shape (2,)
    X: 2-d array with shape (12, 2)
    y: 2-d array with shape (12, 1)
    l: lambda constant
    grad has same shape as theta (2,)
    :param theta:weights
    :param X:输入矩阵
    :param y:输出向量
    :param lam:lambda
    :return:costReg
    '''
    grad = (X @ theta - y.flatten()) @ X
    reg = lam * theta
    reg[0] = 0
    return (grad + reg) / (len(X))


b = gradientReg(theta, X, y, 1)  # [-15.30301567 598.25074417]
1.4 Fitting linear regression
def trainLinearReg(X, y, lam):
    theta = np.zeros(X.shape[1])
    res = opt.minimize(fun=costReg, jac=gradientReg, method='TNC', args=(X, y, lam), x0=theta)
    return res.x


fit_theta = trainLinearReg(X, y, 0)
# plotData(X, y)
# plt.plot(X[:, 1], X @ fit_theta)
# plt.show()

image-20210530155049760

2 Bias-variance

2.1 Learning curves 学习曲线

J t r a i n ( θ ) = 1 2 m [ ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 ] J_{\mathrm{train}}(\theta)=\frac{1}{2 m}\left[\sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}\right] Jtrain(θ)=2m1[i=1m(hθ(x(i))y(i))2]

计算交叉验证代价时记得整个交叉验证集来计算,无需分为子集。

def plot_learning_curve(X, y, Xval, yval, lam):
    '''画出学习曲线,即交叉验证误差和训练需查随样本数量的变化的变化'''
    xx = range(1, len(X) + 1)
    training_cost, cv_cost = [], []
    for i in xx:
        res = trainLinearReg(X[:i], y[:i], lam)
        training_cost_i = costReg(res, X[:i], y[:i], lam)
        cv_cost_i = costReg(res, Xval, yval, lam)
        training_cost.append(training_cost_i)
        cv_cost.append(cv_cost_i)
        pass
    plt.figure(figsize=(8, 5))
    plt.plot(xx, training_cost, label='training cost')
    plt.plot(xx, cv_cost, label='cross validation cost')
    plt.legend()
    plt.xlabel('Number of Training samples')
    plt.ylabel('Error')
    plt.title('Learning curve for linear regression')
    plt.grid(True)
    plt.show()
    pass
# plot_learning_curve(X, y, Xval, yval, 0)

image-20210530161055551

3 Polynomial regression 多项式回归

数据预处理

  1. X,Xval,Xtest都需要添加多项式特征,这里我们选择增加到6次方
  2. 不要忘了标准化。
# 数据预处理
def genPolyFeatures(X, power):
    '''
    添加多项式特征
     每次在array的最后一列插入第二列的i+2次方(第一列为偏置)
    从二次方开始开始插入(因为本身含有一列一次方)
    :param X:
    :param power:
    :return:
    '''
    Xpoly = X.copy()
    for i in range(2, power+1):
        Xpoly = np.insert(Xpoly, Xpoly.shape[1], np.power(Xpoly[:, 1], i), axis=1)
        pass
    return Xpoly

关于归一化,所有数据集应该都用训练集的均值和样本标准差处理。切记。所以要将训练集的均值和样本标准差存储起来,对后面的数据进行处理。

def get_mean_std(X):
    '''获取数据的均值和误差,用来标准化所有数据'''
    means = np.nanmean(X, axis=0)
    # 每一行的相加起来除以总数
    stds = np.nanstd(X, axis=0, ddof=1)     # ddof=1 means 样本标准差
    return means, stds


def featureNormalize(myX, means, stds):
    '''标准化'''
    X_norm = myX.copy()
    X_norm[:, 1:] = X_norm[:, 1:] - means[1:]
    # 减去相应的位置
    X_norm[:, 1:] = X_norm[:, 1:] / stds[1:]
    # 除相应的位置
    return X_norm

而且注意这里是样本标准差而不是总体标准差,使用np.std()时,将ddof=1则是样本标准差,默认=0是总体标准差。而pandas默认计算样本标准差。

获取添加多项式特征以及标准化之后的数据。

power = 6
T = genPolyFeatures(X, power)
train_means, train_stds = get_mean_std(genPolyFeatures(X, power))
X_norm = featureNormalize(genPolyFeatures(X, power), train_means, train_stds)
Xval_norm = featureNormalize(genPolyFeatures(Xval, power), train_means, train_stds)
Xtest_norm = featureNormalize(genPolyFeatures(Xtest, power), train_means, train_stds)
def plot_fit(means, stds, lam, power):
    '''画出拟合曲线'''
    theta = trainLinearReg(X_norm, y, lam)
    x = np.linspace(-75, 55, 50)
    xmat = x.reshape(-1, 1)
    # 数组新的shape属性应该要与原来的配套,如果等于-1的话,那么Numpy会根据剩下的维度计算出数组的另外一个shape属性值。
    xmat = np.insert(xmat, 0, 1, axis=1)
    Xmat = genPolyFeatures(xmat, power)
    Xmat_norm = featureNormalize(Xmat, means, stds)

    plotData(X, y)
    plt.plot(x, Xmat_norm @ theta, 'b--')
    plt.show()
    pass
# lambda = 0
plot_fit(train_means, train_stds, 0, 6)
plot_learning_curve(X_norm, y, Xval_norm, yval, 0)

image-20210531140653927

image-20210531140723700

# lambda = 1
plot_fit(train_means, train_stds, 1, 6)
plot_learning_curve(X_norm, y, Xval_norm, yval, 1)

image-20210531140812941

image-20210531140843080

# lambda = 100
plot_fit(train_means, train_stds, 100, 6)
plot_learning_curve(X_norm, y, Xval_norm, yval, 100)

image-20210531140934080

image-20210531140945892

惩罚过多,欠拟合状态

3.1 Selecting λ using a cross validation set
lambdas = [0, 0.01, 0.02, 0.04, 0.08, 0.15, 0.32, 0.64, 1.28, 2.56, 3, 5.12, 10]
errors_train, errors_val = [], []
for l in lambdas:
    theta = trainLinearReg(X_norm, y, l)
    errors_train.append(costReg(theta, X_norm, y, 0))     # 记得把lambda = 0
    errors_val.append(costReg(theta, Xval_norm, yval, 0))
    pass
plt.figure(figsize=(8, 6))
plt.plot(lambdas, errors_train, c='b', label='train')
plt.plot(lambdas, errors_val, c='r', label='cv')
plt.legend()
plt.xlabel('Learning parameters: λ')
plt.ylabel('Error')
plt.grid(True)
# plt.show()

# 可以看到时交叉验证代价最小的是 lambda = 2.56
var = lambdas[np.nanargmin(errors_val)]  # 2.56
# print(var)

image-20210531142402825

3.2 Computing test set error
'''
6.Computing test set error
'''
theta = trainLinearReg(X_norm, y, 2.56)
print('test cost(l={}) = {}'.format(2.56, costReg(theta, Xtest_norm, ytest, 0)))
# for l in lambdas:
#     theta = trainLinearReg(X_norm, y, l)
#     print('test cost(l={}) = {}'.format(l, costReg(theta, Xtest_norm, ytest, 0

image-20210531144257589
完整代码:

import numpy as np
import matplotlib.pyplot as plt
from scipy.io import loadmat
import scipy.optimize as opt

'''
1.Prepare dataset
'''
data = loadmat('ex5data1.mat')
print(data)
X, y = data['X'], data['y']
Xval, yval = data['Xval'], data['yval']
Xtest, ytest = data['Xtest'], data['ytest']

# Insert a column of 1's to all of the X's, as usual
X = np.insert(X, 0, 1, axis=1)
Xval = np.insert(Xval, 0, 1, axis=1)
Xtest = np.insert(Xtest, 0, 1, axis=1)
print(X.shape, y.shape)  # (12, 2) (12, 1)
print(Xval.shape, yval.shape)  # (21, 2) (21, 1)
print(Xtest.shape, ytest.shape)  # (21, 2) (21, 1)


# Visualizing the data 可视化数据


def plotData(X, y):
    '''可视化数据'''
    fig, ax = plt.subplots(figsize=(6, 4))
    X = X[:, 1:]
    ax.scatter(X, y, c='r')
    ax.set_xlabel('Change in water level (x)')
    ax.set_ylabel('Water flowing out of the dam (y)')
    ax.grid(True)
    # plt.show()
    pass



'''
2.Regularized linear regression cost function
'''


def costReg(theta, X, y, lam):
    '''
    do not regularizethe theta0
    theta is a 1-d array with shape (n+1,)
    X is a matrix with shape (m, n+1)
    y is a matrix with shape (m, 1)
    :param theta:weights
    :param X:输入矩阵
    :param y:输出向量
    :param lam:lambda
    :return:costReg
    '''
    cost = np.nansum((X @ theta - y.flatten()) ** 2)
    reg = lam * theta[1:] @ theta[1:]
    return (cost + reg) * (1 / (2 * len(X)))


theta = np.ones(X.shape[1])
a = costReg(theta, X, y, 1)  # 303.9931922202643


def gradientReg(theta, X, y, lam):
    '''
    theta: 1-d array with shape (2,)
    X: 2-d array with shape (12, 2)
    y: 2-d array with shape (12, 1)
    l: lambda constant
    grad has same shape as theta (2,)
    :param theta:weights
    :param X:输入矩阵
    :param y:输出向量
    :param lam:lambda
    :return:costReg
    '''
    grad = (X @ theta - y.flatten()) @ X
    reg = lam * theta
    reg[0] = 0
    return (grad + reg) / (len(X))


b = gradientReg(theta, X, y, 1)  # [-15.30301567 598.25074417]
'''
3.Fitting linear regression
'''


def trainLinearReg(X, y, lam):
    theta = np.zeros(X.shape[1])
    res = opt.minimize(fun=costReg, jac=gradientReg, method='TNC', args=(X, y, lam), x0=theta)
    return res.x


fit_theta = trainLinearReg(X, y, 0)
# plotData(X, y)
# plt.plot(X[:, 1], X @ fit_theta)
# plt.show()


def plot_learning_curve(X, y, Xval, yval, lam):
    '''画出学习曲线,即交叉验证误差和训练需查随样本数量的变化的变化'''
    xx = range(1, len(X) + 1)
    training_cost, cv_cost = [], []
    for i in xx:
        res = trainLinearReg(X[:i], y[:i], lam)
        training_cost_i = costReg(res, X[:i], y[:i], lam)
        cv_cost_i = costReg(res, Xval, yval, lam)
        training_cost.append(training_cost_i)
        cv_cost.append(cv_cost_i)
        pass
    plt.figure(figsize=(8, 5))
    plt.plot(xx, training_cost, label='training cost')
    plt.plot(xx, cv_cost, label='cross validation cost')
    plt.legend()
    plt.xlabel('Number of Training samples')
    plt.ylabel('Error')
    plt.title('Learning curve for linear regression')
    plt.grid(True)
    plt.show()
    pass


# plot_learning_curve(X, y, Xval, yval, 0)
'''
4.Polynomial regression 多项式回归
'''


# 数据预处理
def genPolyFeatures(X, power):
    '''
    添加多项式特征
     每次在array的最后一列插入第二列的i+2次方(第一列为偏置)
    从二次方开始开始插入(因为本身含有一列一次方)
    :param X:
    :param power:
    :return:
    '''
    Xpoly = X.copy()
    for i in range(2, power+1):
        Xpoly = np.insert(Xpoly, Xpoly.shape[1], np.power(Xpoly[:, 1], i), axis=1)
        pass
    return Xpoly


def get_mean_std(X):
    '''获取数据的均值和误差,用来标准化所有数据'''
    means = np.nanmean(X, axis=0)
    # 每一行的相加起来除以总数
    stds = np.nanstd(X, axis=0, ddof=1)     # ddof=1 means 样本标准差
    return means, stds


def featureNormalize(myX, means, stds):
    '''标准化'''
    X_norm = myX.copy()
    X_norm[:, 1:] = X_norm[:, 1:] - means[1:]
    # 减去相应的位置
    X_norm[:, 1:] = X_norm[:, 1:] / stds[1:]
    # 除相应的位置
    return X_norm


power = 6
T = genPolyFeatures(X, power)
train_means, train_stds = get_mean_std(genPolyFeatures(X, power))
X_norm = featureNormalize(genPolyFeatures(X, power), train_means, train_stds)
Xval_norm = featureNormalize(genPolyFeatures(Xval, power), train_means, train_stds)
Xtest_norm = featureNormalize(genPolyFeatures(Xtest, power), train_means, train_stds)


def plot_fit(means, stds, lam, power):
    '''画出拟合曲线'''
    theta = trainLinearReg(X_norm, y, lam)
    x = np.linspace(-75, 55, 50)
    xmat = x.reshape(-1, 1)
    # 数组新的shape属性应该要与原来的配套,如果等于-1的话,那么Numpy会根据剩下的维度计算出数组的另外一个shape属性值。
    xmat = np.insert(xmat, 0, 1, axis=1)
    Xmat = genPolyFeatures(xmat, power)
    Xmat_norm = featureNormalize(Xmat, means, stds)

    plotData(X, y)
    plt.plot(x, Xmat_norm @ theta, 'b--')
    plt.show()
    pass


# lambda = 0
# plot_fit(train_means, train_stds, 0, 6)
# plot_learning_curve(X_norm, y, Xval_norm, yval, 0)

# lambda = 1
# plot_fit(train_means, train_stds, 1, 6)
# plot_learning_curve(X_norm, y, Xval_norm, yval, 1)


# lambda = 100
# plot_fit(train_means, train_stds, 100, 6)
# plot_learning_curve(X_norm, y, Xval_norm, yval, 100)


'''
5.Selecting λ using a cross validation set
'''
lambdas = [0, 0.01, 0.02, 0.04, 0.08, 0.15, 0.32, 0.64, 1.28, 2.56, 3, 5.12, 10]
errors_train, errors_val = [], []
for l in lambdas:
    theta = trainLinearReg(X_norm, y, l)
    errors_train.append(costReg(theta, X_norm, y, 0))     # 记得把lambda = 0
    errors_val.append(costReg(theta, Xval_norm, yval, 0))
    pass
plt.figure(figsize=(8, 6))
plt.plot(lambdas, errors_train, c='b', label='train')
plt.plot(lambdas, errors_val, c='r', label='cv')
plt.legend()
plt.xlabel('Learning parameters: λ')
plt.ylabel('Error')
plt.grid(True)
# plt.show()

# 可以看到时交叉验证代价最小的是 lambda = 2.56
var = lambdas[np.nanargmin(errors_val)]  # 2.56
# print(var)


'''
6.Computing test set error
'''
theta = trainLinearReg(X_norm, y, 2.56)
print('test cost(l={}) = {}'.format(2.56, costReg(theta, Xtest_norm, ytest, 0)))
# for l in lambdas:
#     theta = trainLinearReg(X_norm, y, l)
#     print('test cost(l={}) = {}'.format(l, costReg(theta, Xtest_norm, ytest, 0)))

参考链接:https://blog.csdn.net/Cowry5/article/details/80421712

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值