机器学习练习 5 - 偏差和方差

最新推荐文章于 2022-11-10 16:12:29 发布

追寻远方的人

最新推荐文章于 2022-11-10 16:12:29 发布

阅读量530

点赞数

分类专栏：机器学习练习(Stanford) 文章标签：机器学习 python 人工智能

本文链接：https://blog.csdn.net/segegse/article/details/125802213

版权

机器学习练习(Stanford) 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

1 正则化线性回归

这一部分，我们需要先对一个水库的流出水量以及水库水位进行正则化线性归回。然后将会探讨方差-偏差的问题

1.1 数据可视化

In [1]:

import numpy as np
import scipy.io as sio
import scipy.optimize as opt
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
/opt/conda/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/opt/conda/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)

In [2]:

data = sio.loadmat('/home/kesci/input/andrew_ml_ex55139/ex5data1.mat')
X, y, Xval, yval, Xtest, ytest = map(np.ravel,[data['X'], data['y'], data['Xval'], data['yval'], data['Xtest'], data['ytest']])
X.shape, y.shape, Xval.shape, yval.shape, Xtest.shape, ytest.shape

Out[2]:

((12,), (12,), (21,), (21,), (21,), (21,))

In [3]:

fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(X, y)
ax.set_xlabel('water_level')
ax.set_ylabel('flow')
plt.show()

1.2 正则化线性回归代价函数

正则化线性回归代价函数公式如下:

$J(\theta)=\frac{1}{2m}\left(\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2 \right )+\frac{\lambda}{2m}\left(\sum_{j=1}^n\theta_j^2\right)$

其中λ是正则化参数，他控制了正则化的程度。正则化项在原来的代价J上加了一个惩罚项。随着参数 $\theta_j$ 的变大，惩罚项也会增大。另外，你不需要正则化 $\theta_0$
下面完场计算正则化线性回归代价函数的代码。theta初始值为[1,1]，输出应该为303.993

np.array和np.matrix的差别

'''
上面所说的 np.array([1,2,3])这种不是矩阵np.array([[1,2,3]])或np.mat([1,2,3])才算
np.array 大多数操作符号都是element-wise的, 除了@可以表示叉乘(python>=3.5) , np.array要表示叉乘需要使用函数np.dot(A, B)
两者都有 .T 操作以返回转置矩阵, 但是np.mat多了.H(共轭转置)和.I(逆矩阵)
np.array 可以表示超过1~n维的数据, 而np.mat只能用于二维
np.array 取第一列A[:, 0] 返回的不是矩阵, shape是没有列的维度的(例如(3,) 而不是(3,1)), 而np.mat则是列向量形式的矩阵(符合预期)
两者可以互相用np.asmatrix()或np.asarray()互相转换
'''

In [4]:

X, Xval, Xtest = [np.insert(x.reshape(x.shape[0], 1), 0, np.ones(x.shape[0]), axis=1) for x in (X, Xval, Xtest)]

In [5]:

def cost(theta, X, y):
    """
    X: R(m*n), m records, n features
    y: R(m)
    theta : R(n), linear regression parameters
    """
    m = X.shape[0]

    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    inner = X * theta.T - y.T  # R(m*1)

    # 1*m @ m*1 = 1*1 in matrix multiplication
    # but you know numpy didn't do transpose in 1d array, so here is just a
    # vector inner product to itselves
    square_sum = inner.T * inner
    cost = square_sum / (2 * m)

    return cost[0,0] # 注意此处

In [6]:

def costReg(theta, X, y, reg = 1):
    
    m = X.shape[0]

    regularized_term = (reg / (2 * m)) * np.power(theta[1:], 2).sum()

    return cost(theta, X, y) + regularized_term

In [7]:

theta = np.ones(X.shape[1])
costReg(theta, X, y, 1)

Out[7]:

303.9931922202643

1.3 正则化线性回归的梯度

$\frac{\partial J( \theta )}{\partial {{\theta }_{0}}}=\frac{1}{m}\sum_{i=1}^{m}{({{h}_{\theta }}( {{x}^{(i)}})-{{y}^{(i)}})x_{_{j}}^{(i)}},for j = 0$

$\frac{\partial J( \theta )}{\partial {{\theta }_{j}}}=\frac{1}{m}\sum_{i=1}^{m}{({{h}_{\theta }}( {{x}^{(i)}})-{{y}^{(i)}})x_{_{j}}^{(i)}}+\frac{\lambda}{m}\theta_j,for j\geq0$
设定θ初始值为[1,1]，输出应该为[-15.30, 598.250]

In [8]:

def gradient(theta, X, y):
    m = X.shape[0]
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)

    inner = X.T * (X * theta.T - y.T)  #  (n, 1)  矩阵代替for循环

    return inner / m

In [9]:

def gradientReg(theta, X, y, reg):
    m = X.shape[0]

    theta = np.matrix(theta)
    theta = theta.T
    regularized_term = theta.copy()  # same shape as theta
    regularized_term[0] = 0  # don't regularize intercept theta

    regularized_term = (reg / m) * regularized_term

    return gradient(theta.T, X, y) + regularized_term

In [10]:

gradientReg(theta, X, y, 1)

Out[10]:

array([-15.30301567, 598.25074417])

1.4 拟合线性回归

调用工具库找到θ最优解，在这个部分，我们令λ=0。因为我们现在训练的是2维的θ，所以正则化不会对这种低维的θ有很大的帮助。
完成之后，将数据和拟合曲线可视化。

In [11]:

theta = np.ones(X.shape[1])
final_theta = opt.minimize(fun=costReg, x0=theta, args=(X, y, 0), method='TNC', jac=gradientReg, options={'disp': True}).x
final_theta

Out[11]:

array([13.08790348,  0.36777923])

In [12]:

b = final_theta[0] # intercept
m = final_theta[1] # slope

fig, ax = plt.subplots(figsize=(12,8))
plt.scatter(X[:,1], y, c='r', label="Training data")
plt.plot(X[:, 1], X[:, 1]*m + b, c='b', label="Prediction")
ax.set_xlabel('water_level')
ax.set_ylabel('flow')
ax.legend()
plt.show()

2 方差和偏差

机器学习中的一个重要概念是偏差-方差权衡。偏差较大的模型会欠拟合，而方差较大的模型会过拟合。这部分会让你画出学习曲线来判断方差和偏差的问题。

Bias/variance

Training error:
$KaTeX parse error: Expected group after '_' at position 38: …frac{1}{2m}\sum_̲\limits{i=1}^{m…$

Cross Validation error:
$KaTeX parse error: Expected group after '_' at position 40: …1}{2m_{cv}}\sum_̲\limits{i=1}^{m…$

对于训练集，当 $d$ 较小时，模型拟合程度更低，误差较大；随着 $d$ 的增长，拟合程度提高，误差减小。

对于交叉验证集，当 $d$ 较小时，模型拟合程度低，误差较大；但是随着 $d$ 的增长，误差呈现先减小后增大的趋势，转折点是我们的模型开始过拟合训练数据集的时候。

如果我们的交叉验证集误差较大，我们如何判断是方差还是偏差呢

训练集误差和交叉验证集误差近似时：偏差/欠拟合

交叉验证集误差远大于训练集误差时：方差/过拟合

当 $\lambda$ 较小时，训练集误差较小（过拟合）而交叉验证集误差较大

随着 $\lambda$ 的增加，训练集误差不断增加（欠拟合），而交叉验证集误差则是先减小后增加

2.1 学习曲线

1.使用训练集的子集来拟合应模型
2.在计算训练代价和验证集代价时，没有用正则化
3.记住使用相同的训练集子集来计算训练代价
代价公式为：

$J_{train}(\theta)=\frac{1}{m}[\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2]$

In [13]:

def linear_regression(X, y, l=1):
    """linear regression
    args:
        X: feature matrix, (m, n+1) # with incercept x0=1
        y: target vector, (m, )
        l: lambda constant for regularization

    return: trained parameters
    """
    # init theta
    theta = np.ones(X.shape[1])

    # train it
    res = opt.minimize(fun=costReg,
                       x0=theta,
                       args=(X, y, l),
                       method='TNC',
                       jac=gradientReg,
                       options={'disp': True})
    return res

In [14]:

training_cost, cv_cost = [], []

In [15]:

for i in range(1, m + 1):
    res = linear_regression(X[:i, :], y[:i], 0)

    tc = costReg(res.x, X[:i, :], y[:i], 0)
    cv = costReg(res.x, Xval, yval, 0)

    training_cost.append(tc)
    cv_cost.append(cv)

In [16]:

fig, ax = plt.subplots(figsize=(12,8))
plt.plot(np.arange(1, m+1), training_cost, label='training cost')
plt.plot(np.arange(1, m+1), cv_cost, label='cv cost')
plt.legend()
plt.show()

这个模型拟合不太好, 欠拟合了

让我们再次回到最开始的例子，在那里寻找答案，这就是我们之前的例子。回顾 1.1 中提出的六种可选的下一步，让我们来看一看我们在什么情况下应该怎样选择：

获得更多的训练样本——解决高方差
尝试减少特征的数量——解决高方差
尝试获得更多的特征——解决高偏差
尝试增加多项式特征——解决高偏差
尝试减少正则化程度λ——解决高偏差
尝试增加正则化程度λ——解决高方差

3 多项式回归

线性回归对于现有数据来说太简单了，会欠拟合，我们需要多添加一些特征。
写一个函数，输入原始X，和幂的次数p，返回X的1到p次幂

In [17]:

def poly_features(x, power, as_ndarray=False):
    data = {'f{}'.format(i): np.power(x, i) for i in range(1, power + 1)}
    df = pd.DataFrame(data)

    return df.values if as_ndarray else df
    #.values：对应的二维NumPy值数组。

In [18]:

data = sio.loadmat('/home/kesci/input/andrew_ml_ex55139/ex5data1.mat')
X, y, Xval, yval, Xtest, ytest = map(np.ravel,[data['X'], data['y'], data['Xval'], data['yval'], data['Xtest'], data['ytest']])

In [19]:

poly_features(X, power=3)

Out[19]:

	f1	f2	f3
0	-15.936758	253.980260	-4047.621971
1	-29.152979	849.896197	-24777.006175
2	36.189549	1309.683430	47396.852168
3	37.492187	1405.664111	52701.422173
4	-48.058829	2309.651088	-110999.127750
5	-8.941458	79.949670	-714.866612
6	15.307793	234.328523	3587.052500
7	-34.706266	1204.524887	-41804.560890
8	1.389154	1.929750	2.680720
9	-44.383760	1969.918139	-87432.373590
10	7.013502	49.189211	344.988637
11	22.762749	518.142738	11794.353058

3.1 多项式回归

使用之前的代价函数和梯度函数
扩展特征到8阶特征
使用 归一化 来处理 $x^n$
λ=0

apply() 使用时，通常放入一个 lambda 函数表达式、或一个函数作为操作运算，官方上给出DataFrame的 apply() 用法：

DataFrame.apply(self, func, axis=0, raw=False, result_type=None, args=(), **kwargs)

参数：

func：函数或 lambda 表达式,应用于每行或者每列

axis：{0 or ‘index’, 1 or ‘columns’}, 默认为0

0 or ‘index’: 表示函数处理的是每一列
1 or ‘columns’: 表示函数处理的是每一行

raw：bool 类型，默认为 False;

False ，表示把每一行或列作为 Series 传入函数中；
True，表示接受的是 ndarray 数据类型；

result_type：{‘expand’, ‘reduce’, ‘broadcast’, None}, default None

These only act when axis=1 (columns):

‘expand’ : 列表式的结果将被转化为列。
‘reduce’ : 如果可能的话，返回一个 Series，而不是展开类似列表的结果。这与 expand 相反。
‘broadcast’ : 结果将被广播到 DataFrame 的原始形状，原始索引和列将被保留。

args: func 的位置参数

**kwargs：要作为关键字参数传递给 func 的其他关键字参数，1.3.0 开始支持

返回值：

Series 或者 DataFrame：沿数据的给定轴应用 func 的结果

注：DataFrame与Series的区别与联系：

区别：

series，只是一个一维结构，它由index和value组成。
dataframe，是一个二维结构，除了拥有index和value之外，还拥有column。

联系：

dataframe由多个series组成，无论是行还是列，单独拆分出来都是一个series。

In [20]:

def normalize_feature(df):
    """Applies function along input axis(default 0) of DataFrame."""
    return df.apply(lambda column: (column - column.mean()) / column.std())

In [21]:

def prepare_poly_data(*args, power):
    """
    args: keep feeding in X, Xval, or Xtest
        will return in the same order
    """
    def prepare(x):
        # expand feature
        df = poly_features(x, power=power)

        # normalization
        ndarr = normalize_feature(df).values

        # add intercept term
        return np.insert(ndarr, 0, np.ones(ndarr.shape[0]), axis=1)

    return [prepare(x) for x in args]

In [22]:

X_poly, Xval_poly, Xtest_poly= prepare_poly_data(X, Xval, Xtest, power=8)
X_poly[:3, :]

Out[22]:

array([[ 1.00000000e+00, -3.62140776e-01, -7.55086688e-01,
         1.82225876e-01, -7.06189908e-01,  3.06617917e-01,
        -5.90877673e-01,  3.44515797e-01, -5.08481165e-01],
       [ 1.00000000e+00, -8.03204845e-01,  1.25825266e-03,
        -2.47936991e-01, -3.27023420e-01,  9.33963187e-02,
        -4.35817606e-01,  2.55416116e-01, -4.48912493e-01],
       [ 1.00000000e+00,  1.37746700e+00,  5.84826715e-01,
         1.24976856e+00,  2.45311974e-01,  9.78359696e-01,
        -1.21556976e-02,  7.56568484e-01, -1.70352114e-01]])

画出学习曲线
首先，我们没有使用正则化，所以 λ=0

In [23]:

def plot_learning_curve(X, Xinit, y, Xval, yval, l=0):
    training_cost, cv_cost = [], []
    m = X.shape[0]

    for i in range(1, m + 1):
        # regularization applies here for fitting parameters
        res = linear_regression(X[:i, :], y[:i], l=l)

        # remember, when you compute the cost here, you are computing
        # non-regularized cost. Regularization is used to fit parameters only
        tc = cost(res.x, X[:i, :], y[:i])
        cv = cost(res.x, Xval, yval)

        training_cost.append(tc)
        cv_cost.append(cv)

    fig, ax = plt.subplots(2, 1, figsize=(12, 12))
    ax[0].plot(np.arange(1, m + 1), training_cost, label='training cost')
    ax[0].plot(np.arange(1, m + 1), cv_cost, label='cv cost')
    ax[0].legend()

    fitx = np.linspace(-50, 50, 100)
    fitxtmp = prepare_poly_data(fitx, power=8)
    fity = np.dot(prepare_poly_data(fitx, power=8)[0], linear_regression(X, y, l).x.T)
    # (100, 9) (9,)用dot
    ax[1].plot(fitx, fity, c='r', label='fitcurve')
    ax[1].scatter(Xinit, y, c='b', label='initial_Xy')

    ax[1].set_xlabel('water_level')
    ax[1].set_ylabel('flow')

In [24]:

plot_learning_curve(X_poly, X, y, Xval_poly, yval, l=0)
plt.show()

你可以看到训练的代价太低了，不真实. 这是 过拟合了

3.2 调整正则化系数λ

令 λ=1

In [25]:

plot_learning_curve(X_poly, X, y, Xval_poly, yval, l=1)
plt.show()

训练代价不再是0了，也就是说我们减轻过拟合

令 λ=100

In [26]:

plot_learning_curve(X_poly, X, y, Xval_poly, yval, l=100)
plt.show()

太多正则化惩罚太多，变成 欠拟合状态

3.3 找到最佳的λ

通过之前的实验，我们可以发现λ可以极大程度地影响正则化多项式回归。
所以这部分我们会会使用验证集去评价λ的表现好坏，然后选择表现最好的λ后，用测试集测试模型在没有出现过的数据上会表现多好。
尝试λ值[0, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10]

In [27]:

l_candidate = [0, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10]
training_cost, cv_cost = [], []

In [28]:

for l in l_candidate:
    res = linear_regression(X_poly, y, l)
    
    tc = cost(res.x, X_poly, y)
    cv = cost(res.x, Xval_poly, yval)
    
    training_cost.append(tc)
    cv_cost.append(cv)

In [29]:

fig, ax = plt.subplots(figsize=(12,8))
ax.plot(l_candidate, training_cost, label='training')
ax.plot(l_candidate, cv_cost, label='cross validation')
plt.legend()

plt.xlabel('lambda')

plt.ylabel('cost')
plt.show()

我们可以看到，最小值在4左右，对应的λ的值约为1

3.4 计算测试集上的误差

实际上，为了获得一个更好的模型，我们需要把最终的模型用在一个从来没有在计算中出现过的测试集上，也就是说，需要既没有被用作选择θ，也没有被用作选择λ的数据（？？？？有问题吧）

In [30]:

# use test data to compute the cost
for l in l_candidate:
    theta = linear_regression(X_poly, y, l).x
    print('test cost(l={}) = {}'.format(l, cost(theta, Xtest_poly, ytest)))
test cost(l=0) = 10.804375286491785
test cost(l=0.001) = 10.911365745177878
test cost(l=0.003) = 11.265060784108712
test cost(l=0.01) = 10.879143763702967
test cost(l=0.03) = 10.022378551698187
test cost(l=0.1) = 8.631776100446476
test cost(l=0.3) = 7.3365081011786275
test cost(l=1) = 7.466282452677015
test cost(l=3) = 11.643940740451052
test cost(l=10) = 27.715080273166386