4 梯度下降法

最新推荐文章于 2022-04-02 21:15:24 发布

weixin_34185320

最新推荐文章于 2022-04-02 21:15:24 发布

阅读量148

点赞数

文章标签：数据结构与算法 python 人工智能

原文链接：https://juejin.im/post/5ceea64e51882567697f98c2

版权

上一章我们讲解了正规方程求解线性回归算法，本文介绍另外一种在机器学习领域非常实用的算法-梯度下降法。首先我们要知道几点：

不是一个机器学习算法
是一种基于搜索的最优化方法
作用：最小化一个损失函数
梯度上升法：最大化一个效用函数

1 定义

我们预测的模型是

h_{\theta}\left( x \right)=\theta^{T}X={\theta_{0}}+{\theta_{1}}{x_{1}}+{\theta_{2}}{x_{2}}+...+{\theta_{n}}{x_{n}}

目标：使以下式子尽可能小

J\left( {\theta_{0}},{\theta_{1}}...{\theta_{n}} \right)=\sum\limits_{i=1}^{m}{{{\left( h_{\theta} \left({x}^{\left( i \right)} \right)-{y}^{\left( i \right)} \right)}^{2}}}

但是我们希望目标式子与m的大小无关，所以最终定义的损失函数为：

J\left( {\theta_{0}},{\theta_{1}}...{\theta_{n}} \right)=\frac{1}{m}\sum\limits_{i=1}^{m}{{{\left( h_{\theta} \left({x}^{\left( i \right)} \right)-{y}^{\left( i \right)} \right)}^{2}}}

对损失函数的参数的每一个θ求导：

当 n>=1 时， ${{\theta }_{0}}:={{\theta }_{0}}-η\frac{2}{m}\sum\limits_{i=1}^{m}{({{h}_{\theta }}({{x}^{(i)}})-{{y}^{(i)}})}x_{0}^{(i)}$

${{\theta }_{1}}:={{\theta }_{1}}-η\frac{2}{m}\sum\limits_{i=1}^{m}{({{h}_{\theta }}({{x}^{(i)}})-{{y}^{(i)}})}x_{1}^{(i)}$

${{\theta }_{2}}:={{\theta }_{2}}-η\frac{2}{m}\sum\limits_{i=1}^{m}{({{h}_{\theta }}({{x}^{(i)}})-{{y}^{(i)}})}x_{2}^{(i)}$

换成向量形式表示为：

{{\theta }_{n}}:={{\theta }_{n}}-η\frac{2}{m}\sum\limits_{i=1}^{m}{({X_b^{(i)}\theta}-{{y}^{(i)}})}x_{n}^{(i)}

$X_b^{(i)}$ 表示第i个样本的(n+1)个特征，第一列是虚拟特征即 $X_0^{(i)}=1$ ，用向量表示的结果

其中为学习率，它决定了我们沿着能让损失函数下降程度最大的方向向下迈出的步子有多大。所以是梯度下降法的一个超参数。

另外，下降的幅度是先快后慢的，是取决于导数的斜率，也就是下降速度。

还要注意的是并不是所有函数都有唯一的极值点

多次运行，随机化初始点
梯度下降法的初始点也是一个超参数

我们开始随机选择一系列的参数值，计算所有的预测结果后，再给所有的参数一个新的值，如此循环直到收敛。

2 线性回归模型中使用梯度下降法

import numpy as np
from .metrics import r2_score


class LinearRegression:

    def __init__(self):
        """初始化Linear Regression模型"""
        self.coef_ = None
        self.intercept_ = None
        self._theta = None

    def fit_normal(self, X_train, y_train):
        """根据训练数据集X_train, y_train训练Linear Regression模型"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        self._theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)

        self.intercept_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

    def fit_bgd(self, X_train, y_train, eta=0.01, n_iters=1e4):
        """根据训练数据集X_train, y_train, 使用梯度下降法训练Linear Regression模型"""
        assert X_train.shape[0] == y_train.shape[0]

        def J(theta, X_b, y):
            """J的损失函数的值，如果η选取过大，数值会不断增大，最终超出异常"""
            try:
                return np.sum((y - X_b.dot(theta)) ** 2) / len(y)
            except:
                return float('inf')

        def dJ(theta, X_b, y):
            """J的损失函数的求导"""
            res = np.empty(len(theta))
            res[0] = np.sum(X_b.dot(theta) - y)  # 对θ_0单独求导
            for i in range(1, len(theta)):
                res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
            return res * 2 / len(X_b)

        def dJ(theta, X_b, y):
            """使用向量点乘的方式求导"""
            return X_b.T.dot(X_b.dot(theta) - y) * 2. / len(y)

        def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
            theta = initial_theta  # 随机初始化θ
            cur_iter = 0  # 对梯度下降法限制次数

            while cur_iter < n_iters:
                gradient = dJ(theta, X_b, y)
                last_theta = theta
                theta = theta - eta * gradient
                # epsilon：计算机有精度限制，小于此值就可以认为趋于0了
                if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
                    break
                cur_iter += 1
            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])  # 第一列插入1
        initial_theta = np.zeros(X_b.shape[1])  # 初始化θ为0
        self._theta = gradient_descent(X_b, y_train, initial_theta, eta, n_iters)

        self.intercept_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self


def predict(self, X_predict):
    """给定待预测数据集X_predict，返回表示X_predict的结果向量"""
    assert self.intercept_ is not None and self.coef_ is not None
    assert X_predict.shape[1] == len(self.coef_)

    X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
    return X_b.dot(self._theta)


def score(self, X_test, y_test):
    """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""

    y_predict = self.predict(X_test)
    return r2_score(y_test, y_predict)


def __repr__(self):
    return "LinearRegression()"

复制代码

上面方法是使用for循环方式求损失函数 $J(\theta)$ 的导数，那么能否转化成向量的方式呢？答：当然没问题。

上式子的左侧为J的导数，转换成右侧的向量乘法左侧是m1的矩阵，右侧是1m的矩阵，最后在转置一下就ok了

最终的形式为：

def dJ(theta, X_b, y):
            return X_b.T.dot(X_b.dot(theta) - y) * 2. / len(y)
复制代码

3 几点注意

3.1 数据归一化

我们使用正规方程的方式求解的时候，最终转化的是一个公式，不需要搜索的过程，因此不需要对数据进行归一化。但是使用梯度下降法时，由于中间有很多搜索的过程，因此数据归一化能使梯度下降算法更快地收敛。

以房价问题为例，假设我们使用两个特征，房屋的尺寸和房间的数量，尺寸的值为 0-2000平方英尺，而房间数量的值则是0-5，以两个参数分别为横纵坐标，绘制损失函数，看出图像会显得很扁。

步长=梯度*η，步长要么太大要么太小，太大的话会导致结果不收敛，太小的话又会导致搜索过程太慢。

最简单的方法是令： ${{x_n}}=\frac{{{x_n}}-{{\mu_n}}}{{{s_n}}}$ ，其中 ${\mu_{n}}$ 是平均值， ${s_{n}}$ 是标准差

具体过程我们在kNN算法的时候已经讲了，这里不再叙述。

此处我们进行对比测量一下：

未使用数据归一化：

import numpy as np
from sklearn import datasets
from playML.model_selection import train_test_split
from playML.LinearRegression import LinearRegression

boston = datasets.load_boston()
X = boston.data
y = boston.target
X = X[y < 50.0]
y = y[y < 50.0]
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)

# 未使用数据归一化时，默认是1e4次搜索：
lin_reg2 = LinearRegression()
lin_reg2.fit_bgd(X_train, y_train, eta=0.000001)
lin_reg2.score(X_test, y_test) # 结果为0.27556634853389195，R方太低了

# 将次数改为1e6次后，时间耗时49.9 s
%time lin_reg2.fit_bgd(X_train, y_train, eta=0.000001, n_iters=1e6)
lin_reg2.score(X_test, y_test) # R方0.75418523539807636

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
使用数据归一化
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
standardScaler = StandardScaler()
standardScaler.fit(X_train)
X_train_standard = standardScaler.transform(X_train)

lin_reg3 = LinearRegression()
%time lin_reg3.fit_bgd(X_train_standard, y_train) # 258 ms

X_test_standard = standardScaler.transform(X_test)
lin_reg3.score(X_test_standard, y_test) # 0.81298806201222351
复制代码

3.2 对比正规方程

对于那些不可逆的矩阵（通常是因为特征之间不独立，如同时包含英尺为单位的尺寸和米为单位的尺寸两个特征，也有可能是特征数量大于训练集的数量），正规方程方法是不能用的。我们称那些不可逆矩阵为奇异或退化矩阵。

例如，在预测住房价格时，如果 ${x_{1}}$ 是以英尺为尺寸规格计算的房子， ${x_{2}}$ 是以平方米为尺寸规格计算的房子，同时，你也知道1米等于3.28英尺 ( 四舍五入到两位小数 )，这样，你的这两个特征值将始终满足约束： ${x_{1}}={x_{2}}*{{\left( 3.28 \right)}^{2}}$ 。

梯度下降与正规方程的比较：

梯度下降	正规方程
需要选择学习率 $\alpha$	不需要
需要多次迭代	一次运算得出
当特征数量大时也能较好适用	需要计算 ${{\left( {{X}^{T}}X \right)}^{-1}}$ 如果特征数量n较大则运算代价大，因为矩阵逆的计算时间复杂度为 $O\left( {{n}^{3}} \right)$ ，通常来说当小于10000 时还是可以接受的
适用于各种类型的模型	只适用于线性模型，不适合逻辑回归模型等其他模型

总结一下，只要特征变量的数目并不大，正规方程是一个很好的计算参数 $\theta$ 的替代方法。具体地说，只要特征变量数量小于一万，我通常使用标准方程法，而不使用梯度下降法。

4 随机梯度下降法

4.1 定义

前面所说的梯度下降法，需要对所有的样本进行运算，称为批量梯度下降法。如果我们样本量特别大，那就会非常耗时。那么有没有办法优化呢？

左侧为批量梯度下降法，对所有样本梯度的和来计算，如果我们只对一个样本进行求梯度运算，来当做搜索的方向（注意不是梯度的方向，梯度是下降最快的那个方向），那就成为随机梯度下降法。

随机梯度下降法不能保证每一次搜索方向一定是下降的方向，更不能保证是下降最快的方向。但是实验告诉我们，随机梯度虽然不会像批量梯度下降法那样一定会到最小值那个位置，但是依然会达到最小值的附近。

如果样本量特别大的时候，我们愿意用一定的精度换取一定的时间

此时学习率η很重要，因为如果η固定不变的话，如果此时到达最小值附近了，但是随机梯度又不确定，可能会又跳出最小值附近，因此我们让η值随着搜索次数的增加，逐渐递减。

上图中如果a=1，b=0时，此时i_iters由1到2的时候，缩小了50%，而当i_iters非常大的时候，缩小又会非常小。

此时，常见的可以选择a = 5，b=50

$η = \frac{t_0}{i_{iters}+t_1}$ 模拟退火的思想

def fit_sgd(self, X_train, y_train, n_iters=50, t0=5, t1=50):
        """使用随机梯度下降法训练Linear Regression模型"""
        assert X_train.shape[0] == y_train.shape[0]
        assert n_iters >= 1

        def dJ_sgd(theta, X_b_i, y_i):
          """随机梯度对单个样本求导"""
            return X_b_i * (X_b_i.dot(theta) - y_i) * 2.

        def sgd(X_b, y, initial_theta, n_iters=5, t0=5, t1=50):
        """对单个样本求θ"""

            def learning_rate(t):
                """随机梯度学习率"""
                return t0 / (t + t1)

            theta = initial_theta
            m = len(X_b)
            # 设置循环几遍
            for i_iter in range(n_iters): 
                indexes = np.random.permutation(m) # 随机一个样本
                X_b_new = X_b[indexes,:]
                y_new = y[indexes]
              # 对比批量梯度法，不需要判断两次之间的差值，因为无法保证下一次就比上一次更接近最小值，
              # 只需要设置次数就好，循环m个样本次数
                for i in range(m):
                    gradient = dJ_sgd(theta, X_b_new[i], y_new[i])
                    theta = theta - learning_rate(i_iter * m + i) * gradient

            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        initial_theta = np.random.randn(X_b.shape[1])
        self._theta = sgd(X_b, y_train, initial_theta, n_iters, t0, t1)

        self.intercept_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self
复制代码

4.2 对比批量梯度下降法

import numpy as np
from sklearn import datasets
from playML.model_selection import train_test_split
from playML.LinearRegression import LinearRegression

boston = datasets.load_boston()
X = boston.data
y = boston.target
X = X[y < 50.0]
y = y[y < 50.0]
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)

standardScaler = StandardScaler()
standardScaler.fit(X_train)
X_train_standard = standardScaler.transform(X_train)
X_test_standard = standardScaler.transform(X_test)
lin_reg3 = LinearRegression()

"""""""""""""""""""""""""""""
批量梯度下降法
"""""""""""""""""""""""""""""
%time lin_reg3.fit_bgd(X_train_standard, y_train) # 258 ms
lin_reg3.score(X_test_standard, y_test) # 0.81298806201222351

"""""""""""""""""""""""""""""
随机梯度下降法
"""""""""""""""""""""""""""""
%time lin_reg.fit_sgd(X_train_standard, y_train, n_iters=2) # 13.5 ms
lin_reg.score(X_test_standard, y_test) # 0.78651716204682975

# 增大n_iters= 50
%time lin_reg.fit_sgd(X_train_standard, y_train, n_iters=50) # 158 ms
lin_reg.score(X_test_standard, y_test) # 0.80857287165738345

# 增大n_iters= 100，此时很接近最小值了
%time lin_reg.fit_sgd(X_train_standard, y_train, n_iters=100) # 287 ms
lin_reg.score(X_test_standard, y_test) # 0.81294846132723497
复制代码

4.3 scikit-learn中的SGD

scikit-learn对SGD进行非常多的优化，具体如何可以翻阅源码（本人也没看），而且比起我们的速度要快的多。我们这里只讲最基础的原理。

5 关于梯度的调试

怎么才能发现我们求得梯度是否错误呢？

如上图所示，红点的梯度，可以等价于两个蓝点的连线，和数学定义导数的公式一样。

因此我们可以判断得到的结果与debug的结果是否一样，来发现我们求得梯度是否正确。

def dJ_debug(theta, X_b, y, epsilon=0.01):
    """debug-定义求导"""
    res = np.empty(len(theta))
    for i in range(len(theta)):
        theta_1 = theta.copy()
        theta_1[i] += epsilon
        theta_2 = theta.copy()
        theta_2[i] -= epsilon
        res[i] = (J(theta_1, X_b, y) - J(theta_2, X_b, y)) / (2 * epsilon)
    return res
复制代码