liuyubobobo《机器学习》学习笔记（九）

atporter

于 2021-01-20 23:35:00 发布

阅读量238

点赞数

分类专栏：机器学习文章标签：机器学习

本文链接：https://blog.csdn.net/atporter/article/details/112913505

版权

机器学习专栏收录该内容

11 篇文章 7 订阅

订阅专栏

梯度下降法

1.什么是梯度下降法

是一种基于搜索的最优化方法，作用是最小化一个损失函数

并不是一个机器学习算法！

在这里插入图片描述

$\eta$ 称为学习率（learning rate），是梯度下降法的一个超参数，其取值会影响获得最优解的速度。如果取值不合适，甚至得不到最优解
并不是所有函数都有唯一的极值点

解决方案：多次运行，随机化初始点（初始点也是超参数）

2.梯度下降的代码实现

def gradient_descent(initial_theta, eta, n_iters = 1e4, epsilon = 1e-8):
    theta = initial_theta
    theta_history.append(initial_theta)
    i_iter = 0
    
    while i_iter < n_iters:
        gradient = dJ(theta)  # 损失函数对θ求导
        last_theta = theta
        theta = theta - eta * gradient   # 梯度下降
        theta_history.append(theta)
        
        if (abs(J(theta) - J(last_theta)) < epsilon):
            break
        i_iter += 1

3.线性回归中的梯度下降法

损失函数： $J(\theta) = MSE(y, \hat{y}) = \frac{1}{m}\sum_1^m(y^{(i)} - \hat{y}^{(i)})^2 = \frac{1}{m}\sum_1^m(y^{(i)} - \theta _0 - \theta_1X_1^{(i)}- ... - \theta _n X_n^{(i)})^2$

梯度： $\nabla J(\theta) = \begin{pmatrix} \partial J/\partial \theta _0\\ \partial J/\partial \theta _1\\ \partial J/\partial \theta _2\\ ...\\ \partial J/\partial \theta _n\\ \end{pmatrix} = \frac{1}{m} \begin{pmatrix} \sum_1^m 2(y^{(i)} - X_b^{(i)}\theta)\cdot(-1)\\ \sum_1^m 2(y^{(i)} - X_b^{(i)}\theta)\cdot(-X_1^{(i)})\\ \sum_1^m 2(y^{(i)} - X_b^{(i)}\theta)\cdot(-X_2^{(i)})\\ ...\\ \sum_1^m 2(y^{(i)} - X_b^{(i)}\theta)\cdot(-X_n^{(i)})\\ \end{pmatrix} = \frac{2}{m} \cdot \begin{pmatrix} \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_0^{(i)}\\ \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_1^{(i)}\\ \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_2^{(i)}\\ ...\\ \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_n^{(i)}\\ \end{pmatrix}$

其中 $X_b = \begin{pmatrix} X_0^{(1)} & X_1^{(1)} & ... & X_n^{(1)}\\ X_0^{(2)} & X_1^{(2)} & ... & X_n^{(2)}\\ X_0^{(3)} & X_1^{(3)} & ... & X_n^{(3)}\\ ... & ... & ... & ...\\ X_0^{(m)} & X_1^{(m)} & ... & X_n^{(m)}\\ \end{pmatrix}$

向量化计算： $\nabla J(\theta) = \frac{2}{m}\cdot X_b^T\cdot (X_b\theta - y)$

使用梯度下降法前，最好进行数据归一化（防止梯度过大）

4.批量梯度 VS 随机梯度

批量梯度： $\nabla J(\theta) = \frac{2}{m} \cdot \begin{pmatrix} \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_0^{(i)}\\ \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_1^{(i)}\\ \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_2^{(i)}\\ ...\\ \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_n^{(i)}\\ \end{pmatrix} = \frac{2}{m}\cdot X_b^T\cdot (X_b\theta - y)$

随机梯度： $\nabla J(\theta) = 2 \cdot \begin{pmatrix} (X_b^{(i)}\theta - y^{(i)})\cdot X_0^{(i)}\\ (X_b^{(i)}\theta - y^{(i)})\cdot X_1^{(i)}\\ (X_b^{(i)}\theta - y^{(i)})\cdot X_2^{(i)}\\ ...\\ (X_b^{(i)}\theta - y^{(i)})\cdot X_n^{(i)}\\ \end{pmatrix} = 2\cdot (X_b^{(i)})^T\cdot (X_b^{(i)}\theta - y^{(i)})$

5.随机梯度下降法 Stochastic Gradient Descent

在这里插入图片描述

优势：更快的运行速度，跳出局部最优解
随机梯度下降法中的学习率 $\eta$ 需要是递减的，构造一个函数：

$\eta = \frac{a}{i\_iters + b}$

其中， $a, b$ 为两个超参数

代码实现

def sgd(X_b, y, initial_theta, n_iters):
    t0 = 5
    t1 = 50
    
    def learning_rate(t):
        return t0 / (t + t1)
    
    theta = initial_theta
    for cur_iter in range(n_iters):
        rand_i = np.random.randint(len(X_b))
        gradient = dJ_sgd(theta, X_b[rand_i], y[rand_i])
        theta = theta - learning_rate(cur_iter) * gradient
       
    return theta

6.scikit-learn中的SGD

from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(n_iter = 100)
sgd_reg.fit(X_train, y_train)
sgd_reg.score(X_test, y_test)

注意：SGDRegressor是封装在linear_model中的，所以只能解决线性模型

7.小批量梯度下降法 Mini-Batch Gradient Descent

批量梯度考虑所有行，计算梯度优势：损失函数下降稳定

随机梯度只选取一行，计算梯度优势：计算速度快

小批量梯度综合两者特点，选取k行计算梯度

8.通用求解梯度的方法

导数的求解：

在这里插入图片描述

梯度的求解：

在这里插入图片描述

atporter

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录