liuyubobobo《机器学习》学习笔记(九)

梯度下降法


1.什么是梯度下降法

是一种基于搜索的最优化方法,作用是最小化一个损失函数

并不是一个机器学习算法!

在这里插入图片描述

在这里插入图片描述

  • η \eta η 称为学习率(learning rate),是梯度下降法的一个超参数,其取值会影响获得最优解的速度。如果取值不合适,甚至得不到最优解

  • 并不是所有函数都有唯一的极值点

    解决方案:多次运行,随机化初始点(初始点也是超参数)

2.梯度下降的代码实现
def gradient_descent(initial_theta, eta, n_iters = 1e4, epsilon = 1e-8):
    theta = initial_theta
    theta_history.append(initial_theta)
    i_iter = 0
    
    while i_iter < n_iters:
        gradient = dJ(theta)  # 损失函数对θ求导
        last_theta = theta
        theta = theta - eta * gradient   # 梯度下降
        theta_history.append(theta)
        
        if (abs(J(theta) - J(last_theta)) < epsilon):
            break
        i_iter += 1
3.线性回归中的梯度下降法

损失函数 J ( θ ) = M S E ( y , y ^ ) = 1 m ∑ 1 m ( y ( i ) − y ^ ( i ) ) 2 = 1 m ∑ 1 m ( y ( i ) − θ 0 − θ 1 X 1 ( i ) − . . . − θ n X n ( i ) ) 2 J(\theta) = MSE(y, \hat{y}) = \frac{1}{m}\sum_1^m(y^{(i)} - \hat{y}^{(i)})^2 = \frac{1}{m}\sum_1^m(y^{(i)} - \theta _0 - \theta_1X_1^{(i)}- ... - \theta _n X_n^{(i)})^2 J(θ)=MSE(y,y^)=m11m(y(i)y^(i))2=m11m(y(i)θ0θ1X1(i)...θnXn(i))2

梯度 ∇ J ( θ ) = ( ∂ J / ∂ θ 0 ∂ J / ∂ θ 1 ∂ J / ∂ θ 2 . . . ∂ J / ∂ θ n ) = 1 m ( ∑ 1 m 2 ( y ( i ) − X b ( i ) θ ) ⋅ ( − 1 ) ∑ 1 m 2 ( y ( i ) − X b ( i ) θ ) ⋅ ( − X 1 ( i ) ) ∑ 1 m 2 ( y ( i ) − X b ( i ) θ ) ⋅ ( − X 2 ( i ) ) . . . ∑ 1 m 2 ( y ( i ) − X b ( i ) θ ) ⋅ ( − X n ( i ) ) ) = 2 m ⋅ ( ∑ 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X 0 ( i ) ∑ 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X 1 ( i ) ∑ 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X 2 ( i ) . . . ∑ 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X n ( i ) ) \nabla J(\theta) = \begin{pmatrix} \partial J/\partial \theta _0\\ \partial J/\partial \theta _1\\ \partial J/\partial \theta _2\\ ...\\ \partial J/\partial \theta _n\\ \end{pmatrix} = \frac{1}{m} \begin{pmatrix} \sum_1^m 2(y^{(i)} - X_b^{(i)}\theta)\cdot(-1)\\ \sum_1^m 2(y^{(i)} - X_b^{(i)}\theta)\cdot(-X_1^{(i)})\\ \sum_1^m 2(y^{(i)} - X_b^{(i)}\theta)\cdot(-X_2^{(i)})\\ ...\\ \sum_1^m 2(y^{(i)} - X_b^{(i)}\theta)\cdot(-X_n^{(i)})\\ \end{pmatrix} = \frac{2}{m} \cdot \begin{pmatrix} \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_0^{(i)}\\ \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_1^{(i)}\\ \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_2^{(i)}\\ ...\\ \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_n^{(i)}\\ \end{pmatrix} J(θ)=J/θ0J/θ1J/θ2...J/θn=m11m2(y(i)Xb(i)θ)(1)1m2(y(i)Xb(i)θ)(X1(i))1m2(y(i)Xb(i)θ)(X2(i))...1m2(y(i)Xb(i)θ)(Xn(i))=m21m(Xb(i)θy(i))X0(i)1m(Xb(i)θy(i))X1(i)1m(Xb(i)θy(i))X2(i)...1m(Xb(i)θy(i))Xn(i)

​ 其中 X b = ( X 0 ( 1 ) X 1 ( 1 ) . . . X n ( 1 ) X 0 ( 2 ) X 1 ( 2 ) . . . X n ( 2 ) X 0 ( 3 ) X 1 ( 3 ) . . . X n ( 3 ) . . . . . . . . . . . . X 0 ( m ) X 1 ( m ) . . . X n ( m ) ) X_b = \begin{pmatrix} X_0^{(1)} & X_1^{(1)} & ... & X_n^{(1)}\\ X_0^{(2)} & X_1^{(2)} & ... & X_n^{(2)}\\ X_0^{(3)} & X_1^{(3)} & ... & X_n^{(3)}\\ ... & ... & ... & ...\\ X_0^{(m)} & X_1^{(m)} & ... & X_n^{(m)}\\ \end{pmatrix} Xb=X0(1)X0(2)X0(3)...X0(m)X1(1)X1(2)X1(3)...X1(m)...............Xn(1)Xn(2)Xn(3)...Xn(m)

向量化计算 ∇ J ( θ ) = 2 m ⋅ X b T ⋅ ( X b θ − y ) \nabla J(\theta) = \frac{2}{m}\cdot X_b^T\cdot (X_b\theta - y) J(θ)=m2XbT(Xbθy)

使用梯度下降法前,最好进行数据归一化(防止梯度过大)

4.批量梯度 VS 随机梯度

批量梯度 ∇ J ( θ ) = 2 m ⋅ ( ∑ 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X 0 ( i ) ∑ 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X 1 ( i ) ∑ 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X 2 ( i ) . . . ∑ 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X n ( i ) ) = 2 m ⋅ X b T ⋅ ( X b θ − y ) \nabla J(\theta) = \frac{2}{m} \cdot \begin{pmatrix} \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_0^{(i)}\\ \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_1^{(i)}\\ \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_2^{(i)}\\ ...\\ \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_n^{(i)}\\ \end{pmatrix} = \frac{2}{m}\cdot X_b^T\cdot (X_b\theta - y) J(θ)=m21m(Xb(i)θy(i))X0(i)1m(Xb(i)θy(i))X1(i)1m(Xb(i)θy(i))X2(i)...1m(Xb(i)θy(i))Xn(i)=m2XbT(Xbθy)

随机梯度 ∇ J ( θ ) = 2 ⋅ ( ( X b ( i ) θ − y ( i ) ) ⋅ X 0 ( i ) ( X b ( i ) θ − y ( i ) ) ⋅ X 1 ( i ) ( X b ( i ) θ − y ( i ) ) ⋅ X 2 ( i ) . . . ( X b ( i ) θ − y ( i ) ) ⋅ X n ( i ) ) = 2 ⋅ ( X b ( i ) ) T ⋅ ( X b ( i ) θ − y ( i ) ) \nabla J(\theta) = 2 \cdot \begin{pmatrix} (X_b^{(i)}\theta - y^{(i)})\cdot X_0^{(i)}\\ (X_b^{(i)}\theta - y^{(i)})\cdot X_1^{(i)}\\ (X_b^{(i)}\theta - y^{(i)})\cdot X_2^{(i)}\\ ...\\ (X_b^{(i)}\theta - y^{(i)})\cdot X_n^{(i)}\\ \end{pmatrix} = 2\cdot (X_b^{(i)})^T\cdot (X_b^{(i)}\theta - y^{(i)}) J(θ)=2(Xb(i)θy(i))X0(i)(Xb(i)θy(i))X1(i)(Xb(i)θy(i))X2(i)...(Xb(i)θy(i))Xn(i)=2(Xb(i))T(Xb(i)θy(i))

5.随机梯度下降法 Stochastic Gradient Descent

在这里插入图片描述

  • 优势:更快的运行速度,跳出局部最优解
  • 随机梯度下降法中的学习率 η \eta η 需要是递减的,构造一个函数:

η = a i _ i t e r s + b \eta = \frac{a}{i\_iters + b} η=i_iters+ba

​ 其中, a , b a, b a,b为两个超参数

  • 代码实现
def sgd(X_b, y, initial_theta, n_iters):
    t0 = 5
    t1 = 50
    
    def learning_rate(t):
        return t0 / (t + t1)
    
    theta = initial_theta
    for cur_iter in range(n_iters):
        rand_i = np.random.randint(len(X_b))
        gradient = dJ_sgd(theta, X_b[rand_i], y[rand_i])
        theta = theta - learning_rate(cur_iter) * gradient
       
    return theta
6.scikit-learn中的SGD
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(n_iter = 100)
sgd_reg.fit(X_train, y_train)
sgd_reg.score(X_test, y_test)

注意:SGDRegressor是封装在linear_model中的,所以只能解决线性模型

7.小批量梯度下降法 Mini-Batch Gradient Descent

批量梯度考虑所有行,计算梯度 优势:损失函数下降稳定

随机梯度只选取一行,计算梯度 优势:计算速度快

小批量梯度综合两者特点,选取k行计算梯度

8.通用求解梯度的方法

导数的求解

在这里插入图片描述

梯度的求解

在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值