CMU 11-785 L05 Convergence

这篇博客探讨了在神经网络中,反向传播和梯度下降如何影响模型的收敛。对于单变量输入,当目标函数为二次曲面时,可以使用泰勒展开找到最优步长以实现单调收敛。在多变量情况下,需要考虑不同坐标的学习率,并可能通过正则化目标来统一学习率。此外,由于复杂模型的Hessian矩阵难以计算,以及非凸损失函数可能导致的发散,学习率的选择至关重要。适当地调整学习率可以帮助模型跳出局部最优解。
摘要由CSDN通过智能技术生成

Backpropagation

  • The divergence function minimized is only a proxy for classification error(like Softmax)
  • Minimizing divergence may not minimize classification error
    • Does not separate the points even though the points are linearly separable
    • This is because the separating solution is not a feasible optimum for the loss function
  • Compare to perceptron
    • Perceptron rule has low bias(makes no errors if possible)
      • But high variance(swings wildly in response to small changes to input)
    • Backprop is minimally changed by new training instances
      • Prefers consistency over perfection(which is good)

Convergence

Univariate inputs

  • For quadratic surfaces

Minimize  E = 1 2 a w 2 + b w + c \text {Minimize } E=\frac{1}{2} a w^{2}+b w+c Minimize E=21aw2+bw+c

w ( k + 1 ) = w ( k ) − η d E ( w ( k ) ) d w \mathrm{w}^{(k+1)}=\mathrm{w}^{(k)}-\eta \frac{d E\left(\mathrm{w}^{(k)}\right)}{d \mathrm{w}} w(k+1)=w(k)ηdwdE(w(k))

  • Gradient descent with fixed step size η \eta η to estimate scalar parameter w w w
  • Using Taylor expansion

E ( w ) = E ( w ( k ) ) + E ′ ( w ( k ) ) ( w − w ( k ) ) + E ′ ′ ( w ( k ) ) ( w − w ( k ) ) 2 E(w)=E\left(\mathbf{w}^{(k)}\right)+E^{\prime}\left(\mathbf{w}^{(k)}\right)\left(w-\mathbf{w}^{(k)}\right)+E^{\prime\prime}\left(\mathbf{w}^{(k)}\right)\left(w-\mathbf{w}^{(k)}\right)^2 E(w)=E(w(k))+E(w(k))(ww(k))+E(w(k))(ww(k))2

  • So we can get the optimum step size η o p t = E ′ ′ ( w ( k ) ) − 1 \eta_{opt} = E^{\prime\prime}(w^{(k)})^{-1} ηopt=E(w(k))1
    • For η < η o p t \eta < \eta_{opt} η<ηopt the algorithm will converge monotonically
    • For 2 η o p t > η > η o p t 2\eta_{opt} > \eta > \eta_{opt} 2ηopt>η>ηopt, we have oscillating convergence
    • For η > 2 η o p t \eta > 2\eta_{opt} η>2ηopt, we get divergence
  • For generic differentiable convex objectives
    • also can use Taylor expansion to estimate
    • Using Newton’s method

η o p t = ( d 2 E ( w ( k ) ) d w 2 ) − 1 \eta_{o p t}=\left(\frac{d^{2} E\left(\mathrm{w}^{(k)}\right)}{d w^{2}}\right)^{-1} ηopt=(dw2d2E(w(k)))1

Multivariate inputs

  • Quadratic convex function

E = 1 2 w T A w + w T b + c E=\frac{1}{2} \mathbf{w}^{T} \mathbf{A} \mathbf{w}+\mathbf{w}^{T} \mathbf{b}+c E=21wTAw+wTb+c

  • If A A A is diagonal

E = 1 2 ∑ i ( a i i w i 2 + b i w i ) + c E=\frac{1}{2} \sum_{i}\left(a_{i i} w_{i}^{2}+b_{i} w_{i}\right)+c E=21i(aiiwi2+biwi)+c

  • We can optimize each coordinate independently
    • Like η 1 , o p t = a 11 − 1 \eta_{1,opt} = a^{-1}_{11} η1,opt=a111, η 2 , o p t = a 22 − 1 \eta_{2,opt} = a^{-1}_{22} η2,opt=a221
    • But Optimal learning rate is different for the different coordinates
  • If updating gradient descent for entire vector, need to satisfy

η < 2 min ⁡ i η i , o p t \eta < 2 \min_i \eta_{i,opt} η<2iminηi,opt

  • This, however, makes the learning very slow if max ⁡ i η i , o p t min ⁡ i η i , o p t \frac{\max_i \eta_{i,opt}}{\min_i\eta_{i,opt}} miniηi,optmaxiηi,opt is large
  • Solution: Normalize the objective to have identical eccentricity in all directions
    • Then all of them will have identical optimal learning rates
    • Easier to find a working learning rate
  • Target

E = 1 2 w ^ T w ^ + b ^ T w ^ + c E=\frac{1}{2} \widehat{\mathbf{w}}^{T} \widehat{\mathbf{w}}+\hat{\mathbf{b}}^{T} \widehat{\mathbf{w}}+c E=21w Tw +b^Tw +c

  • So let w ^ = S w \widehat{\mathbf{w}}=\mathbf{S} \mathbf{w} w =Sw, and S = A 0.5 S = A^{0.5} S=A0.5, b ^ = A − 0.5 b \hat{b} = A^{-0.5}b b^=A0.5b , w ^ = A 0.5 w \widehat{\mathbf{w}} = A^{0.5} \mathbf{w} w =A0.5w
  • Gradient descent rule

w ^ ( k + 1 ) = w ^ ( k ) − η ∇ w ^ E ( w ^ ( k ) ) T \widehat{\mathbf{w}}^{(k+1)}=\widehat{\mathbf{w}}^{(k)}-\eta \nabla_{\widehat{\mathbf{w}}} E\left(\widehat{\mathbf{w}}^{(k)}\right)^{T} w (k+1)=w (k)ηw E(w (k))T

w ( k + 1 ) = w ( k ) − η A − 1 ∇ w E ( w ( k ) ) T \mathbf{w}^{(k+1)}=\mathbf{w}^{(k)}-\eta \mathbf{A}^{-1} \nabla_{\mathbf{w}} E\left(\mathbf{w}^{(k)}\right)^{T} w(k+1)=w(k)ηA1wE(w(k))T

  • So we just need to caculate $\mathbf{A}^{-1} $, and the step size of each direction is all the same(1)

  • For generic differentiable multivariate convex functions

    • Also use Taylor expansion

    • E ( w ) ≈ E ( w ( k ) ) + ∇ w E ( w ( k ) ) ( w − w ( k ) ) + 1 2 ( w − w ( k ) ) T H E ( w ( k ) ) ( w − w ( k ) ) + ⋯ E(\mathbf{w}) \approx E\left(\mathbf{w}^{(k)}\right)+\nabla_{\mathbf{w}} E\left(\mathbf{w}^{(k)}\right)\left(\mathbf{w}-w^{(k)}\right)+\frac{1}{2}\left(\mathbf{w}-w^{(k)}\right)^{T} H_{E}\left(w^{(k)}\right)\left(\mathbf{w}-w^{(k)}\right)+\cdots E(w)E(w(k))+wE(w(k))(ww(k))+21(ww(k))THE(w(k))(ww(k))+

    • We get the normalized update rule

    • w ( k + 1 ) = w ( k ) − η H E ( w ( k ) ) − 1 ∇ w E ( w ( k ) ) T \mathbf{w}^{(k+1)}=\mathbf{w}^{(k)}-\eta H_{E}\left(\boldsymbol{w}^{(k)}\right)^{-1} \nabla_{\mathbf{w}} E\left(\mathbf{w}^{(k)}\right)^{T} w(k+1)=w(k)ηHE(w(k))1wE(w(k))T

    • Use quadratic approximations to get the maximum

Issues

Hessian
  • For complex models such as neural networks, with a very large number of parameters, the Hessian is extremely difficult to compute
  • For non-convex functions, the Hessian may not be positive semi-definite, in which case the algorithm can diverge
Learning rate
  • For complex models such as neural networks the loss function is often not convex
    • η > 2 η o p t \eta > 2\eta_{opt} η>2ηopt can actually help escape local optima
  • However always having η > 2 η o p t \eta > 2\eta_{opt} η>2ηopt will ensure that you never ever actually find a solution
  • Using Decaying learning rate
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值