随机优化工具 Stochastic Optimization Techniques

@(Paper summaries)[Neural Networks|Optimization]

应用场景:训练集很大。
好处:避免取到局部最优解。

Neural networks are often trained stochastically, i.e. using a method where the objective function changes at each iteration. This stochastic variation is due to the model being trained on different data during each iteration. This is motivated by (at least) two factors: First, the dataset used as training data is often too large to fit in memory and/or be optimized over efficiently. Second, the objective function is typically nonconvex, so using different data at each iteration can help prevent the model from settling in a local minimum. Furthermore, training neural networks is usually done using only the first-order gradient of the parameters with respect to the loss function. This is due to the large number of parameters present in a neural network, which for practical purposes prevents the computation of the Hessian matrix. Because vanilla gradient descent can diverge or converge incredibly slowly if its learning rate hyperparameter is set inappropriately, many alternative methods have been proposed which are intended to produce desirable convergence with less dependence on hyperparameter settings. These methods often effectively compute and utilize a preconditioner on the gradient, adaptively change the learning rate over time or approximate the Hessian matrix.

In the following, we will use θt to denote some generic parameter of the model at iteration t , to be optimized according to some loss function which is to be minimized.

Stochastic Gradient Descent

Stochastic gradient descent (SGD) simply updates each parameter by subtracting the gradient of the loss with respect to the parameter, scaled by the learning rate η , a hyperparameter. If η is too large, SGD will diverge; if it’s too small, it will converge slowly. The update rule is simply

θt+1=θtη(θt)

Momentum

In SGD, the gradient (θt) often changes rapidly at each iteration t due to the fact that the loss is being computed over different data. This is often partially mitigated by re-using the gradient value from the previous iteration, scaled by a momentum hyperparameter μ , as follows:

vt+1θt+1=μvtη(θt)=θt+vt+1

It has been argued that including the previous gradient step has the effect of approximating some second-order information about the gradient.

Nesterov’s Accelerated Gradient

In Nesterov’s Accelerated Gradient (NAG), the gradient of the loss at each step is computed at θt+μvt instead of θt . In momentum, the parameter update could be written θt+1=θt+μvtη(θt) , so NAG effectively computes the gradient at the new parameter location but without considering the gradient term. In practice, this causes NAG to behave more stably than regular momentum in many situations. A more thorough analysis can be found in ((Sutskever, Martens, Dahl, and Hinton, “On the importance of initialization and momentum in deep learning” (ICML 2013) )). The update rules are then as follows:

vt+1θt+1=μvtη(θt+μvt
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值