Deep Learning：Optimization for Training Deep Models（二）

最新推荐文章于 2021-11-02 15:10:49 发布

蚊子爱牛牛

最新推荐文章于 2021-11-02 15:10:49 发布

阅读量394

点赞数

分类专栏： deep-learning 文章标签：深度学习训练局部最优

本文链接：https://blog.csdn.net/XJY104165/article/details/78406816

版权

deep-learning 专栏收录该内容

21 篇文章 0 订阅

订阅专栏

Challenges in Neural Network Optimization

When training neural networks, we must confront the general non-convex case. Even convex optimization is not without its complications. In this section, we summarize several of the most prominent challenges involved in optimization for training deep models.

Ill-Conditioning

Some challenges arise even when optimizing convex functions. Of these, the most prominent is ill-conditioning of the Hessian matrix H.
The ill-conditioning problem is generally believed to be present in neural network training problems. Ill-conditioning can manifest by causing SGD to get “stuck” in the sense that even very small steps increase the cost function.
A second-order Taylor series expansion of the cost function predicts that a gradient descent step of $−\epsilon g$ will add

1 2 ϵ 2 g T H g - ϵ g T g

$\frac{1}{2}\epsilon^2g^THg-\epsilon g^Tg$
to the cost. Ill-conditioning of the gradient becomes a problem when

12ϵ2gTHg $\frac{1}{2}\epsilon^2g^THg$ exceeds

ϵgTg $\epsilon g^Tg$ .
To determine whether ill-conditioning is detrimental to a neural network training task, one can monitor the squared gradient norm

gTg $g^Tg$ and the

gTHg $g^THg$ term. In many cases, the gradient norm does not shrink significantly throughout learning, but the

gTHg $g^THg$ term grows by more than order of magnitude. The result is that learning becomes very slow despite the presence of a strong gradient because the learning rate must be shrunk to compensate for even stronger curvature.
Though ill-conditioning is present in other settings besides neural network training, some of the techniques used to combat it in other contexts are less applicable to neural networks. For example, Newton’s method is an excellent tool for minimizing convex functions with poorly conditioned Hessian matrices, but in the subsequent sections we will argue that Newton’s method requires significant modification before it can be applied to neural networks.

Local Minima

With non-convex functions, such as neural nets, it is possible to have many local minima. Indeed, nearly any deep model is essentially guaranteed to have an extremely large number of local minima.
Neural networks and any models with multiple equivalently parametrized latent variables all have multiple local minima because of the model identifiability problem.
A model is said to be identifiable if a sufficiently large training set can rule out all but one setting of the model’s parameters. Models with latent variables are often not identifiable because we can obtain equivalent models by exchanging latent variables with each other.