second derivative & Hessian matrix

最新推荐文章于 2023-02-21 22:15:08 发布

sohero

最新推荐文章于 2023-02-21 22:15:08 发布

阅读量856

点赞数

文章标签： Hessian

We are also sometimes interested in a derivative of a derivative. This is known as a second derivative. For example, $\frac{\partial ^2}{\partial x_i \partial x_j}f$ is the derivative with respect to $x_i$ of the derivative of $f$ with respect to $x_j$ . Note that the order of derivativation can be swapped, so that $\frac{\partial ^2}{\partial x_i \partial x_j}f = \frac{\partial ^2}{\partial x_j \partial x_i}f$ . In a single dimension, we can denote $\frac{d^2}{dx^2}f$ by $f''(x)$ .

The second derivative tells us how the first derivative will change as we vary the input. This means it can be useful for determining whether a critical point is a local maximum , a local minimum, or saddle point. Recall that on a critical point, $f'(x)=0$ . When $f''(x)>0$ , this means that $f'(x)$ increases as we move to the right, and $f'(x)$ decreases as we move to the left. This means $f'(x-\epsilon)<0$ and $f'(x+\epsilon)>0$ for small enough $\epsilon$ . In other words, as we move right, the slope begins to point uphill to the right, and as we move left, the slope begins to point uphill to the left. Thus, when $f'(x)=0$ and $f''(x)>0$ , we can conclude that $x$ is a local minimum. Similarly, when $f'(x)=0$ and $f''(x)<0$ , we can conclude that $x$ is a local local maximum. This is known as the second derivative test. Unfortunately, when $f''(x)=0$ , the test is inconclusive. In this case $x$ may be a saddle point, or a part of a flat region.

In multiple dimensions, we need to examine all of the second derivatives of the function. These derivatives can be collected together into a matrix called the Hessian matrix. The Hessian matrix $H(f)(x)$ is defined such that

H (f) (x) i, j = \partial 2 \partial x i \partial x j f (x) .

$H(f)(x)_{i,j} = \frac{\partial ^2}{\partial x_i \partial x_j}f(x).$
Equivalently, the Hessian is the Jacobian of the gradient.

Anywhere that the second partial derivatives are continuous, the differential operators are commutative:

\partial 2 \partial x i \partial x j f (x) = \partial 2 \partial x j \partial x i f (x)

$\frac{\partial ^2}{\partial x_i \partial x_j}f(x)=\frac{\partial ^2}{\partial x_j \partial x_i}f(x)$
This implies that

hi,j=hj,i $h_{i,j}=h_{j,i}$ , so the Hessian matrix is symmetric at such points (which includes nearly all inputs to nearly all functions we encounter in deep learning). Because the Hessian matrix is real and symmetric, we can decompose it into a set of real eigenvalues and an orthogonal basis of eigenvectors.
Using the eigendecomposition Hessian matrix, we can generalize the second derivative test to multiple dimensions. At a critical point, where

∇xf(x)=0 $\nabla_xf(x)=0$ , we can examine the eigenvalues of the Hessian to determine whether the critical point is a local maximum, local minimum, or saddle point. When the Hessian is positive definite (all its eigenvalues are positive), the point is a local minimum. This can be seen by observing that the directional second derivative in any direction must be positive, and making reference to the univariate second derivative test. Likewise, when the Hessian is negative definite (all its eigenvalues are negative), the point is a local maximum. In multiple dimensions, it is actually possible to find positive evidence of saddle points in some cases. When at least one eigenvalue is positive and at least on eigenvalue is negative, we known that

x $x$ is a local maximum on one across section of

f $f$ but a local minimum on another cross section. Finally, the multidimensional second derivative test can be inconclusive, just like the univariate version. The test is inconclusive whenever all of the non-zero eigenvalues have the same sign, but at least one eigenvalue is zero. This is because the univariate second derivative test is inconclusive in the cross section corresponding to the zero eigenvalue.

The Hessian can also be useful for understanding the performance of gradient descent. When the Hessian has a poor condition number, gradient descent performs poorly. This is because in one direction, the derivative increases rapidly, while in another direction, it increases slowly. Gradient descent is unaware of this change in the derivative so it does not know that it needs to explore preferentially in the direction where the derivative remains negative for longer.