deep learning by Bengio part1
Information theory
•
Likely events should have low information content, and in the extreme case,
events that are guaranteed to happen should have no information content
whatsoever.
•
Less likely events should have higher information content.
• Independent events should have additive information.
In order to satisfy all three of these properties, we define the self-information
of an event x = x to be
I( x) = − log P(x).
the Shannon entropy of a distribution is the
expected amount of information in an event drawn from that distribution
the Kullback-Leibler (KL) divergence:
Because the KL divergence is non-negative and measures the difference
between two distributions, it is often conceptualized as measuring some sort of
distance between these distributions. However, it is not a true distance measure
because it is not symmetric:
rounding error:
Underflow
occurs when numbers near zero are rounded to zero
Overflow occurs
when numbers with large magnitude are approximated as ∞ or −∞ .
One example of a function that must be stabilized against underflow and
overflow is the softmax function
poor conditioning
gradient-based optimization
Beyond the Gradient: Jacobian and Hessian Matrices
Equivalently, the Hessian is the Jacobian of the gradient.
Most of the functions we encounter in the context of deep learning have a symmetric
Hessian almost everywhere.
The (directional) second derivative tells us how well we can expect a gradient
descent step to perform.
We can make a second-order Taylor series approximation
to the function around f(x) the current point x
(0)
:
The second derivative can be used to determine whether a critical point is a
local maximum, a local minimum, or saddle point
In more than one dimension, it is not necessary to have an eigenvalue
of 0 in order to get a saddle point: it is only necessary to have both positive and negative
eigenvalues.
Constrained Optimization
wish to find the maximal or minimal value of
f ( x) for values of x in some set S.
The Karush–Kuhn–Tucker (KKT) approach
1
provides a very general solution
to constrained optimization. With the KKT approach, we introduce a new function
called the generalized Lagrangian or generalized Lagrange function.拉格朗日乘数法