Applied Math and Machine Learning Basics 摘要笔记

deep learning by Bengio   part1 

Information theory
Likely events should have low information content, and in the extreme case,
events that are guaranteed to happen should have no information content
  Less likely events should have higher information content.
Independent events should have additive information.
In order to satisfy all three of these properties, we define the self-information
of an event x = x to be
I( x) = log P(x).
the Shannon entropy of a distribution is the
expected amount of information in an event drawn from that distribution
the Kullback-Leibler (KL) divergence:

Because the KL divergence is non-negative and measures the difference
between two distributions, it is often conceptualized as measuring some sort of
distance between these distributions. However, it is not a true distance measure
because it is not symmetric:

rounding error:
Underflow  occurs when numbers near zero are rounded to zero
Overflow occurs  when numbers with large magnitude are approximated as or −∞ .
One example of a function that must be stabilized against underflow and
overflow is the softmax function

poor conditioning
gradient-based optimization
 Beyond the Gradient: Jacobian and Hessian Matrices

Equivalently, the Hessian is the Jacobian of the gradient.
Most of the functions we encounter in the context of deep learning have a symmetric
Hessian almost everywhere.
The (directional) second derivative tells us how well we can expect a gradient  descent step to perform.
We can make a second-order Taylor series approximation
to the function around f(x) the current point x (0) :

The second derivative can be used to determine whether a critical point is a
local maximum, a local minimum, or saddle point
In more than one dimension, it is not necessary to have an eigenvalue
of 0 in order to get a saddle point: it is only necessary to have both positive and negative  eigenvalues.
Constrained Optimization
wish to find the maximal or minimal value of  f ( x) for values of x in some set S.
The Karush–Kuhn–Tucker (KKT) approach 1 provides a very general solution
to constrained optimization. With the KKT approach, we introduce a new function
called the generalized Lagrangian or generalized Lagrange function.拉格朗日乘数法





当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


