Applied Math and Machine Learning Basics 摘要笔记

deep learning by Bengio   part1 

Information theory
Likely events should have low information content, and in the extreme case,
events that are guaranteed to happen should have no information content
whatsoever.
  Less likely events should have higher information content.
Independent events should have additive information.
In order to satisfy all three of these properties, we define the self-information
of an event x = x to be
I( x) = log P(x).
the Shannon entropy of a distribution is the
expected amount of information in an event drawn from that distribution
the Kullback-Leibler (KL) divergence:

Because the KL divergence is non-negative and measures the difference
between two distributions, it is often conceptualized as measuring some sort of
distance between these distributions. However, it is not a true distance measure
because it is not symmetric:

rounding error:
Underflow  occurs when numbers near zero are rounded to zero
Overflow occurs  when numbers with large magnitude are approximated as or −∞ .
One example of a function that must be stabilized against underflow and
overflow is the softmax function

poor conditioning
gradient-based optimization
 Beyond the Gradient: Jacobian and Hessian Matrices

Equivalently, the Hessian is the Jacobian of the gradient.
Most of the functions we encounter in the context of deep learning have a symmetric
Hessian almost everywhere.
The (directional) second derivative tells us how well we can expect a gradient  descent step to perform.
We can make a second-order Taylor series approximation
to the function around f(x) the current point x (0) :

The second derivative can be used to determine whether a critical point is a
local maximum, a local minimum, or saddle point
In more than one dimension, it is not necessary to have an eigenvalue
of 0 in order to get a saddle point: it is only necessary to have both positive and negative  eigenvalues.
Constrained Optimization
wish to find the maximal or minimal value of  f ( x) for values of x in some set S.
The Karush–Kuhn–Tucker (KKT) approach 1 provides a very general solution
to constrained optimization. With the KKT approach, we introduce a new function
called the generalized Lagrangian or generalized Lagrange function.拉格朗日乘数法


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值