Optimizing techniques in Deep Learning
- Jacobian Matrix
represent the partial derivatives of a MIMO system. - Hessian Matrix
the derivative of derivative. For a function f:Rn→R , H(f)(x)i,j=∂2∂xj∂xif(x) . The Hessian Matrix is real symmetric.
First order optimization algorithm: only use Jacobian Matrix
Second order optimization algorithm: make use of Hessian Matrix. Example: Newton’s Method, which assume the local function can be approximated by positive definite quadratic. It is not suitable around saddle points.
KL Divergence
Measures how two distributions P(X) and Q(x) are different. It is not symmetric, so not a true distance measure. Noted that the KL Divergence is non-negative. It is zero when P(X)=Q(x) . It is closely related to “Cross Entropy”. See Deep Learning Textbook page 76-78 for detailed explanation.