Overflow and Underflow
- underflow
- occurs when numbers near zero are rounded to zero
- Overflow
- occurs when numbers with large magnitude are approximated as ∞ \infty ∞ or − ∞ -\infty −∞
- Softmax function
- be stabilized against underflow and overflow
- used to predict theprobabilities associated with a multinoulli distribution
- softmax ( x ⃗ ) i = exp ( x i ) ∑ j = 1 n exp ( x j ) \text{softmax}(\vec{x})_i=\frac{\exp(x_i)}{\sum_{j=1}^n\exp(x_j)} softmax(x)i=∑j=1nexp(xj)exp(xi)
- softmax ( z ⃗ ) \text{softmax}(\vec{z}) softmax(z), where z ⃗ = x ⃗ − max i x i \vec{z}=\vec{x}-\max_ix_i z=x−maxixi==> solve the difficulties that the results are undefined.
Poor conditioning
- how rapidly a function changes with respect to small changes in its inputs.
- For f ( x ⃗ ) = A − 1 x ⃗ f(\vec{x})=\mathbf{A}^{-1}\vec{x} f(x)=A−1x, and A ∈ R n × n \mathbf{A}\in\mathbb{R}^{n\times n} A∈Rn×n, we get condition number max i , j ∣ λ i λ j ∣ \max_{i,j}|\frac{\lambda_i}{\lambda_j}| maxi,j∣λjλi∣.
- namely, the ratio of the magnitude of the largest and smallest eigenvalue
- the number is larger, matrix inversion is more sensitive to error in the input.
Gradient-Based Optimization
- objective function or criterion: The function we want to minimize or maximize (For minimized problem, we can call it cost function,loss function, or error function)
- x ⃗ ∗ = arg min f ( x ⃗ ) \vec{x}^*=\arg\min f(\vec{x}) x∗=argminf(x)
- gradient descent:
f
(
x
+
ϵ
)
≈
f
(
x
)
+
ϵ
f
′
(
x
)
f(x+\epsilon)\approx f(x)+\epsilon f'(x)
f(x+ϵ)≈f(x)+ϵf′(x)
- critical points or stationary points: f ′ ( x ) = 0 f'(x)=0 f′(x)=0
- saddle points
- local minimum: a point where f ( x ) f(x) f(x) is lower than at all neighboring points
- local maximum: a point where
f
(
x
)
f(x)
f(x) is higher than at all neighboring points
- global minimum: A point that obtains the absolute lowest value of
f
(
x
)
f(x)
f(x)
- partial derivatives ∂ ∂ x i f ( x ⃗ ) \frac{\partial}{\partial x_i}f(\vec{x}) ∂xi∂f(x), measures how f f f changes as only the variable x i x_i xi increases at point x ⃗ \vec{x} x.
- the gradient of f f f is denoted as ∇ x ⃗ f ( x ⃗ ) \nabla_{\vec{x}}f(\vec{x}) ∇xf(x), which is a vector containing all of the partial derivatives with respect to x i x_i xi
- directional derivative in direction
u
⃗
\vec{u}
u is the slope of the function in direction
u
u
u.
- the derivative of f ( x ⃗ + α u ⃗ ) f(\vec{x}+\alpha\vec{u}) f(x+αu) with respect to α \alpha α, evaluated at α = 0 \alpha=0 α=0
- namely, ∂ ∂ α f ( x ⃗ + α u ⃗ ) \frac{\partial}{\partial\alpha}f(\vec{x}+\alpha\vec{u}) ∂α∂f(x+αu) evaluates to u ⃗ T ∇ x ⃗ f ( x ⃗ ) \vec{u}^T\nabla_{\vec{x}}f(\vec{x}) uT∇xf(x), when α = 0 \alpha=0 α=0
- To minimize
f
f
f, we solve the equation:
- min u ⃗ , u ⃗ T u ⃗ = 1 u ⃗ T ∇ x ⃗ f ( x ⃗ ) = min u ⃗ , u ⃗ T u ⃗ = 1 ∥ u ⃗ ∥ 2 ∥ ∇ x ⃗ f ( x ⃗ ) ∥ 2 cos θ \min\limits_{\vec{u},\vec{u}^T\vec{u}=1}\vec{u}^T\nabla_{\vec{x}}f(\vec{x})=\min\limits_{\vec{u},\vec{u}^T\vec{u}=1}\|\vec{u}\|_2\|\nabla_{\vec{x}}f(\vec{x})\|_2\cos{\theta} u,uTu=1minuT∇xf(x)=u,uTu=1min∥u∥2∥∇xf(x)∥2cosθ, where θ \theta θ is the angle between u ⃗ \vec{u} u and the gradient.
- decrease f f f by moving in the direction of the negative gradient
- steepest descent or gradient descent
- x ⃗ ′ = x ⃗ − ϵ ∇ x ⃗ f ( x ⃗ ) \vec{x}'=\vec{x}-\epsilon\nabla_{\vec{x}}f(\vec{x}) x′=x−ϵ∇xf(x), where ϵ \epsilon ϵ is learning rate.
- choose
ϵ
\epsilon
ϵ
- set ϵ \epsilon ϵ to a small constant.
- linear search: evaluate f ( x ⃗ − ϵ ∇ x ⃗ f ( x ⃗ ) ) f(\vec{x}-\epsilon\nabla\vec{x}f(\vec{x})) f(x−ϵ∇xf(x)) for several values of ϵ \epsilon ϵ and choose the one that results in the smallest objective function value
Beyond the Gradient: Jacobian and Hessian Matrices
- Jacobian matrix
- if we have a function f : R m → R n f:\mathbb{R}^m\rightarrow\mathbb{R}^n f:Rm→Rn, then the Jacobian matrix J ∈ R m × n J\in\mathbb{R}^{m\times n} J∈Rm×n of f f f is defined such that J i , j = ∂ ∂ x j f ( x ⃗ ) i J_{i,j}=\frac{\partial}{\partial x_j}f(\vec{x})_i Ji,j=∂xj∂f(x)i
- second derivative: a derivative of a derivative, regard as measuring curvature.
- Hessian matrix==>
H
(
f
)
(
x
⃗
)
H(f)(\vec{x})
H(f)(x)
- H ( f ) ( x ⃗ ) i , j = ∂ 2 ∂ x i ∂ x j f ( x ⃗ ) H(f)(\vec{x})_{i,j}=\frac{\partial^2}{\partial x_i\partial x_j}f(\vec{x}) H(f)(x)i,j=∂xi∂xj∂2f(x)
- the Hessian is the Jacobian of the gradient
- H i , j = H j , i H_{i,j}=H_{j,i} Hi,j=Hj,i
- Because the Hessian matrix is real and symmetric,we can decompose it into a set of real eigenvalues and an orthogonal basis of eigenvectors.
- second-order Taylor series approximation:
- f ( x ⃗ ) ≈ f ( x ⃗ ( 0 ) ) + ( x ⃗ − x ⃗ ( 0 ) ) T g ⃗ + 1 2 ( x ⃗ − x ⃗ ( 0 ) ) T H ( x ⃗ − x ⃗ ( 0 ) ) f(\vec{x})\approx f(\vec{x}^{(0)})+(\vec{x}-\vec{x}^{(0)})^T\vec{g}+\frac{1}{2}(\vec{x}-\vec{x}^{(0)})^TH(\vec{x}-\vec{x}^{(0)}) f(x)≈f(x(0))+(x−x(0))Tg+21(x−x(0))TH(x−x(0)), where g ⃗ \vec{g} g is the gradient and H H H is the Hessian at x ⃗ ( 0 ) \vec{x}^{(0)} x(0).
- Use x ⃗ ( 0 ) − ϵ g ⃗ \vec{x}^{(0)}-\epsilon\vec{g} x(0)−ϵg, then f ( x ⃗ ( 0 ) − ϵ g ⃗ ) ≈ f ( x ⃗ ( 0 ) ) − ϵ g ⃗ T g ⃗ + 1 2 ϵ 2 g ⃗ T H g ⃗ f(\vec{x}^{(0)}-\epsilon\vec{g})\approx f(\vec{x}^{(0)})-\epsilon\vec{g}^T\vec{g}+\frac{1}{2}\epsilon^2\vec{g}^TH\vec{g} f(x(0)−ϵg)≈f(x(0))−ϵgTg+21ϵ2gTHg.
- the original value of the function f ( x ⃗ ( 0 ) ) f(\vec{x}^{(0)}) f(x(0))
- the expected improvement due to the slope of the function − ϵ g ⃗ T g ⃗ -\epsilon\vec{g}^T\vec{g} −ϵgTg
- the correction we must apply to account for the curvature of the function 1 2 ϵ 2 g ⃗ T H g ⃗ \frac{1}{2}\epsilon^2\vec{g}^TH\vec{g} 21ϵ2gTHg
- when g ⃗ T H g ⃗ \vec{g}^TH\vec{g} gTHg is positive, we can get $\epsilon*=\frac{\vec{g}T\vec{g}}{\vec{g}^TH\vec{g}} $
- critical point:
f
′
(
x
)
=
0
f'(x)=0
f′(x)=0
- if f ′ ′ ( x ) > 0 f''(x)>0 f′′(x)>0, then f ′ ( x − ϵ ) < 0 f'(x-\epsilon)<0 f′(x−ϵ)<0 and f ′ ( x + ϵ ) > 0 f'(x+\epsilon)>0 f′(x+ϵ)>0 for small enough ϵ \epsilon ϵ.
- local minimum: f ′ ( x ) = 0 f'(x)=0 f′(x)=0 and f ′ ′ ( x ) > 0 f''(x)>0 f′′(x)>0
- local maximum:
f
′
(
x
)
=
0
f'(x)=0
f′(x)=0 and
f
′
′
(
x
)
<
0
f''(x)<0
f′′(x)<0
- Newton’s method
- based on using a second-order Taylor series
- f ( x ⃗ ) ≈ f ( x ⃗ ( 0 ) ) + ( x ⃗ − x ⃗ ( 0 ) ) T ∇ x f ( x ⃗ ( 0 ) ) + 1 2 ( x ⃗ − x ⃗ ( 0 ) ) T H ( f ) ( x ⃗ ( 0 ) ) ( x ⃗ − x ⃗ ( 0 ) ) f(\vec{x})\approx f(\vec{x}^{(0)})+(\vec{x}-\vec{x}^{(0)})^T\nabla_xf(\vec{x}^{(0)})+\frac{1}{2}(\vec{x}-\vec{x}^{(0)})^TH(f)(\vec{x}^{(0)})(\vec{x}-\vec{x}^{(0)}) f(x)≈f(x(0))+(x−x(0))T∇xf(x(0))+21(x−x(0))TH(f)(x(0))(x−x(0))
- Solve for the critical point: x ⃗ ∗ = x ⃗ ( 0 ) − H ( f ) ( x ⃗ ( 0 ) ) − 1 ∇ x f ( x ⃗ ( 0 ) ) \vec{x}^*=\vec{x}^{(0)}-H(f)(\vec{x}^{(0)})^{-1}\nabla_xf(\vec{x}^{(0)}) x∗=x(0)−H(f)(x(0))−1∇xf(x(0))
- Newton’s method is only appropriate when the nearby critical point is a minimum
- Lipschitz continuous (derivatives)
- A Lipschitz continuous function is a function f f f whose rate of change is bounded by a Lipschitz constant L \mathcal{L} L: ∀ x ⃗ , ∀ y ⃗ , ∣ f ( x ⃗ ) − f ( y ⃗ ) ∣ ≤ L ∥ x ⃗ − y ⃗ ∥ 2 \forall\vec{x},\forall\vec{y},|f(\vec{x})-f(\vec{y})|\leq\mathcal{L}\|\vec{x}-\vec{y}\|_2 ∀x,∀y,∣f(x)−f(y)∣≤L∥x−y∥2
- weak constraint
- Convex optimization
- strong constraint
- all of their local minima are necessarily global minima
Constraint Optimization
- find the maximal or minimal value of f ( x ⃗ ) f(\vec{x}) f(x) for values of x ⃗ \vec{x} x in some set S \mathbb{S} S.
- method:
- modify gradient descent taking the constraint into account
- design a different, unconstrained optimization problem whose solution can be converted into a solution to the original,constrained optimization problem
- Karush–Kuhn–Tucker(KKT) approach (Need more information)
- generalized Lagrangian or generalized Lagrange function
- S = { x ⃗ ∣ ∀ i , g ( i ) ( x ⃗ ) = 0 and ∀ j , h ( j ) ( x ⃗ ) ≤ 0 } \mathbb{S}=\{\vec{x}|\forall i,g^{(i)}(\vec{x})=0\text{ and }\forall j,h^{(j)}(\vec{x})\leq0\} S={x∣∀i,g(i)(x)=0 and ∀j,h(j)(x)≤0}
- equality constraints g ( i ) g^{(i)} g(i)
- inequality constraints h ( j ) h^{(j)} h(j)
- KKT multipliers: λ i \lambda_i λi and α j \alpha_j αj for each constraint
- L ( x ⃗ , λ ⃗ , α ⃗ ) = f ( x ⃗ ) + ∑ i λ i g ( i ) ( x ⃗ ) + ∑ j α j h ( j ) ( x ⃗ ) L(\vec{x},\vec{\lambda},\vec{\alpha})=f(\vec{x})+\sum\limits_i\lambda_ig^{(i)}(\vec{x})+\sum\limits_j\alpha_jh^{(j)}(\vec{x}) L(x,λ,α)=f(x)+i∑λig(i)(x)+j∑αjh(j)(x)
- generalized Lagrangian or generalized Lagrange function
- min x ⃗ max λ ⃗ max α ⃗ , α ⃗ ≥ 0 L ( x ⃗ , λ ⃗ , α ⃗ ) ⇔ min S f ( x ⃗ ) \min\limits_{\vec{x}}\max\limits_{\vec{\lambda}}\max\limits_{\vec{\alpha},\vec{\alpha}\geq0}L(\vec{x},\vec{\lambda},\vec{\alpha})\Leftrightarrow \min\limits_{\mathbb{S}}f(\vec{x}) xminλmaxα,α≥0maxL(x,λ,α)⇔Sminf(x)
- Karush-Kuhn-Tucker (KKT) conditions:
- The gradient of the generalized Lagrangian is zero
- All constraints on both x ⃗ \vec{x} x and the KKT multipliers are satisfied
- The inequality constraints exhibit “complementary slackness”: α ⊙ h ( x ⃗ ) = 0 \alpha\odot h(\vec{x}) =0 α⊙h(x)=0