C4-Numerical Computation

Overflow and Underflow

  • underflow
    • occurs when numbers near zero are rounded to zero
  • Overflow
    • occurs when numbers with large magnitude are approximated as ∞ \infty or − ∞ -\infty
  • Softmax function
    • be stabilized against underflow and overflow
    • used to predict theprobabilities associated with a multinoulli distribution
    • softmax ( x ⃗ ) i = exp ⁡ ( x i ) ∑ j = 1 n exp ⁡ ( x j ) \text{softmax}(\vec{x})_i=\frac{\exp(x_i)}{\sum_{j=1}^n\exp(x_j)} softmax(x )i=j=1nexp(xj)exp(xi)
    • softmax ( z ⃗ ) \text{softmax}(\vec{z}) softmax(z ), where z ⃗ = x ⃗ − max ⁡ i x i \vec{z}=\vec{x}-\max_ix_i z =x maxixi==> solve the difficulties that the results are undefined.

Poor conditioning

  • how rapidly a function changes with respect to small changes in its inputs.
  • For f ( x ⃗ ) = A − 1 x ⃗ f(\vec{x})=\mathbf{A}^{-1}\vec{x} f(x )=A1x , and A ∈ R n × n \mathbf{A}\in\mathbb{R}^{n\times n} ARn×n, we get condition number max ⁡ i , j ∣ λ i λ j ∣ \max_{i,j}|\frac{\lambda_i}{\lambda_j}| maxi,jλjλi.
  • namely, the ratio of the magnitude of the largest and smallest eigenvalue
  • the number is larger, matrix inversion is more sensitive to error in the input.

Gradient-Based Optimization

  • objective function or criterion: The function we want to minimize or maximize (For minimized problem, we can call it cost function,loss function, or error function)
  • x ⃗ ∗ = arg ⁡ min ⁡ f ( x ⃗ ) \vec{x}^*=\arg\min f(\vec{x}) x =argminf(x )
  • gradient descent: f ( x + ϵ ) ≈ f ( x ) + ϵ f ′ ( x ) f(x+\epsilon)\approx f(x)+\epsilon f'(x) f(x+ϵ)f(x)+ϵf(x)
    Fig4.1
  • critical points or stationary points: f ′ ( x ) = 0 f'(x)=0 f(x)=0
  • saddle points
    • local minimum: a point where f ( x ) f(x) f(x) is lower than at all neighboring points
    • local maximum: a point where f ( x ) f(x) f(x) is higher than at all neighboring points
      Fig4.2
    • global minimum: A point that obtains the absolute lowest value of f ( x ) f(x) f(x)
      Fig4.3
  • partial derivatives ∂ ∂ x i f ( x ⃗ ) \frac{\partial}{\partial x_i}f(\vec{x}) xif(x ), measures how f f f changes as only the variable x i x_i xi increases at point x ⃗ \vec{x} x .
  • the gradient of f f f is denoted as ∇ x ⃗ f ( x ⃗ ) \nabla_{\vec{x}}f(\vec{x}) x f(x ), which is a vector containing all of the partial derivatives with respect to x i x_i xi
  • directional derivative in direction u ⃗ \vec{u} u is the slope of the function in direction u u u.
    • the derivative of f ( x ⃗ + α u ⃗ ) f(\vec{x}+\alpha\vec{u}) f(x +αu ) with respect to α \alpha α, evaluated at α = 0 \alpha=0 α=0
    • namely, ∂ ∂ α f ( x ⃗ + α u ⃗ ) \frac{\partial}{\partial\alpha}f(\vec{x}+\alpha\vec{u}) αf(x +αu ) evaluates to u ⃗ T ∇ x ⃗ f ( x ⃗ ) \vec{u}^T\nabla_{\vec{x}}f(\vec{x}) u Tx f(x ), when α = 0 \alpha=0 α=0
    • To minimize f f f, we solve the equation:
      • min ⁡ u ⃗ , u ⃗ T u ⃗ = 1 u ⃗ T ∇ x ⃗ f ( x ⃗ ) = min ⁡ u ⃗ , u ⃗ T u ⃗ = 1 ∥ u ⃗ ∥ 2 ∥ ∇ x ⃗ f ( x ⃗ ) ∥ 2 cos ⁡ θ \min\limits_{\vec{u},\vec{u}^T\vec{u}=1}\vec{u}^T\nabla_{\vec{x}}f(\vec{x})=\min\limits_{\vec{u},\vec{u}^T\vec{u}=1}\|\vec{u}\|_2\|\nabla_{\vec{x}}f(\vec{x})\|_2\cos{\theta} u ,u Tu =1minu Tx f(x )=u ,u Tu =1minu 2x f(x )2cosθ, where θ \theta θ is the angle between u ⃗ \vec{u} u and the gradient.
      • decrease f f f by moving in the direction of the negative gradient
    • steepest descent or gradient descent
      • x ⃗ ′ = x ⃗ − ϵ ∇ x ⃗ f ( x ⃗ ) \vec{x}'=\vec{x}-\epsilon\nabla_{\vec{x}}f(\vec{x}) x =x ϵx f(x ), where ϵ \epsilon ϵ is learning rate.
      • choose ϵ \epsilon ϵ
        • set ϵ \epsilon ϵ to a small constant.
        • linear search: evaluate f ( x ⃗ − ϵ ∇ x ⃗ f ( x ⃗ ) ) f(\vec{x}-\epsilon\nabla\vec{x}f(\vec{x})) f(x ϵx f(x )) for several values of ϵ \epsilon ϵ and choose the one that results in the smallest objective function value

Beyond the Gradient: Jacobian and Hessian Matrices

  • Jacobian matrix
    • if we have a function f : R m → R n f:\mathbb{R}^m\rightarrow\mathbb{R}^n f:RmRn, then the Jacobian matrix J ∈ R m × n J\in\mathbb{R}^{m\times n} JRm×n of f f f is defined such that J i , j = ∂ ∂ x j f ( x ⃗ ) i J_{i,j}=\frac{\partial}{\partial x_j}f(\vec{x})_i Ji,j=xjf(x )i
  • second derivative: a derivative of a derivative, regard as measuring curvature.
    Fig4.4
  • Hessian matrix==> H ( f ) ( x ⃗ ) H(f)(\vec{x}) H(f)(x )
    • H ( f ) ( x ⃗ ) i , j = ∂ 2 ∂ x i ∂ x j f ( x ⃗ ) H(f)(\vec{x})_{i,j}=\frac{\partial^2}{\partial x_i\partial x_j}f(\vec{x}) H(f)(x )i,j=xixj2f(x )
    • the Hessian is the Jacobian of the gradient
    • H i , j = H j , i H_{i,j}=H_{j,i} Hi,j=Hj,i
    • Because the Hessian matrix is real and symmetric,we can decompose it into a set of real eigenvalues and an orthogonal basis of eigenvectors.
    • second-order Taylor series approximation:
      • f ( x ⃗ ) ≈ f ( x ⃗ ( 0 ) ) + ( x ⃗ − x ⃗ ( 0 ) ) T g ⃗ + 1 2 ( x ⃗ − x ⃗ ( 0 ) ) T H ( x ⃗ − x ⃗ ( 0 ) ) f(\vec{x})\approx f(\vec{x}^{(0)})+(\vec{x}-\vec{x}^{(0)})^T\vec{g}+\frac{1}{2}(\vec{x}-\vec{x}^{(0)})^TH(\vec{x}-\vec{x}^{(0)}) f(x )f(x (0))+(x x (0))Tg +21(x x (0))TH(x x (0)), where g ⃗ \vec{g} g is the gradient and H H H is the Hessian at x ⃗ ( 0 ) \vec{x}^{(0)} x (0).
      • Use x ⃗ ( 0 ) − ϵ g ⃗ \vec{x}^{(0)}-\epsilon\vec{g} x (0)ϵg , then f ( x ⃗ ( 0 ) − ϵ g ⃗ ) ≈ f ( x ⃗ ( 0 ) ) − ϵ g ⃗ T g ⃗ + 1 2 ϵ 2 g ⃗ T H g ⃗ f(\vec{x}^{(0)}-\epsilon\vec{g})\approx f(\vec{x}^{(0)})-\epsilon\vec{g}^T\vec{g}+\frac{1}{2}\epsilon^2\vec{g}^TH\vec{g} f(x (0)ϵg )f(x (0))ϵg Tg +21ϵ2g THg .
      • the original value of the function f ( x ⃗ ( 0 ) ) f(\vec{x}^{(0)}) f(x (0))
      • the expected improvement due to the slope of the function − ϵ g ⃗ T g ⃗ -\epsilon\vec{g}^T\vec{g} ϵg Tg
      • the correction we must apply to account for the curvature of the function 1 2 ϵ 2 g ⃗ T H g ⃗ \frac{1}{2}\epsilon^2\vec{g}^TH\vec{g} 21ϵ2g THg
      • when g ⃗ T H g ⃗ \vec{g}^TH\vec{g} g THg is positive, we can get $\epsilon*=\frac{\vec{g}T\vec{g}}{\vec{g}^TH\vec{g}} $
      • critical point: f ′ ( x ) = 0 f'(x)=0 f(x)=0
        • if f ′ ′ ( x ) &gt; 0 f&#x27;&#x27;(x)&gt;0 f(x)>0, then f ′ ( x − ϵ ) &lt; 0 f&#x27;(x-\epsilon)&lt;0 f(xϵ)<0 and f ′ ( x + ϵ ) &gt; 0 f&#x27;(x+\epsilon)&gt;0 f(x+ϵ)>0 for small enough ϵ \epsilon ϵ.
      • local minimum: f ′ ( x ) = 0 f&#x27;(x)=0 f(x)=0 and f ′ ′ ( x ) &gt; 0 f&#x27;&#x27;(x)&gt;0 f(x)>0
      • local maximum: f ′ ( x ) = 0 f&#x27;(x)=0 f(x)=0 and f ′ ′ ( x ) &lt; 0 f&#x27;&#x27;(x)&lt;0 f(x)<0
        Fig4.5
        Fig4.6
  • Newton’s method
    • based on using a second-order Taylor series
    • f ( x ⃗ ) ≈ f ( x ⃗ ( 0 ) ) + ( x ⃗ − x ⃗ ( 0 ) ) T ∇ x f ( x ⃗ ( 0 ) ) + 1 2 ( x ⃗ − x ⃗ ( 0 ) ) T H ( f ) ( x ⃗ ( 0 ) ) ( x ⃗ − x ⃗ ( 0 ) ) f(\vec{x})\approx f(\vec{x}^{(0)})+(\vec{x}-\vec{x}^{(0)})^T\nabla_xf(\vec{x}^{(0)})+\frac{1}{2}(\vec{x}-\vec{x}^{(0)})^TH(f)(\vec{x}^{(0)})(\vec{x}-\vec{x}^{(0)}) f(x )f(x (0))+(x x (0))Txf(x (0))+21(x x (0))TH(f)(x (0))(x x (0))
    • Solve for the critical point: x ⃗ ∗ = x ⃗ ( 0 ) − H ( f ) ( x ⃗ ( 0 ) ) − 1 ∇ x f ( x ⃗ ( 0 ) ) \vec{x}^*=\vec{x}^{(0)}-H(f)(\vec{x}^{(0)})^{-1}\nabla_xf(\vec{x}^{(0)}) x =x (0)H(f)(x (0))1xf(x (0))
    • Newton’s method is only appropriate when the nearby critical point is a minimum
  • Lipschitz continuous (derivatives)
    • A Lipschitz continuous function is a function f f f whose rate of change is bounded by a Lipschitz constant L \mathcal{L} L: ∀ x ⃗ , ∀ y ⃗ , ∣ f ( x ⃗ ) − f ( y ⃗ ) ∣ ≤ L ∥ x ⃗ − y ⃗ ∥ 2 \forall\vec{x},\forall\vec{y},|f(\vec{x})-f(\vec{y})|\leq\mathcal{L}\|\vec{x}-\vec{y}\|_2 x ,y ,f(x )f(y )Lx y 2
    • weak constraint
  • Convex optimization
    • strong constraint
    • all of their local minima are necessarily global minima

Constraint Optimization

  • find the maximal or minimal value of f ( x ⃗ ) f(\vec{x}) f(x ) for values of x ⃗ \vec{x} x in some set S \mathbb{S} S.
  • method:
    • modify gradient descent taking the constraint into account
    • design a different, unconstrained optimization problem whose solution can be converted into a solution to the original,constrained optimization problem
    • Karush–Kuhn–Tucker(KKT) approach (Need more information)
      • generalized Lagrangian or generalized Lagrange function
        • S = { x ⃗ ∣ ∀ i , g ( i ) ( x ⃗ ) = 0  and  ∀ j , h ( j ) ( x ⃗ ) ≤ 0 } \mathbb{S}=\{\vec{x}|\forall i,g^{(i)}(\vec{x})=0\text{ and }\forall j,h^{(j)}(\vec{x})\leq0\} S={x i,g(i)(x )=0 and j,h(j)(x )0}
        • equality constraints g ( i ) g^{(i)} g(i)
        • inequality constraints h ( j ) h^{(j)} h(j)
        • KKT multipliers: λ i \lambda_i λi and α j \alpha_j αj for each constraint
        • L ( x ⃗ , λ ⃗ , α ⃗ ) = f ( x ⃗ ) + ∑ i λ i g ( i ) ( x ⃗ ) + ∑ j α j h ( j ) ( x ⃗ ) L(\vec{x},\vec{\lambda},\vec{\alpha})=f(\vec{x})+\sum\limits_i\lambda_ig^{(i)}(\vec{x})+\sum\limits_j\alpha_jh^{(j)}(\vec{x}) L(x ,λ ,α )=f(x )+iλig(i)(x )+jαjh(j)(x )
    • min ⁡ x ⃗ max ⁡ λ ⃗ max ⁡ α ⃗ , α ⃗ ≥ 0 L ( x ⃗ , λ ⃗ , α ⃗ ) ⇔ min ⁡ S f ( x ⃗ ) \min\limits_{\vec{x}}\max\limits_{\vec{\lambda}}\max\limits_{\vec{\alpha},\vec{\alpha}\geq0}L(\vec{x},\vec{\lambda},\vec{\alpha})\Leftrightarrow \min\limits_{\mathbb{S}}f(\vec{x}) x minλ maxα ,α 0maxL(x ,λ ,α )Sminf(x )
    • Karush-Kuhn-Tucker (KKT) conditions:
      • The gradient of the generalized Lagrangian is zero
      • All constraints on both x ⃗ \vec{x} x and the KKT multipliers are satisfied
      • The inequality constraints exhibit “complementary slackness”: α ⊙ h ( x ⃗ ) = 0 \alpha\odot h(\vec{x}) =0 αh(x )=0
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值