Optimization Week 13: Newton method and barrier method

1 Newton method (Second order method)

1.1 Motivation

  • Gradient methods is first order, which uses a linear approximation to iterate.
  • Gradient method is not affine invariant. A linear or affine change will of variables will cahnge the convergence rate.
  • Thus use the change to get the best convergence rate, which requires the Hessian to be identity, resulting in the second order method, Newton method.

1.2 Idea of Newton method

  • Use local quadratic approximation
    g ( x ) = f ( x 0 ) + ∇ f ( x 0 ) T ( x − x 0 ) + ( x − x 0 ) T ∇ 2 f ( x 0 ) ( x − x 0 ) g(x)=f(x_0)+\nabla f(x_0)^T(x-x_0)+(x-x_0)^T\nabla^2 f(x_0)(x-x_0) g(x)=f(x0)+f(x0)T(xx0)+(xx0)T2f(x0)(xx0) x + = arg min ⁡ x g ( x ) = x 0 − [ ∇ 2 f ( x 0 ) T ] − 1 ∇ f ( x 0 ) x_+=\argmin_xg(x)=x_0-[\nabla^2 f(x_0)^T]^{-1}\nabla f(x_0) x+=xargming(x)=x0[2f(x0)T]1f(x0)

1.3 Newton method

x + = x 0 − η [ ∇ 2 f ( x 0 ) ] − 1 ∇ f ( x 0 ) x_+=x_0-\eta[\nabla^2 f(x_0)]^{-1}\nabla f(x_0) x+=x0η[2f(x0)]1f(x0)

  • Affine invariant
  • Need to know Hessian ∇ 2 f \nabla^2f 2f
  • Converge fast
  • Expensive each iteration

1.4 Step size

  • η = 1 \eta=1 η=1: pure newton, may not converge
  • Backtracking line search (BLTS):
    α < 1 / 2 , β < 1 \alpha<1/2, \beta<1 α<1/2,β<1
    d = [ ∇ 2 f ( x 0 ) ] − 1 ∇ f ( x 0 ) d=[\nabla^2 f(x_0)]^{-1}\nabla f(x_0) d=[2f(x0)]1f(x0)
    while f ( x − η d ) > f ( x ) − α η ∇ f ( x ) T d f(x-\eta d)>f(x)-\alpha \eta \nabla f(x)^Td f(xηd)>f(x)αηf(x)Td
    η = η β \quad \quad \eta=\eta \beta η=ηβ
    Backtracking line search is a way to choose the step size by starting from a very optimistic step size, and then checking if we are too optimistic. And if we are, we decrease it by a fraction beta. And alpha is a parameter that decides whether we’re optimistic or not.

1.5 Convergence with BTLS

  • Prerequisite 1: m I ≤ ∇ 2 f ( x ) ≤ M I mI\leq\nabla^2f(x)\leq MI mI2f(x)MI
  • Prerequisite 2: ∇ 2 f ( x ) \nabla^2f(x) 2f(x) is L-lipschitz
  • Two-phase convergence:
    • Damped phase ( ∣ ∣ ∇ f ( x ) ≥ α ∣ ∣ ||\nabla f(x)\geq \alpha|| f(x)α): f ( x t ) − f ( x ∗ ) ≤ f ( x 0 ) − f ∗ − γ t f(x_t)-f(x^*)\leq f(x_0)-f^*-\gamma t f(xt)f(x)f(x0)fγt
    • Pure phase ( ∣ ∣ ∇ f ( x ) < α ∣ ∣ ||\nabla f(x)< \alpha|| f(x)<α, BTLS selects η = 1 \eta=1 η=1): f ( x t ) − f ( x ∗ ) ≤ 2 m 3 L 2 ( 1 2 ) 2 t − t 0 + 1 f(x_t)-f(x^*)\leq \frac{2m^3}{L^2}(\frac{1}{2})^{2^{t-t_0+1}} f(xt)f(x)L22m3(21)2tt0+1, or L 2 m 2 ∣ ∣ ∇ f ( x + ) ∣ ∣ ≤ ( L 2 m 2 ∣ ∣ ∇ f ( x ) ∣ ∣ ) 2 \frac{L}{2m^2}||\nabla f(x_+)||\leq(\frac{L}{2m^2}||\nabla f(x)||)^2 2m2Lf(x+)(2m2Lf(x))2.
  • Steps to reach ε \varepsilon ε accuracy: f ( x 0 ) − f ∗ γ + log ⁡ log ⁡ ( ε 0 ε ) \frac{f(x_0)-f^*}{\gamma}+\log\log(\frac{\varepsilon_0}{\varepsilon}) γf(x0)f+loglog(εε0), quadratic convergence
  • Error ( 1 2 ) 2 t (\frac{1}{2})^{2^t} (21)2t
  • Gradient descent: log ⁡ ( 1 ε ) \log(\frac{1}{\varepsilon}) log(ε1), linear convergence.

1.6 Scale free Newton

1.6.1 Definition

  • A one-dimensional convex function f : R → R f:\mathbb{R}\rightarrow \mathbb{R} f:RR is self-concordant if ∣ f ′ ′ ′ ∣ ≤ 2 [ f ′ ′ ( x ) ] 3 2 |f'''|\leq 2[f''(x)]^{\frac{3}{2}} f2[f(x)]23.
  • A n-dimensional convex function f : R n → R f:\mathbb{R}^n\rightarrow \mathbb{R} f:RnR is self-concordant if its every 1-d projection is self-concordant.

1.6.2 Convergence

Newton with BTLS ( α , β ) (\alpha,\beta) (α,β) for a self-concordant f f f reaches ε \varepsilon ε-

optimality in c ( α , β ) ( f ( x 0 ) − f ∗ ) + log ⁡ log ⁡ ( 1 ε ) c(\alpha,\beta)(f(x_0)-f^*)+\log\log(\frac{1}{\varepsilon}) c(α,β)(f(x0)f)+loglog(ε1).

2 Quasi Newton methods

2.1 Motivation

  • x t + 1 = x t − η I ∇ f ( x t ) x_{t+1}=x_t-\eta I \nabla f(x_t) xt+1=xtηIf(xt)
  • x t + 1 = x t − η ∇ 2 f ( x t ) − 1 ∇ f ( x t ) x_{t+1}=x_t-\eta \nabla^2f(x_t)^{-1} \nabla f(x_t) xt+1=xtη2f(xt)1f(xt)
  • x t + 1 = x t − η H t ∇ f ( x t ) x_{t+1}=x_t-\eta H_t \nabla f(x_t) xt+1=xtηHtf(xt)

2.2 Basic idea

x + − x = s x_{+}-x=s x+x=s Solve secant equation B + s = ∇ f ( x + ) − ∇ f ( x ) B_{+}s=\nabla f(x_{+})-\nabla f(x) B+s=f(x+)f(x) B + s = y B_+s=y B+s=y for B + B_{+} B+ through cheap update of B B B and keeping B B B symmetric, positive semidefinite, then solve B + p = − ∇ f ( x ) B_+p=-\nabla f(x) B+p=f(x) for p p p ,finally, approximate x + = x 0 − η [ ∇ 2 f ( x 0 ) ] − 1 ∇ f ( x 0 ) x_+=x_0-\eta[\nabla^2 f(x_0)]^{-1}\nabla f(x_0) x+=x0η[2f(x0)]1f(x0) with x + = x 0 + η p x_+=x_0+\eta p x+=x0+ηp

https://en.wikipedia.org/wiki/Quasi-Newton_method

Use Sherman–Morrison formula to:
Update of inverse

2.3 Convergence

  • Super linear convergence when strong convexity + extra assumptions.

3 Barrier method (with constraints)

3.1 Basic idea

min ⁡ x f ( x ) s . t . h i ( x ) ≤ C , i = 1 … m A x = b \begin{aligned} \min_x& \quad f(x)\\ s.t.& \quad h_i(x)\leq C, i=1\dots m\\ & \quad Ax=b \end{aligned} xmins.t.f(x)hi(x)C,i=1mAx=b Bring the constraints into the objective function using indicator function I ( u ) = ∞ I(u)=\infin I(u)= if u ≥ 0 u\geq 0 u0; = 0 =0 =0 otherwise.
min ⁡ x f ( x ) + ∑ i I ( h i ( x ) ) s . t . A x = b \begin{aligned} \min_x& \quad f(x)+\sum_i I(h_i(x))\\ s.t.& \quad Ax=b \end{aligned} xmins.t.f(x)+iI(hi(x))Ax=b Then use a smooth function approx = − 1 t log ⁡ ( − u ) =-\frac{1}{t}\log(-u) =t1log(u) to approximate I ( u ) I(u) I(u), t → ∞ t\rightarrow \infin t, approx → I ( u ) \rightarrow I(u) I(u).

min ⁡ x t f ( x ) + [ − ∑ i = 1 m log ⁡ ( − h i ( x ) ) ] s . t . A x = b \begin{aligned} \min_x& \quad tf(x)+[-\sum_{i=1}^m \log (-h_i(x))]\\ s.t.& \quad Ax=b \end{aligned} xmins.t.tf(x)+[i=1mlog(hi(x))]Ax=b

3.2 Barrier method

  • Solve sequence of problems: min ⁡ x t k f ( x ) + [ − ∑ i = 1 m log ⁡ ( − h i ( x ) ) ] s . t . A x = b \begin{aligned} \min_x& \quad t^kf(x)+[-\sum_{i=1}^m \log (-h_i(x))]\\ s.t.& \quad Ax=b \end{aligned} xmins.t.tkf(x)+[i=1mlog(hi(x))]Ax=b
  • Start from initial t 0 t^0 t0
  • At each epoch t k t^k tk, find x ∗ ( t k ) x^*(t^k) x(tk) using Newton method starting at x ∗ ( t k − 1 ) x^*(t^{k-1}) x(tk1). Increase t k + 1 = μ t k t^{k+1}=\mu t^k tk+1=μtk.

3.2.1 Central path

Solve the problem min ⁡ x t k f ( x ) + [ − ∑ i = 1 m log ⁡ ( − h i ( x ) ) ] s . t . A x = b \begin{aligned} \min_x& \quad t^kf(x)+[-\sum_{i=1}^m \log (-h_i(x))]\\ s.t.& \quad Ax=b \end{aligned} xmins.t.tkf(x)+[i=1mlog(hi(x))]Ax=b we get the optimal solution x ∗ ( t k ) x^*(t^k) x(tk). x ∗ ( t k ) → x ∗ x^*(t^k)\rightarrow x^* x(tk)x when k → ∞ k\rightarrow\infin k.

3.2.2 Choose t t t

  • Initially, use small t t t to avoid the bad conditioning, gradually increase t t t when approaching x ∗ x^* x.
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值