Week 13: Newton method and barrier method
1 Newton method (Second order method)
1.1 Motivation
- Gradient methods is first order, which uses a linear approximation to iterate.
- Gradient method is not affine invariant. A linear or affine change will of variables will cahnge the convergence rate.
- Thus use the change to get the best convergence rate, which requires the Hessian to be identity, resulting in the second order method, Newton method.
1.2 Idea of Newton method
- Use local quadratic approximation
g ( x ) = f ( x 0 ) + ∇ f ( x 0 ) T ( x − x 0 ) + ( x − x 0 ) T ∇ 2 f ( x 0 ) ( x − x 0 ) g(x)=f(x_0)+\nabla f(x_0)^T(x-x_0)+(x-x_0)^T\nabla^2 f(x_0)(x-x_0) g(x)=f(x0)+∇f(x0)T(x−x0)+(x−x0)T∇2f(x0)(x−x0) x + = arg min x g ( x ) = x 0 − [ ∇ 2 f ( x 0 ) T ] − 1 ∇ f ( x 0 ) x_+=\argmin_xg(x)=x_0-[\nabla^2 f(x_0)^T]^{-1}\nabla f(x_0) x+=xargming(x)=x0−[∇2f(x0)T]−1∇f(x0)
1.3 Newton method
x + = x 0 − η [ ∇ 2 f ( x 0 ) ] − 1 ∇ f ( x 0 ) x_+=x_0-\eta[\nabla^2 f(x_0)]^{-1}\nabla f(x_0) x+=x0−η[∇2f(x0)]−1∇f(x0)
- Affine invariant
- Need to know Hessian ∇ 2 f \nabla^2f ∇2f
- Converge fast
- Expensive each iteration
1.4 Step size
- η = 1 \eta=1 η=1: pure newton, may not converge
- Backtracking line search (BLTS):
α < 1 / 2 , β < 1 \alpha<1/2, \beta<1 α<1/2,β<1
d = [ ∇ 2 f ( x 0 ) ] − 1 ∇ f ( x 0 ) d=[\nabla^2 f(x_0)]^{-1}\nabla f(x_0) d=[∇2f(x0)]−1∇f(x0)
while f ( x − η d ) > f ( x ) − α η ∇ f ( x ) T d f(x-\eta d)>f(x)-\alpha \eta \nabla f(x)^Td f(x−ηd)>f(x)−αη∇f(x)Td
η = η β \quad \quad \eta=\eta \beta η=ηβ
Backtracking line search is a way to choose the step size by starting from a very optimistic step size, and then checking if we are too optimistic. And if we are, we decrease it by a fraction beta. And alpha is a parameter that decides whether we’re optimistic or not.
1.5 Convergence with BTLS
- Prerequisite 1: m I ≤ ∇ 2 f ( x ) ≤ M I mI\leq\nabla^2f(x)\leq MI mI≤∇2f(x)≤MI
- Prerequisite 2: ∇ 2 f ( x ) \nabla^2f(x) ∇2f(x) is L-lipschitz
- Two-phase convergence:
- Damped phase ( ∣ ∣ ∇ f ( x ) ≥ α ∣ ∣ ||\nabla f(x)\geq \alpha|| ∣∣∇f(x)≥α∣∣): f ( x t ) − f ( x ∗ ) ≤ f ( x 0 ) − f ∗ − γ t f(x_t)-f(x^*)\leq f(x_0)-f^*-\gamma t f(xt)−f(x∗)≤f(x0)−f∗−γt
- Pure phase ( ∣ ∣ ∇ f ( x ) < α ∣ ∣ ||\nabla f(x)< \alpha|| ∣∣∇f(x)<α∣∣, BTLS selects η = 1 \eta=1 η=1): f ( x t ) − f ( x ∗ ) ≤ 2 m 3 L 2 ( 1 2 ) 2 t − t 0 + 1 f(x_t)-f(x^*)\leq \frac{2m^3}{L^2}(\frac{1}{2})^{2^{t-t_0+1}} f(xt)−f(x∗)≤L22m3(21)2t−t0+1, or L 2 m 2 ∣ ∣ ∇ f ( x + ) ∣ ∣ ≤ ( L 2 m 2 ∣ ∣ ∇ f ( x ) ∣ ∣ ) 2 \frac{L}{2m^2}||\nabla f(x_+)||\leq(\frac{L}{2m^2}||\nabla f(x)||)^2 2m2L∣∣∇f(x+)∣∣≤(2m2L∣∣∇f(x)∣∣)2.
- Steps to reach ε \varepsilon ε accuracy: f ( x 0 ) − f ∗ γ + log log ( ε 0 ε ) \frac{f(x_0)-f^*}{\gamma}+\log\log(\frac{\varepsilon_0}{\varepsilon}) γf(x0)−f∗+loglog(εε0), quadratic convergence
- Error ( 1 2 ) 2 t (\frac{1}{2})^{2^t} (21)2t
- Gradient descent: log ( 1 ε ) \log(\frac{1}{\varepsilon}) log(ε1), linear convergence.
1.6 Scale free Newton
1.6.1 Definition
- A one-dimensional convex function f : R → R f:\mathbb{R}\rightarrow \mathbb{R} f:R→R is self-concordant if ∣ f ′ ′ ′ ∣ ≤ 2 [ f ′ ′ ( x ) ] 3 2 |f'''|\leq 2[f''(x)]^{\frac{3}{2}} ∣f′′′∣≤2[f′′(x)]23.
- A n-dimensional convex function f : R n → R f:\mathbb{R}^n\rightarrow \mathbb{R} f:Rn→R is self-concordant if its every 1-d projection is self-concordant.
1.6.2 Convergence
Newton with BTLS ( α , β ) (\alpha,\beta) (α,β) for a self-concordant f f f reaches ε \varepsilon ε-
optimality in c ( α , β ) ( f ( x 0 ) − f ∗ ) + log log ( 1 ε ) c(\alpha,\beta)(f(x_0)-f^*)+\log\log(\frac{1}{\varepsilon}) c(α,β)(f(x0)−f∗)+loglog(ε1).
2 Quasi Newton methods
2.1 Motivation
- x t + 1 = x t − η I ∇ f ( x t ) x_{t+1}=x_t-\eta I \nabla f(x_t) xt+1=xt−ηI∇f(xt)
- x t + 1 = x t − η ∇ 2 f ( x t ) − 1 ∇ f ( x t ) x_{t+1}=x_t-\eta \nabla^2f(x_t)^{-1} \nabla f(x_t) xt+1=xt−η∇2f(xt)−1∇f(xt)
- x t + 1 = x t − η H t ∇ f ( x t ) x_{t+1}=x_t-\eta H_t \nabla f(x_t) xt+1=xt−ηHt∇f(xt)
2.2 Basic idea
x + − x = s x_{+}-x=s x+−x=s Solve secant equation B + s = ∇ f ( x + ) − ∇ f ( x ) B_{+}s=\nabla f(x_{+})-\nabla f(x) B+s=∇f(x+)−∇f(x) B + s = y B_+s=y B+s=y for B + B_{+} B+ through cheap update of B B B and keeping B B B symmetric, positive semidefinite, then solve B + p = − ∇ f ( x ) B_+p=-\nabla f(x) B+p=−∇f(x) for p p p ,finally, approximate x + = x 0 − η [ ∇ 2 f ( x 0 ) ] − 1 ∇ f ( x 0 ) x_+=x_0-\eta[\nabla^2 f(x_0)]^{-1}\nabla f(x_0) x+=x0−η[∇2f(x0)]−1∇f(x0) with x + = x 0 + η p x_+=x_0+\eta p x+=x0+ηp
https://en.wikipedia.org/wiki/Quasi-Newton_method
Use Sherman–Morrison formula to:
Update of inverse
2.3 Convergence
- Super linear convergence when strong convexity + extra assumptions.
3 Barrier method (with constraints)
3.1 Basic idea
min
x
f
(
x
)
s
.
t
.
h
i
(
x
)
≤
C
,
i
=
1
…
m
A
x
=
b
\begin{aligned} \min_x& \quad f(x)\\ s.t.& \quad h_i(x)\leq C, i=1\dots m\\ & \quad Ax=b \end{aligned}
xmins.t.f(x)hi(x)≤C,i=1…mAx=b Bring the constraints into the objective function using indicator function
I
(
u
)
=
∞
I(u)=\infin
I(u)=∞ if
u
≥
0
u\geq 0
u≥0;
=
0
=0
=0 otherwise.
min
x
f
(
x
)
+
∑
i
I
(
h
i
(
x
)
)
s
.
t
.
A
x
=
b
\begin{aligned} \min_x& \quad f(x)+\sum_i I(h_i(x))\\ s.t.& \quad Ax=b \end{aligned}
xmins.t.f(x)+i∑I(hi(x))Ax=b Then use a smooth function approx
=
−
1
t
log
(
−
u
)
=-\frac{1}{t}\log(-u)
=−t1log(−u) to approximate
I
(
u
)
I(u)
I(u),
t
→
∞
t\rightarrow \infin
t→∞, approx
→
I
(
u
)
\rightarrow I(u)
→I(u).
min x t f ( x ) + [ − ∑ i = 1 m log ( − h i ( x ) ) ] s . t . A x = b \begin{aligned} \min_x& \quad tf(x)+[-\sum_{i=1}^m \log (-h_i(x))]\\ s.t.& \quad Ax=b \end{aligned} xmins.t.tf(x)+[−i=1∑mlog(−hi(x))]Ax=b
3.2 Barrier method
- Solve sequence of problems: min x t k f ( x ) + [ − ∑ i = 1 m log ( − h i ( x ) ) ] s . t . A x = b \begin{aligned} \min_x& \quad t^kf(x)+[-\sum_{i=1}^m \log (-h_i(x))]\\ s.t.& \quad Ax=b \end{aligned} xmins.t.tkf(x)+[−i=1∑mlog(−hi(x))]Ax=b
- Start from initial t 0 t^0 t0
- At each epoch t k t^k tk, find x ∗ ( t k ) x^*(t^k) x∗(tk) using Newton method starting at x ∗ ( t k − 1 ) x^*(t^{k-1}) x∗(tk−1). Increase t k + 1 = μ t k t^{k+1}=\mu t^k tk+1=μtk.
3.2.1 Central path
Solve the problem min x t k f ( x ) + [ − ∑ i = 1 m log ( − h i ( x ) ) ] s . t . A x = b \begin{aligned} \min_x& \quad t^kf(x)+[-\sum_{i=1}^m \log (-h_i(x))]\\ s.t.& \quad Ax=b \end{aligned} xmins.t.tkf(x)+[−i=1∑mlog(−hi(x))]Ax=b we get the optimal solution x ∗ ( t k ) x^*(t^k) x∗(tk). x ∗ ( t k ) → x ∗ x^*(t^k)\rightarrow x^* x∗(tk)→x∗ when k → ∞ k\rightarrow\infin k→∞.
3.2.2 Choose t t t
- Initially, use small t t t to avoid the bad conditioning, gradually increase t t t when approaching x ∗ x^* x∗.