# 1 Newton method (Second order method)

## 1.1 Motivation

• Gradient methods is first order, which uses a linear approximation to iterate.
• Gradient method is not affine invariant. A linear or affine change will of variables will cahnge the convergence rate.
• Thus use the change to get the best convergence rate, which requires the Hessian to be identity, resulting in the second order method, Newton method.

## 1.2 Idea of Newton method

g ( x ) = f ( x 0 ) + ∇ f ( x 0 ) T ( x − x 0 ) + ( x − x 0 ) T ∇ 2 f ( x 0 ) ( x − x 0 ) g(x)=f(x_0)+\nabla f(x_0)^T(x-x_0)+(x-x_0)^T\nabla^2 f(x_0)(x-x_0) x + = arg min ⁡ x g ( x ) = x 0 − [ ∇ 2 f ( x 0 ) T ] − 1 ∇ f ( x 0 ) x_+=\argmin_xg(x)=x_0-[\nabla^2 f(x_0)^T]^{-1}\nabla f(x_0)

## 1.3 Newton method

x + = x 0 − η [ ∇ 2 f ( x 0 ) ] − 1 ∇ f ( x 0 ) x_+=x_0-\eta[\nabla^2 f(x_0)]^{-1}\nabla f(x_0)

• Affine invariant
• Need to know Hessian ∇ 2 f \nabla^2f
• Converge fast
• Expensive each iteration

## 1.4 Step size

• η = 1 \eta=1 : pure newton, may not converge
• Backtracking line search (BLTS):
α < 1 / 2 , β < 1 \alpha<1/2, \beta<1
d = [ ∇ 2 f ( x 0 ) ] − 1 ∇ f ( x 0 ) d=[\nabla^2 f(x_0)]^{-1}\nabla f(x_0)
while f ( x − η d ) > f ( x ) − α η ∇ f ( x ) T d f(x-\eta d)>f(x)-\alpha \eta \nabla f(x)^Td
Backtracking line search is a way to choose the step size by starting from a very optimistic step size, and then checking if we are too optimistic. And if we are, we decrease it by a fraction beta. And alpha is a parameter that decides whether we’re optimistic or not.

## 1.5 Convergence with BTLS

• Prerequisite 1: m I ≤ ∇ 2 f ( x ) ≤ M I mI\leq\nabla^2f(x)\leq MI
• Prerequisite 2: ∇ 2 f ( x ) \nabla^2f(x) is L-lipschitz
• Two-phase convergence:
• Damped phase ( ∣ ∣ ∇ f ( x ) ≥ α ∣ ∣ ||\nabla f(x)\geq \alpha|| ): f ( x t ) − f ( x ∗ ) ≤ f ( x 0 ) − f ∗ − γ t f(x_t)-f(x^*)\leq f(x_0)-f^*-\gamma t
• Pure phase ( ∣ ∣ ∇ f ( x ) < α ∣ ∣ ||\nabla f(x)< \alpha|| , BTLS selects η = 1 \eta=1 ): f ( x t ) − f ( x ∗ ) ≤ 2 m 3 L 2 ( 1 2 ) 2 t − t 0 + 1 f(x_t)-f(x^*)\leq \frac{2m^3}{L^2}(\frac{1}{2})^{2^{t-t_0+1}} , or L 2 m 2 ∣ ∣ ∇ f ( x + ) ∣ ∣ ≤ ( L 2 m 2 ∣ ∣ ∇ f ( x ) ∣ ∣ ) 2 \frac{L}{2m^2}||\nabla f(x_+)||\leq(\frac{L}{2m^2}||\nabla f(x)||)^2 .
• Steps to reach ε \varepsilon accuracy: f ( x 0 ) − f ∗ γ + log ⁡ log ⁡ ( ε 0 ε ) \frac{f(x_0)-f^*}{\gamma}+\log\log(\frac{\varepsilon_0}{\varepsilon}) , quadratic convergence
• Gradient descent: log ⁡ ( 1 ε ) \log(\frac{1}{\varepsilon}) , linear convergence.

## 1.6 Scale free Newton

### 1.6.1 Definition

• A one-dimensional convex function f : R → R f:\mathbb{R}\rightarrow \mathbb{R} is self-concordant if ∣ f ′ ′ ′ ∣ ≤ 2 [ f ′ ′ ( x ) ] 3 2 |f'''|\leq 2[f''(x)]^{\frac{3}{2}} .
• A n-dimensional convex function f : R n → R f:\mathbb{R}^n\rightarrow \mathbb{R} is self-concordant if its every 1-d projection is self-concordant.

### 1.6.2 Convergence

Newton with BTLS ( α , β ) (\alpha,\beta) for a self-concordant f f reaches ε \varepsilon -

optimality in c ( α , β ) ( f ( x 0 ) − f ∗ ) + log ⁡ log ⁡ ( 1 ε ) c(\alpha,\beta)(f(x_0)-f^*)+\log\log(\frac{1}{\varepsilon}) .

# 2 Quasi Newton methods

## 2.1 Basic idea

x + − x = s x_{+}-x=s Solve secant equation B + s = ∇ f ( x + ) − ∇ f ( x ) B_{+}s=\nabla f(x_{+})-\nabla f(x) B + s = y B_+s=y for B + B_{+} through cheap update of B B and keeping B B symmetric, positive semidefinite, then solve B + p = − ∇ f ( x ) B_+p=-\nabla f(x) for p p ,finally, approximate x + = x 0 − η [ ∇ 2 f ( x 0 ) ] − 1 ∇ f ( x 0 ) x_+=x_0-\eta[\nabla^2 f(x_0)]^{-1}\nabla f(x_0) with x + = x 0 + η p x_+=x_0+\eta p

https://en.wikipedia.org/wiki/Quasi-Newton_method

Use Sherman–Morrison formula to:
Update of inverse

## 2.2 Convergence

• Super linear convergence when strong convexity + extra assumptions.

# 3 Barrier method (with constraints)

## 3.1 Basic idea

min ⁡ x f ( x ) s . t . h i ( x ) ≤ C , i = 1 … m A x = b \begin{aligned} \min_x& \quad f(x)\\ s.t.& \quad h_i(x)\leq C, i=1\dots m\\ & \quad Ax=b \end{aligned} Bring the constraints into the objective function using indicator function I ( u ) = ∞ I(u)=\infin if u ≥ 0 u\geq 0 ; = 0 =0 otherwise.
min ⁡ x f ( x ) + ∑ i I ( h i ( x ) ) s . t . A x = b \begin{aligned} \min_x& \quad f(x)+\sum_i I(h_i(x))\\ s.t.& \quad Ax=b \end{aligned} Then use a smooth function approx = − 1 t log ⁡ ( − u ) =-\frac{1}{t}\log(-u) to approximate I ( u ) I(u) , t → ∞ t\rightarrow \infin , approx → I ( u ) \rightarrow I(u) .

min ⁡ x t f ( x ) + [ − ∑ i = 1 m log ⁡ ( − h i ( x ) ) ] s . t . A x = b \begin{aligned} \min_x& \quad tf(x)+[-\sum_{i=1}^m \log (-h_i(x))]\\ s.t.& \quad Ax=b \end{aligned}

## 3.2 Barrier method

• Solve sequence of problems: min ⁡ x t k f ( x ) + [ − ∑ i = 1 m log ⁡ ( − h i ( x ) ) ] s . t . A x = b \begin{aligned} \min_x& \quad t^kf(x)+[-\sum_{i=1}^m \log (-h_i(x))]\\ s.t.& \quad Ax=b \end{aligned}
• Start from initial t 0 t^0
• At each epoch t k t^k , find x ∗ ( t k ) x^*(t^k) using Newton method starting at x ∗ ( t k − 1 ) x^*(t^{k-1}) . Increase t k + 1 = μ t k t^{k+1}=\mu t^k .

### 3.2.1 Central path

Solve the problem min ⁡ x t k f ( x ) + [ − ∑ i = 1 m log ⁡ ( − h i ( x ) ) ] s . t . A x = b \begin{aligned} \min_x& \quad t^kf(x)+[-\sum_{i=1}^m \log (-h_i(x))]\\ s.t.& \quad Ax=b \end{aligned} we get the optimal solution x ∗ ( t k ) x^*(t^k) . x ∗ ( t k ) → x ∗ x^*(t^k)\rightarrow x^* when k → ∞ k\rightarrow\infin .

### 3.2.2 Choose t t

• Initially, use small t t to avoid the bad conditioning, gradually increase t t when approaching x ∗ x^* .

09-19 1525
06-17 2267
01-08 2938
02-02 2081
02-24 1万+
04-15 1万+
07-18 2万+
09-06 6万+
12-02 9万+