Week 10: Gradient Descent
1 Motivation
1.1 First Order Taylor expansion
f
(
x
t
+
η
d
)
≈
f
(
x
t
)
+
∇
f
(
x
t
)
T
η
d
f(x_t+\eta d)\approx f(x_t)+\nabla f(x_t)^T \eta d
f(xt+ηd)≈f(xt)+∇f(xt)Tηd
Minimize when
d
=
−
∇
f
(
x
t
)
d=-\nabla f(x_t)
d=−∇f(xt)
x
t
+
1
=
x
t
−
η
∇
f
(
x
t
)
x_{t+1}=x_t-\eta \nabla f(x_t)
xt+1=xt−η∇f(xt)
Where does
η
\eta
η comes from? Quadratic approximation.
1.2 Quadratic approximation
f
(
x
t
+
1
)
=
f
(
x
t
)
+
∇
f
(
x
t
)
T
(
x
t
+
1
−
x
t
)
+
1
2
η
∣
∣
x
t
+
1
−
x
t
∣
∣
2
2
f(x_{t+1})=f(x_t)+\nabla f(x_t)^T(x_{t+1}-x_t)+\frac{1}{2\eta}||x_{t+1}-x_t||^2_2
f(xt+1)=f(xt)+∇f(xt)T(xt+1−xt)+2η1∣∣xt+1−xt∣∣22
Add a quadratic prox term to avoid too far deviation.
Minimize w.r.t.
x
t
+
1
x_{t+1}
xt+1.
x
t
+
1
=
x
t
−
η
∇
f
(
x
t
)
x_{t+1}=x_t-\eta \nabla f(x_t)
xt+1=xt−η∇f(xt)
2 Step size
Exact line search
x t + 1 = x t + η t d t x_{t+1}=x_t+\eta_t d_t xt+1=xt+ηtdt
η t = arg min η f ( x t + η d t ) \eta_t=\argmin_{\eta} f(x_t+\eta d_t) ηt=ηargminf(xt+ηdt)
Backtracking line search (BTLS)
Goal: Ensure
f
(
x
+
η
d
)
f(x+\eta d)
f(x+ηd) decrease enough.
According to the convexity:
f
(
x
+
η
d
)
≥
f
(
x
)
+
η
∇
f
(
x
)
T
d
f(x+\eta d)\geq f(x)+\eta \nabla f(x)^T d
f(x+ηd)≥f(x)+η∇f(x)Td
BTLS for gradient descent
If f f f is M M M-smooth, then η B T L S ≥ β / M \eta_{BTLS}\geq \beta/M ηBTLS≥β/M, and f ( x + ) ≤ f ( x ) − α β M ∣ ∣ ∇ f ( x ) ∣ ∣ 2 2 f(x_+)\leq f(x)-\frac{\alpha \beta}{M} ||\nabla f(x)||_2^2 f(x+)≤f(x)−Mαβ∣∣∇f(x)∣∣22
3 Convergence, step size
3.1 Smoothness, upper bound, and self-tuning
Lipschitz Gradients (it is gradients)
∣
∣
∇
f
(
x
)
−
∇
f
(
y
)
∣
∣
≤
M
∣
∣
x
−
y
∣
∣
,
∀
x
,
y
||\nabla f(x)-\nabla f(y)||\leq M||x-y||, \forall x,y
∣∣∇f(x)−∇f(y)∣∣≤M∣∣x−y∣∣,∀x,y
Then
f
f
f has
M
M
M-Lipschitz gradients,
f
f
f may not be convex.
M
M
M is the smoothness parameter, and is the largest eigenvalue of the Hessian when it is a quadratic function.
Quadratic upper bound
- f f f has M M M-Lipschitz gradient and convex, then g ( x ) = M 2 x T x − f ( x ) g(x)=\frac{M}{2} x^Tx-f(x) g(x)=2MxTx−f(x) is convex.
- f ( y ) ≤ f ( x ) + ∇ f ( x ) ( y − x ) + M 2 ∣ ∣ y − x ∣ ∣ 2 2 f(y)\leq f(x)+\nabla f(x)(y-x)+\frac{M}{2}||y-x||_2^2 f(y)≤f(x)+∇f(x)(y−x)+2M∣∣y−x∣∣22
Step size
f ( y ) ≤ f ( x ) + ∇ f ( x ) ( y − x ) + M 2 ∣ ∣ y − x ∣ ∣ 2 2 f(y)\leq f(x)+\nabla f(x)(y-x)+\frac{M}{2}||y-x||_2^2 f(y)≤f(x)+∇f(x)(y−x)+2M∣∣y−x∣∣22, choose y = x − η ∇ f ( x ) y=x-\eta \nabla f(x) y=x−η∇f(x)
then, f ( y ) ≤ f ( x ) − η ∇ f ( x ) 2 + M 2 η 2 ∇ f ( x ) 2 = f ( x ) + η ( M 2 η − 1 ) ∇ f ( x ) 2 f(y)\leq f(x)-\eta \nabla f(x)^2+\frac{M}{2}\eta^2\nabla f(x)^2=f(x)+\eta(\frac{M}{2}\eta-1) \nabla f(x)^2 f(y)≤f(x)−η∇f(x)2+2Mη2∇f(x)2=f(x)+η(2Mη−1)∇f(x)2
Thus, to ensure convergence, M 2 η − 1 < 0 , η < 2 M \frac{M}{2}\eta-1< 0, \eta< \frac{2}{M} 2Mη−1<0,η<M2, to decrease faster, η = 1 M \eta=\frac{1}{M} η=M1
- η < 2 M \eta<\frac{2}{M} η<M2 to ensure convergence, η = 1 M \eta=\frac{1}{M} η=M1 to ensure fastest convergence.
Convergence
- η < 2 M \eta<\frac{2}{M} η<M2 to ensure convergence, η = 1 M \eta=\frac{1}{M} η=M1 to ensure fastest convergence.
- If η ≤ 1 M \eta\leq \frac{1}{M} η≤M1, GD is descent with f ( x t ) − f ∗ ≤ 1 t 1 2 η ∣ ∣ x 0 − x ∗ ∣ ∣ = O ( M t ) f(x_t)-f^*\leq\frac{1}{t}\frac{1}{2 \eta}||x_0-x^*||=O(\frac{M}{t}) f(xt)−f∗≤t12η1∣∣x0−x∗∣∣=O(tM)
- Self-tuning: Update → \rightarrow → 0, when x → x ∗ x\rightarrow x^* x→x∗.
- Smoothness can ensure the convergence of the function value, and the solution will not oscillate because the update will drecease when approaching the optimal solution.
- but the solution point is not ensured to converge to the optimal solution if not strongly convex, since the function can be flat, this can be guaranteed by the following strong convexity.
Bound on suboptimality (OL)
If is
M
M
M smooth:
1
2
M
∥
∇
f
(
x
)
∥
2
2
≤
f
(
x
)
−
f
(
x
∗
)
≤
M
2
∥
x
−
x
∗
∥
2
\frac{1}{2M}\|\nabla f(x) \|_2^2\leq f(x)-f(x^*)\leq \frac{M}{2}\|x-x^*\|^2
2M1∥∇f(x)∥22≤f(x)−f(x∗)≤2M∥x−x∗∥2
Co-coercivity (OL)
If is
M
M
M smooth:
<
∇
f
(
x
)
−
∇
f
(
y
)
,
x
−
y
>
≥
1
M
∥
∇
f
(
x
)
−
∇
f
(
y
)
∥
2
2
<\nabla f(x)-\nabla f(y),x-y>\geq \frac{1}{M}\|\nabla f(x)-\nabla f(y)\|^2_2
<∇f(x)−∇f(y),x−y>≥M1∥∇f(x)−∇f(y)∥22
3.2 Strong convexity, lower bound
Strong convexity
- ∀ x , y , ⟨ ∇ f ( x ) − ∇ f ( y ) ⟩ ⟨ x − y ⟩ ≥ m ∣ ∣ x − y ∣ ∣ 2 2 \forall x,y, \langle \nabla f(x)-\nabla f(y) \rangle \langle x-y \rangle \geq m||x-y||^2_2 ∀x,y,⟨∇f(x)−∇f(y)⟩⟨x−y⟩≥m∣∣x−y∣∣22
- If ∇ 2 f ( x ) \nabla^2 f(x) ∇2f(x) exists, ∇ 2 f ( x ) ≥ m I \nabla^2 f(x)\geq mI ∇2f(x)≥mI
- When f f f is quadratic, m m m is the smallest eigenvalue of the Hessian.
Quadratic lower bound
- If f f f is m-strongly convex, g ( x ) = f ( x ) − m 2 x T x g(x)=f(x)-\frac{m}{2}x^T x g(x)=f(x)−2mxTx, then g g g is convex.
- If f f f is m-strongly convex, f ( y ) ≥ ∇ f ( x ) T ( y − x ) + m 2 ∣ ∣ x − y ∣ ∣ 2 2 f(y)\geq\nabla f(x)^T (y-x)+\frac{m}{2}||x-y||^2_2 f(y)≥∇f(x)T(y−x)+2m∣∣x−y∣∣22, corollary: f ( y ) ≥ f ( x ) − 1 2 m ∣ ∣ ∇ f ( x ) ∣ ∣ 2 2 f(y)\geq f(x)-\frac{1}{2m}||\nabla f(x)||^2_2 f(y)≥f(x)−2m1∣∣∇f(x)∣∣22
Convergence
- t > 2 m t>\frac{2}{m} t>m2, divergence.
- f ( x + ) − f ( x ∗ ) ≤ [ 1 − m M ] ( f ( x ) − f ( x ∗ ) ) f(x_+)-f(x^*)\leq [1-\frac{m}{M}](f(x)-f(x^*)) f(x+)−f(x∗)≤[1−Mm](f(x)−f(x∗)), η = 1 M \eta=\frac{1}{M} η=M1
- Strong convexity ensures that GD makes very fast progress when being far away from the optimal point.
Bound on suboptimality (OL)
If
m
m
m-strongly convex:
m
2
∥
x
−
x
∗
∥
2
≤
f
(
x
)
−
f
(
x
∗
)
≤
1
2
m
∥
∇
f
(
x
)
∥
2
2
\frac{m}{2}\|x-x^*\|^2\leq f(x)-f(x^*)\leq \frac{1}{2m}\|\nabla f(x) \|_2^2
2m∥x−x∗∥2≤f(x)−f(x∗)≤2m1∥∇f(x)∥22
Co-coercivity (OL)
If is
M
M
M smooth:
<
∇
f
(
x
)
−
∇
f
(
y
)
,
x
−
y
>
≥
m
∥
x
−
y
∥
2
2
<\nabla f(x)-\nabla f(y),x-y>\geq m \|x-y\|^2_2
<∇f(x)−∇f(y),x−y>≥m∥x−y∥22
3.3 Smoothness and strong convexity
M and m
m
≤
∣
∣
∇
f
(
x
)
−
∇
f
(
y
)
∣
∣
∣
∣
x
−
y
∣
∣
≤
M
m\leq \frac{||\nabla f(x)-\nabla f(y)||}{||x-y||}\leq M
m≤∣∣x−y∣∣∣∣∇f(x)−∇f(y)∣∣≤M
m
M
≤
1
\frac{m}{M}\leq 1
Mm≤1. When
m
M
\frac{m}{M}
Mm is large, the trajectory will zigzag, when small, converge quickly. When
m
M
=
1
\frac{m}{M}=1
Mm=1,
f
(
x
+
)
−
f
(
x
∗
)
=
0
f(x_+)-f(x^*)=0
f(x+)−f(x∗)=0
Convergence
Linear convergence when smooth and strongly convex:
f ( x t ) − f ( x ∗ ) ≤ O ( ( 1 − m M ) t ) f(x_t)-f(x^*)\leq O((1-\frac{m}{M})^t) f(xt)−f(x∗)≤O((1−Mm)t).
Whenever we have strong convexity, we can guarantee that x t x_t xt converges to x ∗ x^* x∗. Piazza @338
4 Oracle lower bounds
4.1 Lipschitz convex function
For the Lipschitz convex function, there is no algorithm which can guarantee error better than O ( 1 / T ) O(1/\sqrt{T}) O(1/T).
4.2 Smooth convex function
For the smooth convex function, there is no algorithm which can guarantee error better than O ( 1 / T 2 ) O(1/T^2) O(1/T2). So we can improve Gradient Descent ( O ( 1 / T ) O(1/T) O(1/T)) to accelerated gradient descent.
4.3 Smooth and strongly convex function
For the M M M smooth m m m convex function, there is no algorithm which can guarantee error better than O ( ( K − 1 K + 1 ) T ) O((\frac{\sqrt{K}-1}{\sqrt{K}+1})^T) O((K+1K−1)T), K = M / m K=M/m K=M/m.
5 Accelerated gradient method (week 14)
5.1 First order methods
x t + 1 ∈ x 1 + s p a n ( ∇ f ( x 1 ) , … , ∇ f ( x t ) ) x_{t+1}\in x_1+span(\nabla f(x_1),\dots,\nabla f(x_t)) xt+1∈x1+span(∇f(x1),…,∇f(xt))
5.2 Convergence performance
5.3 Heavy ball method (momentum)
5.3.1 Update rule 1
x k + 1 = x k − η ∇ f ( x k ) + β k ( x k − x k − 1 ) x_{k+1}=x_k-\eta \nabla f(x_k)+\beta_k (x_k-x_{k-1}) xk+1=xk−η∇f(xk)+βk(xk−xk−1)
- Vanilla gradient descent: x k − η ∇ f ( x k ) x_k-\eta \nabla f(x_k) xk−η∇f(xk)
- Momentum term: β k ( x k − x k − 1 ) \beta_k (x_k-x_{k-1}) βk(xk−xk−1)
- First vanilla update, then using momentum update
- Also works for proximal gradient setting.
Can be rewitten as: p k = − ∇ f ( x k ) + β k p k − 1 p_k=-\nabla f(x_k)+\beta_k p_{k-1} pk=−∇f(xk)+βkpk−1 x k + 1 = x k + α k p k x_{k+1}=x_k+\alpha_k p_k xk+1=xk+αkpk
5.3.2 Update rule 2
y k + 1 = x k − η ∇ f ( x k ) y_{k+1}=x_k-\eta \nabla f(x_k) yk+1=xk−η∇f(xk)
- x k + 1 = y k + 1 + K − 1 K + 1 ( y k + 1 − y k ) x_{k+1}=y_{k+1}+\frac{\sqrt{K}-1}{\sqrt{K}+1}(y_{k+1}-y_k) xk+1=yk+1+K+1K−1(yk+1−yk)
5.3.3 Convergence rate
For strongly convex
f
f
f with condition number
κ
≤
1
\kappa \leq 1
κ≤1:
∣
∣
x
k
−
x
∗
∣
∣
≤
(
1
−
2
κ
+
1
)
k
∣
∣
x
0
−
x
∗
∣
∣
||x_k-x^*||\leq (1-\frac{2}{\sqrt{\kappa}+1})^k ||x_0-x^*||
∣∣xk−x∗∣∣≤(1−κ+12)k∣∣x0−x∗∣∣ Lipscitz only: unknown.
5.4 Nesterov accelerated gradient
5.4.1 Update rule
p k = ∇ f ( x k + β k ( x k − x k − 1 ) ) + β k p k − 1 p_k=\nabla f(x_k+\beta_k(x_k-x_{k-1}))+\beta_kp_{k-1} pk=∇f(xk+βk(xk−xk−1))+βkpk−1 x k + 1 = x k + α k p k x_{k+1}=x_k+\alpha_kp_k xk+1=xk+αkpk
- Momentum before gradient
- α k = 1 L \alpha_k=\frac{1}{L} αk=L1
- β k = k − 2 k + 1 \beta_k=\frac{k-2}{k+1} βk=k+1k−2
5.4.2 Convergence rate
- O ( 1 T 2 ) O(\frac{1}{T^2}) O(T21) error for Lipscitz gradients
-
O
(
1
−
2
κ
+
1
)
k
O(1-\frac{2}{\sqrt{\kappa}+1})^k
O(1−κ+12)k error for
κ
\kappa
κ-conditioned strongly convex.
Optimal for all first-order settings.
6 Mirror descent
6.1 New motivation
x
t
+
1
=
arg min
:
η
g
t
T
x
+
1
2
D
ϕ
(
x
,
x
t
)
x_{t+1}=\argmin: \eta g_t^Tx+\frac{1}{2}D_{\phi}(x,x_t)
xt+1=argmin:ηgtTx+21Dϕ(x,xt)
Bregman divergence is
D
ϕ
D_{\phi}
Dϕ.
6.2 Dual norm
For ∥ ∥ p \|\|_p ∥∥p norm, its dual norm is ∥ ∥ p \|\|_p ∥∥p, 1 p + 1 q = 1 \frac{1}{p}+\frac{1}{q}=1 p1+q1=1.
6.3 For ϕ = ∑ x i log x i \phi=\sum x_i \log x_i ϕ=∑xilogxi
ϕ
(
y
t
+
1
)
=
ϕ
(
x
t
)
−
η
g
t
\phi(y_{t+1})=\phi(x_t)-\eta g_t
ϕ(yt+1)=ϕ(xt)−ηgt
For this special
ϕ
\phi
ϕ,
y
t
+
1
(
i
)
=
x
t
(
i
)
e
−
η
(
∇
f
(
x
t
)
)
i
y_{t+1}(i)=x_t(i) e^{-\eta(\nabla f(x_t))_i}
yt+1(i)=xt(i)e−η(∇f(xt))i
x
t
+
1
=
y
t
+
1
∥
y
t
+
1
∥
1
x_{t+1}=\frac{y_{t+1}}{\|y_{t+1}\|_1}
xt+1=∥yt+1∥1yt+1