Optimization Week 10: Gradient Descent

1 Motivation

1.1 First Order Taylor expansion

f ( x t + η d ) ≈ f ( x t ) + ∇ f ( x t ) T η d f(x_t+\eta d)\approx f(x_t)+\nabla f(x_t)^T \eta d f(xt+ηd)f(xt)+f(xt)Tηd
Minimize when d = − ∇ f ( x t ) d=-\nabla f(x_t) d=f(xt)
x t + 1 = x t − η ∇ f ( x t ) x_{t+1}=x_t-\eta \nabla f(x_t) xt+1=xtηf(xt)
Where does η \eta η comes from? Quadratic approximation.

1.2 Quadratic approximation

f ( x t + 1 ) = f ( x t ) + ∇ f ( x t ) T ( x t + 1 − x t ) + 1 2 η ∣ ∣ x t + 1 − x t ∣ ∣ 2 2 f(x_{t+1})=f(x_t)+\nabla f(x_t)^T(x_{t+1}-x_t)+\frac{1}{2\eta}||x_{t+1}-x_t||^2_2 f(xt+1)=f(xt)+f(xt)T(xt+1xt)+2η1xt+1xt22
Add a quadratic prox term to avoid too far deviation.

Minimize w.r.t. x t + 1 x_{t+1} xt+1.
x t + 1 = x t − η ∇ f ( x t ) x_{t+1}=x_t-\eta \nabla f(x_t) xt+1=xtηf(xt)

2 Step size

Exact line search

x t + 1 = x t + η t d t x_{t+1}=x_t+\eta_t d_t xt+1=xt+ηtdt

η t = arg min ⁡ η f ( x t + η d t ) \eta_t=\argmin_{\eta} f(x_t+\eta d_t) ηt=ηargminf(xt+ηdt)

Backtracking line search (BTLS)

Goal: Ensure f ( x + η d ) f(x+\eta d) f(x+ηd) decrease enough.
According to the convexity:
f ( x + η d ) ≥ f ( x ) + η ∇ f ( x ) T d f(x+\eta d)\geq f(x)+\eta \nabla f(x)^T d f(x+ηd)f(x)+ηf(x)Td

BTLS for gradient descent

If f f f is M M M-smooth, then η B T L S ≥ β / M \eta_{BTLS}\geq \beta/M ηBTLSβ/M, and f ( x + ) ≤ f ( x ) − α β M ∣ ∣ ∇ f ( x ) ∣ ∣ 2 2 f(x_+)\leq f(x)-\frac{\alpha \beta}{M} ||\nabla f(x)||_2^2 f(x+)f(x)Mαβf(x)22

3 Convergence, step size

3.1 Smoothness, upper bound, and self-tuning

Lipschitz Gradients (it is gradients)

在这里插入图片描述

∣ ∣ ∇ f ( x ) − ∇ f ( y ) ∣ ∣ ≤ M ∣ ∣ x − y ∣ ∣ , ∀ x , y ||\nabla f(x)-\nabla f(y)||\leq M||x-y||, \forall x,y f(x)f(y)Mxy,x,y
Then f f f has M M M-Lipschitz gradients, f f f may not be convex.
M M M is the smoothness parameter, and is the largest eigenvalue of the Hessian when it is a quadratic function.

Quadratic upper bound

  • f f f has M M M-Lipschitz gradient and convex, then g ( x ) = M 2 x T x − f ( x ) g(x)=\frac{M}{2} x^Tx-f(x) g(x)=2MxTxf(x) is convex.
  • f ( y ) ≤ f ( x ) + ∇ f ( x ) ( y − x ) + M 2 ∣ ∣ y − x ∣ ∣ 2 2 f(y)\leq f(x)+\nabla f(x)(y-x)+\frac{M}{2}||y-x||_2^2 f(y)f(x)+f(x)(yx)+2Myx22

Step size

f ( y ) ≤ f ( x ) + ∇ f ( x ) ( y − x ) + M 2 ∣ ∣ y − x ∣ ∣ 2 2 f(y)\leq f(x)+\nabla f(x)(y-x)+\frac{M}{2}||y-x||_2^2 f(y)f(x)+f(x)(yx)+2Myx22, choose y = x − η ∇ f ( x ) y=x-\eta \nabla f(x) y=xηf(x)

then, f ( y ) ≤ f ( x ) − η ∇ f ( x ) 2 + M 2 η 2 ∇ f ( x ) 2 = f ( x ) + η ( M 2 η − 1 ) ∇ f ( x ) 2 f(y)\leq f(x)-\eta \nabla f(x)^2+\frac{M}{2}\eta^2\nabla f(x)^2=f(x)+\eta(\frac{M}{2}\eta-1) \nabla f(x)^2 f(y)f(x)ηf(x)2+2Mη2f(x)2=f(x)+η(2Mη1)f(x)2

Thus, to ensure convergence, M 2 η − 1 < 0 , η < 2 M \frac{M}{2}\eta-1< 0, \eta< \frac{2}{M} 2Mη1<0,η<M2, to decrease faster, η = 1 M \eta=\frac{1}{M} η=M1

  • η < 2 M \eta<\frac{2}{M} η<M2 to ensure convergence, η = 1 M \eta=\frac{1}{M} η=M1 to ensure fastest convergence.

Convergence

  • η < 2 M \eta<\frac{2}{M} η<M2 to ensure convergence, η = 1 M \eta=\frac{1}{M} η=M1 to ensure fastest convergence.
  • If η ≤ 1 M \eta\leq \frac{1}{M} ηM1, GD is descent with f ( x t ) − f ∗ ≤ 1 t 1 2 η ∣ ∣ x 0 − x ∗ ∣ ∣ = O ( M t ) f(x_t)-f^*\leq\frac{1}{t}\frac{1}{2 \eta}||x_0-x^*||=O(\frac{M}{t}) f(xt)ft12η1x0x=O(tM)
  • Self-tuning: Update → \rightarrow 0, when x → x ∗ x\rightarrow x^* xx.
  • Smoothness can ensure the convergence of the function value, and the solution will not oscillate because the update will drecease when approaching the optimal solution.
  • but the solution point is not ensured to converge to the optimal solution if not strongly convex, since the function can be flat, this can be guaranteed by the following strong convexity.

Bound on suboptimality (OL)

If is M M M smooth:
1 2 M ∥ ∇ f ( x ) ∥ 2 2 ≤ f ( x ) − f ( x ∗ ) ≤ M 2 ∥ x − x ∗ ∥ 2 \frac{1}{2M}\|\nabla f(x) \|_2^2\leq f(x)-f(x^*)\leq \frac{M}{2}\|x-x^*\|^2 2M1f(x)22f(x)f(x)2Mxx2

Co-coercivity (OL)

If is M M M smooth:
< ∇ f ( x ) − ∇ f ( y ) , x − y > ≥ 1 M ∥ ∇ f ( x ) − ∇ f ( y ) ∥ 2 2 <\nabla f(x)-\nabla f(y),x-y>\geq \frac{1}{M}\|\nabla f(x)-\nabla f(y)\|^2_2 <f(x)f(y),xy>M1f(x)f(y)22

3.2 Strong convexity, lower bound

Strong convexity

  • ∀ x , y , ⟨ ∇ f ( x ) − ∇ f ( y ) ⟩ ⟨ x − y ⟩ ≥ m ∣ ∣ x − y ∣ ∣ 2 2 \forall x,y, \langle \nabla f(x)-\nabla f(y) \rangle \langle x-y \rangle \geq m||x-y||^2_2 x,y,f(x)f(y)xymxy22
  • If ∇ 2 f ( x ) \nabla^2 f(x) 2f(x) exists, ∇ 2 f ( x ) ≥ m I \nabla^2 f(x)\geq mI 2f(x)mI
  • When f f f is quadratic, m m m is the smallest eigenvalue of the Hessian.

Quadratic lower bound

  • If f f f is m-strongly convex, g ( x ) = f ( x ) − m 2 x T x g(x)=f(x)-\frac{m}{2}x^T x g(x)=f(x)2mxTx, then g g g is convex.
  • If f f f is m-strongly convex, f ( y ) ≥ ∇ f ( x ) T ( y − x ) + m 2 ∣ ∣ x − y ∣ ∣ 2 2 f(y)\geq\nabla f(x)^T (y-x)+\frac{m}{2}||x-y||^2_2 f(y)f(x)T(yx)+2mxy22, corollary: f ( y ) ≥ f ( x ) − 1 2 m ∣ ∣ ∇ f ( x ) ∣ ∣ 2 2 f(y)\geq f(x)-\frac{1}{2m}||\nabla f(x)||^2_2 f(y)f(x)2m1f(x)22

Convergence

  • t > 2 m t>\frac{2}{m} t>m2, divergence.
  • f ( x + ) − f ( x ∗ ) ≤ [ 1 − m M ] ( f ( x ) − f ( x ∗ ) ) f(x_+)-f(x^*)\leq [1-\frac{m}{M}](f(x)-f(x^*)) f(x+)f(x)[1Mm](f(x)f(x)), η = 1 M \eta=\frac{1}{M} η=M1
  • Strong convexity ensures that GD makes very fast progress when being far away from the optimal point.

Bound on suboptimality (OL)

If m m m-strongly convex:
m 2 ∥ x − x ∗ ∥ 2 ≤ f ( x ) − f ( x ∗ ) ≤ 1 2 m ∥ ∇ f ( x ) ∥ 2 2 \frac{m}{2}\|x-x^*\|^2\leq f(x)-f(x^*)\leq \frac{1}{2m}\|\nabla f(x) \|_2^2 2mxx2f(x)f(x)2m1f(x)22

Co-coercivity (OL)

If is M M M smooth:
< ∇ f ( x ) − ∇ f ( y ) , x − y > ≥ m ∥ x − y ∥ 2 2 <\nabla f(x)-\nabla f(y),x-y>\geq m \|x-y\|^2_2 <f(x)f(y),xy>mxy22

3.3 Smoothness and strong convexity

M and m

m ≤ ∣ ∣ ∇ f ( x ) − ∇ f ( y ) ∣ ∣ ∣ ∣ x − y ∣ ∣ ≤ M m\leq \frac{||\nabla f(x)-\nabla f(y)||}{||x-y||}\leq M mxyf(x)f(y)M
m M ≤ 1 \frac{m}{M}\leq 1 Mm1. When m M \frac{m}{M} Mm is large, the trajectory will zigzag, when small, converge quickly. When m M = 1 \frac{m}{M}=1 Mm=1, f ( x + ) − f ( x ∗ ) = 0 f(x_+)-f(x^*)=0 f(x+)f(x)=0

Convergence

Linear convergence when smooth and strongly convex:

f ( x t ) − f ( x ∗ ) ≤ O ( ( 1 − m M ) t ) f(x_t)-f(x^*)\leq O((1-\frac{m}{M})^t) f(xt)f(x)O((1Mm)t).

Whenever we have strong convexity, we can guarantee that x t x_t xt converges to x ∗ x^* x. Piazza @338

4 Oracle lower bounds

4.1 Lipschitz convex function

For the Lipschitz convex function, there is no algorithm which can guarantee error better than O ( 1 / T ) O(1/\sqrt{T}) O(1/T ).

4.2 Smooth convex function

For the smooth convex function, there is no algorithm which can guarantee error better than O ( 1 / T 2 ) O(1/T^2) O(1/T2). So we can improve Gradient Descent ( O ( 1 / T ) O(1/T) O(1/T)) to accelerated gradient descent.

4.3 Smooth and strongly convex function

For the M M M smooth m m m convex function, there is no algorithm which can guarantee error better than O ( ( K − 1 K + 1 ) T ) O((\frac{\sqrt{K}-1}{\sqrt{K}+1})^T) O((K +1K 1)T), K = M / m K=M/m K=M/m.

5 Accelerated gradient method (week 14)

5.1 First order methods

x t + 1 ∈ x 1 + s p a n ( ∇ f ( x 1 ) , … , ∇ f ( x t ) ) x_{t+1}\in x_1+span(\nabla f(x_1),\dots,\nabla f(x_t)) xt+1x1+span(f(x1),,f(xt))

5.2 Convergence performance

在这里插入图片描述

5.3 Heavy ball method (momentum)

5.3.1 Update rule 1

x k + 1 = x k − η ∇ f ( x k ) + β k ( x k − x k − 1 ) x_{k+1}=x_k-\eta \nabla f(x_k)+\beta_k (x_k-x_{k-1}) xk+1=xkηf(xk)+βk(xkxk1)

  • Vanilla gradient descent: x k − η ∇ f ( x k ) x_k-\eta \nabla f(x_k) xkηf(xk)
  • Momentum term: β k ( x k − x k − 1 ) \beta_k (x_k-x_{k-1}) βk(xkxk1)
  • First vanilla update, then using momentum update
  • Also works for proximal gradient setting.

Can be rewitten as: p k = − ∇ f ( x k ) + β k p k − 1 p_k=-\nabla f(x_k)+\beta_k p_{k-1} pk=f(xk)+βkpk1 x k + 1 = x k + α k p k x_{k+1}=x_k+\alpha_k p_k xk+1=xk+αkpk

5.3.2 Update rule 2

y k + 1 = x k − η ∇ f ( x k ) y_{k+1}=x_k-\eta \nabla f(x_k) yk+1=xkηf(xk)

  • x k + 1 = y k + 1 + K − 1 K + 1 ( y k + 1 − y k ) x_{k+1}=y_{k+1}+\frac{\sqrt{K}-1}{\sqrt{K}+1}(y_{k+1}-y_k) xk+1=yk+1+K +1K 1(yk+1yk)

5.3.3 Convergence rate

For strongly convex f f f with condition number κ ≤ 1 \kappa \leq 1 κ1:
∣ ∣ x k − x ∗ ∣ ∣ ≤ ( 1 − 2 κ + 1 ) k ∣ ∣ x 0 − x ∗ ∣ ∣ ||x_k-x^*||\leq (1-\frac{2}{\sqrt{\kappa}+1})^k ||x_0-x^*|| xkx(1κ +12)kx0x Lipscitz only: unknown.

5.4 Nesterov accelerated gradient

5.4.1 Update rule

p k = ∇ f ( x k + β k ( x k − x k − 1 ) ) + β k p k − 1 p_k=\nabla f(x_k+\beta_k(x_k-x_{k-1}))+\beta_kp_{k-1} pk=f(xk+βk(xkxk1))+βkpk1 x k + 1 = x k + α k p k x_{k+1}=x_k+\alpha_kp_k xk+1=xk+αkpk

  • Momentum before gradient
  • α k = 1 L \alpha_k=\frac{1}{L} αk=L1
  • β k = k − 2 k + 1 \beta_k=\frac{k-2}{k+1} βk=k+1k2

5.4.2 Convergence rate

  • O ( 1 T 2 ) O(\frac{1}{T^2}) O(T21) error for Lipscitz gradients
  • O ( 1 − 2 κ + 1 ) k O(1-\frac{2}{\sqrt{\kappa}+1})^k O(1κ +12)k error for κ \kappa κ-conditioned strongly convex.
    Optimal for all first-order settings.

6 Mirror descent

6.1 New motivation

x t + 1 = arg min ⁡ : η g t T x + 1 2 D ϕ ( x , x t ) x_{t+1}=\argmin: \eta g_t^Tx+\frac{1}{2}D_{\phi}(x,x_t) xt+1=argmin:ηgtTx+21Dϕ(x,xt)
Bregman divergence is D ϕ D_{\phi} Dϕ.

6.2 Dual norm

For ∥ ∥ p \|\|_p p norm, its dual norm is ∥ ∥ p \|\|_p p, 1 p + 1 q = 1 \frac{1}{p}+\frac{1}{q}=1 p1+q1=1.

6.3 For ϕ = ∑ x i log ⁡ x i \phi=\sum x_i \log x_i ϕ=xilogxi

ϕ ( y t + 1 ) = ϕ ( x t ) − η g t \phi(y_{t+1})=\phi(x_t)-\eta g_t ϕ(yt+1)=ϕ(xt)ηgt
For this special ϕ \phi ϕ,
y t + 1 ( i ) = x t ( i ) e − η ( ∇ f ( x t ) ) i y_{t+1}(i)=x_t(i) e^{-\eta(\nabla f(x_t))_i} yt+1(i)=xt(i)eη(f(xt))i
x t + 1 = y t + 1 ∥ y t + 1 ∥ 1 x_{t+1}=\frac{y_{t+1}}{\|y_{t+1}\|_1} xt+1=yt+11yt+1

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值