Optimization Week 10: Gradient Descent

1 Motivation

1.1 First Order Taylor expansion

f ( x t + η d ) ≈ f ( x t ) + ∇ f ( x t ) T η d f(x_t+\eta d)\approx f(x_t)+\nabla f(x_t)^T \eta d f(xt+ηd)f(xt)+f(xt)Tηd
Minimize when d = − ∇ f ( x t ) d=-\nabla f(x_t) d=f(xt)
x t + 1 = x t − η ∇ f ( x t ) x_{t+1}=x_t-\eta \nabla f(x_t) xt+1=xtηf(xt)
Where does η \eta η comes from? Quadratic approximation.

1.2 Quadratic approximation

f ( x t + 1 ) = f ( x t ) + ∇ f ( x t ) T ( x t + 1 − x t ) + 1 2 η ∣ ∣ x t + 1 − x t ∣ ∣ 2 2 f(x_{t+1})=f(x_t)+\nabla f(x_t)^T(x_{t+1}-x_t)+\frac{1}{2\eta}||x_{t+1}-x_t||^2_2 f(xt+1)=f(xt)+f(xt)T(xt+1xt)+2η1xt+1xt22
Add a quadratic prox term to avoid too far deviation.

Minimize w.r.t. x t + 1 x_{t+1} xt+1.
x t + 1 = x t − η ∇ f ( x t ) x_{t+1}=x_t-\eta \nabla f(x_t) xt+1=xtηf(xt)

2 Step size

Exact line search

x t + 1 = x t + η t d t x_{t+1}=x_t+\eta_t d_t xt+1=xt+ηtdt

η t = arg min ⁡ η f ( x t + η d t ) \eta_t=\argmin_{\eta} f(x_t+\eta d_t) ηt=ηargminf(xt+ηdt)

Backtracking line search (BTLS)

Goal: Ensure f ( x + η d ) f(x+\eta d) f(x+ηd) decrease enough.
According to the convexity:
f ( x + η d ) ≥ f ( x ) + η ∇ f ( x ) T d f(x+\eta d)\geq f(x)+\eta \nabla f(x)^T d f(x+ηd)f(x)+ηf(x)Td

BTLS for gradient descent

If f f f is M M M-smooth, then η B T L S ≥ β / M \eta_{BTLS}\geq \beta/M ηBTLSβ/M, and f ( x + ) ≤ f ( x ) − α β M ∣ ∣ ∇ f ( x ) ∣ ∣ 2 2 f(x_+)\leq f(x)-\frac{\alpha \beta}{M} ||\nabla f(x)||_2^2 f(x+)f(x)Mαβf(x)22

3 Convergence, step size

3.1 Smoothness, upper bound, and self-tuning

Lipschitz Gradients (it is gradients)


∣ ∣ ∇ f ( x ) − ∇ f ( y ) ∣ ∣ ≤ M ∣ ∣ x − y ∣ ∣ , ∀ x , y ||\nabla f(x)-\nabla f(y)||\leq M||x-y||, \forall x,y f(x)f(y)Mxy,x,y
Then f f f has M M M-Lipschitz gradients, f f f may not be convex.
M M M is the smoothness parameter, and is the largest eigenvalue of the Hessian when it is a quadratic function.

Quadratic upper bound

  • f f f has M M M-Lipschitz gradient and convex, then g ( x ) = M 2 x T x − f ( x ) g(x)=\frac{M}{2} x^Tx-f(x) g(x)=2MxTxf(x) is convex.
  • f ( y ) ≤ f ( x ) + ∇ f ( x ) ( y − x ) + M 2 ∣ ∣ y − x ∣ ∣ 2 2 f(y)\leq f(x)+\nabla f(x)(y-x)+\frac{M}{2}||y-x||_2^2 f(y)f(x)+f(x)(yx)+2Myx22

Step size

f ( y ) ≤ f ( x ) + ∇ f ( x ) ( y − x ) + M 2 ∣ ∣ y − x ∣ ∣ 2 2 f(y)\leq f(x)+\nabla f(x)(y-x)+\frac{M}{2}||y-x||_2^2 f(y)f(x)+f(x)(yx)+2Myx22, choose y = x − η ∇ f ( x ) y=x-\eta \nabla f(x) y=xηf(x)

then, f ( y ) ≤ f ( x ) − η ∇ f ( x ) 2 + M 2 η 2 ∇ f ( x ) 2 = f ( x ) + η ( M 2 η − 1 ) ∇ f ( x ) 2 f(y)\leq f(x)-\eta \nabla f(x)^2+\frac{M}{2}\eta^2\nabla f(x)^2=f(x)+\eta(\frac{M}{2}\eta-1) \nabla f(x)^2 f(y)f(x)ηf(x)2+2Mη2f(x)2=f(x)+η(2Mη1)f(x)2

Thus, to ensure convergence, M 2 η − 1 < 0 , η < 2 M \frac{M}{2}\eta-1< 0, \eta< \frac{2}{M} 2Mη1<0,η<M2, to decrease faster, η = 1 M \eta=\frac{1}{M} η=M1

  • η < 2 M \eta<\frac{2}{M} η<M2 to ensure convergence, η = 1 M \eta=\frac{1}{M} η=M1 to ensure fastest convergence.


  • η < 2 M \eta<\frac{2}{M} η<M2 to ensure convergence, η = 1 M \eta=\frac{1}{M} η=M1 to ensure fastest convergence.
  • If η ≤ 1 M \eta\leq \frac{1}{M} ηM1, GD is descent with f ( x t ) − f ∗ ≤ 1 t 1 2 η ∣ ∣ x 0 − x ∗ ∣ ∣ = O ( M t ) f(x_t)-f^*\leq\frac{1}{t}\frac{1}{2 \eta}||x_0-x^*||=O(\frac{M}{t}) f(xt)ft12η1x0x=O(tM)
  • Self-tuning: Update → \rightarrow 0, when x → x ∗ x\rightarrow x^* xx.
  • Smoothness can ensure the convergence of the function value, and the solution will not oscillate because the update will drecease when approaching the optimal solution.
  • but the solution point is not ensured to converge to the optimal solution if not strongly convex, since the function can be flat, this can be guaranteed by the following strong convexity.

Bound on suboptimality (OL)

If is M M M smooth:
1 2 M ∥ ∇ f ( x ) ∥ 2 2 ≤ f ( x ) − f ( x ∗ ) ≤ M 2 ∥ x − x ∗ ∥ 2 \frac{1}{2M}\|\nabla f(x) \|_2^2\leq f(x)-f(x^*)\leq \frac{M}{2}\|x-x^*\|^2 2M1f(x)22f(x)f(x)2Mxx2

Co-coercivity (OL)

If is M M M smooth:
< ∇ f ( x ) − ∇ f ( y ) , x − y > ≥ 1 M ∥ ∇ f ( x ) − ∇ f ( y ) ∥ 2 2 <\nabla f(x)-\nabla f(y),x-y>\geq \frac{1}{M}\|\nabla f(x)-\nabla f(y)\|^2_2 <f(x)f(y),xy>M1f(x)f(y)22

3.2 Strong convexity, lower bound

Strong convexity

  • ∀ x , y , ⟨ ∇ f ( x ) − ∇ f ( y ) ⟩ ⟨ x − y ⟩ ≥ m ∣ ∣ x − y ∣ ∣ 2 2 \forall x,y, \langle \nabla f(x)-\nabla f(y) \rangle \langle x-y \rangle \geq m||x-y||^2_2 x,y,f(x)f(y)xymxy22
  • If ∇ 2 f ( x ) \nabla^2 f(x) 2f(x) exists, ∇ 2 f ( x ) ≥ m I \nabla^2 f(x)\geq mI 2f(x)mI
  • When f f f is quadratic, m m m is the smallest eigenvalue of the Hessian.

Quadratic lower bound

  • If f f f is m-strongly convex, g ( x ) = f ( x ) − m 2 x T x g(x)=f(x)-\frac{m}{2}x^T x g(x)=f(x)2mxTx, then g g g is convex.
  • If f f f is m-strongly convex, f ( y ) ≥ ∇ f ( x ) T ( y − x ) + m 2 ∣ ∣ x − y ∣ ∣ 2 2 f(y)\geq\nabla f(x)^T (y-x)+\frac{m}{2}||x-y||^2_2 f(y)f(x)T(yx)+2mxy22, corollary: f ( y ) ≥ f ( x ) − 1 2 m ∣ ∣ ∇ f ( x ) ∣ ∣ 2 2 f(y)\geq f(x)-\frac{1}{2m}||\nabla f(x)||^2_2 f(y)f(x)2m1f(x)22


  • t > 2 m t>\frac{2}{m} t>m2, divergence.
  • f ( x + ) − f ( x ∗ ) ≤ [ 1 − m M ] ( f ( x ) − f ( x ∗ ) ) f(x_+)-f(x^*)\leq [1-\frac{m}{M}](f(x)-f(x^*)) f(x+)f(x)[1Mm](f(x)f(x)), η = 1 M \eta=\frac{1}{M} η=M1
  • Strong convexity ensures that GD makes very fast progress when being far away from the optimal point.

Bound on suboptimality (OL)

If m m m-strongly convex:
m 2 ∥ x − x ∗ ∥ 2 ≤ f ( x ) − f ( x ∗ ) ≤ 1 2 m ∥ ∇ f ( x ) ∥ 2 2 \frac{m}{2}\|x-x^*\|^2\leq f(x)-f(x^*)\leq \frac{1}{2m}\|\nabla f(x) \|_2^2 2mxx2f(x)f(x)2m1f(x)22

Co-coercivity (OL)

If is M M M smooth:
< ∇ f ( x ) − ∇ f ( y ) , x − y > ≥ m ∥ x − y ∥ 2 2 <\nabla f(x)-\nabla f(y),x-y>\geq m \|x-y\|^2_2 <f(x)f(y),xy>mxy22

3.3 Smoothness and strong convexity

M and m

m ≤ ∣ ∣ ∇ f ( x ) − ∇ f ( y ) ∣ ∣ ∣ ∣ x − y ∣ ∣ ≤ M m\leq \frac{||\nabla f(x)-\nabla f(y)||}{||x-y||}\leq M mxyf(x)f(y)M
m M ≤ 1 \frac{m}{M}\leq 1 Mm1. When m M \frac{m}{M} Mm is large, the trajectory will zigzag, when small, converge quickly. When m M = 1 \frac{m}{M}=1 Mm=1, f ( x + ) − f ( x ∗ ) = 0 f(x_+)-f(x^*)=0 f(x+)f(x)=0


Linear convergence when smooth and strongly convex:

f ( x t ) − f ( x ∗ ) ≤ O ( ( 1 − m M ) t ) f(x_t)-f(x^*)\leq O((1-\frac{m}{M})^t) f(xt)f(x)O((1Mm)t).

Whenever we have strong convexity, we can guarantee that x t x_t xt converges to x ∗ x^* x. Piazza @338

4 Oracle lower bounds

4.1 Lipschitz convex function

For the Lipschitz convex function, there is no algorithm which can guarantee error better than O ( 1 / T ) O(1/\sqrt{T}) O(1/T ).

4.2 Smooth convex function

For the smooth convex function, there is no algorithm which can guarantee error better than O ( 1 / T 2 ) O(1/T^2) O(1/T2). So we can improve Gradient Descent ( O ( 1 / T ) O(1/T) O(1/T)) to accelerated gradient descent.

4.3 Smooth and strongly convex function

For the M M M smooth m m m convex function, there is no algorithm which can guarantee error better than O ( ( K − 1 K + 1 ) T ) O((\frac{\sqrt{K}-1}{\sqrt{K}+1})^T) O((K +1K 1)T), K = M / m K=M/m K=M/m.

5 Accelerated gradient method (week 14)

5.1 First order methods

x t + 1 ∈ x 1 + s p a n ( ∇ f ( x 1 ) , … , ∇ f ( x t ) ) x_{t+1}\in x_1+span(\nabla f(x_1),\dots,\nabla f(x_t)) xt+1x1+span(f(x1),,f(xt))

5.2 Convergence performance


5.3 Heavy ball method (momentum)

5.3.1 Update rule 1

x k + 1 = x k − η ∇ f ( x k ) + β k ( x k − x k − 1 ) x_{k+1}=x_k-\eta \nabla f(x_k)+\beta_k (x_k-x_{k-1}) xk+1=xkηf(xk)+βk(xkxk1)

  • Vanilla gradient descent: x k − η ∇ f ( x k ) x_k-\eta \nabla f(x_k) xkηf(xk)
  • Momentum term: β k ( x k − x k − 1 ) \beta_k (x_k-x_{k-1}) βk(xkxk1)
  • First vanilla update, then using momentum update
  • Also works for proximal gradient setting.

Can be rewitten as: p k = − ∇ f ( x k ) + β k p k − 1 p_k=-\nabla f(x_k)+\beta_k p_{k-1} pk=f(xk)+βkpk1 x k + 1 = x k + α k p k x_{k+1}=x_k+\alpha_k p_k xk+1=xk+αkpk

5.3.2 Update rule 2

y k + 1 = x k − η ∇ f ( x k ) y_{k+1}=x_k-\eta \nabla f(x_k) yk+1=xkηf(xk)

  • x k + 1 = y k + 1 + K − 1 K + 1 ( y k + 1 − y k ) x_{k+1}=y_{k+1}+\frac{\sqrt{K}-1}{\sqrt{K}+1}(y_{k+1}-y_k) xk+1=yk+1+K +1K 1(yk+1yk)

5.3.3 Convergence rate

For strongly convex f f f with condition number κ ≤ 1 \kappa \leq 1 κ1:
∣ ∣ x k − x ∗ ∣ ∣ ≤ ( 1 − 2 κ + 1 ) k ∣ ∣ x 0 − x ∗ ∣ ∣ ||x_k-x^*||\leq (1-\frac{2}{\sqrt{\kappa}+1})^k ||x_0-x^*|| xkx(1κ +12)kx0x Lipscitz only: unknown.

5.4 Nesterov accelerated gradient

5.4.1 Update rule

p k = ∇ f ( x k + β k ( x k − x k − 1 ) ) + β k p k − 1 p_k=\nabla f(x_k+\beta_k(x_k-x_{k-1}))+\beta_kp_{k-1} pk=f(xk+βk(xkxk1))+βkpk1 x k + 1 = x k + α k p k x_{k+1}=x_k+\alpha_kp_k xk+1=xk+αkpk

  • Momentum before gradient
  • α k = 1 L \alpha_k=\frac{1}{L} αk=L1
  • β k = k − 2 k + 1 \beta_k=\frac{k-2}{k+1} βk=k+1k2

5.4.2 Convergence rate

  • O ( 1 T 2 ) O(\frac{1}{T^2}) O(T21) error for Lipscitz gradients
  • O ( 1 − 2 κ + 1 ) k O(1-\frac{2}{\sqrt{\kappa}+1})^k O(1κ +12)k error for κ \kappa κ-conditioned strongly convex.
    Optimal for all first-order settings.

6 Mirror descent

6.1 New motivation

x t + 1 = arg min ⁡ : η g t T x + 1 2 D ϕ ( x , x t ) x_{t+1}=\argmin: \eta g_t^Tx+\frac{1}{2}D_{\phi}(x,x_t) xt+1=argmin:ηgtTx+21Dϕ(x,xt)
Bregman divergence is D ϕ D_{\phi} Dϕ.

6.2 Dual norm

For ∥ ∥ p \|\|_p p norm, its dual norm is ∥ ∥ p \|\|_p p, 1 p + 1 q = 1 \frac{1}{p}+\frac{1}{q}=1 p1+q1=1.

6.3 For ϕ = ∑ x i log ⁡ x i \phi=\sum x_i \log x_i ϕ=xilogxi

ϕ ( y t + 1 ) = ϕ ( x t ) − η g t \phi(y_{t+1})=\phi(x_t)-\eta g_t ϕ(yt+1)=ϕ(xt)ηgt
For this special ϕ \phi ϕ,
y t + 1 ( i ) = x t ( i ) e − η ( ∇ f ( x t ) ) i y_{t+1}(i)=x_t(i) e^{-\eta(\nabla f(x_t))_i} yt+1(i)=xt(i)eη(f(xt))i
x t + 1 = y t + 1 ∥ y t + 1 ∥ 1 x_{t+1}=\frac{y_{t+1}}{\|y_{t+1}\|_1} xt+1=yt+11yt+1

  • 0
  • 1
    觉得还不错? 一键收藏
  • 0
提供的源码资源涵盖了Java应用等多个领域,每个领域都包含了丰富的实例和项目。这些源码都是基于各自平台的最新技术和标准编写,确保了在对应环境下能够无缝运行。同时,源码中配备了详细的注释和文档,帮助用户快速理解代码结构和实现逻辑。 适用人群: 适合毕业设计、课程设计作业。这些源码资源特别适合大学生群体。无论你是计算机相关专业的学生,还是对其他领域编程感兴趣的学生,这些资源都能为你提供宝贵的学习和实践机会。通过学习和运行这些源码,你可以掌握各平台开发的基础知识,提升编程能力和项目实战经验。 使用场景及目标: 在学习阶段,你可以利用这些源码资源进行课程实践、课外项目或毕业设计。通过分析和运行源码,你将深入了解各平台开发的技术细节和最佳实践,逐步培养起自己的项目开发和问题解决能力。此外,在求职或创业过程中,具备跨平台开发能力的大学生将更具竞争力。 其他说明: 为了确保源码资源的可运行性和易用性,特别注意了以下几点:首先,每份源码都提供了详细的运行环境和依赖说明,确保用户能够轻松搭建起开发环境;其次,源码中的注释和文档都非常完善,方便用户快速上手和理解代码;最后,我会定期更新这些源码资源,以适应各平台技术的最新发展和市场需求。 所有源码均经过严格测试,可以直接运行,可以放心下载使用。有任何使用问题欢迎随时与博主沟通,第一时间进行解答!
提供的源码资源涵盖了小程序应用等多个领域,每个领域都包含了丰富的实例和项目。这些源码都是基于各自平台的最新技术和标准编写,确保了在对应环境下能够无缝运行。同时,源码中配备了详细的注释和文档,帮助用户快速理解代码结构和实现逻辑。 适用人群: 适合毕业设计、课程设计作业。这些源码资源特别适合大学生群体。无论你是计算机相关专业的学生,还是对其他领域编程感兴趣的学生,这些资源都能为你提供宝贵的学习和实践机会。通过学习和运行这些源码,你可以掌握各平台开发的基础知识,提升编程能力和项目实战经验。 使用场景及目标: 在学习阶段,你可以利用这些源码资源进行课程实践、课外项目或毕业设计。通过分析和运行源码,你将深入了解各平台开发的技术细节和最佳实践,逐步培养起自己的项目开发和问题解决能力。此外,在求职或创业过程中,具备跨平台开发能力的大学生将更具竞争力。 其他说明: 为了确保源码资源的可运行性和易用性,特别注意了以下几点:首先,每份源码都提供了详细的运行环境和依赖说明,确保用户能够轻松搭建起开发环境;其次,源码中的注释和文档都非常完善,方便用户快速上手和理解代码;最后,我会定期更新这些源码资源,以适应各平台技术的最新发展和市场需求。 所有源码均经过严格测试,可以直接运行,可以放心下载使用。有任何使用问题欢迎随时与博主沟通,第一时间进行解答!
提供的源码资源涵盖了Java应用等多个领域,每个领域都包含了丰富的实例和项目。这些源码都是基于各自平台的最新技术和标准编写,确保了在对应环境下能够无缝运行。同时,源码中配备了详细的注释和文档,帮助用户快速理解代码结构和实现逻辑。 适用人群: 适合毕业设计、课程设计作业。这些源码资源特别适合大学生群体。无论你是计算机相关专业的学生,还是对其他领域编程感兴趣的学生,这些资源都能为你提供宝贵的学习和实践机会。通过学习和运行这些源码,你可以掌握各平台开发的基础知识,提升编程能力和项目实战经验。 使用场景及目标: 在学习阶段,你可以利用这些源码资源进行课程实践、课外项目或毕业设计。通过分析和运行源码,你将深入了解各平台开发的技术细节和最佳实践,逐步培养起自己的项目开发和问题解决能力。此外,在求职或创业过程中,具备跨平台开发能力的大学生将更具竞争力。 其他说明: 为了确保源码资源的可运行性和易用性,特别注意了以下几点:首先,每份源码都提供了详细的运行环境和依赖说明,确保用户能够轻松搭建起开发环境;其次,源码中的注释和文档都非常完善,方便用户快速上手和理解代码;最后,我会定期更新这些源码资源,以适应各平台技术的最新发展和市场需求。 所有源码均经过严格测试,可以直接运行,可以放心下载使用。有任何使用问题欢迎随时与博主沟通,第一时间进行解答!


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


