Optimization Week 10: Gradient Descent

最新推荐文章于 2021-04-21 06:04:37 发布

xiwang_chn

最新推荐文章于 2021-04-21 06:04:37 发布

阅读量300

点赞数

分类专栏： Optimization

本文链接：https://blog.csdn.net/weixin_42017454/article/details/109684275

版权

Optimization 专栏收录该内容

15 篇文章 1 订阅

订阅专栏

Week 10: Gradient Descent

1 Motivation
- 1.1 First Order Taylor expansion
- 1.2 Quadratic approximation
2 Step size
3 Convergence, step size
4 Oracle lower bounds
5 Accelerated gradient method (week 14)
6 Mirror descent

1 Motivation

1.1 First Order Taylor expansion

$f(x_t+\eta d)\approx f(x_t)+\nabla f(x_t)^T \eta d$
Minimize when $d=-\nabla f(x_t)$
$x_{t+1}=x_t-\eta \nabla f(x_t)$
Where does $\eta$ comes from? Quadratic approximation.

1.2 Quadratic approximation

$f(x_{t+1})=f(x_t)+\nabla f(x_t)^T(x_{t+1}-x_t)+\frac{1}{2\eta}||x_{t+1}-x_t||^2_2$
Add a quadratic prox term to avoid too far deviation.

Minimize w.r.t. $x_{t+1}$ .
$x_{t+1}=x_t-\eta \nabla f(x_t)$

2 Step size

Exact line search

$x_{t+1}=x_t+\eta_t d_t$

$\eta_t=\argmin_{\eta} f(x_t+\eta d_t)$

Backtracking line search (BTLS)

Goal: Ensure $f(x+\eta d)$ decrease enough.
According to the convexity:
$f(x+\eta d)\geq f(x)+\eta \nabla f(x)^T d$

BTLS for gradient descent

If $f$ is $M$ -smooth, then $\eta_{BTLS}\geq \beta/M$ , and $f(x_+)\leq f(x)-\frac{\alpha \beta}{M} ||\nabla f(x)||_2^2$

3 Convergence, step size

3.1 Smoothness, upper bound, and self-tuning

Lipschitz Gradients (it is gradients)

在这里插入图片描述

$||\nabla f(x)-\nabla f(y)||\leq M||x-y||, \forall x,y$
Then $f$ has $M$ -Lipschitz gradients, $f$ may not be convex.
$M$ is the smoothness parameter, and is the largest eigenvalue of the Hessian when it is a quadratic function.

Quadratic upper bound

$f$ has $M$ -Lipschitz gradient and convex, then $g(x)=\frac{M}{2} x^Tx-f(x)$ is convex.
$f(y)\leq f(x)+\nabla f(x)(y-x)+\frac{M}{2}||y-x||_2^2$

Step size

$f(y)\leq f(x)+\nabla f(x)(y-x)+\frac{M}{2}||y-x||_2^2$ , choose $y=x-\eta \nabla f(x)$

then, $f(y)\leq f(x)-\eta \nabla f(x)^2+\frac{M}{2}\eta^2\nabla f(x)^2=f(x)+\eta(\frac{M}{2}\eta-1) \nabla f(x)^2$

Thus, to ensure convergence, $\frac{M}{2}\eta-1< 0, \eta< \frac{2}{M}$ , to decrease faster, $\eta=\frac{1}{M}$

$\eta<\frac{2}{M}$ to ensure convergence, $\eta=\frac{1}{M}$ to ensure fastest convergence.

Convergence

$\eta<\frac{2}{M}$ to ensure convergence, $\eta=\frac{1}{M}$ to ensure fastest convergence.
If $\eta\leq \frac{1}{M}$ , GD is descent with $f(x_t)-f^*\leq\frac{1}{t}\frac{1}{2 \eta}||x_0-x^*||=O(\frac{M}{t})$
Self-tuning: Update $\rightarrow$ 0, when $x\rightarrow x^*$ .
Smoothness can ensure the convergence of the function value, and the solution will not oscillate because the update will drecease when approaching the optimal solution.
but the solution point is not ensured to converge to the optimal solution if not strongly convex, since the function can be flat, this can be guaranteed by the following strong convexity.

Bound on suboptimality (OL)

If is $M$ smooth:
$\frac{1}{2M}\|\nabla f(x) \|_2^2\leq f(x)-f(x^*)\leq \frac{M}{2}\|x-x^*\|^2$

Co-coercivity (OL)

If is $M$ smooth:
$<\nabla f(x)-\nabla f(y),x-y>\geq \frac{1}{M}\|\nabla f(x)-\nabla f(y)\|^2_2$

3.2 Strong convexity, lower bound

Strong convexity

$\forall x,y, \langle \nabla f(x)-\nabla f(y) \rangle \langle x-y \rangle \geq m||x-y||^2_2$
If $\nabla^2 f(x)$ exists, $\nabla^2 f(x)\geq mI$
When $f$ is quadratic, $m$ is the smallest eigenvalue of the Hessian.

Quadratic lower bound

If $f$ is m-strongly convex, $g(x)=f(x)-\frac{m}{2}x^T x$ , then $g$ is convex.
If $f$ is m-strongly convex, $f(y)\geq\nabla f(x)^T (y-x)+\frac{m}{2}||x-y||^2_2$ , corollary: $f(y)\geq f(x)-\frac{1}{2m}||\nabla f(x)||^2_2$

Convergence

$t>\frac{2}{m}$ , divergence.
$f(x_+)-f(x^*)\leq [1-\frac{m}{M}](f(x)-f(x^*))$ , $\eta=\frac{1}{M}$
Strong convexity ensures that GD makes very fast progress when being far away from the optimal point.

Bound on suboptimality (OL)

If $m$ -strongly convex:
$\frac{m}{2}\|x-x^*\|^2\leq f(x)-f(x^*)\leq \frac{1}{2m}\|\nabla f(x) \|_2^2$

Co-coercivity (OL)

If is $M$ smooth:
$<\nabla f(x)-\nabla f(y),x-y>\geq m \|x-y\|^2_2$

3.3 Smoothness and strong convexity

M and m

$m\leq \frac{||\nabla f(x)-\nabla f(y)||}{||x-y||}\leq M$
$\frac{m}{M}\leq 1$ . When $\frac{m}{M}$ is large, the trajectory will zigzag, when small, converge quickly. When $\frac{m}{M}=1$ , $f(x_+)-f(x^*)=0$

Convergence

Linear convergence when smooth and strongly convex:

$f(x_t)-f(x^*)\leq O((1-\frac{m}{M})^t)$ .

Whenever we have strong convexity, we can guarantee that $x_t$ converges to $x^*$ . Piazza @338

4 Oracle lower bounds

4.1 Lipschitz convex function

For the Lipschitz convex function, there is no algorithm which can guarantee error better than $O(1/\sqrt{T})$ .

4.2 Smooth convex function

For the smooth convex function, there is no algorithm which can guarantee error better than $O(1/T^2)$ . So we can improve Gradient Descent ( $O (1 / T)$ ) to accelerated gradient descent.

4.3 Smooth and strongly convex function

For the $M$ smooth $m$ convex function, there is no algorithm which can guarantee error better than $O((\frac{\sqrt{K}-1}{\sqrt{K}+1})^T)$ , $K = M / m$ .

5 Accelerated gradient method (week 14)

5.1 First order methods

$x_{t+1}\in x_1+span(\nabla f(x_1),\dots,\nabla f(x_t))$

5.2 Convergence performance

在这里插入图片描述

5.3 Heavy ball method (momentum)

5.3.1 Update rule 1

$x_{k+1}=x_k-\eta \nabla f(x_k)+\beta_k (x_k-x_{k-1})$

Vanilla gradient descent: $x_k-\eta \nabla f(x_k)$
Momentum term: $\beta_k (x_k-x_{k-1})$
First vanilla update, then using momentum update
Also works for proximal gradient setting.

Can be rewitten as: $p_k=-\nabla f(x_k)+\beta_k p_{k-1}$ $x_{k+1}=x_k+\alpha_k p_k$

5.3.2 Update rule 2

$y_{k+1}=x_k-\eta \nabla f(x_k)$

$x_{k+1}=y_{k+1}+\frac{\sqrt{K}-1}{\sqrt{K}+1}(y_{k+1}-y_k)$

5.3.3 Convergence rate

For strongly convex $f$ with condition number $\kappa \leq 1$ :
$||x_k-x^*||\leq (1-\frac{2}{\sqrt{\kappa}+1})^k ||x_0-x^*||$ Lipscitz only: unknown.

5.4 Nesterov accelerated gradient

5.4.1 Update rule

$p_k=\nabla f(x_k+\beta_k(x_k-x_{k-1}))+\beta_kp_{k-1}$ $x_{k+1}=x_k+\alpha_kp_k$

Momentum before gradient
$\alpha_k=\frac{1}{L}$
$\beta_k=\frac{k-2}{k+1}$

5.4.2 Convergence rate

$O(\frac{1}{T^2})$ error for Lipscitz gradients
$O(1-\frac{2}{\sqrt{\kappa}+1})^k$ error for $\kappa$ -conditioned strongly convex.
Optimal for all first-order settings.

6 Mirror descent

6.1 New motivation

$x_{t+1}=\argmin: \eta g_t^Tx+\frac{1}{2}D_{\phi}(x,x_t)$
Bregman divergence is $D_{\phi}$ .

6.2 Dual norm

For $_p$ norm, its dual norm is $_p$ , $\frac{1}{p}+\frac{1}{q}=1$ .

6.3 For $\phi=\sum x_i \log x_i$

$\phi(y_{t+1})=\phi(x_t)-\eta g_t$
For this special $\phi$ ,
$y_{t+1}(i)=x_t(i) e^{-\eta(\nabla f(x_t))_i}$
$x_{t+1}=\frac{y_{t+1}}{\|y_{t+1}\|_1}$

xiwang_chn

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Optimization Week 10: Gradient Descent

Week 10: Gradient Descent1 MotivationFirst Order Taylor expansionQuadratic approximation2 Step sizeExact line searchBacktracking line search (BTLS)BTLS for gradient descent3 Convergence, step size* Smoothness, upper bound, and self-tuning3.1 Lipschitz Grad
复制链接

扫一扫