[First order method] Gradient descent

最新推荐文章于 2022-03-15 13:41:30 发布

艳艳儿

最新推荐文章于 2022-03-15 13:41:30 发布

阅读量1.4k

点赞数

分类专栏： convex optimization 文章标签： convex

本文链接：https://blog.csdn.net/COMEYAN/article/details/50540167

版权

convex optimization 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

gradient descent

1. gradient descent

1.1 Model to consider

Consider unconstrained, smooth convex optimization

min x f (x)

$\min_x f(x)$
i.e.,

f $f$ is convex and differentiable with

dom(f)=R $\text{dom}(f) = R$ . Denote the optimal criteria value by

f∗=minxf(x) $f^* = \min_x f(x)$ , and a solution by

x∗ $x^*$ .

Gradient descent: choose initial value $x^{(0)}$ , repeat:

x (k) = x (k - 1) - t k \cdot \nabla f (x (k - 1))

$x^{(k)} = x^{(k-1)} - t_k \cdot \nabla f(x^{(k-1)})$
Stop at some point.

1.2 Interpretaion

1.2.1 Interpretation via newton method

Since for the smooth and convex function $f(x)$ , the minimum satisfies a condition that

\nabla f (x *) = 0

$\nabla f(x^*)=0$

If $\nabla f(x)$ can be calculated simply, we can get $x^*$ directly just by solving this equality.
If $\nabla f(x)$ is difficult to calculated. Then we can use linear approximation to $\nabla f(x)$ at $x^{(0)}$ :
$ℓ (x) = \nabla f (x (0)) + \nabla 2 f (x (0)) \cdot (x - x (0))$ $\ell(x) = \nabla f(x^{(0)})+ \nabla^2 f(x^{(0)})\cdot (x-x^{(0)})$
and by setting this linear approximation to zero, we get
$x (n e w) = x (0) - \nabla 2 f (x (0)) - 1 \nabla f (x (0))$ $x^{(new)} = x^{(0)} - \nabla^2 f(x^{(0)})^{-1}\nabla f(x^{(0)})$

But for many functions, calculating their twice differential is difficult. So we can replace $\nabla^2 f(x)$ by $\frac{1}{t} \cdot I$ :

x (n e w) = x (0) - t \cdot \nabla f (x (0))

$x^{(new)} = x^{(0)} - t\cdot \nabla f(x^{(0)})$

The core idea behind gradient descent is that using linear approximation of $\nabla f(x)$ to get the root of $\nabla f(x)$ , which is exactly Newton-Raphson method. Newton-Raphson method is a method for finding successively better approximations to the roots (or zeroes) of a real-valued function.

1.2.2 Interpretation via quadratic approximation of original function

Linear approximation of $\nabla f(x)$ can be regarded as a quadratic approximation of the original function $f(x)$ :

q (x) = f (x (0)) + \nabla f (x (0)) \cdot (x - c (0)) + 1 2 (x - x (0)) \cdot \nabla 2 f (x (0)) \cdot (x - x (0))

$q (x) = f(x^{(0)}) + \nabla f(x^{(0)}) \cdot (x-c^{(0)}) +\frac{1}{2}(x-x^{(0)}) \cdot \nabla^2 f(x^{(0)})\cdot (x-x^{(0)})$
which satisfies that:

\nabla q (x) = ℓ (x)

$\nabla q(x) = \ell (x)$

we can also use $\frac{1}{t}\cdot I$ to replace $\nabla^2 f(x^{(0)})$ :

f ~ (x) \approx f (x (0)) + \nabla f (x (0)) \cdot (x - c (0)) + 2 t \cdot (x - x (0)) \cdot (x - x (0))

$\tilde f(x) \approx f(x^{(0)}) + \nabla f(x^{(0)}) \cdot (x-c^{(0)}) +\frac{2}{t}\cdot (x-x^{(0)})\cdot (x-x^{(0)})$

Then setting first differential of $\tilde f(x)$ to be zero:

\nabla f ~ (x) = \nabla f (x (0)) + 1 t \cdot (x - x (0)) = 0

$\nabla \tilde f(x) = \nabla f(x^{(0)}) +\frac{1}{t} \cdot (x-x^{(0)})=0$
we get

x (n e w) = x (0) - t \cdot \nabla f (x (0))

$x^{(new)} = x^{(0)} - t\cdot \nabla f(x^{(0)})$

1.3 How to choose step size $t^{(k)}$

If $t$ is too large, we algorithm will not converge. If it is tool small, the algorithm will converge too slow. So how to choose a suitable $t$ ?

fixed $t$
The exact line search can be used if $f(x)$ is good enough:
$t = arg min t f (x - t \nabla f (x))$ $t =\arg\min_t f(x - t\nabla f(x))$
- But in most cases, we can use backtracking line search

(i) set $\beta\in(0,1)$ and $\alpha \in (0,1/2)$ fixed.(in practice, choose $\alpha=1/2$ )
(ii) at each iteration, start with $t=1$ and while

f (x - t \nabla f (x)) > f (x) - α t \cdot ∥ \nabla f (x) ∥ 22

$f(x-t\nabla f(x))> f(x) - \alpha t \cdot \|\nabla f(x)\|_2^2$
shrink

t=βt $t=\beta t$ . Else perform gradient descent:

x + = x - t \cdot \nabla f (x)

$x^+ = x- t\cdot \nabla f(x)$

From backtracking line search, we can see that

f (x - t \nabla f (x)) \leq f (x) - α t ∥ \nabla f (x) ∥ 22 \leq f (x)

$f(x-t\nabla f(x)) \leq f(x) - \alpha t\|\nabla f(x)\|_2^2\leq f(x)$
which make sure that gradient descent is going on a exact descent direction.

1.4 Convergence analysis

Theorem:[lipschitz of first derivative with fixed t] if $f$ is convex and differentiable, $\text{dom}(f)=\mathbb{R}^{n}$ and $\nabla f(x)$ is $L-$ lipschize differentiable. Then with fixed step size $t<\frac{1}{L}$ , we have

f (x (k)) - f * \leq ∥ x ( 0 ) - x * ∥ 2 2 2 t k

$f(x^{(k)}) - f^* \leq \frac{\|x^{(0)} - x^*\|_2^2}{2tk}$

proof:

from lipschize properties of $\nabla f(x)$ , we have
$∥ \nabla f (x) - \nabla f (y) ∥ 2 \leq L ∥ x - y ∥ 2 f (y) \leq f (x) + \nabla f (x) \cdot (y - x) + L 2 ∥ y - x ∥ 22$ $\begin{align} &\|\nabla f(x) - \nabla f(y)\|_2\leq L\|x-y\|_2\\ &f(y) \leq f(x) +\nabla f(x)\cdot (y-x)+\frac{L}{2}\|y-x\|_2^2\\ \end{align}$
from the definition of gradient descent method, we know
$x + = x - t \cdot \nabla f (x)$ $\begin{align} &x^+ =x - t\cdot\nabla f(x)\\ \end{align}$
from the convexity of $f(x)$ , we have
$f (x *) \geq f (x) + \nabla f (x) \cdot (x * - x)$ $f(x^*)\geq f(x) + \nabla f(x)\cdot(x^* - x)$
which can be written as
$f (x) \leq f (x *) + \nabla f (x) \cdot (x - x *)$ $f(x)\leq f(x^*) +\nabla f(x)\cdot (x - x^*)$

combining these three together, we have

f (x +) = f (x - t \cdot \nabla f (x)) \leq f (x) - t \cdot ∥ \nabla f (x) ∥ 22 + L 2 ∥ t \cdot \nabla f (x) ∥ 22 = f (x) - (1 - L t 2) t \cdot ∥ \nabla f (x) ∥ 22 \leq f (x) - 1 2 t \cdot ∥ \nabla f (x) ∥ 22 \leq f (x *) + \nabla f (x) \cdot (x - x *) - 1 2 t \cdot ∥ \nabla f (x) ∥ 22 \leq f (x *) + 1 2 (2 t (x - x +) \cdot (x - x *) - 1 t ∥ x - x + ∥ 22) = f (x *) + 1 2 t (2 (x - x +) \cdot (x - x *) - ∥ x - x + ∥ 22) = f (x *) + 1 2 t (∥ x - x * ∥ 22 - ∥ x + - x * ∥ 22)

$\begin{align} f(x^+) &=f(x - t\cdot\nabla f(x))\\ &\leq f(x) - t\cdot\|\nabla f(x)\|_2^2 +\frac{L}{2}\|t\cdot \nabla f(x)\|_2^2\\ &=f(x) - \left(1-\frac{Lt}{2}\right)t\cdot \|\nabla f(x)\|_2^2\\ &\leq f(x) -\frac{1}{2}t\cdot \|\nabla f(x)\|_2^2\\ & \leq f(x^*) + \nabla f(x)\cdot (x-x^*) -\frac{1}{2}t\cdot \|\nabla f(x)\|_2^2\\ &\leq f(x^*) + \frac{1}{2}\left(\frac{2}{t}(x-x^+)\cdot (x-x^*) -\frac{1}{t}\|x - x^+\|_2^2\right)\\ &= f(x^*) + \frac{1}{2t}\left(2(x-x^+)\cdot (x-x^*) -\|x - x^+\|_2^2 \right)\\ &= f(x^*) + \frac{1}{2t}\left(\|x - x^*\|_2^2 - \|x^+-x^*\|_2^2\right) \end{align}$

So we have

f (x (k)) f (x (1)) \leq f (x *) + 1 2 t (∥ x (k - 1) - x * ∥ 22 - ∥ x (k) - x * ∥ 22) ⋮ \leq f (x *) + 1 2 t (∥ x (0) - x * ∥ 22 - ∥ x (1) - x * ∥ 22)

$\begin{align} f(x^{(k)})&\leq f(x^*) +\frac{1}{2t}\left(\|x^{(k-1)} - x^*\|_2^2 - \|x^{(k)}-x^*\|_2^2\right)\\ &\vdots\\ f(x^{(1)})&\leq f(x^*) + \frac{1}{2t}\left(\|x^{(0)} - x^*\|_2^2 - \|x^{(1)}-x^*\|_2^2\right)\\ \end{align}$

summing all of these inequalities, we have

f (x (1)) + \dots + f (x (k)) \leq k \cdot f (x *) + 1 2 t (∥ x (0) - x * ∥ 22 - ∥ x (k) - x * ∥ 22) \leq k \cdot f (x *) + 1 2 t ∥ x (0) - x * ∥ 22

$f(x^{(1)})+\cdots+f(x^{(k)})\leq k\cdot f(x^*)+\frac{1}{2t}\left(\|x^{(0)} - x^*\|_2^2 - \|x^{(k)}-x^*\|_2^2\right)\leq k\cdot f(x^*)+\frac{1}{2t}\|x^{(0)} - x^*\|_2^2$
Then

f (x (k)) \leq f ( x ( 1 ) ) + \dots + f ( x ( k ) ) k \leq f (x *) + ∥ x ( 0 ) - x * ∥ 2 2 2 t k

$\begin{align} f(x^{(k)})&\leq \frac{f(x^{(1)})+\cdots+f(x^{(k)})}{k}\\ &\leq f(x^*)+ \frac{\|x^{(0)}-x^*\|_2^2}{2tk} \end{align}$

Theorem:[lipschitz of first derivative with backtracking] if $f$ is convex and differentiable, $\text{dom}(f)=\mathbb{R}^{n}$ and $\nabla f(x)$ is $L-$ lipschize differentiable. Then with backtracking line search, we have

f (x (k)) - f * \leq ∥ x ( 0 ) - x * ∥ 2 2 2 t min k

$f(x^{(k)}) - f^* \leq \frac{\|x^{(0)} - x^*\|_2^2}{2t_{\min}k}$
where

tmin≥min{1,βL} $t_{\min}\geq \min\{1,\frac{\beta}{L}\}$

proof:
All are the same as fixed t but for the value of $t_{\min}$ ,
From the backtrack line search idea, we know that, there exists a $t_0$ such that for $\forall t\in [0,t_0]$ , we have

f (x - t \cdot \nabla f (x)) \leq f (x) - 1 2 t \cdot ∥ \nabla f (x) ∥ 22

$f(x - t\cdot \nabla f(x))\leq f(x) - \frac{1}{2}t\cdot \|\nabla f(x)\|_2^2$

So the final value of $t_{\text{backtrack}} \in (\frac{\beta}{t_0}, t_0]$ . From the equality of last theorem: we have

f (x +) \leq f (x) - (1 - L t 2) t \cdot ∥ \nabla f (x) ∥ 22

$f(x^+)\leq f(x) - \left(1-\frac{Lt}{2}\right)t\cdot \|\nabla f(x)\|_2^2$
we know that

t0=1L $t_0 = \frac{1}{L}$ .

So $t_{\min}\in (\frac{\beta}{L}, \frac{1}{L}]$ and $t_{\min}\leq 1$

t min \geq min {1, β L}

$t_{\min}\geq \min\{1,\frac{\beta}{L}\}$

Theorem:[lipschitz of first derivative and strong convexity of function] If $f(x)$ is m-strong convex and $\nabla f(x)$ is L-lipshcitz function, then gradient descent with fixed step size $t\leq \frac{2}{L+m}$ or with backtracking line search satisfies:

f (x (k)) - f * \leq c k L 2 ∥ x (0) - x * ∥ 22

$f(x^{(k)}) - f^*\leq c^k \frac{L}{2}\|x^{(0)} - x^*\|_2^2$
with

0<c<1 $0<c<1$

proof:

∥ x + - x * ∥ 22 = ∥ x - t \cdot \nabla f (x) - x * ∥ 22 = ∥ x - x * ∥ 22 + t 2 ∥ \nabla f (x) ∥ 22 - 2 t \cdot \nabla f (x) \cdot (x - x *) \leq ∥ x - x * ∥ 22 + t 2 ∥ \nabla f (x) ∥ 22 - 2 t \cdot {m L L + m ∥ x - x * ∥ 22 + 1 m + L ∥ \nabla f (x) ∥ 22} = (1 - 2 t m L m + L) ∥ x - x * ∥ 22 + (t 2 - 2 t m + L) ∥ \nabla f (x) ∥ 22 \leq (1 - 2 t m L m + L) ∥ x - x * ∥ 22

$\begin{align} \|x^+ - x^*\|_2^2 &=\|x - t\cdot \nabla f(x) - x^*\|_2^2\\ &=\|x-x^*\|_2^2+t^2\|\nabla f(x)\|_2^2 - 2t\cdot \nabla f(x)\cdot (x-x^*)\\ &\leq \|x-x^*\|_2^2+t^2\|\nabla f(x)\|_2^2\\ &- 2t\cdot \left\{\frac{mL}{L+m}\|x-x^*\|_2^2+\frac{1}{m+L}\|\nabla f(x)\|_2^2\right\}\\ &=\left(1 - 2t\frac{mL}{m+L}\right) \|x-x^*\|_2^2 +\left(t^2 - \frac{2t}{m+L}\right)\|\nabla f(x)\|_2^2\\ &\leq \left(1 - 2t\frac{mL}{m+L}\right) \|x-x^*\|_2^2 \end{align}$
Then we have

∥ x (k) - x * ∥ 22 \leq (1 - 2 t m L m + L) k ∥ x (0) - x * ∥ 22

$\|x^{(k)} - x^*\|_2^2 \leq \left(1 - 2t\frac{mL}{m+L}\right)^k \|x^{(0)} - x^*\|_2^2$
and from the convexity of

f(x) $f(x)$

f (x (k)) - f (x *) \leq \nabla f (x *) \cdot (x (k) - x (0)) + L 2 ∥ x (k) - x * ∥ 22 \leq L 2 ∥ x (k) - x * ∥ 22 \leq L 2 (1 - 2 t m L m + L) k ∥ x (0) - x * ∥ 22

$\begin{align} f(x^{(k)}) - f(x^*)&\leq \nabla f(x^*)\cdot (x^{(k)} - x^{(0)})+\frac{L}{2}\|x^{(k)} - x^*\|_2^2\\ &\leq \frac{L}{2}\|x^{(k)} - x^*\|_2^2\\ &\leq \frac{L}{2}\left(1 - 2t\frac{mL}{m+L}\right)^k \|x^{(0)} - x^*\|_2^2\\ \end{align}$

1.5 Summarize

For lipschitz gradient situation, gradient descent has convergence rate $O(1/k)$
i.e., to get $f(x^{(k)}) − f^* ≤O(\epsilon)$ , we need $O(1/\epsilon)$ iterations
For lipschitz gradient and strong function situation, gradient decent has exponential convergence rate.

Reference:
http://www.stat.cmu.edu/~ryantibs/convexopt/lectures/05-grad-descent.pdf
http://www.seas.ucla.edu/~vandenbe/236C/lectures/gradient.pdf
http://www.stat.cmu.edu/~ryantibs/convexopt/scribes/05-grad-descent-scribed.pdf

艳艳儿

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[First order method] Gradient descent

gradient descent1 Model to consider2 Interpretaion21 Interpretation via newton method22 Interpretation via quadratic approximation of original function3 How to choose step size tktk4 Convergence
复制链接

扫一扫