Optimization Week 14: Stochastic gradient descent

xiwang_chn

于 2021-01-20 04:24:03 发布

阅读量427

点赞数

分类专栏： Optimization

本文链接：https://blog.csdn.net/weixin_42017454/article/details/111056455

版权

15 篇文章 1 订阅

订阅专栏

1 Noisy Unbiased (sub) Gradient (NUS)

$g$ is a NUS of $f$ if $E[g(x)|x]\in \partial f(x),\forall x$ ;
If $\nabla f(x)$ exists, $\nabla f(x)$
$f(x)=\frac{1}{n}\sum_if_i(x)$ , $g(x)=\nabla f_i(x)$ , index $i$ chosen random uniform.
Random coordinate descent: $x_+^i=x^i-\eta \frac{\partial}{\partial x_i}f(x)$ , coordinate index $i$ chosen random uniform.

$x_{k+1}=x_k-\eta g(x_k)$

$g(x_k)=$ gradient estimate
Random updates at every step
Keep track of $x_{BEST}$
$\min_x\frac{1}{n}\sum_{i=1}^{n}f_i(x)$ Gradient descent update rule: $x_{k+1}=x_k-\eta_k \frac{1}{n}\sum_i\nabla f_i(x_k)$
Stochastic gradient descent rule: $x_{k+1}=x_k-\eta_k\nabla f_{i_k}(x_k)$ $i_k$ random index

$E[f(x_{BEST}^T)]-f^*\leq \frac{R^2+G^2\sum_0^T\eta_k^2}{2\sum_0^T\eta_k}$

$||x_0-x^*||^2\leq R^2$
Variance $G^2 \geq E[||g(x)||^2|x]$
Fixed step size not good, will jump away from $x^*$
Error after $T$ steps:
$\begin{matrix} & GD& SGD \\ \hline \text{convex } f & O({1}/{\sqrt{T}})&O({1}/{\sqrt{T}})\\ \hline \text{Lipschitz } \nabla f& O({1}/{T})&O({1}/{\sqrt{T}})\\ \hline \text{Strongly convex}&O(C^T)& O({1}/{T})\\ \hline &O(nd)&O(d) \end{matrix}$
Stochastic gradient descent, because it is forced to use a step size of close to zero, cannot adapt to the function. If the function becomes better, gradient descent can become better, but stochastic gradient descent cannot really become better.
Faster iteration, but slower convergence & inability to adapt to good $f$ .
Not linear convergence, since cannot self-tune caused by variance.

$\sum_0^T \eta_k\rightarrow\infin$
$\eta_k\rightarrow0$ because var( $g_k$ ) $\nrightarrow$ 0, but SGD will be slower than it needs to be

$x_{k+1}=x_k-\eta_k \frac{1}{|I_k|}\sum_{i\in I_k}\nabla f_i(x_k)$

Error after $T$ steps:
$\begin{matrix} \text{SGD}& \text{Mini-batch SGD} \\ \hline O({1}/{\sqrt{T}})&O({1}/{\sqrt{T}}+1/\sqrt{|I|T})\\ \hline O(d)&O(|I|d) \end{matrix}$
Gain in # iterations. No gain in compute cost
Need variance reduction
Even for Lipschitz $\nabla f$
We have faster convergence but it’s a for a constant batch size B
mini-batch SGD has essentially the same trend as SGD but it has lower waves

And the same holds true if you’re doing mini-batch SGD and the constant becomes better by a factor B.

Minimization problem:
$\min_x \frac{1}{n}\sum_{i=1}^nf_i(x)$

Update:

At step k, pick $i_k$ randomly
$g_{i_k}^k=\nabla f_{i_k}(x^{k-1})$ $g_j^k=g_j^{k-1}, \text{ for }j \neq i_k$
Update $x^k=x^{k-1}-\eta_k \frac{1}{n}\sum_{i=1}^{n}g^k_i$ Use all $g_i$ including the stale ones

This bias is not zero at any given $x^k$ , asymptotically unbiased, (bias $\rightarrow0$ , variance $\rightarrow0$ ), so that is in contrast to simple stochastic, to a simple SGD, where the bias was always zero.
This means is that we can use a fixed step size, which was the key to having an algorithm that adapts to the niceness of F.