Optimization Week 14: Stochastic gradient descent

1 Noisy Unbiased (sub) Gradient (NUS)

  • g g g is a NUS of f f f if E [ g ( x ) ∣ x ] ∈ ∂ f ( x ) , ∀ x E[g(x)|x]\in \partial f(x),\forall x E[g(x)x]f(x),x;
  • If ∇ f ( x ) \nabla f(x) f(x) exists, E [ g ( x ) ∣ x ] = ∇ f ( x ) E[g(x)|x]= \nabla f(x) E[g(x)x]=f(x)
  • f ( x ) = 1 n ∑ i f i ( x ) f(x)=\frac{1}{n}\sum_if_i(x) f(x)=n1ifi(x), g ( x ) = ∇ f i ( x ) g(x)=\nabla f_i(x) g(x)=fi(x), index i i i chosen random uniform.
  • Random coordinate descent: x + i = x i − η ∂ ∂ x i f ( x ) x_+^i=x^i-\eta \frac{\partial}{\partial x_i}f(x) x+i=xiηxif(x), coordinate index i i i chosen random uniform.

2 Stochastic gradient descent

2.1 Update rule

x k + 1 = x k − η g ( x k ) x_{k+1}=x_k-\eta g(x_k) xk+1=xkηg(xk)

  • g ( x k ) = g(x_k)= g(xk)= gradient estimate
  • Random updates at every step
  • Keep track of x B E S T x_{BEST} xBEST
    min ⁡ x 1 n ∑ i = 1 n f i ( x ) \min_x\frac{1}{n}\sum_{i=1}^{n}f_i(x) xminn1i=1nfi(x) Gradient descent update rule: x k + 1 = x k − η k 1 n ∑ i ∇ f i ( x k ) x_{k+1}=x_k-\eta_k \frac{1}{n}\sum_i\nabla f_i(x_k) xk+1=xkηkn1ifi(xk)
    Stochastic gradient descent rule: x k + 1 = x k − η k ∇ f i k ( x k ) x_{k+1}=x_k-\eta_k\nabla f_{i_k}(x_k) xk+1=xkηkfik(xk) i k i_k ik random index

2.2 Convergence rate

E [ f ( x B E S T T ) ] − f ∗ ≤ R 2 + G 2 ∑ 0 T η k 2 2 ∑ 0 T η k E[f(x_{BEST}^T)]-f^*\leq \frac{R^2+G^2\sum_0^T\eta_k^2}{2\sum_0^T\eta_k} E[f(xBESTT)]f20TηkR2+G20Tηk2

  • ∣ ∣ x 0 − x ∗ ∣ ∣ 2 ≤ R 2 ||x_0-x^*||^2\leq R^2 x0x2R2

  • Variance G 2 ≥ E [ ∣ ∣ g ( x ) ∣ ∣ 2 ∣ x ] G^2 \geq E[||g(x)||^2|x] G2E[g(x)2x]

  • Fixed step size not good, will jump away from x ∗ x^* x

  • Error after T T T steps:
    G D S G D convex  f O ( 1 / T ) O ( 1 / T ) Lipschitz  ∇ f O ( 1 / T ) O ( 1 / T ) Strongly convex O ( C T ) O ( 1 / T ) O ( n d ) O ( d ) \begin{matrix} & GD& SGD \\ \hline \text{convex } f & O({1}/{\sqrt{T}})&O({1}/{\sqrt{T}})\\ \hline \text{Lipschitz } \nabla f& O({1}/{T})&O({1}/{\sqrt{T}})\\ \hline \text{Strongly convex}&O(C^T)& O({1}/{T})\\ \hline &O(nd)&O(d) \end{matrix} convex fLipschitz fStrongly convexGDO(1/T )O(1/T)O(CT)O(nd)SGDO(1/T )O(1/T )O(1/T)O(d)

  • Stochastic gradient descent, because it is forced to use a step size of close to zero, cannot adapt to the function. If the function becomes better, gradient descent can become better, but stochastic gradient descent cannot really become better.

  • Faster iteration, but slower convergence & inability to adapt to good f f f.

2.3 Step size

  • ∑ 0 T η k → ∞ \sum_0^T \eta_k\rightarrow\infin 0Tηk
  • η k → 0 \eta_k\rightarrow0 ηk0 because var( g k g_k gk) ↛ \nrightarrow 0, but SGD will be slower than it needs to be

3 Mini-batch Stochastic Gradient Descent

3.1 Update rule

x k + 1 = x k − η k 1 ∣ I k ∣ ∑ i ∈ I k ∇ f i ( x k ) x_{k+1}=x_k-\eta_k \frac{1}{|I_k|}\sum_{i\in I_k}\nabla f_i(x_k) xk+1=xkηkIk1iIkfi(xk)

  • UPDATE = 1 ∣ I k ∣ ∑ i ∈ I k ∇ f i ( x k ) =\frac{1}{|I_k|}\sum_{i\in I_k}\nabla f_i(x_k) =Ik1iIkfi(xk)
  • E ( UPDATE ∣ x k ) = ∇ f ( x k ) E(\text{UPDATE}|x_k)=\nabla f(x_k) E(UPDATExk)=f(xk)
  • V a r [ UPDATE ∣ x k ] = V a r [ S G D ∣ x k ] Var[\text{UPDATE}|x_k]=Var[SGD|x_k] Var[UPDATExk]=Var[SGDxk]

3.2 Convergence rate

  • Error after T T T steps:
    SGD Mini-batch SGD O ( 1 / T ) O ( 1 / T + 1 / ∣ I ∣ T ) O ( d ) O ( ∣ I ∣ d ) \begin{matrix} \text{SGD}& \text{Mini-batch SGD} \\ \hline O({1}/{\sqrt{T}})&O({1}/{\sqrt{T}}+1/\sqrt{|I|T})\\ \hline O(d)&O(|I|d) \end{matrix} SGDO(1/T )O(d)Mini-batch SGDO(1/T +1/IT )O(Id)
  • Gain in # iterations. No gain in compute cost
  • Need variance reduction
  • Even for Lipschitz ∇ f \nabla f f
  • We have faster convergence but it’s a for a constant batch size B
  • mini-batch SGD has essentially the same trend as SGD but it has lower waves

3.3 Step size

  • The problem is SGD, needs η k → 0 \eta_k\rightarrow0 ηk0 because V a r ↛ 0 Var\nrightarrow 0 Var0
  • Remains a problem for Mini-batch SGD.

4 Variance reduction in SGD

4.1 Bias and Variance

  • Bias: E [ g ( x k ) ∣ x k ] − ∇ f ( x k ) E[g(x_k)|x_k]-\nabla f(x_k) E[g(xk)xk]f(xk)
  • Variance: V a r [ ∣ ∣ g ( x k ) ∣ ∣ ∣ x k ] Var[||g(x_k)|||x_k] Var[g(xk)xk]
  • When B i a s = 0 Bias=0 Bias=0 and V a r ≤ G 2 Var\leq G^2 VarG2, O ( G 2 T ) O(\frac{G^2}{\sqrt{T}}) O(T G2) convergence

And the same holds true if you’re doing mini-batch SGD and the constant becomes better by a factor B.

4.2 Stochastic Average Gradient

4.2.1 Update rule

Minimization problem:
min ⁡ x 1 n ∑ i = 1 n f i ( x ) \min_x \frac{1}{n}\sum_{i=1}^nf_i(x) xminn1i=1nfi(x)

  • Maintain g 1 k , … , g n k g_1^k,\dots,g_n^k g1k,,gnk, current estimate for ∇ f i k \nabla f_i^k fik
  • Initialize with all samples: g i 0 = ∇ f i 0 g_i^0=\nabla f_i^0 gi0=fi0

Update:

  • At step k, pick i k i_k ik randomly
    g i k k = ∇ f i k ( x k − 1 ) g_{i_k}^k=\nabla f_{i_k}(x^{k-1}) gikk=fik(xk1) g j k = g j k − 1 ,  for  j ≠ i k g_j^k=g_j^{k-1}, \text{ for }j \neq i_k gjk=gjk1, for j=ik

  • Update x k = x k − 1 − η k 1 n ∑ i = 1 n g i k x^k=x^{k-1}-\eta_k \frac{1}{n}\sum_{i=1}^{n}g^k_i xk=xk1ηkn1i=1ngik Use all g i g_i gi including the stale ones

4.2.2 Step size

  • This bias is not zero at any given x k x^k xk, asymptotically unbiased, (bias → 0 \rightarrow0 0, variance → 0 \rightarrow0 0), so that is in contrast to simple stochastic, to a simple SGD, where the bias was always zero.
  • This means is that we can use a fixed step size, which was the key to having an algorithm that adapts to the niceness of F.

4.2.3 Convergence rate

  • Smooth function: O ( 1 / T ) O(1/T) O(1/T)
  • Strongly convex: O ( C T ) O(C^T) O(CT), linear convergence

4.3 SAGA, variant of stochastic average gradient (note)

已标记关键词 清除标记
©️2020 CSDN 皮肤主题: 1024 设计师:白松林 返回首页