1 Noisy Unbiased (sub) Gradient (NUS)

• g g is a NUS of f f if E [ g ( x ) ∣ x ] ∈ ∂ f ( x ) , ∀ x E[g(x)|x]\in \partial f(x),\forall x ;
• If ∇ f ( x ) \nabla f(x) exists, E [ g ( x ) ∣ x ] = ∇ f ( x ) E[g(x)|x]= \nabla f(x)
• f ( x ) = 1 n ∑ i f i ( x ) f(x)=\frac{1}{n}\sum_if_i(x) , g ( x ) = ∇ f i ( x ) g(x)=\nabla f_i(x) , index i i chosen random uniform.
• Random coordinate descent: x + i = x i − η ∂ ∂ x i f ( x ) x_+^i=x^i-\eta \frac{\partial}{\partial x_i}f(x) , coordinate index i i chosen random uniform.

2.1 Update rule

x k + 1 = x k − η g ( x k ) x_{k+1}=x_k-\eta g(x_k)

• g ( x k ) = g(x_k)= gradient estimate
• Random updates at every step
• Keep track of x B E S T x_{BEST}
min ⁡ x 1 n ∑ i = 1 n f i ( x ) \min_x\frac{1}{n}\sum_{i=1}^{n}f_i(x) Gradient descent update rule: x k + 1 = x k − η k 1 n ∑ i ∇ f i ( x k ) x_{k+1}=x_k-\eta_k \frac{1}{n}\sum_i\nabla f_i(x_k)
Stochastic gradient descent rule: x k + 1 = x k − η k ∇ f i k ( x k ) x_{k+1}=x_k-\eta_k\nabla f_{i_k}(x_k) i k i_k random index

2.2 Convergence rate

E [ f ( x B E S T T ) ] − f ∗ ≤ R 2 + G 2 ∑ 0 T η k 2 2 ∑ 0 T η k E[f(x_{BEST}^T)]-f^*\leq \frac{R^2+G^2\sum_0^T\eta_k^2}{2\sum_0^T\eta_k}

• ∣ ∣ x 0 − x ∗ ∣ ∣ 2 ≤ R 2 ||x_0-x^*||^2\leq R^2

• Variance G 2 ≥ E [ ∣ ∣ g ( x ) ∣ ∣ 2 ∣ x ] G^2 \geq E[||g(x)||^2|x]

• Fixed step size not good, will jump away from x ∗ x^*

• Error after T T steps:
G D S G D convex  f O ( 1 / T ) O ( 1 / T ) Lipschitz  ∇ f O ( 1 / T ) O ( 1 / T ) Strongly convex O ( C T ) O ( 1 / T ) O ( n d ) O ( d ) \begin{matrix} & GD& SGD \\ \hline \text{convex } f & O({1}/{\sqrt{T}})&O({1}/{\sqrt{T}})\\ \hline \text{Lipschitz } \nabla f& O({1}/{T})&O({1}/{\sqrt{T}})\\ \hline \text{Strongly convex}&O(C^T)& O({1}/{T})\\ \hline &O(nd)&O(d) \end{matrix}

• Stochastic gradient descent, because it is forced to use a step size of close to zero, cannot adapt to the function. If the function becomes better, gradient descent can become better, but stochastic gradient descent cannot really become better.

• Faster iteration, but slower convergence & inability to adapt to good f f .

2.3 Step size

• ∑ 0 T η k → ∞ \sum_0^T \eta_k\rightarrow\infin
• η k → 0 \eta_k\rightarrow0 because var( g k g_k ) ↛ \nrightarrow 0, but SGD will be slower than it needs to be

3.1 Update rule

x k + 1 = x k − η k 1 ∣ I k ∣ ∑ i ∈ I k ∇ f i ( x k ) x_{k+1}=x_k-\eta_k \frac{1}{|I_k|}\sum_{i\in I_k}\nabla f_i(x_k)

• UPDATE = 1 ∣ I k ∣ ∑ i ∈ I k ∇ f i ( x k ) =\frac{1}{|I_k|}\sum_{i\in I_k}\nabla f_i(x_k)
• E ( UPDATE ∣ x k ) = ∇ f ( x k ) E(\text{UPDATE}|x_k)=\nabla f(x_k)
• V a r [ UPDATE ∣ x k ] = V a r [ S G D ∣ x k ] Var[\text{UPDATE}|x_k]=Var[SGD|x_k]

3.2 Convergence rate

• Error after T T steps:
SGD Mini-batch SGD O ( 1 / T ) O ( 1 / T + 1 / ∣ I ∣ T ) O ( d ) O ( ∣ I ∣ d ) \begin{matrix} \text{SGD}& \text{Mini-batch SGD} \\ \hline O({1}/{\sqrt{T}})&O({1}/{\sqrt{T}}+1/\sqrt{|I|T})\\ \hline O(d)&O(|I|d) \end{matrix}
• Gain in # iterations. No gain in compute cost
• Need variance reduction
• Even for Lipschitz ∇ f \nabla f
• We have faster convergence but it’s a for a constant batch size B
• mini-batch SGD has essentially the same trend as SGD but it has lower waves

3.3 Step size

• The problem is SGD, needs η k → 0 \eta_k\rightarrow0 because V a r ↛ 0 Var\nrightarrow 0
• Remains a problem for Mini-batch SGD.

4 Variance reduction in SGD

4.1 Bias and Variance

• Bias: E [ g ( x k ) ∣ x k ] − ∇ f ( x k ) E[g(x_k)|x_k]-\nabla f(x_k)
• Variance: V a r [ ∣ ∣ g ( x k ) ∣ ∣ ∣ x k ] Var[||g(x_k)|||x_k]
• When B i a s = 0 Bias=0 and V a r ≤ G 2 Var\leq G^2 , O ( G 2 T ) O(\frac{G^2}{\sqrt{T}}) convergence

And the same holds true if you’re doing mini-batch SGD and the constant becomes better by a factor B.

4.2.1 Update rule

Minimization problem:
min ⁡ x 1 n ∑ i = 1 n f i ( x ) \min_x \frac{1}{n}\sum_{i=1}^nf_i(x)

• Maintain g 1 k , … , g n k g_1^k,\dots,g_n^k , current estimate for ∇ f i k \nabla f_i^k
• Initialize with all samples: g i 0 = ∇ f i 0 g_i^0=\nabla f_i^0

Update:

• At step k, pick i k i_k randomly
g i k k = ∇ f i k ( x k − 1 ) g_{i_k}^k=\nabla f_{i_k}(x^{k-1}) g j k = g j k − 1 ,  for  j ≠ i k g_j^k=g_j^{k-1}, \text{ for }j \neq i_k

• Update x k = x k − 1 − η k 1 n ∑ i = 1 n g i k x^k=x^{k-1}-\eta_k \frac{1}{n}\sum_{i=1}^{n}g^k_i Use all g i g_i including the stale ones

4.2.2 Step size

• This bias is not zero at any given x k x^k , asymptotically unbiased, (bias → 0 \rightarrow0 , variance → 0 \rightarrow0 ), so that is in contrast to simple stochastic, to a simple SGD, where the bias was always zero.
• This means is that we can use a fixed step size, which was the key to having an algorithm that adapts to the niceness of F.

4.2.3 Convergence rate

• Smooth function: O ( 1 / T ) O(1/T)
• Strongly convex: O ( C T ) O(C^T) , linear convergence

05-05 4703

01-29 415
11-29 2167
03-05 401
04-14 3万+
07-12 1164