Week 14: Stochastic gradient descent
1 Noisy Unbiased (sub) Gradient (NUS)
- g g g is a NUS of f f f if E [ g ( x ) ∣ x ] ∈ ∂ f ( x ) , ∀ x E[g(x)|x]\in \partial f(x),\forall x E[g(x)∣x]∈∂f(x),∀x;
- If ∇ f ( x ) \nabla f(x) ∇f(x) exists, E [ g ( x ) ∣ x ] = ∇ f ( x ) E[g(x)|x]= \nabla f(x) E[g(x)∣x]=∇f(x)
- f ( x ) = 1 n ∑ i f i ( x ) f(x)=\frac{1}{n}\sum_if_i(x) f(x)=n1∑ifi(x), g ( x ) = ∇ f i ( x ) g(x)=\nabla f_i(x) g(x)=∇fi(x), index i i i chosen random uniform.
- Random coordinate descent: x + i = x i − η ∂ ∂ x i f ( x ) x_+^i=x^i-\eta \frac{\partial}{\partial x_i}f(x) x+i=xi−η∂xi∂f(x), coordinate index i i i chosen random uniform.
2 Stochastic gradient descent
2.1 Update rule
x k + 1 = x k − η g ( x k ) x_{k+1}=x_k-\eta g(x_k) xk+1=xk−ηg(xk)
- g ( x k ) = g(x_k)= g(xk)= gradient estimate
- Random updates at every step
- Keep track of
x
B
E
S
T
x_{BEST}
xBEST
min x 1 n ∑ i = 1 n f i ( x ) \min_x\frac{1}{n}\sum_{i=1}^{n}f_i(x) xminn1i=1∑nfi(x) Gradient descent update rule: x k + 1 = x k − η k 1 n ∑ i ∇ f i ( x k ) x_{k+1}=x_k-\eta_k \frac{1}{n}\sum_i\nabla f_i(x_k) xk+1=xk−ηkn1i∑∇fi(xk)
Stochastic gradient descent rule: x k + 1 = x k − η k ∇ f i k ( x k ) x_{k+1}=x_k-\eta_k\nabla f_{i_k}(x_k) xk+1=xk−ηk∇fik(xk) i k i_k ik random index
2.2 Convergence rate
E [ f ( x B E S T T ) ] − f ∗ ≤ R 2 + G 2 ∑ 0 T η k 2 2 ∑ 0 T η k E[f(x_{BEST}^T)]-f^*\leq \frac{R^2+G^2\sum_0^T\eta_k^2}{2\sum_0^T\eta_k} E[f(xBESTT)]−f∗≤2∑0TηkR2+G2∑0Tηk2
-
∣ ∣ x 0 − x ∗ ∣ ∣ 2 ≤ R 2 ||x_0-x^*||^2\leq R^2 ∣∣x0−x∗∣∣2≤R2
-
Variance G 2 ≥ E [ ∣ ∣ g ( x ) ∣ ∣ 2 ∣ x ] G^2 \geq E[||g(x)||^2|x] G2≥E[∣∣g(x)∣∣2∣x]
-
Fixed step size not good, will jump away from x ∗ x^* x∗
-
Error after T T T steps:
G D S G D convex f O ( 1 / T ) O ( 1 / T ) Lipschitz ∇ f O ( 1 / T ) O ( 1 / T ) Strongly convex O ( C T ) O ( 1 / T ) O ( n d ) O ( d ) \begin{matrix} & GD& SGD \\ \hline \text{convex } f & O({1}/{\sqrt{T}})&O({1}/{\sqrt{T}})\\ \hline \text{Lipschitz } \nabla f& O({1}/{T})&O({1}/{\sqrt{T}})\\ \hline \text{Strongly convex}&O(C^T)& O({1}/{T})\\ \hline &O(nd)&O(d) \end{matrix} convex fLipschitz ∇fStrongly convexGDO(1/T)O(1/T)O(CT)O(nd)SGDO(1/T)O(1/T)O(1/T)O(d) -
Stochastic gradient descent, because it is forced to use a step size of close to zero, cannot adapt to the function. If the function becomes better, gradient descent can become better, but stochastic gradient descent cannot really become better.
-
Faster iteration, but slower convergence & inability to adapt to good f f f.
-
Not linear convergence, since cannot self-tune caused by variance.
2.3 Step size
- ∑ 0 T η k → ∞ \sum_0^T \eta_k\rightarrow\infin ∑0Tηk→∞
- η k → 0 \eta_k\rightarrow0 ηk→0 because var( g k g_k gk) ↛ \nrightarrow ↛ 0, but SGD will be slower than it needs to be
3 Mini-batch Stochastic Gradient Descent
3.1 Update rule
x k + 1 = x k − η k 1 ∣ I k ∣ ∑ i ∈ I k ∇ f i ( x k ) x_{k+1}=x_k-\eta_k \frac{1}{|I_k|}\sum_{i\in I_k}\nabla f_i(x_k) xk+1=xk−ηk∣Ik∣1i∈Ik∑∇fi(xk)
- UPDATE = 1 ∣ I k ∣ ∑ i ∈ I k ∇ f i ( x k ) =\frac{1}{|I_k|}\sum_{i\in I_k}\nabla f_i(x_k) =∣Ik∣1∑i∈Ik∇fi(xk)
- E ( UPDATE ∣ x k ) = ∇ f ( x k ) E(\text{UPDATE}|x_k)=\nabla f(x_k) E(UPDATE∣xk)=∇f(xk)
- V a r [ UPDATE ∣ x k ] = V a r [ S G D ∣ x k ] Var[\text{UPDATE}|x_k]=Var[SGD|x_k] Var[UPDATE∣xk]=Var[SGD∣xk]
3.2 Convergence rate
- Error after
T
T
T steps:
SGD Mini-batch SGD O ( 1 / T ) O ( 1 / T + 1 / ∣ I ∣ T ) O ( d ) O ( ∣ I ∣ d ) \begin{matrix} \text{SGD}& \text{Mini-batch SGD} \\ \hline O({1}/{\sqrt{T}})&O({1}/{\sqrt{T}}+1/\sqrt{|I|T})\\ \hline O(d)&O(|I|d) \end{matrix} SGDO(1/T)O(d)Mini-batch SGDO(1/T+1/∣I∣T)O(∣I∣d) - Gain in # iterations. No gain in compute cost
- Need variance reduction
- Even for Lipschitz ∇ f \nabla f ∇f
- We have faster convergence but it’s a for a constant batch size B
- mini-batch SGD has essentially the same trend as SGD but it has lower waves
3.3 Step size
- The problem is SGD, needs η k → 0 \eta_k\rightarrow0 ηk→0 because V a r ↛ 0 Var\nrightarrow 0 Var↛0
- Remains a problem for Mini-batch SGD.
4 Variance reduction in SGD
4.1 Bias and Variance
- Bias: E [ g ( x k ) ∣ x k ] − ∇ f ( x k ) E[g(x_k)|x_k]-\nabla f(x_k) E[g(xk)∣xk]−∇f(xk)
- Variance: V a r [ ∣ ∣ g ( x k ) ∣ ∣ ∣ x k ] Var[||g(x_k)|||x_k] Var[∣∣g(xk)∣∣∣xk]
- When B i a s = 0 Bias=0 Bias=0 and V a r ≤ G 2 Var\leq G^2 Var≤G2, O ( G 2 T ) O(\frac{G^2}{\sqrt{T}}) O(TG2) convergence
And the same holds true if you’re doing mini-batch SGD and the constant becomes better by a factor B.
4.2 Stochastic Average Gradient
4.2.1 Update rule
Minimization problem:
min
x
1
n
∑
i
=
1
n
f
i
(
x
)
\min_x \frac{1}{n}\sum_{i=1}^nf_i(x)
xminn1i=1∑nfi(x)
- Maintain g 1 k , … , g n k g_1^k,\dots,g_n^k g1k,…,gnk, current estimate for ∇ f i k \nabla f_i^k ∇fik
- Initialize with all samples: g i 0 = ∇ f i 0 g_i^0=\nabla f_i^0 gi0=∇fi0
Update:
-
At step k, pick i k i_k ik randomly
g i k k = ∇ f i k ( x k − 1 ) g_{i_k}^k=\nabla f_{i_k}(x^{k-1}) gikk=∇fik(xk−1) g j k = g j k − 1 , for j ≠ i k g_j^k=g_j^{k-1}, \text{ for }j \neq i_k gjk=gjk−1, for j=ik -
Update x k = x k − 1 − η k 1 n ∑ i = 1 n g i k x^k=x^{k-1}-\eta_k \frac{1}{n}\sum_{i=1}^{n}g^k_i xk=xk−1−ηkn1i=1∑ngik Use all g i g_i gi including the stale ones
4.2.2 Step size
- This bias is not zero at any given x k x^k xk, asymptotically unbiased, (bias → 0 \rightarrow0 →0, variance → 0 \rightarrow0 →0), so that is in contrast to simple stochastic, to a simple SGD, where the bias was always zero.
- This means is that we can use a fixed step size, which was the key to having an algorithm that adapts to the niceness of F.
4.2.3 Convergence rate
- Smooth function: O ( 1 / T ) O(1/T) O(1/T)
- Strongly convex: O ( C T ) O(C^T) O(CT), linear convergence