【例子】
问题:如何计算均值 x ˉ \bar{x} xˉ
回答:
-
第一种方法:
E [ X ] ≈ x ˉ : = 1 N ∑ i = 1 N x i \mathbb{E}[X] \approx \bar{x}:=\frac{1}{N} \sum_{i=1}^N x_i E[X]≈xˉ:=N1i=1∑Nxi
问题:需要等,将所有数据收集到一起后再求平均 -
第二种方法:增量式与迭代式的方法
w k + 1 和 w_{k+1}和 wk+1和 w k w_k wk分别表示前 k + 1 k+1 k+1个求平均和前 k k k个求平均
w k + 1 = 1 k ∑ i = 1 k x i , k = 1 , 2 , … w k = 1 k − 1 ∑ i = 1 k − 1 x i , k = 2 , 3 , … \begin{aligned} & w_{k+1}=\frac{1}{k} \sum_{i=1}^k x_i, \quad k=1,2, \ldots \\ & w_k=\frac{1}{k-1} \sum_{i=1}^{k-1} x_i, \quad k=2,3, \ldots \end{aligned} wk+1=k1i=1∑kxi,k=1,2,…wk=k−11i=1∑k−1xi,k=2,3,…
我们发现前 k + 1 k+1 k+1个求平均和前 k k k个求平均是有关系的:
w k + 1 = 1 k ∑ i = 1 k x i = 1 k ( ∑ i = 1 k − 1 x i + x k ) = 1 k ( ( k − 1 ) w k + x k ) = w k − 1 k ( w k − x k ) . w k + 1 = w k − 1 k ( w k − x k ) . \begin{aligned} w_{k+1}=\frac{1}{k} \sum_{i=1}^k x_i & =\frac{1}{k}\left(\sum_{i=1}^{k-1} x_i+x_k\right) =\frac{1}{k}\left((k-1) w_k+x_k\right)=w_k-\frac{1}{k}\left(w_k-x_k\right) .\\ & w_{k+1}=w_k-\frac{1}{k}\left(w_k-x_k\right) . \end{aligned} wk+1=k1i=1∑kxi=k1(i=1∑k−1xi+xk)=k1((k−1)wk+xk)=wk−k1(wk−xk).wk+1=wk−k1(wk−xk).
例子验证:发现可以表征,我们就得到了一个求平均数的迭代式的算法
w 1 = x 1 , w 2 = w 1 − 1 1 ( w 1 − x 1 ) = x 1 , w 3 = w 2 − 1 2 ( w 2 − x 2 ) = x 1 − 1 2 ( x 1 − x 2 ) = 1 2 ( x 1 + x 2 ) , w 4 = w 3 − 1 3 ( w 3 − x 3 ) = 1 3 ( x 1 + x 2 + x 3 ) , ⋮ w k + 1 = 1 k ∑ i = 1 k x i . \begin{aligned} w_1 & =x_1, \\ w_2 & =w_1-\frac{1}{1}\left(w_1-x_1\right)=x_1, \\ w_3 & =w_2-\frac{1}{2}\left(w_2-x_2\right)=x_1-\frac{1}{2}\left(x_1-x_2\right)=\frac{1}{2}\left(x_1+x_2\right), \\ w_4 & =w_3-\frac{1}{3}\left(w_3-x_3\right)=\frac{1}{3}\left(x_1+x_2+x_3\right), \\ \vdots & \\ w_{k+1} & =\frac{1}{k} \sum_{i=1}^k x_i . \end{aligned} w1w2w3w4⋮wk+1=x1,=w1−11(w1−x1)=x1,=w2−21(w2−x2)=x1−21(x1−x2)=21(x1+x2),=w3−31(w3−x3)=31(x1+x2+x3),=k1i=1∑kxi.
【Robbins-Monro算法(RM算法)】
Stochastic approximation(SA 随机近似):
- 代表了一大类随机迭代的算法,涉及到对随机变量的采用,主要用于方程求解和优化问题
- 它不需要方程或目标函数的表达式
Robbins-Monro算法(RM算法)
- 属于Stochastic approximation领域
- 随机梯度下降方法式该方法的一种情况
求解的问题:
g
(
w
)
=
0
g(w)=0
g(w)=0
- 很多问题可以用这个来进行表征,比如优化问题 g ( w ) = ∇ w J ( w ) = 0 g(w)=\nabla_w J(w)=0 g(w)=∇wJ(w)=0
- 加入我们要求 g ( w ) = c g(w)=c g(w)=c,我们可以将其转变为 g ( w ) − c = 0 g(w)-c=0 g(w)−c=0这样的问题
计算过程:
- 第一种情况: g g g 的表达式知道
- 第二种情况: g g g 的表达式不知道(类似于神经网络)
✨RM算法:
w k + 1 = w k − a k g ~ ( w k , η k ) , k = 1 , 2 , 3 , … w_{k+1}=w_k-a_k \tilde{g}\left(w_k, \eta_k\right), \quad k=1,2,3, \ldots wk+1=wk−akg~(wk,ηk),k=1,2,3,…
- w k w_k wk:是对 w ∗ w^* w∗第 k k k次的观测
- g ~ ( w k , η k ) = g ( w k ) + η k \tilde{g}\left(w_k, \eta_k\right)=g\left(w_k\right)+\eta_k g~(wk,ηk)=g(wk)+ηk 是第 k k k次的带噪音观察
- a k a_k ak:是常数
这个里面中 g ( w ) g(w) g(w)是个黑盒,输入 { w k } \left\{w_k\right\} {wk},输出 { g ~ ( w k , η k ) } \left\{\tilde{g}\left(w_k, \eta_k\right)\right\} {g~(wk,ηk)}
✨RM算法例子:
g ( w ) = tanh ( w − 1 ) = 0 g(w)=\tanh (w-1)=0 g(w)=tanh(w−1)=0
参数: w 1 = 3 , a k = 1 / k , η k ≡ 0 w_1=3, a_k=1 / k, \eta_k \equiv 0 w1=3,ak=1/k,ηk≡0
RM算法: w k + 1 = w k − a k g ( w s ) w_{k+1}=w_k-a_k g\left(w_s\right) wk+1=wk−akg(ws)
仿真结果:
我们发现 w k + 1 w_{k+1} wk+1 更接近于 w ∗ w^* w∗ 相比于 w k w_k wk,因为 w k + 1 = w k − a k g ( w k ) < w k w_{k+1}=w_k-a_k g\left(w_k\right)<w_k wk+1=wk−akg(wk)<wk,所以 w k + 1 w_{k+1} wk+1比 w k w_k wk小更接近 w ∗ w* w∗
【随机梯度下降】
✨解决的优化问题:
min w J ( w ) = E [ f ( w , X ) ] \min _w \quad J(w)=\mathbb{E}[f(w, X)] wminJ(w)=E[f(w,X)]
- w w w:是最优的参数
- X X X:随机变量
✨解决方法1:gradient descent (GD) 梯度下降
w k + 1 = w k − α k ∇ w E [ f ( w k , X ) ] = w k − α k E [ ∇ w f ( w k , X ) ] w_{k+1}=w_k-\alpha_k \nabla_w \mathbb{E}\left[f\left(w_k, X\right)\right]=w_k-\alpha_k \mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right] wk+1=wk−αk∇wE[f(wk,X)]=wk−αkE[∇wf(wk,X)]
- α k \alpha_k αk:步长表示快还是慢下降
- E [ ∇ w f ( w k , X ) ] = = J ( w ) \mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]==J(w) E[∇wf(wk,X)]==J(w)
问题:对期望求梯度如何计算:
- 有模型对模型求梯度
- 没有模型,用数据求也就是方法2
✨解决方法2:batch gradient descent (BGD)梯度下降
E [ ∇ w f ( w k , X ) ] ≈ 1 n ∑ i = 1 n ∇ w f ( w k , x i ) w k + 1 = w k − α k 1 n ∑ i = 1 n ∇ w f ( w k , x i ) . \begin{aligned} & \mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right] \approx \frac{1}{n} \sum_{i=1}^n \nabla_w f\left(w_k, x_i\right) \\ & w_{k+1}=w_k-\alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f\left(w_k, x_i\right) . \end{aligned} E[∇wf(wk,X)]≈n1i=1∑n∇wf(wk,xi)wk+1=wk−αkn1i=1∑n∇wf(wk,xi).
问题:在每一次更新采样k的时候都要采集n次
✨解决方法3:stochastic gradient descent (SGD)梯度下降
w k + 1 = w k − α k ∇ w f ( w k , x k ) , w_{k+1}=w_k-\alpha_k \nabla_w f\left(w_k, x_k\right), wk+1=wk−αk∇wf(wk,xk),
- 相比GD:替换真梯度 E [ ∇ w f ( w k , X ) ] \mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right] E[∇wf(wk,X)] 为随机梯度 ∇ w f ( w k , x k ) \nabla_w f\left(w_k, x_k\right) ∇wf(wk,xk)
- 相比BGD:将n=1
✨例子:
min w J ( w ) = E [ f ( w , X ) ] = E [ 1 2 ∥ w − X ∥ 2 ] , \min _w \quad J(w)=\mathbb{E}[f(w, X)]=\mathbb{E}\left[\frac{1}{2}\|w-X\|^2\right], wminJ(w)=E[f(w,X)]=E[21∥w−X∥2],
其中: f ( w , X ) = ∥ w − X ∥ 2 / 2 ∇ w f ( w , X ) = w − X f(w, X)=\|w-X\|^2 / 2 \quad \nabla_w f(w, X)=w-X f(w,X)=∥w−X∥2/2∇wf(w,X)=w−X
【问题1】:最优解是否是
w
∗
=
E
[
X
]
w^*=\mathbb{E}[X]
w∗=E[X]?
∇
w
J
(
w
)
=
0
⇒
E
[
∇
w
f
(
w
,
X
)
⏟
w
−
x
]
=
0
⇒
E
[
w
−
x
]
=
0
⇒
w
=
E
[
x
]
\begin{gathered} \nabla_w J(w)=0 \Rightarrow E[\underbrace{\nabla_w f(w, X)}_{w-x}]=0 \Rightarrow E[w-x]=0 \Rightarrow w=E[x] \end{gathered}
∇wJ(w)=0⇒E[w−x
∇wf(w,X)]=0⇒E[w−x]=0⇒w=E[x]
【问题2】:写出解决次问题的GD算法
w
k
+
1
=
w
k
−
α
k
∇
w
J
(
w
k
)
=
w
k
−
α
k
E
[
∇
w
f
(
w
k
,
X
)
]
=
w
k
−
α
k
E
[
w
k
−
X
]
\begin{aligned} w_{k+1} & =w_k-\alpha_k \nabla_w J\left(w_k\right) \\ & =w_k-\alpha_k \mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right] \\ & =w_k-\alpha_k \mathbb{E}\left[w_k-X\right] \end{aligned}
wk+1=wk−αk∇wJ(wk)=wk−αkE[∇wf(wk,X)]=wk−αkE[wk−X]
【问题3】:写出解决此问题的SGD算法
w
k
+
1
=
w
k
−
α
k
∇
w
f
(
w
k
,
x
k
)
=
w
k
−
α
k
(
w
k
−
x
k
)
w_{k+1}=w_k-\alpha_k \nabla_w f\left(w_k, x_k\right)=w_k-\alpha_k\left(w_k-x_k\right)
wk+1=wk−αk∇wf(wk,xk)=wk−αk(wk−xk)
✨SGD算法的收敛性:
w k + 1 = w k − α k E [ ∇ w f ( w k , X ) ] ⇓ w k + 1 = w k − α k ∇ w f ( w k , x k ) \begin{gathered} w_{k+1}=w_k-\alpha_k \mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right] \\ \Downarrow \\ w_{k+1}=w_k-\alpha_k \nabla_w f\left(w_k, x_k\right) \end{gathered} wk+1=wk−αkE[∇wf(wk,X)]⇓wk+1=wk−αk∇wf(wk,xk)
从随机梯度来的,由于模型不知道所以通过随机采样来近似E,这个采样叫随机梯度,前面的叫真实梯度
∇
w
f
(
w
k
,
x
k
)
=
E
[
∇
w
f
(
w
,
X
)
]
+
∇
w
f
(
w
k
,
x
k
)
−
E
[
∇
w
f
(
w
,
X
)
]
⏟
η
\nabla_w f\left(w_k, x_k\right)=\mathbb{E}\left[\nabla_w f(w, X)\right]+\underbrace{\nabla_w f\left(w_k, x_k\right)-\mathbb{E}\left[\nabla_w f(w, X)\right]}_\eta
∇wf(wk,xk)=E[∇wf(w,X)]+η
∇wf(wk,xk)−E[∇wf(w,X)]
由于是用随机梯度来近似真实梯度所以不准确的存在误差
η
\eta
η
问题:由于 ∇ w f ( w k , x k ) ≠ E [ ∇ w f ( w , X ) ] \nabla_w f\left(w_k, x_k\right) \neq \mathbb{E}\left[\nabla_w f(w, X)\right] ∇wf(wk,xk)=E[∇wf(w,X)],所以使用SGD是否当 k → ∞ k \rightarrow \infty k→∞ 时 w k → w ∗ w_k \rightarrow w^* wk→w∗ ?
回答:由于SGD是特殊的RM算法,那么前面RM算法的收敛性就可以应用到SGD的收敛性当中
✨SGD算法收敛行为:
问题:随机梯度是个随机的,逼近会不会不准确,SGD收敛是慢还是随机的
回答:
δ
k
≐
∣
∇
w
f
(
w
k
,
x
k
)
−
E
[
∇
w
f
(
w
k
,
X
)
]
∣
∣
E
[
∇
w
f
(
w
k
,
X
)
]
∣
\delta_k \doteq \frac{\left|\nabla_w f\left(w_k, x_k\right)-\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]\right|}{\left|\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]\right|}
δk≐∣E[∇wf(wk,X)]∣∣∇wf(wk,xk)−E[∇wf(wk,X)]∣
- 分子为差的绝对值
由于
E
[
∇
w
f
(
w
∗
,
X
)
]
=
0
\mathbb{E}\left[\nabla_w f\left(w^*, X\right)\right]=0
E[∇wf(w∗,X)]=0,我们将上面的式子进行修正为:
δ
k
=
∣
∇
w
f
(
w
k
,
x
k
)
−
E
[
∇
w
f
(
w
k
,
X
)
]
∣
∣
E
[
∇
w
f
(
w
k
,
X
)
]
−
E
[
∇
w
f
(
w
∗
,
X
)
]
∣
=
∣
∇
w
f
(
w
k
,
x
k
)
−
E
[
∇
w
f
(
w
k
,
X
)
]
∣
∣
E
[
∇
w
2
f
(
w
~
k
,
X
)
(
w
k
−
w
∗
)
]
∣
\delta_k=\frac{\left|\nabla_w f\left(w_k, x_k\right)-\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]\right|}{\left|\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]-\mathbb{E}\left[\nabla_w f\left(w^*, X\right)\right]\right|}=\frac{\left|\nabla_w f\left(w_k, x_k\right)-\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]\right|}{\left|\mathbb{E}\left[\nabla_w^2 f\left(\tilde{w}_k, X\right)\left(w_k-w^*\right)\right]\right|}
δk=∣E[∇wf(wk,X)]−E[∇wf(w∗,X)]∣∣∇wf(wk,xk)−E[∇wf(wk,X)]∣=∣E[∇w2f(w~k,X)(wk−w∗)]∣∣∇wf(wk,xk)−E[∇wf(wk,X)]∣
其中左边的式子将上面的带了进去,右边的式子对左边的式子使用中值定理进行化简,我们假设其中
∇
w
2
f
≥
c
>
0
\nabla_w^2 f \geq c>0
∇w2f≥c>0,对于所有的
w
,
X
w,X
w,X
∣
E
[
∇
w
2
f
(
w
~
k
,
X
)
(
w
k
−
w
∗
)
]
∣
=
∣
E
[
∇
w
2
f
(
w
~
k
,
X
)
]
(
w
k
−
w
∗
)
∣
=
∣
E
[
∇
w
2
f
(
w
~
k
,
X
)
]
∣
∣
(
w
k
−
w
∗
)
∣
≥
c
∣
w
k
−
w
∗
∣
\begin{aligned} \left|\mathbb{E}\left[\nabla_w^2 f\left(\tilde{w}_k, X\right)\left(w_k-w^*\right)\right]\right| & =\left|\mathbb{E}\left[\nabla_w^2 f\left(\tilde{w}_k, X\right)\right]\left(w_k-w^*\right)\right| \\ & =\left|\mathbb{E}\left[\nabla_w^2 f\left(\tilde{w}_k, X\right)\right]\right|\left|\left(w_k-w^*\right)\right| \geq c\left|w_k-w^*\right| \end{aligned}
E[∇w2f(w~k,X)(wk−w∗)]
=
E[∇w2f(w~k,X)](wk−w∗)
=
E[∇w2f(w~k,X)]
∣(wk−w∗)∣≥c∣wk−w∗∣
将其分母进行带入得到:
δ
k
≤
∣
∇
w
f
(
w
k
,
x
k
)
−
E
[
∇
w
f
(
w
k
,
X
)
]
∣
c
∣
w
k
−
w
∗
∣
\delta_k \leq \frac{\left|\nabla_w f\left(w_k, x_k\right)-\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]\right|}{c\left|w_k-w^*\right|}
δk≤c∣wk−w∗∣∣∇wf(wk,xk)−E[∇wf(wk,X)]∣
我们再来分析这个式子:
δ
k
≤
∣
∇
w
f
(
w
k
,
x
k
)
⏞
stochastic gradient
−
E
[
∇
w
f
(
w
k
,
X
)
]
⏞
true gradient
∣
c
∣
w
k
−
w
∗
∣
⏟
distance to the optimal solution
.
\delta_k \leq \frac{|\overbrace{\nabla_w f\left(w_k, x_k\right)}^{\text {stochastic gradient }}-\overbrace{\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]}^{\text {true gradient }}|}{\underbrace{c\left|w_k-w^*\right|}_{\text {distance to the optimal solution }}} .
δk≤distance to the optimal solution
c∣wk−w∗∣∣∇wf(wk,xk)
stochastic gradient −E[∇wf(wk,X)]
true gradient ∣.
- 分子是它的绝对误差
- 分母是 w k w_k wk距离 w ∗ w^* w∗的距离
- 当 w k w_k wk距离 w ∗ w^* w∗非常远的时候,SGD的行为和普通的梯度下降类似,
✨SGD收敛例子:
假设我们有20x20的范围,其最终收敛效果如下所示:
【BGD,MBGD,SGD】
目标函数:
J
(
w
)
=
E
[
f
(
w
,
X
)
]
J(w)=\mathbb{E}[f(w, X)]
J(w)=E[f(w,X)] 随机变量:
{
x
i
}
i
=
1
n
\left\{x_i\right\}_{i=1}^n
{xi}i=1n
w
k
+
1
=
w
k
−
α
k
1
n
∑
i
=
1
n
∇
w
f
(
w
k
,
x
i
)
,
(
B
G
D
)
w
k
+
1
=
w
k
−
α
k
1
m
∑
j
∈
I
k
∇
w
f
(
w
k
,
x
j
)
,
(
M
B
G
D
)
w
k
+
1
=
w
k
−
α
k
∇
w
f
(
w
k
,
x
k
)
.
(
S
G
D
)
\begin{aligned} & w_{k+1}=w_k-\alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f\left(w_k, x_i\right),&(BGD) \\ & w_{k+1}=w_k-\alpha_k \frac{1}{m} \sum_{j \in \mathcal{I}_k} \nabla_w f\left(w_k, x_j\right), &(MBGD)\\ & w_{k+1}=w_k-\alpha_k \nabla_w f\left(w_k, x_k\right) .&(SGD) \end{aligned}
wk+1=wk−αkn1i=1∑n∇wf(wk,xi),wk+1=wk−αkm1j∈Ik∑∇wf(wk,xj),wk+1=wk−αk∇wf(wk,xk).(BGD)(MBGD)(SGD)
- BGD:用到所有的采用在其基础上求平均
- MBGD:用到一组上随机抽取的采样求平均
- SGD:在其实随机采样一个求随机梯度
✨BGD,MBGD,SGD比较
MBGD囊括了BGD和SGD,当比较小的时候接近SGD,当比较大的时候接近BGD
如果m=1,MBGD是SGD
如果m=n,MBGD不等于BGD,MBGD是在所以采样时随机抽取可能抽不到
【总结】
-
Mean estimation:使用一组数 { x k } \left\{x_k\right\} {xk}来求平均 E [ X ] \mathbb{E}[X] E[X]
w k + 1 = w k − 1 k ( w k − x k ) . w_{k+1}=w_k-\frac{1}{k}\left(w_k-x_k\right) . wk+1=wk−k1(wk−xk). -
RM 迭代:用含有噪音的测量进行估计 { g ~ ( w k , η k ) 1 } \left\{\tilde{g}\left(w_k, \eta_k\right)_1\right\} {g~(wk,ηk)1}, g ( w ) = 0 g(w)=0 g(w)=0
w k + 1 = w k − a k g ~ ( w k , η k ) w_{k+1}=w_k-a_k \tilde{g}\left(w_k, \eta_k\right) wk+1=wk−akg~(wk,ηk) -
SGD迭代:利用梯度采样 { ∇ w f ( w k , x k ) } \left\{\nabla_w f\left(w_k, x_k\right)\right\} {∇wf(wk,xk)}求解 J ( w ) = E [ f ( w , X ) ] J(w)=\mathbb{E}[f(w, X)] J(w)=E[f(w,X)]
w k + 1 = w k − α k ∇ w f ( w k , x k ) w_{k+1}=w_k-\alpha_k \nabla_w f\left(w_k, x_k\right) wk+1=wk−αk∇wf(wk,xk)