# 关于SGD收敛性的证明

37 篇文章 0 订阅

## 准备工作

θ ∗ = a r g m i n θ ∈ R d L ( x , θ ) \theta^* = argmin_{\theta\in R^d} L(x,\theta)

θ t + 1 = θ t − η t g ( x t , θ t ) \theta^{t+1} = \theta^t - \eta^tg(x^t,\theta^t)

θ t = θ 1 − ∑ s = 1 t − 1 η s g ( x s , θ s ) \theta^t = \theta^1-\sum_{s=1}^{t-1} \eta^sg(x^s,\theta^s)

L ( T ) = 1 T ∑ t = 1 T L ( x t , θ t ) L(T) = \frac{1}{T}\sum_{t=1}^{T}L(x^t,\theta^t)

1 T ∑ t = 1 T L ( x t , θ t ) − 1 T ∑ t = 1 T L ( x t , θ ∗ ) = 1 T R ( T )        ( 1 ) \frac{1}{T}\sum_{t=1}^{T}L(x^t,\theta^t)-\frac{1}{T}\sum_{t=1}^{T}L(x^t,\theta^*)=\frac{1}{T}R(T)\ \ \ \ \ \ (1)

R ( T ) = ∑ t = 1 T L ( x t , θ t ) − ∑ t = 1 T L ( x t , θ ∗ ) R(T)=\sum_{t=1}^{T}L(x^t,\theta^t)-\sum_{t=1}^{T}L(x^t,\theta^*)

lim ⁡ T → ∞ 1 T R ( T ) = 0 \lim_{T\rightarrow \infty}\frac{1}{T}R(T)=0

## 证明过程

L ( x t , θ i ) − L ( x t , θ j ) ≥ ( θ i − θ j ) ⋅ g ( x t , θ j )       ( 2 ) L(x^t,\theta^i)-L(x^t,\theta^j)\geq (\theta^i-\theta^j)\cdot g(x^t,\theta^j)\ \ \ \ \ (2)

R ( T ) = ∑ t = 1 T L ( x t , θ t ) − ∑ t = 1 T L ( x t , θ ∗ ) = ∑ t = 1 T ( L ( x t , θ t ) − L ( x t , θ ∗ ) ) ≤ ∑ t = 1 T ( θ t − θ ∗ ) ⋅ g ( x t , θ t ) R(T)=\sum_{t=1}^{T}L(x^t,\theta^t)-\sum_{t=1}^{T}L(x^t,\theta^*)\\ =\sum_{t=1}^{T}(L(x^t,\theta^t)-L(x^t,\theta^*))\\ \leq \sum_{t=1}^{T}(\theta^t-\theta^*)\cdot g(x^t,\theta^t)

R ( T ) ≤ ∑ t = 1 T ⟨ θ t − θ ∗ , g ( x t , θ t ) ⟩       ( 3 ) R(T)\leq \sum_{t=1}^{T}\left \langle \theta^t-\theta^*,g(x^t,\theta^t) \right \rangle \ \ \ \ \ (3)

θ t + 1 = θ t − η t g ( x t , θ t ) θ t + 1 − θ ∗ = θ t − θ ∗ − η t g ( x t , θ t ) ∥ θ t + 1 − θ ∗ ∥ 2 = ∥ θ t − θ ∗ − η t g ( x t , θ t ) ∥ 2 ∥ θ t + 1 − θ ∗ ∥ 2 = ∥ θ t − θ ∗ ∥ 2 + ( η t ) 2 ∥ g ( x t , θ t ) ∥ 2 − 2 η t ⟨ θ t − θ ∗ , η t g ( x t , θ t ) ⟩ \theta^{t+1} = \theta^t - \eta^tg(x^t,\theta^t)\\ \theta^{t+1}-\theta^* = \theta^t-\theta^*-\eta^tg(x^t,\theta^t)\\ \left \| \theta^{t+1}-\theta^* \right \|^2 = \left \| \theta^t-\theta^*-\eta^tg(x^t,\theta^t) \right \|^2 \\ \left \| \theta^{t+1}-\theta^* \right \|^2 = \left \| \theta^{t}-\theta^* \right \|^2 +(\eta^t)^2 \left \| g(x^t,\theta^t) \right \|^2 -2\eta^t \left \langle \theta^t-\theta^*,\eta^tg(x^t,\theta^t) \right \rangle\\

⟨ θ t − θ ∗ , η t g ( x t , θ t ) ⟩ = 1 2 η t [ ∥ θ t − θ ∗ ∥ 2 − ∥ θ t + 1 − θ ∗ ∥ 2 ] + η t 2 ∥ g ( x t , θ t ) ∥ 2 \left \langle \theta^t-\theta^*,\eta^tg(x^t,\theta^t) \right \rangle = \frac{1}{2\eta^t} \left [ \left \| \theta^{t}-\theta^* \right \|^2 -\left \| \theta^{t+1}-\theta^* \right \|^2 \right ] + \frac{\eta^t}{2} \left \| g(x^t,\theta^t) \right \|^2

R ( T ) ≤ ∑ t = 1 T 1 2 η t [ ∥ θ t − θ ∗ ∥ 2 − ∥ θ t + 1 − θ ∗ ∥ 2 ] ⏟ ( a ) + ∑ t = 1 T η t 2 ∥ g ( x t , θ t ) ∥ 2 ⏟ ( b ) R(T)\leq \underbrace{\sum_{t=1}^{T} \frac{1}{2\eta^t} \left [ \left \| \theta^{t}-\theta^* \right \|^2 -\left \| \theta^{t+1}-\theta^* \right \|^2 \right ] }_{(a)}+ \underbrace{\sum_{t=1}^{T} \frac{\eta^t}{2} \left \| g(x^t,\theta^t) \right \|^2}_{(b)}

( a ) = ∑ t = 1 T 1 2 η t [ ∥ θ t − θ ∗ ∥ 2 − ∥ θ t + 1 − θ ∗ ∥ 2 ] = 1 2 η 1 [ ∥ θ 1 − θ ∗ ∥ 2 − ∥ θ 2 − θ ∗ ∥ 2 ] + . . . + 1 2 η T [ ∥ θ T − θ ∗ ∥ 2 − ∥ θ T + 1 − θ ∗ ∥ 2 ] = 1 2 η 1 ∥ θ 1 − θ ∗ ∥ 2 − 1 2 η T ∥ θ T + 1 − θ ∗ ∥ 2 + ∑ t = 2 T ∥ θ t − θ ∗ ∥ 2 ( 1 2 η t − 1 2 η t − 1 ) (a) = \sum_{t=1}^{T} \frac{1}{2\eta^t} \left [ \left \| \theta^{t}-\theta^* \right \|^2 -\left \| \theta^{t+1}-\theta^* \right \|^2 \right ]\\ = \frac{1}{2\eta^1} \left [ \left \| \theta^{1}-\theta^* \right \|^2 -\left \| \theta^{2}-\theta^* \right \|^2 \right ] + ... + \frac{1}{2\eta^T} \left [ \left \| \theta^{T}-\theta^* \right \|^2 -\left \| \theta^{T+1}-\theta^* \right \|^2 \right ]\\ = \frac{1}{2\eta^1}\left \| \theta^{1}-\theta^* \right \|^2-\frac{1}{2\eta^T}\left \| \theta^{T+1}-\theta^* \right \|^2 + \sum_{t=2}^{T} \left \| \theta^{t}-\theta^* \right \|^2(\frac{1}{2\eta^t}-\frac{1}{2\eta^{t-1}})

• η t \eta_t 序列是单调不递增的，即 η t + 1 ≥ η t , ∀ t ≥ 1 \eta^{t+1}\geq \eta^t,\forall t\geq 1
• D = m a x { ∥ θ t − θ ∗ ∥ } < ∞ D = max\{\left \| \theta^{t}-\theta^* \right \|\}<\infty

( a ) ≤ 1 2 η 1 D 2 + ∑ t = 2 T D 2 ( 1 2 η t − 1 2 η t − 1 ) = 1 2 η 1 D 2 + D 2 ( 1 2 η T − 1 2 η 1 ) = D 2 2 η T (a)\leq \frac{1}{2\eta^1}D^2 + \sum_{t=2}^{T} D^2(\frac{1}{2\eta^t}-\frac{1}{2\eta^{t-1}})\\ =\frac{1}{2\eta^1}D^2 + D^2(\frac{1}{2\eta^T}-\frac{1}{2\eta^{1}})\\ = \frac{D^2}{2\eta^T}

( b ) = ∑ t = 1 T η t 2 ∥ g ( x t , θ t ) ∥ 2 (b) = \sum_{t=1}^{T} \frac{\eta^t}{2} \left \| g(x^t,\theta^t) \right \|^2

G = m a x ∥ g ( x t , θ t ) ∥ 2 < ∞ G = max{\left \| g(x^t,\theta^t) \right \|^2}<\infty

( b ) ≤ ∑ t = 1 T G 2 2 η t = G 2 2 ∑ t = 1 T η t (b)\leq \sum_{t=1}^{T}\frac{G^2}{2}\eta^t = \frac{G^2}{2}\sum_{t=1}^{T}\eta^t

R ( T ) = ∑ t = 1 T L ( x t , θ t ) − ∑ t = 1 T L ( x t , θ ∗ ) ≤ D 2 2 η T + G 2 2 ∑ t = 1 T η t       ( 4 ) R(T)=\sum_{t=1}^{T}L(x^t,\theta^t)-\sum_{t=1}^{T}L(x^t,\theta^*)\leq \frac{D^2}{2\eta^T} + \frac{G^2}{2}\sum_{t=1}^{T}\eta^t\ \ \ \ \ (4)

R ( T ) ≤ D 2 T 2 C + C G 2 2 ∑ t = 1 T 1 t ≤ D 2 T 2 C + C G 2 2 ∑ t = 1 T 2 t − 1 + t ≤ D 2 T 2 C + C G 2 2 ∑ t = 1 T 2 ( t − t − 1 ) ≤ D 2 T 2 C + C G 2 T R(T)\leq \frac{D^2\sqrt{T}}{2C} + \frac{CG^2}{2}\sum_{t=1}^{T}\frac{1}{\sqrt{t}}\\ \leq \frac{D^2\sqrt{T}}{2C} + \frac{CG^2}{2}\sum_{t=1}^{T}\frac{2}{\sqrt{t-1}+\sqrt{t}}\\ \leq \frac{D^2\sqrt{T}}{2C} + \frac{CG^2}{2}\sum_{t=1}^{T}2(\sqrt{t}-\sqrt{t-1})\\ \leq \frac{D^2\sqrt{T}}{2C} + CG^2\sqrt{T}

1 T R ( T ) ≤ D 2 2 C T + C G 2 T \frac{1}{T}R(T) \leq \frac{D^2}{2C\sqrt{T}} + \frac{CG^2}{\sqrt{T}}

D 2 2 η T + G 2 2 ∑ t = 1 T η t = D 2 2 η + G 2 2 T η ≥ D G T 2 \frac{D^2}{2\eta^T} + \frac{G^2}{2}\sum_{t=1}^{T}\eta^t = \frac{D^2}{2\eta} + \frac{G^2}{2}T\eta\geq \frac{DG\sqrt{T}}{2}

1 T R ( T ) ≤ D G 2 T \frac{1}{T}R(T)\leq \frac{DG}{2\sqrt{T}}

## 实验

( 4 ) (4) 式与学习率的关系如此紧密，这也难怪SGD这种方法会对超参数如此敏感。本人简单做了一个不同学习率下的对比实验，学习率都设置为常数，可以看到不同情况下loss的下降曲线差异也是很大的。

## 总结

• 损失函数 L ( x t , θ t ) , ∀ t ≥ 1 L(x^t,\theta^t),\forall t\geq 1 关于 θ \theta 是convex函数
• 学习率 η t \eta_t 序列是单调不递增的，即 η t + 1 ≥ η t , ∀ t ≥ 1 \eta^{t+1}\geq \eta^t,\forall t\geq 1 ，非常合理
• D = m a x { ∥ θ t − θ ∗ ∥ } < ∞ D = max\{\left \| \theta^{t}-\theta^* \right \|\}<\infty ，实际上它要求参数位于线性空间的一个有界集合内
• G = m a x ∥ g ( x t , θ t ) ∥ 2 < ∞ G = max{\left \| g(x^t,\theta^t) \right \|^2}<\infty ，也就是梯度有界

1 T ∑ t = 1 T L ( x t , θ t ) − 1 T ∑ t = 1 T L ( x t , θ ∗ ) = 1 T R ( T ) \frac{1}{T}\sum_{t=1}^{T}L(x^t,\theta^t)-\frac{1}{T}\sum_{t=1}^{T}L(x^t,\theta^*)=\frac{1}{T}R(T)

1 T R ( T ) = ∑ t = 1 T 1 T L ( x t , θ t ) − 1 T ∑ t = 1 T L ( x t , θ ∗ ) ≤ D 2 2 T η T + G 2 2 T ∑ t = 1 T η t \frac{1}{T}R(T)=\sum_{t=1}^{T}\frac{1}{T}L(x^t,\theta^t)-\frac{1}{T}\sum_{t=1}^{T}L(x^t,\theta^*)\leq \frac{D^2}{2T\eta^T} + \frac{G^2}{2T}\sum_{t=1}^{T}\eta^t

## 参考文献

[1] M. Zinkevich, “Online convex programming and generalized infinitesimal gradient ascent,” in Proceedings of the 20th international conference on machine learning (ICML-03), 2003, pp. 928– 936.

05-23 2475
02-10 4218
09-18 7464
12-22 2万+
09-11 1605
09-14 1098
09-11 260
09-10 861
09-11 383

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。