论文阅读:Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
论文地址: https://arxiv.org/pdf/1706.02677
背景
1)larger networks and larger datasets need longer time for training;
解决方法:分布式同步SGD,将SGD minibatches分配给并行工作站。
(distributed synchronous SGD offers a potential solution to the problem by dividing SGD minibatches over a pool of parallel workers.)
2)large minibatches’s main issue is: optimization difficultly rather than poor generalization.
解决方法:Authors provide some strategies
use large minibatch in place of small minibatches while maintaining training and generalization accuracy.
但是,用large minibatch会使得optimization difficultly,针对这个问题,作者提出一些tips.
Large minibatch SGD
首先,回忆一下SGD的公式:
1)损失函数:
L
(
w
)
=
1
X
∑
x
∈
X
l
(
x
,
w
)
L(w)=\frac{1}{X}\sum_{x \in X} l(x,w)
L(w)=X1x∈X∑l(x,w)
X
X
X为全部的训练集,
w
w
w为权值,
l
(
x
,
w
)
l(x,w)
l(x,w)为单个sample
x
x
x的损失函数
2)权值更新:
w
t
+
1
=
w
t
−
η
1
n
∑
x
∈
B
Δ
l
(
x
,
w
t
)
w_{ t+1 }=w_{ t } - \eta \frac{1}{n} \sum_{x \in B } \Delta l(x,w_{t})
wt+1=wt−ηn1x∈B∑Δl(x,wt)
B
B
B为minibatch sample,
n
n
n为
∣
B
∣
\left | B \right |
∣B∣,
η
\eta
η为学习率,
t
t
t为迭代index.
Large minibatch
Linear Scaling Ruler:
Warmup
BN
Tips
Communication
对于每一个参数的梯度,都是通过allreduce操作来进行汇聚的。在进行allreduce之前,每个GPU都会计算自己的梯度,在allreduce*之后,每个GPU得到梯度的和。
推荐参考:
https://www.zhihu.com/question/60874090
https://www.jianshu.com/p/738ff3628543