1. 问题背景
目前,我们在训练神经网络模型时,一般采用批梯度训练,大量实验表明,超参数batch size会影响模型收敛速度(训练时间)和模型效果。
通常,batch size越小,模型的收敛速度越慢;batch size越大,模型收敛速度越快,性能一般也会好一些。batch size 的影响可以看实验:Tensorflow playground.
但是受限于设备的显存,我们不可能一直增大batch size。于是,如何在设备显存受限的情况下,增大batch size的可选范围(可方便调参),成为这篇文章的主要研究问题。
2. 主要解决思路
主要解决思路正如文章标题所示:在一个batch中先累加多个minibatch计算的梯度,再反向传播。即,
(1)将整个dataset分成多个batch,
(2)分别将每个batch分成多个minibatch,将每个minibatch喂给神经网络,计算loss,计算梯度,并将梯度保存下来,先不进行反向传播。
(3)对一个batch中的所有minibatch得到的梯度进行累加,并进行反向传播。
上述结果完全等同于,将一个batch喂给神经网络,计算loss,计算梯度,再进行反向传播。
好处:你可以根据显存大小,实现任何batch size 大小的批梯度训练。batch size 的范围是 [1, len(dataset)]。如果你有更多的显存,就把minibatch size 设大一点,显存不足则将minibatch size设小一点。
3. Tensorflow实现
在 pytorch 中,梯度只要不清零默认是累加的,于是很容易实现上述问题。但在Tensorflow中,却不那么容易。话不多说,直接上程序。
import tensorflow as tf
import numpy as np
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
x_data = np.array(range(1, 20))
num_dataset = len(x_data)
batch_size = 4
minibatch_size = 2
with tf.Graph().as_default():
x = tf.placeholder(dtype='float32', shape=None)
w = tf.Variable(initial_value=4., dtype='float32')
loss = w * w * x
# Optimizer definition - nothing different from any classical example
opt = tf.train.GradientDescentOptimizer(0.1)
# Retrieve all trainable variables you defined in your graph
tvs = tf.trainable_variables()
# Creation of a list of variables with the same shape as the trainable ones
# initialized with zeros
accum_vars = [tf.Variable(tf.zeros_like(tv.initialized_value()), trainable=False) for tv in tvs]
zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_vars]
# Calls the compute_gradients function of the optimizer to obtain the list of gradients
gvs = opt.compute_gradients(loss, tvs)
# Adds to each element from the list you initialized earlier with zeros its gradient
# (works because accum_vars and gvs are in the same order)
accum_ops = [accum_vars[i].assign_add(gv[0]) for i, gv in enumerate(gvs)]
# Define the training step (part with variable value update)
train_step = opt.apply_gradients([(accum_vars[i], gv[1]) for i, gv in enumerate(gvs)])
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for batch_count in range(batch_size):
# 在run每个batch, 需先将前一个batch所得的累积梯度清零
sess.run(zero_ops)
batch_data = x_data[batch_count*batch_size: (batch_count+1)*batch_size]
# Accumulate the gradients 'minibatch_size' times in accum_vars using accum_ops
for minibatch_count in range(minibatch_size):
minibatch_data = batch_data[minibatch_count*minibatch_size: (minibatch_count+1)*minibatch_size]
accum_array = sess.run(accum_ops, feed_dict={x: minibatch_data})
print("[%d][%d]" % (batch_count, minibatch_count), accum_array)
print(sess.run(tvs))
# Run the train_step ops to update the weights based on your accumulated gradients
sess.run(train_step)
输出结果:
[0][0] [24.0]
[4.0]
[0][1] [80.0]
[4.0]
[1][0] [-88.0]
[-4.0]
[1][1] [-208.0]
[-4.0]
[2][0] [638.4]
[16.800001]
[2][1] [1411.2001]
[16.800001]
[3][0] [-6713.2803]
[-124.32001]
[3][1] [-14421.121]
[-124.32001]
参考资料:
[1]: https://stackoverflow.com/questions/46772685/how-to-accumulate-gradients-in-tensorflow.
[2]: https://www.zhihu.com/question/320783553.