Tensorflow实现先累加多个minibatch计算的梯度，再反向传播

最新推荐文章于 2025-04-16 16:44:02 发布

dekiang

最新推荐文章于 2025-04-16 16:44:02 发布

阅读量2.5k

点赞数

分类专栏： TensorFlow

本文链接：https://blog.csdn.net/weixin_41560402/article/details/106930463

版权

TensorFlow 专栏收录该内容

9 篇文章

订阅专栏

1. 问题背景

目前，我们在训练神经网络模型时，一般采用批梯度训练，大量实验表明，超参数batch size会影响模型收敛速度（训练时间）和模型效果。

通常，batch size越小，模型的收敛速度越慢；batch size越大，模型收敛速度越快，性能一般也会好一些。batch size 的影响可以看实验：Tensorflow playground.
但是受限于设备的显存，我们不可能一直增大batch size。于是，如何在设备显存受限的情况下，增大batch size的可选范围（可方便调参），成为这篇文章的主要研究问题。

2. 主要解决思路

主要解决思路正如文章标题所示：在一个batch中先累加多个minibatch计算的梯度，再反向传播。即，
（1）将整个dataset分成多个batch，
（2）分别将每个batch分成多个minibatch，将每个minibatch喂给神经网络，计算loss，计算梯度，并将梯度保存下来，先不进行反向传播。
（3）对一个batch中的所有minibatch得到的梯度进行累加，并进行反向传播。

上述结果完全等同于，将一个batch喂给神经网络，计算loss，计算梯度，再进行反向传播。

好处：你可以根据显存大小，实现任何batch size 大小的批梯度训练。batch size 的范围是 [1, len(dataset)]。如果你有更多的显存，就把minibatch size 设大一点，显存不足则将minibatch size设小一点。

3. Tensorflow实现

在 pytorch 中，梯度只要不清零默认是累加的，于是很容易实现上述问题。但在Tensorflow中，却不那么容易。话不多说，直接上程序。

import tensorflow as tf
import numpy as np
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "1"

x_data = np.array(range(1, 20))
num_dataset = len(x_data)
batch_size = 4
minibatch_size = 2
with tf.Graph().as_default():
    x = tf.placeholder(dtype='float32', shape=None)
    w = tf.Variable(initial_value=4., dtype='float32')
    loss = w * w * x

    # Optimizer definition - nothing different from any classical example
    opt = tf.train.GradientDescentOptimizer(0.1)

    # Retrieve all trainable variables you defined in your graph
    tvs = tf.trainable_variables()

    # Creation of a list of variables with the same shape as the trainable ones
    # initialized with zeros
    accum_vars = [tf.Variable(tf.zeros_like(tv.initialized_value()), trainable=False) for tv in tvs]
    zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_vars]

    # Calls the compute_gradients function of the optimizer to obtain the list of gradients
    gvs = opt.compute_gradients(loss, tvs)

    # Adds to each element from the list you initialized earlier with zeros its gradient
    # (works because accum_vars and gvs are in the same order)
    accum_ops = [accum_vars[i].assign_add(gv[0]) for i, gv in enumerate(gvs)]

    # Define the training step (part with variable value update)
    train_step = opt.apply_gradients([(accum_vars[i], gv[1]) for i, gv in enumerate(gvs)])

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

        for batch_count in range(batch_size):
            # 在run每个batch, 需先将前一个batch所得的累积梯度清零
            sess.run(zero_ops)

            batch_data = x_data[batch_count*batch_size: (batch_count+1)*batch_size]
            # Accumulate the gradients 'minibatch_size' times in accum_vars using accum_ops
            for minibatch_count in range(minibatch_size):
                minibatch_data = batch_data[minibatch_count*minibatch_size: (minibatch_count+1)*minibatch_size]
                accum_array = sess.run(accum_ops, feed_dict={x: minibatch_data})
                print("[%d][%d]" % (batch_count, minibatch_count), accum_array)
                print(sess.run(tvs))
            # Run the train_step ops to update the weights based on your accumulated gradients
            sess.run(train_step)

输出结果：

[0][0] [24.0]
[4.0]
[0][1] [80.0]
[4.0]
[1][0] [-88.0]
[-4.0]
[1][1] [-208.0]
[-4.0]
[2][0] [638.4]
[16.800001]
[2][1] [1411.2001]
[16.800001]
[3][0] [-6713.2803]
[-124.32001]
[3][1] [-14421.121]
[-124.32001]