Group Normalization vs Batch Normalization

最新推荐文章于 2024-04-15 19:31:41 发布

masonwang_513

最新推荐文章于 2024-04-15 19:31:41 发布

阅读量446

点赞数

分类专栏： cv Deep Learning image processing

本文链接：https://blog.csdn.net/reform513/article/details/104636743

版权

cv 同时被 3 个专栏收录

22 篇文章

订阅专栏

Deep Learning

7 篇文章

订阅专栏

image processing

2 篇文章

订阅专栏

本文深入探讨了批量归一化(BN)存在的问题，包括其对大batchsize的依赖、训练与测试不一致性以及对batch内样本独立同分布的要求。同时，介绍了组归一化(GN)如何解决这些问题，如摆脱batchsize依赖、改进训练与测试一致性，并在小batchsize场景下展示其潜在优势。此外，还提出了小batchsize下BN的优化技巧。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

BN 存在哪些问题：

1. BN 依赖大batch size, 当 batch size 太小时， batch statistics 变得不准确；而显存限制了batch size变大，尤其在检测、分割等比较占用显存的模型上。 batch size上又是一个工程问题, 毕竟去年的coco，Face++主要赢在大batch上，这是最重要的motivation。

2. BN要求batch分布比较理想, 因为BN是沿着[N, H, W]进行统计，在复杂的任务中batch内的样本未符合i.i.d.，比如video里的连续帧，比如detection box/mask head 里，一个batch里的512个proposals 是高度关联甚至是重复的。

3. Train/Test不一致，训练时通过指数滑动平均（EMA）计算出来的 running_mean, running_vars到最后虽然也是能够收敛的，但是测试集和训练集数据分布往往并不完全一致，会造成模型在training/testing的性能差异。

GN做了什么？

1. GN不依赖batch size, group 是指对 channels进行 grouping，然后沿着[H, W, C/G] 进行统计，计算mean and vars，摆脱了对N的依赖。由于是per-N 进行统计的，那么就不要求batch内的N个样本符合i.i.d.

2. GN在 testing time 也会根据不同的输入计算不同的mean 和 vars, 并不像BN那样使用training时的统计值，不存在Train/Test不一致的问题。

GN真的比BN好用吗？

在大batch上， BN依然很有优势，在小batch上，论文 declare 具有优势，实际效果还要case by case去验证。

GN 和 BN的实现（Tensorflow 版）

def GroupNorm(x, group, gamma_initializer=tf.constant_initializer(1.)):
    """
    https://arxiv.org/abs/1803.08494
    """
    shape = x.get_shape().as_list()
    ndims = len(shape)
    assert ndims == 4, shape
    chan = shape[1]
    assert chan % group == 0, chan
    group_size = chan // group

    orig_shape = tf.shape(x)
    h, w = orig_shape[2], orig_shape[3]

    x = tf.reshape(x, tf.stack([-1, group, group_size, h, w]))

    mean, var = tf.nn.moments(x, [2, 3, 4], keep_dims=True)

    new_shape = [1, group, group_size, 1, 1]

    beta = tf.get_variable('beta', [chan], initializer=tf.constant_initializer())
    beta = tf.reshape(beta, new_shape)

    gamma = tf.get_variable('gamma', [chan], initializer=gamma_initializer)
    gamma = tf.reshape(gamma, new_shape)

    out = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-5, name='output')
    return tf.reshape(out, orig_shape, name='output')

def BatchNorm(x, n_out, phase_train, scope='bn'):
    """
    Batch normalization on convolutional maps.
    Args:
        x:           Tensor, 4D BHWD input maps
        n_out:       integer, depth of input maps
        phase_train: boolean tf.Varialbe, true indicates training phase
        scope:       string, variable scope
    Return:
        normed:      batch-normalized maps
    """
    with tf.variable_scope(scope):
        beta = tf.Variable(tf.constant(0.0, shape=[n_out]),
                                     name='beta', trainable=True)
        gamma = tf.Variable(tf.constant(1.0, shape=[n_out]),
                                      name='gamma', trainable=True)
        batch_mean, batch_var = tf.nn.moments(x, [0,1,2], name='moments')
        ema = tf.train.ExponentialMovingAverage(decay=0.5)

        def mean_var_with_update():
            ema_apply_op = ema.apply([batch_mean, batch_var])
            with tf.control_dependencies([ema_apply_op]):
                return tf.identity(batch_mean), tf.identity(batch_var)

        mean, var = tf.cond(phase_train,
                            mean_var_with_update,
                            lambda: (ema.average(batch_mean), ema.average(batch_var)))
        normed = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-3)
    return normed

小batch size下BN tricks

1. 增大BN统计的范围

f = f.reshape([N, H, W * G, C//G])

f = BN(f) # standard BN

f = f.reshape([N, H, W, C])

BN为每个channel单独算一个mean和var；这种BN trick的思路是为每个channel group计算一个mean和var，和GN的motivation有点像。在batch size较小的时候相当于强行增大了BN统计的范围（从N*H*W增大到了N*H*W*G），使BN统计更为稳定, 可以用于解决小batch size的副作用。

2. Synchronized BN

常规的BN是在每一个GPU上单独计算mean和var, 但Synchronized BN 跨卡计算mean和var，减少batch size的副作用。