（二）加入bn层_bn层一般加在哪里-CSDN博客

本文链接：https://blog.csdn.net/like_study_cat/article/details/103487948

Batch-normalized 应该放在非线性激活层的前面还是后面？
作者：论智
链接：https://www.zhihu.com/question/283715823/answer/438882036
来源：知乎

在BN的原始论文中，BN是放在非线性激活层前面的（arXiv:1502.03167v3，第5页）We add the BN transform immediately before the nonlinearity（注意：before的黑体是我加的，为了突出重点）但是，François Chollet爆料说BN论文的作者之一Christian把BN放在ReLU后面（你的问题里引用的文字也提到了这一段）。I can guarantee that recent code written by Christian applies relu before BN.另外，Jeremy Howard直接主张把BN放在非线性激活后面You want the batchnorm after the non-linearity, and before the dropout.“应该”放在前面还是后面？这个“应该”其实有两种解释：放在前面还是后面比较好？为什么要放在前面还是后面？对于第一问，目前在实践上，倾向于把BN放在ReLU后面。也有评测表明BN放ReLU后面效果更好。对于第二问，实际上，我们目前对BN的机制仍然不是特别清楚，这里只能尝试做些（玄学）解释，不一定正确。BN，也就是Batch-Normalization，这名字就能让我们想到普通的normalization（归一化），也就是将输入传给神经网络之前对输入做的normalization。这个normalization是对输入操作的，是在输入层之前进行的。那么，从这个角度上来说，Batch-Normalization可以视作对传给隐藏层的输入的normalization。想象一下，如果我们把网络中的某一个隐藏层前面的网络层全部砍掉，那么这个隐藏层就变成了输入层，传给它的输入需要normalization，就在这一层之间，这个位置，就是原本的BN层的位置。从这方面来说，BN层放非线性激活之后，是很自然的。然后，我们再来考虑一些具体的激活函数。我们看到，无论是tanh
在这里插入图片描述还是sigmoid
函数图像的两端，相对于x的变化，y的变化都很小（这其实很正常，毕竟tanh就是拉伸过的sigmoid）。也就是说，容易出现梯度衰减的问题。那么，如果在tanh或sigmoid之前，进行一些normalization处理，就可以缓解梯度衰减的问题。我想这可能也是最初的BN论文选择把BN层放在非线性激活之前的原因。但是ReLU的画风和它们完全不一样啊。
在这里插入图片描述实际上，最初的BN论文虽然也在使用ReLU的Inception上进行了试验，但首先研究的是sigmoid激活。因此，试验ReLU的，我猜想作者可能就顺便延续了之前把BN放前面的配置，而没有单独针对ReLU进行处理。总结一下，BN层的作用机制也许是通过平滑隐藏层输入的分布，帮助随机梯度下降的进行，缓解随机梯度下降权重更新对后续层的负面影响。因此，实际上，无论是放非线性激活之前，还是之后，也许都能发挥这个作用。只不过，取决于具体激活函数的不同，效果也许有一点差别（比如，对sigmoid和tanh而言，放非线性激活之前，也许顺便还能缓解sigmoid/tanh的梯度衰减问题，而对ReLU而言，这个平滑作用经ReLU“扭曲”之后也许有所衰弱）

**总结：**bn层无论放前面还是后面都有，具体看实验效果而定。起初原作者是放前面的，出发点是可以解决梯度消失问题，但大多数实验表明放后面更好，可以提升收敛速度，解决过拟合问题。

这里加在激活函数前：

# coding=utf-8
import os
import cv2
import tensorflow as tf
import random
import glob
import numpy as np

class_num = 2
lr = 0.01
epochs = 1000
batchsize = 4
imgsize = 128
datadir = 'D:/lizhenqi/catanddog/train'

classification = ['cat',
                  'dog']
class DataSet:
    def __init__(self,path,classification):
        self.path = path
        self.classification = classification
    #@staticmethod
    def get_imgdataandlabel(self):
        img_data = []
        img_labels = []
        idx = 0 # 标签
        for classname in self.classification:
            img_list = glob.glob(os.path.join(self.path,classname) + '/*')
            img_label = [idx for i in range(len(img_list))]
            img_data += img_list
            img_labels += img_label
            idx += 1
        return img_data, img_labels
path = datadir
dataset = DataSet(path,classification)
img_data, imglabels = dataset.get_imgdataandlabel()
print(imglabels)
imgs = []
for img in img_data:
    data = cv2.imread(img)
    data = cv2.resize(data,(imgsize,imgsize))
    imgs.append(data)
# 编码标签(2类)
labels = []
for l in imglabels:
    label = np.zeros([2])
    label[l] = 1
#    print(label)
    labels.append(label)
print(len(imgs),len(labels))
print(labels)

indexs = [i for i in range(len(imgs))]
#print(indexs,len(labels))
random.shuffle(indexs)
inputs = []
true_labels = []
for index in indexs:
    inputs.append(imgs[index])
    true_labels.append(labels[index])
#print(inputs)
inputs = inputs*epochs
true_labels = labels*epochs
print(inputs[0:1])
#print(true_labels)

x = tf.placeholder(tf.float32,[batchsize,128,128,3],'x')
y = tf.placeholder(tf.float32,[batchsize,class_num],'y')

#定义卷积核 shape(height,wide,输入channels,输出channels)
def weight_variable(shape):
    return tf.Variable(tf.truncated_normal(shape,stddev=0.1))

#定义偏置 shape（输出channels）
def bias_variable(shape):
    return tf.Variable(tf.constant(0.1,shape=shape))

def conv2d(x,W):
    return tf.nn.conv2d(x,W,strides=[1,1,1,1],padding='SAME') #实现了滑动窗口的滤波

def max_pool_2x2(x):
    return tf.nn.max_pool(x,ksize=[1,2,2,1],strides=[1,2,2,1],padding="SAME")

def loss(labels,logits):    
    cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=labels,logits=logits))
    return cross_entropy

def optimizer1(loss,lr):
    train_op = tf.train.GradientDescentOptimizer(lr).minimize(loss) # Add Ops to the graph to minimize a cost by updating a list of variables,you can get gradients,variables_new
    return train_op

def optimizer2(loss,lr):
    train_op = tf.train.AdamOptimizer(lr).minimizer(loss)
def batch_norm(x,gamma,theta,istrain,momentum=0.9,eps=1e-8):  
    x_mean, x_var = tf.nn.moments(x, axes=0)
    x_shape = x.get_shape().as_list()

        #测试时用的
    moving_mean = tf.Variable(tf.zeros(x_shape),dtype=tf.float32,trainable=False)
    moving_var = tf.Variable(tf.ones(x_shape),dtype=tf.float32,trainable=False)
    moving_mean = momentum * moving_mean + (1 - momentum) * x_mean # 可以定义别的算法
    moving_var = momentum * moving_var + (1 - momentum) * x_var 
    if istrain == False:
        x_normalization = (x-moving_mean) / tf.sqrt(moving_var + eps)
    else:
        x_normalization = (x - x_mean) / tf.sqrt(x_var + eps)
    return x_normalization * gamma + theta, moving_mean, moving_var
    
#定义三层卷积网络  ,假设输入的是shape(batchsize,128,128,3)的图片
def simplenet(inputs,class_num,istrain=True):
    with tf.variable_scope('conv1'):
        W_conv1 = weight_variable([5,5,3,32])
        b_conv1 = bias_variable([32]) #大小和输出通道数一致
        gamma_conv1 = tf.Variable(tf.constant(1.0,shape=[1]))
        theta_conv1 = tf.Variable(tf.constant(0.0,shape=[1]))
    net = conv2d(inputs,W_conv1) + b_conv1  #：大小：128-5+2*2/1+1=128 其实不用算，padding目的就是维持大小和输入一致
    net, _, _ = batch_norm(net,gamma_conv1,theta_conv1,istrain)
    net = tf.nn.relu(net)
    net = max_pool_2x2(net) # shape(batchsize,64,64,32)
    
    with tf.variable_scope('conv2'):
        W_conv2 = weight_variable([3,3,32,64])
        b_conv2 = bias_variable([64])
        gamma_conv2 = tf.Variable(tf.constant(1.0,shape=[1]))
        theta_conv2 = tf.Variable(tf.constant(0.0,shape=[1]))
    net = conv2d(net,W_conv2) + b_conv2 
    net, _, _ = batch_norm(net,gamma_conv2,theta_conv2,istrain)
    net = tf.nn.relu(net)
    net = max_pool_2x2(net) # shape(batchsize,32,32,64)
    
    with tf.variable_scope('conv3'):
        W_conv3 = weight_variable([3,3,64,128])
        b_conv3 = bias_variable([128])
        gamma_conv3 = tf.Variable(tf.constant(1.0,shape=[1]))
        theta_conv3 = tf.Variable(tf.constant(0.0,shape=[1]))
    net = conv2d(net,W_conv3) + b_conv3 
    net, _, _ = batch_norm(net,gamma_conv3,theta_conv3,istrain)
    net = tf.nn.relu(net)
    net = max_pool_2x2(net) # shape(batchsize,16,16,128)
    
    net_flat = tf.reshape(net,[batchsize,16*16*128])
    with tf.variable_scope('fc'):
        w_fc1 = weight_variable([16*16*128,class_num])
        b_fc1 = bias_variable([class_num])
    logits = tf.nn.relu(tf.matmul(net_flat,w_fc1) + b_fc1) # shape(batchsize,class_num)
    return logits

logits = simplenet(x,class_num)
loss = loss(y,logits)
train_op = optimizer1(loss,lr)
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    steps = epochs * len(img_data) // batchsize
    for step in range(steps):
        batch_inputs = inputs[step*batchsize:(step+1)*batchsize]
        batch_labels = true_labels[step*batchsize:(step+1)*batchsize]
        los, _ = sess.run([loss,train_op],feed_dict={x:batch_inputs,y:batch_labels})
        if step%100 == 0:
            print(step,' step: ',' loss is ',los)