(二)加入bn层

Batch-normalized 应该放在非线性激活层的前面还是后面?
作者:论智
链接:https://www.zhihu.com/question/283715823/answer/438882036
来源:知乎

在BN的原始论文中,BN是放在非线性激活层前面的(arXiv:1502.03167v3,第5页)We add the BN transform immediately before the nonlinearity(注意:before的黑体是我加的,为了突出重点)但是,François Chollet爆料说BN论文的作者之一Christian把BN放在ReLU后面(你的问题里引用的文字也提到了这一段)。I can guarantee that recent code written by Christian applies relu before BN.另外,Jeremy Howard直接主张把BN放在非线性激活后面You want the batchnorm after the non-linearity, and before the dropout.“应该”放在前面还是后面?这个“应该”其实有两种解释:放在前面还是后面比较好?为什么要放在前面还是后面?对于第一问,目前在实践上,倾向于把BN放在ReLU后面。也有评测表明BN放ReLU后面效果更好。对于第二问,实际上,我们目前对BN的机制仍然不是特别清楚,这里只能尝试做些(玄学)解释,不一定正确。BN,也就是Batch-Normalization,这名字就能让我们想到普通的normalization(归一化),也就是将输入传给神经网络之前对输入做的normalization。这个normalization是对输入操作的,是在输入层之前进行的。那么,从这个角度上来说,Batch-Normalization可以视作对传给隐藏层的输入的normalization。想象一下,如果我们把网络中的某一个隐藏层前面的网络层全部砍掉,那么这个隐藏层就变成了输入层,传给它的输入需要normalization,就在这一层之间,这个位置,就是原本的BN层的位置。从这方面来说,BN层放非线性激活之后,是很自然的。然后,我们再来考虑一些具体的激活函数。我们看到,无论是tanh
在这里插入图片描述还是sigmoid
在这里插入图片描述函数图像的两端,相对于x的变化,y的变化都很小(这其实很正常,毕竟tanh就是拉伸过的sigmoid)。也就是说,容易出现梯度衰减的问题。那么,如果在tanh或sigmoid之前,进行一些normalization处理,就可以缓解梯度衰减的问题。我想这可能也是最初的BN论文选择把BN层放在非线性激活之前的原因。但是ReLU的画风和它们完全不一样啊。
在这里插入图片描述实际上,最初的BN论文虽然也在使用ReLU的Inception上进行了试验,但首先研究的是sigmoid激活。因此,试验ReLU的,我猜想作者可能就顺便延续了之前把BN放前面的配置,而没有单独针对ReLU进行处理。总结一下,BN层的作用机制也许是通过平滑隐藏层输入的分布,帮助随机梯度下降的进行,缓解随机梯度下降权重更新对后续层的负面影响。因此,实际上,无论是放非线性激活之前,还是之后,也许都能发挥这个作用。只不过,取决于具体激活函数的不同,效果也许有一点差别(比如,对sigmoid和tanh而言,放非线性激活之前,也许顺便还能缓解sigmoid/tanh的梯度衰减问题,而对ReLU而言,这个平滑作用经ReLU“扭曲”之后也许有所衰弱)

**总结:**bn层无论放前面还是后面都有,具体看实验效果而定。起初原作者是放前面的,出发点是可以解决梯度消失问题,但大多数实验表明放后面更好,可以提升收敛速度,解决过拟合问题。

这里加在激活函数前:

# coding=utf-8
import os
import cv2
import tensorflow as tf
import random
import glob
import numpy as np

class_num = 2
lr = 0.01
epochs = 1000
batchsize = 4
imgsize = 128
datadir = 'D:/lizhenqi/catanddog/train'

classification = ['cat',
                  'dog']
class DataSet:
    def __init__(self,path,classification):
        self.path = path
        self.classification = classification
    #@staticmethod
    def get_imgdataandlabel(self):
        img_data = []
        img_labels = []
        idx = 0 # 标签
        for classname in self.classification:
            img_list = glob.glob(os.path.join(self.path,classname) + '/*')
            img_label = [idx for i in range(len(img_list))]
            img_data += img_list
            img_labels += img_label
            idx += 1
        return img_data, img_labels
path = datadir
dataset = DataSet(path,classification)
img_data, imglabels = dataset.get_imgdataandlabel()
print(imglabels)
imgs = []
for img in img_data:
    data = cv2.imread(img)
    data = cv2.resize(data,(imgsize,imgsize))
    imgs.append(data)
# 编码标签(2类)
labels = []
for l in imglabels:
    label = np.zeros([2])
    label[l] = 1
#    print(label)
    labels.append(label)
print(len(imgs),len(labels))
print(labels)

indexs = [i for i in range(len(imgs))]
#print(indexs,len(labels))
random.shuffle(indexs)
inputs = []
true_labels = []
for index in indexs:
    inputs.append(imgs[index])
    true_labels.append(labels[index])
#print(inputs)
inputs = inputs*epochs
true_labels = labels*epochs
print(inputs[0:1])
#print(true_labels)

x = tf.placeholder(tf.float32,[batchsize,128,128,3],'x')
y = tf.placeholder(tf.float32,[batchsize,class_num],'y')

#定义卷积核 shape(height,wide,输入channels,输出channels)
def weight_variable(shape):
    return tf.Variable(tf.truncated_normal(shape,stddev=0.1))

#定义偏置 shape(输出channels)
def bias_variable(shape):
    return tf.Variable(tf.constant(0.1,shape=shape))

def conv2d(x,W):
    return tf.nn.conv2d(x,W,strides=[1,1,1,1],padding='SAME') #实现了滑动窗口的滤波

def max_pool_2x2(x):
    return tf.nn.max_pool(x,ksize=[1,2,2,1],strides=[1,2,2,1],padding="SAME")

def loss(labels,logits):    
    cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=labels,logits=logits))
    return cross_entropy

def optimizer1(loss,lr):
    train_op = tf.train.GradientDescentOptimizer(lr).minimize(loss) # Add Ops to the graph to minimize a cost by updating a list of variables,you can get gradients,variables_new
    return train_op

def optimizer2(loss,lr):
    train_op = tf.train.AdamOptimizer(lr).minimizer(loss)
def batch_norm(x,gamma,theta,istrain,momentum=0.9,eps=1e-8):  
    x_mean, x_var = tf.nn.moments(x, axes=0)
    x_shape = x.get_shape().as_list()

        #测试时用的
    moving_mean = tf.Variable(tf.zeros(x_shape),dtype=tf.float32,trainable=False)
    moving_var = tf.Variable(tf.ones(x_shape),dtype=tf.float32,trainable=False)
    moving_mean = momentum * moving_mean + (1 - momentum) * x_mean # 可以定义别的算法
    moving_var = momentum * moving_var + (1 - momentum) * x_var 
    if istrain == False:
        x_normalization = (x-moving_mean) / tf.sqrt(moving_var + eps)
    else:
        x_normalization = (x - x_mean) / tf.sqrt(x_var + eps)
    return x_normalization * gamma + theta, moving_mean, moving_var
    
#定义三层卷积网络  ,假设输入的是shape(batchsize,128,128,3)的图片
def simplenet(inputs,class_num,istrain=True):
    with tf.variable_scope('conv1'):
        W_conv1 = weight_variable([5,5,3,32])
        b_conv1 = bias_variable([32]) #大小和输出通道数一致
        gamma_conv1 = tf.Variable(tf.constant(1.0,shape=[1]))
        theta_conv1 = tf.Variable(tf.constant(0.0,shape=[1]))
    net = conv2d(inputs,W_conv1) + b_conv1  #:大小:128-5+2*2/1+1=128 其实不用算,padding目的就是维持大小和输入一致
    net, _, _ = batch_norm(net,gamma_conv1,theta_conv1,istrain)
    net = tf.nn.relu(net)
    net = max_pool_2x2(net) # shape(batchsize,64,64,32)
    
    with tf.variable_scope('conv2'):
        W_conv2 = weight_variable([3,3,32,64])
        b_conv2 = bias_variable([64])
        gamma_conv2 = tf.Variable(tf.constant(1.0,shape=[1]))
        theta_conv2 = tf.Variable(tf.constant(0.0,shape=[1]))
    net = conv2d(net,W_conv2) + b_conv2 
    net, _, _ = batch_norm(net,gamma_conv2,theta_conv2,istrain)
    net = tf.nn.relu(net)
    net = max_pool_2x2(net) # shape(batchsize,32,32,64)
    
    with tf.variable_scope('conv3'):
        W_conv3 = weight_variable([3,3,64,128])
        b_conv3 = bias_variable([128])
        gamma_conv3 = tf.Variable(tf.constant(1.0,shape=[1]))
        theta_conv3 = tf.Variable(tf.constant(0.0,shape=[1]))
    net = conv2d(net,W_conv3) + b_conv3 
    net, _, _ = batch_norm(net,gamma_conv3,theta_conv3,istrain)
    net = tf.nn.relu(net)
    net = max_pool_2x2(net) # shape(batchsize,16,16,128)
    
    net_flat = tf.reshape(net,[batchsize,16*16*128])
    with tf.variable_scope('fc'):
        w_fc1 = weight_variable([16*16*128,class_num])
        b_fc1 = bias_variable([class_num])
    logits = tf.nn.relu(tf.matmul(net_flat,w_fc1) + b_fc1) # shape(batchsize,class_num)
    return logits

logits = simplenet(x,class_num)
loss = loss(y,logits)
train_op = optimizer1(loss,lr)
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    steps = epochs * len(img_data) // batchsize
    for step in range(steps):
        batch_inputs = inputs[step*batchsize:(step+1)*batchsize]
        batch_labels = true_labels[step*batchsize:(step+1)*batchsize]
        los, _ = sess.run([loss,train_op],feed_dict={x:batch_inputs,y:batch_labels})
        if step%100 == 0:
            print(step,' step: ',' loss is ',los)
  • 3
    点赞
  • 24
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值