【Tensorflow】防止过拟合之正则化

Reference
https://stackoverflow.com/questions/41841050/tensorflow-adding-regularization-to-lstm?noredirect=1&lq=1
https://blog.csdn.net/huqinweI987/article/details/82957034

原理

1. L2正则化原理

过拟合的原理:照顾的点多,曲线波动大,导数大,所以w的数值相对更大。
在这里插入图片描述
可见,减小过拟合,减小这个波动,减少w的数值就能办到。
L2正则化训练的原理:在Loss中加入(乘以系数λ的)参数w的平方和,这样训练过程中就会抑制w的值,w的(绝对)值小,模型复杂度低,曲线平滑,过拟合程度低(奥卡姆剃刀),参考公式如下:
在这里插入图片描述
正则化是不阻碍你去拟合曲线的,并不是所有参数都会被无脑抑制,实际上这是一个动态过程,是loss(cross_entropy)和L2 loss博弈的过程。训练过程会去拟合一个合理的w,正则化又会去抑制w,处于一个“中庸”的范围,更好的泛化。

2. L1特征选择

L1把L2公式中的wi的平方换成wi的绝对值,根据数学特性,这种方式会导致wi不均衡的被减小,有些wi很大,有些wi很小,得到稀疏解,属于特征提取为什么L1的w衰减比L2的不均衡,这个很直觉的,都是让loss降低,让w1从0.1降为0和w2从1.0降维0.9,对优化器和loss来说,是一样的,但是带上平方后,前者是0.01 - 0 = 0.01,后者是1 - 0.81 = 0.19,这时候明显是减少w2更划算,下图能说明问题,横纵轴是w1,w2等高线是loss的值,左图的交点w1 = 0,w2 = max(w2),典型的稀疏解,丢弃了w1,而右图则是在w1和w2之间取平衡。这就意味着,本来能得到一条曲线,现在w1丢了,得到一条直线,降低过拟合的同时,拟合能力(表达能力)也下降了。
在这里插入图片描述
L1和L2有个别名:Lasso和ridge,ridge是一条曲线缓慢的延绵下来。
在这里插入图片描述

训练

写法

1. 写法一

weight_decay = tf.multiply(tf.nn.l2_loss(initial),wd,name='weight_loss')
tf.add_to_collection('losses',weight_decay)
tf.add_to_collection('losses',cross_entropy)
total_loss = tf.add_n(tf.get_collection('losses')) #提取所有loss,拿total_loss去训练,也就是实现了图一中公式的效果

2. 写法二
首先,用tf.trainable_variables()得到所有weights和bias,
然后,用tf.nn.l2_loss()计算L2 norm,
求和之后作为正则项加给原来的cost function

tv = tf.trainable_variables() #得到所有可以训练的参数,即所有trainable=True的tf.Variable/tf.get_variable
regularization_cost = 0.001 * tf.reduce_sum([tf.nn.l2_loss(v) for v in tv]) #0.001是lambda超参数
cost = original_cost_function + regularization_cost
optimizer = tf.train.AdamOptimizer(learning_rate = 0.01).minimize(cost)

3. 写法三
其实和写法一没有本质区别:
测一下单独运行正则化操作的效果(加到loss的代码懒得罗列了,太长,就替换前边的代码就可以):

import tensorflow as tf
CONST_SCALE = 0.5
w = tf.constant([[5.0,-2.0],[-3.0,1.0]])
with tf.Session() as sess:
    print(sess.run(tf.abs(w)))
    print('preprocessing:',sess.run(tf.reduce_sum(tf.abs(w))))
    print('manual computation:',sess.run(tf.reduce_sum(tf.abs(w))*CONST_SCALE))
    print('l1_regularizer:',sess.run(tf.contrib.layers.l1_regularizer(CONST_SCALE)(w))) #l1 *CONST_SCALE

    print(sess.run(w**2))
    print(sess.run(tf.reduce_sum(w**2)))
    print("preprocessing:",sess.run(tf.reduce_sum(w**2)/2)) #default
    print('manual computation:',sess.run(tf.reduce_sum(w**2)/2*CONST_SCALE))
    print('l2_regularizer:',sess.run(tf.contrib.layers.l2_regularizer(CONST_SCALE)(w))) #19.5*CONST_SCALE

==============输出====================
[[5. 2.]
 [3. 1.]]
preprocessing: 11.0
manual computation: 5.5
l1_regularizer: 5.5
[[25.  4.]
 [ 9.  1.]]
39.0
preprocessing: 19.5
manual computation: 9.75
l2_regularizer: 9.75

注意:L2正则化的预处理数据是平方和除以2,这是方便处理加的一个系数,因为w平方求导后悔多出一个系数2,有没有系数的优化过程都是一样的,减小a和减小10a是一样的训练目标。如果说正则化和主loss的比例不同,还可以调节衰减系数。
其实在复杂系统下直接写公式,不如把基本loss和正则化项都丢入collection用着方便,何况还可能把不同的weight设置不同的衰减系数,这写成公式就很繁琐了,因此推荐第一种写法

实验

目标: 进行MINIST分类训练,对比cross_entropy和加了L2正则化的“总损失”。

代码说明:

  • 我们只做一层CONV (注意看FC1的输入是h_pool1,短路了conv2),两层conv可以作为对照组。
  • 直接取train的前1000作为validation,test的前1000作为test
  • 一个基础的CONV+FC结构,对图像进行label预测,通过cross_entropy衡量性能,进行训练。

步骤:

  • 对需要正则化的weight直接使用l2_loss处理
  • 把cross_entropy和L2 loss都放到collection 'losses’中

wd其实就是公式中的λ,wd越大,惩罚越大,过拟合越小,拟合能力也会变差,所以不能太大不能太小,很多人默认设置了0.004,一般情况下,这样无所谓,毕竟是前人的经验,但是根据实际经验来看,这个值不是死的,尤其是当你自己定制loss函数时,假如你的权重交叉熵的数值变成了之前的十倍,如果wd保持不变,那么wd的就相当于原来的0.0004!就像loss如果用reduce_sum,grad也用reduce_sum一样,很多东西都要做出改变

完整代码如下:

 
from __future__ import print_function
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
# number 1 to 10 data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
 
def compute_accuracy(v_xs, v_ys):
    global prediction
    y_pre = sess.run(prediction, feed_dict={xs: v_xs, keep_prob: 1})
    correct_prediction = tf.equal(tf.argmax(y_pre,1), tf.argmax(v_ys,1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    #result = sess.run(accuracy, feed_dict={xs: v_xs, ys: v_ys, keep_prob: 1})
    result = sess.run(accuracy, feed_dict={})
    return result
 
def weight_variable(shape, wd):
    initial = tf.truncated_normal(shape, stddev=0.1)
 
    if wd is not None:
        print('wd is not none!!!!!!!')
        weight_decay = tf.multiply(tf.nn.l2_loss(initial), wd, name='weight_loss')
        tf.add_to_collection('losses', weight_decay)
 
    return tf.Variable(initial)
 
def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)
 
def conv2d(x, W):
    # stride [1, x_movement, y_movement, 1]
    # Must have strides[0] = strides[3] = 1
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
 
def max_pool_2x2(x):
    # stride [1, x_movement, y_movement, 1]
    return tf.nn.max_pool(x, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')
 
# define placeholder for inputs to network
xs = tf.placeholder(tf.float32, [None, 784])/255.   # 28x28
ys = tf.placeholder(tf.float32, [None, 10])
keep_prob = tf.placeholder(tf.float32)
x_image = tf.reshape(xs, [-1, 28, 28, 1])
# print(x_image.shape)  # [n_samples, 28,28,1]
 
## conv1 layer ##
W_conv1 = weight_variable([5,5, 1,32], 0.) # patch 5x5, in size 1, out size 32
b_conv1 = bias_variable([32])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1) # output size 28x28x32
h_pool1 = max_pool_2x2(h_conv1)                                         # output size 14x14x32
 
## conv2 layer ##
W_conv2 = weight_variable([5,5, 32, 64], 0.) # patch 5x5, in size 32, out size 64
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2) # output size 14x14x64
h_pool2 = max_pool_2x2(h_conv2)                                         # output size 7x7x64
 
###############################################################################################################
## fc1 layer ##
W_fc1 = weight_variable([14*14*32, 1024], wd = 0.)#do not use conv2
#W_fc1 = weight_variable([7*7*64, 1024], wd = 0.00)#use conv2
b_fc1 = bias_variable([1024])
# [n_samples, 7, 7, 64] ->> [n_samples, 7*7*64]
h_pool2_flat = tf.reshape(h_pool1, [-1, 14*14*32])#do not use conv2
#h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])#use conv2
##################################################################################################################
 
 
 
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
 
## fc2 layer ##
W_fc2 = weight_variable([1024, 10], wd = 0.)
b_fc2 = bias_variable([10])
prediction = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
 
 
# the error between prediction and real data
cross_entropy = tf.reduce_mean(-tf.reduce_sum(ys * tf.log(prediction),
                                              reduction_indices=[1]))       # loss
 
tf.add_to_collection('losses', cross_entropy)
total_loss = tf.add_n(tf.get_collection('losses'))
print(total_loss)
 
train_op = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
train_op_with_l2_norm = tf.train.AdamOptimizer(1e-4).minimize(total_loss)
 
sess = tf.Session()
# important step
# tf.initialize_all_variables() no long valid from
# 2017-03-02 if using tensorflow >= 0.12
if int((tf.__version__).split('.')[1]) < 12 and int((tf.__version__).split('.')[0]) < 1:
    init = tf.initialize_all_variables()
else:
    init = tf.global_variables_initializer()
sess.run(init)
 
for i in range(1000):
    batch_xs, batch_ys = mnist.train.next_batch(100)
    sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 1})
    # sess.run(train_op_with_l2_norm, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 1})
    # sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 0.5})#dropout
    if i % 100 == 0:
        print('train accuracy',compute_accuracy(
            mnist.train.images[:1000], mnist.train.labels[:1000]))
        print('test accuracy',compute_accuracy(
            mnist.test.images[:1000], mnist.test.labels[:1000]))
 
 

不加dropout和L2的训练过程

不加dropout,不加l2 regularization,训练1000步

weight_variable([1024,10],wd=0.)

明显每一步train中都好于test(很多有0.01的差距),出现过拟合的情况!

train accuracy 0.094
test accuracy 0.089
train accuracy 0.892
test accuracy 0.874
train accuracy 0.91
test accuracy 0.893
train accuracy 0.925
test accuracy 0.925
train accuracy 0.945
test accuracy 0.935
train accuracy 0.954
test accuracy 0.944
train accuracy 0.961
test accuracy 0.951
train accuracy 0.965
test accuracy 0.955
train accuracy 0.964
test accuracy 0.959
train accuracy 0.962
test accuracy 0.956

不加dropout,FC层,加入L2,wd设置为0.004的训练过程

weight_variable([1024,10],wd=0.004)

过拟合现象减轻了不少,甚至有时候测试集还优于训练集(因为验证集大小的关系,只展示大概的效果)

train accuracy 0.107
test accuracy 0.145
train accuracy 0.876
test accuracy 0.861
train accuracy 0.91
test accuracy 0.909
train accuracy 0.923
test accuracy 0.919
train accuracy 0.931
test accuracy 0.927
train accuracy 0.936
test accuracy 0.939
train accuracy 0.956
test accuracy 0.949
train accuracy 0.958
test accuracy 0.954
train accuracy 0.947
test accuracy 0.95
train accuracy 0.947
test accuracy 0.953

对照组,不适用l2正则化,只使用dropout:过拟合现象减轻

W_fc1 = weight_variable([14*14*32,1024],wd = 0.)
W_fc2 = weight_variable([1024,10],wd = 0.)
sess.run(train_op,feed_dict={xs:batch_xs,ys_batch_ys,keep_prob:0.5})

结果如下:

train accuracy 0.132
test accuracy 0.104
train accuracy 0.869
test accuracy 0.859
train accuracy 0.898
test accuracy 0.889
train accuracy 0.917
test accuracy 0.906
train accuracy 0.923
test accuracy 0.917
train accuracy 0.928
test accuracy 0.925
train accuracy 0.938
test accuracy 0.94
train accuracy 0.94
test accuracy 0.942
train accuracy 0.947
test accuracy 0.941
train accuracy 0.944
test accuracy 0.947

虽然类似的方法还有batch normalization,dropout等,这些都有"加噪声"的效果,都有一定预防过拟合的效果,但是L1和L2正则化补觉L1norm、L2 norm。norm叫范式,是计算距离用的方法,就像绝对值的距离的平方,不是regularization。L1 regularization和L2 regularization可以理解为用了L1norm和L2norm的regularization。

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值