DNN训练过程中的一些问题以及技巧
首先介绍几个概念
lower layers :浅层的网络层,主要对简单的特征进行梯度,如边缘、角点等 deeper layers:深层的网络层,主要用于提取十分复杂的特征。 在使用DNN的过程中,我们可能会遇到一些问题
梯度弥散或者梯度爆炸,这会提升DNN的训练难度 对于大型的网络来说,其训练速度十分慢 如果网络参数过多,很容易发生过拟合的问题
梯度弥散和梯度爆炸(Vanishing/Exploding Gradients Problems)
在使用BP的过程中,是利用梯度信息去更新网络之间的连接参数的。有时在更浅层的网络层(靠近输入层)中,GD的权值更新使得连接层权重几乎不变,训练算法无法收敛到一个好的解,这就是梯度弥散问题 有时候,梯度会越来越大,导致连接层之间的权值越来越发散,算法也发散,这是梯度爆炸问题 梯度弥散和梯度爆炸问题经常在使用RNN时会遇到;在使用DNN时,有时候会遇到不同的层之间的学习速率差异很大的问题 logistic激活函数中,当输入的幅值非常大时,输出会达到饱和的状态,其梯度很小,在反向传播时,对靠近输入端的网络层基本不会产生什么影响 下面介绍一下几种解决方法
参数初始化
Xavier等人在论文Understanding the Difficulty of Training Deep Feedforward Neural Networks 中提到,要尽可能避免这方面的问题,我们需要让神经网络的每一层输出的方差尽可能相等,但是在现实中难以做到,因为我们无法保证输入层和输出层节点个数相同。但是也给出了比较通用的初始化方法。对于不同的激活函数,其参数初始化方法如下
logistic函数:正态分布的话,
σ = 2 / ( n i + n o ) − − − − − − − − − √
σ
=
2
/
(
n
i
+
n
o
)
,
[ − r , r ]
[
−
r
,
r
]
的均匀分布:
r = 6 / ( n i + n o ) − − − − − − − − − √
r
=
6
/
(
n
i
+
n
o
)
双曲正切函数:正态分布的话,
σ = 4 2 / ( n i + n o ) − − − − − − − − − √
σ
=
4
2
/
(
n
i
+
n
o
)
,
[ − r , r ]
[
−
r
,
r
]
的均匀分布:
r = 4 6 / ( n i + n o ) − − − − − − − − − √
r
=
4
6
/
(
n
i
+
n
o
)
RELU函数:正态分布的话,
σ = 2 – √ 2 / ( n i + n o ) − − − − − − − − − √
σ
=
2
2
/
(
n
i
+
n
o
)
,
[ − r , r ]
[
−
r
,
r
]
的均匀分布:
r = 2 – √ 6 / ( n i + n o ) − − − − − − − − − √
r
=
2
6
/
(
n
i
+
n
o
)
import warnings
warnings.filterwarnings("ignore" )
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import os
gpu_options = tf.GPUOptions(allow_growth=True )
def reset_graph (seed=42 ) :
tf.reset_default_graph()
tf.set_random_seed(seed)
np.random.seed(seed)
return
with tf.Session( config=tf.ConfigProto(gpu_options=gpu_options) ) as sess:
print( sess.run( tf.constant(1 ) ) )
1
默认情况下,TF的全连接使用均匀分布初始化,我们可以修改其默认的参数设置
reset_graph()
n_inputs = 28 * 28
n_hidden1 = 300
X = tf.placeholder(tf.float32, shape=(None , n_inputs), name="X" )
he_init = tf.contrib.layers.variance_scaling_initializer( )
hidden1 = tf.contrib.layers.fully_connected( X, n_hidden1, weights_initializer=he_init, scope="h1" )
不饱和的激活函数
虽然生物学的神经网络中常使用logistic函数,但是其实有更多好的激活函数可以选择,RELU函数等。 RELU函数虽然在输入很大时仍然不会饱和,但是它在
z < 0
z
<
0
的区域均为0,因此当神经网络的输入小于0时,输出为0,梯度也就为0,神经元就无法再起到什么作用了。受到这种影响,引入leaky RELU
。公式如下
L e a k y R E L U α ( z ) = m a x ( α z , z ) (4)
(4)
L
e
a
k
y
R
E
L
U
α
(
z
)
=
m
a
x
(
α
z
,
z
)
超参数
α
α
一般设置为0.01,在输入小于0的时候,仍然会有梯度,神经元仍然能够发生作用 实际使用过程中,也可以随机选取
α
α
,或者利用学习的方法去确定其值,相当于不把它作为一个超参数 介绍一种ELU激活函数,公式如下
E L U α ( z ) = { α ( e x p ( z ) − 1 ) z i f i f z < 0 z ≥ 0 (5)
(5)
E
L
U
α
(
z
)
=
{
α
(
e
x
p
(
z
)
−
1
)
i
f
z
<
0
z
i
f
z
≥
0
不考虑计算性能的前提下,激活函数的选择优先级一般为:
E L U > l e a k y R e L U > R e L U > t a n h > l o g i s t i c
E
L
U
>
l
e
a
k
y
R
e
L
U
>
R
e
L
U
>
t
a
n
h
>
l
o
g
i
s
t
i
c
,如果考虑性能的话,可以优先考虑
l e a k y R e L U
l
e
a
k
y
R
e
L
U
def leaky_relu (z, alpha=0.01 ) :
return np.maximum(alpha*z, z)
def elu ( z, alpha=1 ) :
return np.where(z < 0 , alpha * (np.exp(z) - 1 ), z)
x = np.linspace(-4 ,4 ,50 )
y = leaky_relu( x, 0.2 )
y2 = elu( x )
plt.figure( figsize=(8 ,3 ) )
plt.subplot( 121 )
plt.plot( x, y )
plt.title("Leaky RELU" )
plt.grid()
plt.subplot( 122 )
plt.plot( x, y2 )
plt.title("ELU" )
plt.grid()
plt.show()
def leaky_relu (z, name=None) :
return tf.maximum(0.01 * z, z, name=name)
reset_graph()
n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10
X = tf.placeholder(tf.float32, shape=(None , n_inputs), name="X" )
y = tf.placeholder(tf.int64, shape=(None ), name="y" )
with tf.name_scope("dnn" ):
hidden1 = tf.layers.dense(X, n_hidden1, activation=leaky_relu, name="hidden1" )
hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=leaky_relu, name="hidden2" )
logits = tf.layers.dense(hidden2, n_outputs, name="outputs" )
with tf.name_scope("loss" ):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss" )
learning_rate = 0.01
with tf.name_scope("train" ):
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
training_op = optimizer.minimize(loss)
with tf.name_scope("eval" ):
correct = tf.nn.in_top_k(logits, y, 1 )
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
init = tf.global_variables_initializer()
saver = tf.train.Saver()
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("dataset/mnist" )
n_epochs = 60
batch_size = 500
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for iteration in range(mnist.train.num_examples // batch_size):
X_batch, y_batch = mnist.train.next_batch(batch_size)
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
if epoch % 10 == 0 :
acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
acc_test = accuracy.eval(feed_dict={X: mnist.validation.images, y: mnist.validation.labels})
print(epoch, "Batch accuracy:" , acc_train, "Validation accuracy:" , acc_test)
save_path = saver.save(sess, "./models/mnist/mnist_model_final.ckpt" )
Extracting dataset/mnist/train-images-idx3-ubyte.gz
Extracting dataset/mnist/train-labels-idx1-ubyte.gz
Extracting dataset/mnist/t10k-images-idx3-ubyte.gz
Extracting dataset/mnist/t10k-labels-idx1-ubyte.gz
0 Batch accuracy: 0.678 Validation accuracy: 0.6632
10 Batch accuracy: 0.906 Validation accuracy: 0.9084
20 Batch accuracy: 0.918 Validation accuracy: 0.9268
30 Batch accuracy: 0.932 Validation accuracy: 0.9346
40 Batch accuracy: 0.936 Validation accuracy: 0.9402
50 Batch accuracy: 0.948 Validation accuracy: 0.9454
Batch Normalization
from functools import partial
reset_graph()
batch_norm_momentum = 0.9
X = tf.placeholder(tf.float32, shape=(None , n_inputs), name="X" )
y = tf.placeholder(tf.int64, shape=(None ), name="y" )
training = tf.placeholder_with_default(False , shape=(), name='training' )
with tf.name_scope("dnn" ):
he_init = tf.contrib.layers.variance_scaling_initializer()
my_batch_norm_layer = partial(
tf.layers.batch_normalization,
training=training,
momentum=batch_norm_momentum)
my_dense_layer = partial(
tf.layers.dense,
kernel_initializer=he_init)
hidden1 = my_dense_layer(X, n_hidden1, name="hidden1" )
bn1 = tf.nn.elu(my_batch_norm_layer(hidden1))
hidden2 = my_dense_layer(bn1, n_hidden2, name="hidden2" )
bn2 = tf.nn.elu(my_batch_norm_layer(hidden2))
logits_before_bn = my_dense_layer(bn2, n_outputs, name="outputs" )
logits = my_batch_norm_layer(logits_before_bn)
with tf.name_scope("loss" ):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss" )
with tf.name_scope("train" ):
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
training_op = optimizer.minimize(loss)
with tf.name_scope("eval" ):
correct = tf.nn.in_top_k(logits, y, 1 )
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
init = tf.global_variables_initializer()
saver = tf.train.Saver()
n_epochs = 20
batch_size = 1000
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for iteration in range(mnist.train.num_examples // batch_size):
X_batch, y_batch = mnist.train.next_batch(batch_size)
sess.run([training_op, extra_update_ops],
feed_dict={training: True , X: X_batch, y: y_batch})
accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,
y: mnist.test.labels})
print(epoch, "Test accuracy:" , accuracy_val)
0 Test accuracy: 0.7409
1 Test accuracy: 0.8134
2 Test accuracy: 0.8386
3 Test accuracy: 0.8558
4 Test accuracy: 0.8673
5 Test accuracy: 0.8759
6 Test accuracy: 0.8818
7 Test accuracy: 0.8891
8 Test accuracy: 0.8929
9 Test accuracy: 0.8974
10 Test accuracy: 0.9001
11 Test accuracy: 0.9042
12 Test accuracy: 0.9081
13 Test accuracy: 0.9113
14 Test accuracy: 0.9136
15 Test accuracy: 0.9171
16 Test accuracy: 0.9176
17 Test accuracy: 0.9201
18 Test accuracy: 0.9211
19 Test accuracy: 0.9224
在使用的过程中,需要注意训练参数不包括移moving_variance和moving_mean,因此如果需要保存模型参数,需要注意一下这些参数(不是训练过程中产生的)
print( [v.name for v in tf.trainable_variables()] )
print( [v.name for v in tf.global_variables()] )
['hidden1/kernel:0', 'hidden1/bias:0', 'batch_normalization/gamma:0', 'batch_normalization/beta:0', 'hidden2/kernel:0', 'hidden2/bias:0', 'batch_normalization_1/gamma:0', 'batch_normalization_1/beta:0', 'outputs/kernel:0', 'outputs/bias:0', 'batch_normalization_2/gamma:0', 'batch_normalization_2/beta:0']
['hidden1/kernel:0', 'hidden1/bias:0', 'batch_normalization/gamma:0', 'batch_normalization/beta:0', 'batch_normalization/moving_mean:0', 'batch_normalization/moving_variance:0', 'hidden2/kernel:0', 'hidden2/bias:0', 'batch_normalization_1/gamma:0', 'batch_normalization_1/beta:0', 'batch_normalization_1/moving_mean:0', 'batch_normalization_1/moving_variance:0', 'outputs/kernel:0', 'outputs/bias:0', 'batch_normalization_2/gamma:0', 'batch_normalization_2/beta:0', 'batch_normalization_2/moving_mean:0', 'batch_normalization_2/moving_variance:0']
梯度裁剪
梯度裁剪是为了防止梯度爆炸的问题 添加超参数threshold,用于对梯度进行赋值限制 在之前的过程中,我们直接使用optimizer.minimize(loss),这相当于将计算梯度和使用梯度集成在一起,默认计算梯度时,不进行阈值限幅,因此我们需要将之前的使用方法拆开,在中间加入一步梯度裁剪的步骤
reset_graph()
n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 50
n_hidden3 = 50
n_hidden4 = 50
n_hidden5 = 50
n_outputs = 10
X = tf.placeholder(tf.float32, shape=(None , n_inputs), name="X" )
y = tf.placeholder(tf.int64, shape=(None ), name="y" )
with tf.name_scope("dnn" ):
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1" )
hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2" )
hidden3 = tf.layers.dense(hidden2, n_hidden3, activation=tf.nn.relu, name="hidden3" )
hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu, name="hidden4" )
hidden5 = tf.layers.dense(hidden4, n_hidden5, activation=tf.nn.relu, name="hidden5" )
logits = tf.layers.dense(hidden5, n_outputs, name="outputs" )
with tf.name_scope("loss" ):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss" )
learning_rate = 0.01
threshold = 1.0
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]
training_op = optimizer.apply_gradients(capped_gvs)
with tf.name_scope("eval" ):
correct = tf.nn.in_top_k(logits, y, 1 )
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy" )
init = tf.global_variables_initializer()
saver = tf.train.Saver()
n_epochs = 60
batch_size = 1000
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for iteration in range(mnist.train.num_examples // batch_size):
X_batch, y_batch = mnist.train.next_batch(batch_size)
sess.run([training_op, extra_update_ops],
feed_dict={X: X_batch, y: y_batch})
accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,
y: mnist.test.labels})
if ( epoch % 10 ) == 0 :
print(epoch, "Test accuracy:" , accuracy_val)
save_path = saver.save(sess, "./models/mnist/gradient_clip.ckpt" )
0 Test accuracy: 0.1775
10 Test accuracy: 0.8273
20 Test accuracy: 0.9067
30 Test accuracy: 0.9218
40 Test accuracy: 0.9325
50 Test accuracy: 0.9403
使用预训练的层
reset_graph()
saver = tf.train.import_meta_graph("./models/mnist/mnist_model_final.ckpt.meta" )
for op in tf.get_default_graph().get_operations():
print( op.name )
X
y
hidden1/kernel/Initializer/random_uniform/shape
hidden1/kernel/Initializer/random_uniform/min
hidden1/kernel/Initializer/random_uniform/max
hidden1/kernel/Initializer/random_uniform/RandomUniform
hidden1/kernel/Initializer/random_uniform/sub
hidden1/kernel/Initializer/random_uniform/mul
hidden1/kernel/Initializer/random_uniform
hidden1/kernel
hidden1/kernel/Assign
hidden1/kernel/read
hidden1/bias/Initializer/zeros
hidden1/bias
hidden1/bias/Assign
hidden1/bias/read
dnn/hidden1/MatMul
dnn/hidden1/BiasAdd
dnn/hidden1/mul/x
dnn/hidden1/mul
dnn/hidden1/Maximum
hidden2/kernel/Initializer/random_uniform/shape
hidden2/kernel/Initializer/random_uniform/min
hidden2/kernel/Initializer/random_uniform/max
hidden2/kernel/Initializer/random_uniform/RandomUniform
hidden2/kernel/Initializer/random_uniform/sub
hidden2/kernel/Initializer/random_uniform/mul
hidden2/kernel/Initializer/random_uniform
hidden2/kernel
hidden2/kernel/Assign
hidden2/kernel/read
hidden2/bias/Initializer/zeros
hidden2/bias
hidden2/bias/Assign
hidden2/bias/read
dnn/hidden2/MatMul
dnn/hidden2/BiasAdd
dnn/hidden2/mul/x
dnn/hidden2/mul
dnn/hidden2/Maximum
outputs/kernel/Initializer/random_uniform/shape
outputs/kernel/Initializer/random_uniform/min
outputs/kernel/Initializer/random_uniform/max
outputs/kernel/Initializer/random_uniform/RandomUniform
outputs/kernel/Initializer/random_uniform/sub
outputs/kernel/Initializer/random_uniform/mul
outputs/kernel/Initializer/random_uniform
outputs/kernel
outputs/kernel/Assign
outputs/kernel/read
outputs/bias/Initializer/zeros
outputs/bias
outputs/bias/Assign
outputs/bias/read
dnn/outputs/MatMul
dnn/outputs/BiasAdd
loss/SparseSoftmaxCrossEntropyWithLogits/Shape
loss/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits
loss/Const
loss/loss
train/gradients/Shape
train/gradients/Const
train/gradients/Fill
train/gradients/loss/loss_grad/Reshape/shape
train/gradients/loss/loss_grad/Reshape
train/gradients/loss/loss_grad/Shape
train/gradients/loss/loss_grad/Tile
train/gradients/loss/loss_grad/Shape_1
train/gradients/loss/loss_grad/Shape_2
train/gradients/loss/loss_grad/Const
train/gradients/loss/loss_grad/Prod
train/gradients/loss/loss_grad/Const_1
train/gradients/loss/loss_grad/Prod_1
train/gradients/loss/loss_grad/Maximum/y
train/gradients/loss/loss_grad/Maximum
train/gradients/loss/loss_grad/floordiv
train/gradients/loss/loss_grad/Cast
train/gradients/loss/loss_grad/truediv
train/gradients/zeros_like
train/gradients/loss/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits_grad/PreventGradient
train/gradients/loss/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits_grad/ExpandDims/dim
train/gradients/loss/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits_grad/ExpandDims
train/gradients/loss/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits_grad/mul
train/gradients/dnn/outputs/BiasAdd_grad/BiasAddGrad
train/gradients/dnn/outputs/BiasAdd_grad/tuple/group_deps
train/gradients/dnn/outputs/BiasAdd_grad/tuple/control_dependency
train/gradients/dnn/outputs/BiasAdd_grad/tuple/control_dependency_1
train/gradients/dnn/outputs/MatMul_grad/MatMul
train/gradients/dnn/outputs/MatMul_grad/MatMul_1
train/gradients/dnn/outputs/MatMul_grad/tuple/group_deps
train/gradients/dnn/outputs/MatMul_grad/tuple/control_dependency
train/gradients/dnn/outputs/MatMul_grad/tuple/control_dependency_1
train/gradients/dnn/hidden2/Maximum_grad/Shape
train/gradients/dnn/hidden2/Maximum_grad/Shape_1
train/gradients/dnn/hidden2/Maximum_grad/Shape_2
train/gradients/dnn/hidden2/Maximum_grad/zeros/Const
train/gradients/dnn/hidden2/Maximum_grad/zeros
train/gradients/dnn/hidden2/Maximum_grad/GreaterEqual
train/gradients/dnn/hidden2/Maximum_grad/BroadcastGradientArgs
train/gradients/dnn/hidden2/Maximum_grad/Select
train/gradients/dnn/hidden2/Maximum_grad/Select_1
train/gradients/dnn/hidden2/Maximum_grad/Sum
train/gradients/dnn/hidden2/Maximum_grad/Reshape
train/gradients/dnn/hidden2/Maximum_grad/Sum_1
train/gradients/dnn/hidden2/Maximum_grad/Reshape_1
train/gradients/dnn/hidden2/Maximum_grad/tuple/group_deps
train/gradients/dnn/hidden2/Maximum_grad/tuple/control_dependency
train/gradients/dnn/hidden2/Maximum_grad/tuple/control_dependency_1
train/gradients/dnn/hidden2/mul_grad/Shape
train/gradients/dnn/hidden2/mul_grad/Shape_1
train/gradients/dnn/hidden2/mul_grad/BroadcastGradientArgs
train/gradients/dnn/hidden2/mul_grad/mul
train/gradients/dnn/hidden2/mul_grad/Sum
train/gradients/dnn/hidden2/mul_grad/Reshape
train/gradients/dnn/hidden2/mul_grad/mul_1
train/gradients/dnn/hidden2/mul_grad/Sum_1
train/gradients/dnn/hidden2/mul_grad/Reshape_1
train/gradients/dnn/hidden2/mul_grad/tuple/group_deps
train/gradients/dnn/hidden2/mul_grad/tuple/control_dependency
train/gradients/dnn/hidden2/mul_grad/tuple/control_dependency_1
train/gradients/AddN
train/gradients/dnn/hidden2/BiasAdd_grad/BiasAddGrad
train/gradients/dnn/hidden2/BiasAdd_grad/tuple/group_deps
train/gradients/dnn/hidden2/BiasAdd_grad/tuple/control_dependency
train/gradients/dnn/hidden2/BiasAdd_grad/tuple/control_dependency_1
train/gradients/dnn/hidden2/MatMul_grad/MatMul
train/gradients/dnn/hidden2/MatMul_grad/MatMul_1
train/gradients/dnn/hidden2/MatMul_grad/tuple/group_deps
train/gradients/dnn/hidden2/MatMul_grad/tuple/control_dependency
train/gradients/dnn/hidden2/MatMul_grad/tuple/control_dependency_1
train/gradients/dnn/hidden1/Maximum_grad/Shape
train/gradients/dnn/hidden1/Maximum_grad/Shape_1
train/gradients/dnn/hidden1/Maximum_grad/Shape_2
train/gradients/dnn/hidden1/Maximum_grad/zeros/Const
train/gradients/dnn/hidden1/Maximum_grad/zeros
train/gradients/dnn/hidden1/Maximum_grad/GreaterEqual
train/gradients/dnn/hidden1/Maximum_grad/BroadcastGradientArgs
train/gradients/dnn/hidden1/Maximum_grad/Select
train/gradients/dnn/hidden1/Maximum_grad/Select_1
train/gradients/dnn/hidden1/Maximum_grad/Sum
train/gradients/dnn/hidden1/Maximum_grad/Reshape
train/gradients/dnn/hidden1/Maximum_grad/Sum_1
train/gradients/dnn/hidden1/Maximum_grad/Reshape_1
train/gradients/dnn/hidden1/Maximum_grad/tuple/group_deps
train/gradients/dnn/hidden1/Maximum_grad/tuple/control_dependency
train/gradients/dnn/hidden1/Maximum_grad/tuple/control_dependency_1
train/gradients/dnn/hidden1/mul_grad/Shape
train/gradients/dnn/hidden1/mul_grad/Shape_1
train/gradients/dnn/hidden1/mul_grad/BroadcastGradientArgs
train/gradients/dnn/hidden1/mul_grad/mul
train/gradients/dnn/hidden1/mul_grad/Sum
train/gradients/dnn/hidden1/mul_grad/Reshape
train/gradients/dnn/hidden1/mul_grad/mul_1
train/gradients/dnn/hidden1/mul_grad/Sum_1
train/gradients/dnn/hidden1/mul_grad/Reshape_1
train/gradients/dnn/hidden1/mul_grad/tuple/group_deps
train/gradients/dnn/hidden1/mul_grad/tuple/control_dependency
train/gradients/dnn/hidden1/mul_grad/tuple/control_dependency_1
train/gradients/AddN_1
train/gradients/dnn/hidden1/BiasAdd_grad/BiasAddGrad
train/gradients/dnn/hidden1/BiasAdd_grad/tuple/group_deps
train/gradients/dnn/hidden1/BiasAdd_grad/tuple/control_dependency
train/gradients/dnn/hidden1/BiasAdd_grad/tuple/control_dependency_1
train/gradients/dnn/hidden1/MatMul_grad/MatMul
train/gradients/dnn/hidden1/MatMul_grad/MatMul_1
train/gradients/dnn/hidden1/MatMul_grad/tuple/group_deps
train/gradients/dnn/hidden1/MatMul_grad/tuple/control_dependency
train/gradients/dnn/hidden1/MatMul_grad/tuple/control_dependency_1
train/GradientDescent/learning_rate
train/GradientDescent/update_hidden1/kernel/ApplyGradientDescent
train/GradientDescent/update_hidden1/bias/ApplyGradientDescent
train/GradientDescent/update_hidden2/kernel/ApplyGradientDescent
train/GradientDescent/update_hidden2/bias/ApplyGradientDescent
train/GradientDescent/update_outputs/kernel/ApplyGradientDescent
train/GradientDescent/update_outputs/bias/ApplyGradientDescent
train/GradientDescent
eval/in_top_k/InTopKV2/k
eval/in_top_k/InTopKV2
eval/Cast
eval/Const
eval/Mean
init
save/Const
save/SaveV2/tensor_names
save/SaveV2/shape_and_slices
save/SaveV2
save/control_dependency
save/RestoreV2/tensor_names
save/RestoreV2/shape_and_slices
save/RestoreV2
save/Assign
save/RestoreV2_1/tensor_names
save/RestoreV2_1/shape_and_slices
save/RestoreV2_1
save/Assign_1
save/RestoreV2_2/tensor_names
save/RestoreV2_2/shape_and_slices
save/RestoreV2_2
save/Assign_2
save/RestoreV2_3/tensor_names
save/RestoreV2_3/shape_and_slices
save/RestoreV2_3
save/Assign_3
save/RestoreV2_4/tensor_names
save/RestoreV2_4/shape_and_slices
save/RestoreV2_4
save/Assign_4
save/RestoreV2_5/tensor_names
save/RestoreV2_5/shape_and_slices
save/RestoreV2_5
save/Assign_5
save/restore_all
X = tf.get_default_graph().get_tensor_by_name("X:0" )
y = tf.get_default_graph().get_tensor_by_name("y:0" )
accuracy = tf.get_default_graph().get_tensor_by_name("eval/Mean:0" )
training_op = tf.get_default_graph().get_operation_by_name("train/GradientDescent" )
with tf.Session() as sess:
saver.restore(sess, "./models/mnist/mnist_model_final.ckpt" )
for epoch in range(n_epochs):
for iteration in range(mnist.train.num_examples // batch_size):
X_batch, y_batch = mnist.train.next_batch(batch_size)
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,
y: mnist.test.labels})
if epoch % 10 == 0 :
print(epoch, "Test accuracy:" , accuracy_val)
save_path = saver.save(sess, "./mnist_new_model_final.ckpt" )
INFO:tensorflow:Restoring parameters from ./models/mnist/mnist_model_final.ckpt
0 Test accuracy: 0.9472
10 Test accuracy: 0.9489
20 Test accuracy: 0.9505
30 Test accuracy: 0.9523
40 Test accuracy: 0.9543
50 Test accuracy: 0.9555
也可以将其他网络中的参数使用到自己的网络中,或者直接将网络参数冻结(freezing) 如果有之前的网络层的定义结构,则不需要加载meta文件,直接将之前的定义结构拷贝过来即可
reset_graph()
original_w = np.zeros( (n_inputs, n_hidden1) )
original_b = np.ones( (n_hidden1) )
X = tf.placeholder( tf.float32, shape=(None , n_inputs), name="X" )
hidden1 = tf.contrib.layers.fully_connected( X, n_hidden1, scope="hidden1" )
with tf.variable_scope("" , default_name="" , reuse=True ):
hidden1_weights = tf.get_variable( "hidden1/weights" )
hidden1_biases = tf.get_variable( "hidden1/biases" )
original_weights = tf.placeholder(tf.float32, shape=(n_inputs, n_hidden1))
original_biases = tf.placeholder(tf.float32, shape=(n_hidden1))
assign_hidden1_weights = tf.assign(hidden1_weights, original_weights)
assign_hidden1_biases = tf.assign(hidden1_biases, original_biases)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
sess.run(assign_hidden1_weights, feed_dict={original_weights: original_w})
sess.run(assign_hidden1_biases, feed_dict={original_biases: original_b})
print( np.sum(hidden1_weights.eval()) )
print( np.sum(hidden1_biases.eval()) )
0.0
300.0
reset_graph()
n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 50
n_hidden3 = 50
n_hidden4 = 20
n_outputs = 10
X = tf.placeholder(tf.float32, shape=(None , n_inputs), name="X" )
y = tf.placeholder(tf.int64, shape=(None ), name="y" )
with tf.name_scope("dnn" ):
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1" )
hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2" )
hidden3 = tf.layers.dense(hidden2, n_hidden3, activation=tf.nn.relu, name="hidden3" )
hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu, name="hidden4" )
logits = tf.layers.dense(hidden4, n_outputs, name="outputs" )
with tf.name_scope("loss" ):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss" )
with tf.name_scope("eval" ):
correct = tf.nn.in_top_k(logits, y, 1 )
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy" )
with tf.name_scope("train" ):
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
training_op = optimizer.minimize(loss)
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
scope="hidden[123]" )
reuse_vars_dict = dict([(var.op.name, var) for var in reuse_vars])
restore_saver = tf.train.Saver(reuse_vars_dict)
init = tf.global_variables_initializer()
saver = tf.train.Saver()
with tf.Session() as sess:
init.run()
restore_saver.restore(sess, "./models/mnist/gradient_clip.ckpt" )
for epoch in range(n_epochs):
for iteration in range(mnist.train.num_examples // batch_size):
X_batch, y_batch = mnist.train.next_batch(batch_size)
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,
y: mnist.test.labels})
if epoch % 10 == 0 :
print(epoch, "Test accuracy:" , accuracy_val)
save_path = saver.save(sess, "./models/mnist/gradient_clip.ckpt" )
INFO:tensorflow:Restoring parameters from ./models/mnist/gradient_clip.ckpt
0 Test accuracy: 0.9535
10 Test accuracy: 0.9629
20 Test accuracy: 0.9643
30 Test accuracy: 0.9649
40 Test accuracy: 0.9662
50 Test accuracy: 0.9671
如果浅层的网络的参数是固定的,那我们没有必要在每个epoch都计算输入经过浅层网络的输出。如果有足够的RAM的话,可以直接先把所有输入对应的浅层输出都计算出来,然后将浅层输出等效为输入,对剩下的网络层进行训练
reset_graph()
n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 50
n_hidden3 = 50
n_hidden4 = 20
n_outputs = 10
X = tf.placeholder(tf.float32, shape=(None , n_inputs), name="X" )
y = tf.placeholder(tf.int64, shape=(None ), name="y" )
with tf.name_scope("dnn" ):
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
name="hidden1" )
hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu,
name="hidden2" )
hidden2_stop = tf.stop_gradient(hidden2)
hidden3 = tf.layers.dense(hidden2_stop, n_hidden3, activation=tf.nn.relu,
name="hidden3" )
hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu,
name="hidden4" )
logits = tf.layers.dense(hidden4, n_outputs, name="outputs" )
with tf.name_scope("loss" ):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss" )
with tf.name_scope("eval" ):
correct = tf.nn.in_top_k(logits, y, 1 )
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy" )
with tf.name_scope("train" ):
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
training_op = optimizer.minimize(loss)
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
scope="hidden[123]" )
reuse_vars_dict = dict([(var.op.name, var) for var in reuse_vars])
restore_saver = tf.train.Saver(reuse_vars_dict)
init = tf.global_variables_initializer()
saver = tf.train.Saver()
train_number = 25000
n_batches = train_number // batch_size
with tf.Session() as sess:
init.run()
restore_saver.restore(sess, "./models/mnist/gradient_clip.ckpt" )
h2_cache = sess.run(hidden2, feed_dict={X: mnist.train.images[:train_number,:]})
h2_cache_test = sess.run(hidden2, feed_dict={X: mnist.test.images})
for epoch in range(n_epochs):
shuffled_idx = np.random.permutation(train_number)
hidden2_batches = np.array_split(h2_cache[shuffled_idx], n_batches)
y_batches = np.array_split(mnist.train.labels[shuffled_idx], n_batches)
for hidden2_batch, y_batch in zip(hidden2_batches, y_batches):
sess.run(training_op, feed_dict={hidden2:hidden2_batch, y:y_batch})
accuracy_val = accuracy.eval(feed_dict={hidden2: h2_cache_test,
y: mnist.test.labels})
if epoch % 10 == 0 :
print(epoch, "Test accuracy:" , accuracy_val)
save_path = saver.save(sess, "./models/mnist/gradient_clip.ckpt" )
INFO:tensorflow:Restoring parameters from ./models/mnist/gradient_clip.ckpt
0 Test accuracy: 0.4361
10 Test accuracy: 0.8888
20 Test accuracy: 0.9183
30 Test accuracy: 0.9266
40 Test accuracy: 0.9311
50 Test accuracy: 0.9338
reset_graph()
w1 = tf.Variable( [[1 ,2 ]] )
w2 = tf.Variable( [[3 ,4 ]] )
res = tf.matmul( w1, [[2 ],[1 ]] )
grads = tf.gradients( res, [w1] )
with tf.Session() as sess:
tf.global_variables_initializer().run()
print( sess.run( grads ) )
with tf.variable_scope("" , reuse=tf.AUTO_REUSE):
w1 = tf.get_variable('w1' , shape=[3 ])
w2 = tf.get_variable('w2' , shape=[3 ])
w3 = tf.get_variable('w3' , shape=[3 ])
w4 = tf.get_variable('w4' , shape=[3 ])
z1 = w1+w2+w3+w4
z2 = w3+w4
grads = tf.gradients( [z1, z2], [w1, w2, w3, w4], grad_ys=[ tf.convert_to_tensor([2. ,2 ,3 ]), tf.convert_to_tensor([3. ,2 ,4 ]) ] )
grads_1 = tf.gradients( [z1, z2], [w1, w2, w3, w4] )
with tf.Session() as sess:
tf.global_variables_initializer().run()
print( sess.run( grads ) )
print( sess.run( grads_1 ) )
reset_graph()
w1 = tf.Variable(2.0 )
w2 = tf.Variable(3.0 )
a = tf.multiply( w1, 3.0 )
a_stopped = tf.stop_gradient( a )
b = tf.multiply( a_stopped, w2 )
grad_2 = tf.gradients( b, xs=[w1, w2] )
grad_3 = tf.gradients(b, w2)
with tf.Session() as sess:
tf.global_variables_initializer().run()
print( sess.run( a ) )
print( sess.run( a_stopped ) )
print( sess.run(grad_3) )
[array([[2, 1]], dtype=int32)]
[array([2., 2., 3.], dtype=float32), array([2., 2., 3.], dtype=float32), array([5., 4., 7.], dtype=float32), array([5., 4., 7.], dtype=float32)]
[array([1., 1., 1.], dtype=float32), array([1., 1., 1.], dtype=float32), array([2., 2., 2.], dtype=float32), array([2., 2., 2.], dtype=float32)]
6.0
6.0
[6.0]
非监督预训练
如果labeled训练数据很少,大部分都是ublabeled数据,则我们可以使用当前的unlabeled数据进行预训练,可以使用RBM(Restricted Boltzmann Machines)或者AutoEncoder。然后再使用训练好的参数去对labeled数据使用BP进行训练
使用辅助任务进行预训练
如果直接获得跟任务相关的labeled数据比较困难,可以先找到一个比较容易获得labeled数据的辅助任务,然后先用这个训练神经网络,再在该任务中使用辅助任务中浅层神经网络的训练好的参数
一些加速训练的优化方法
动量优化
Nesterov Accelerated Gradient
learning_rate = 0.1
optimizer= tf.train.MomentumOptimizer( learning_rate=learning_rate, momentum=0.9 )
optimizer= tf.train.MomentumOptimizer( learning_rate=learning_rate, momentum=0.9 , use_nesterov=True )
AdaGrad
RMSProp
Adam Optimization
optimizer = tf.train.AdagradOptimizer( learning_rate=learning_rate )
optimizer = tf.train.RMSPropOptimizer( learning_rate=learning_rate, momentum=0.9 , decay=0.9 , epsilon=1e-10 )
optimizer = tf.train.AdamOptimizer( learning_rate=learning_rate )
调整学习率
学习率太大,训练过程容易震荡,学习率太小,训练过程迭代十分慢;因此可以首先设置为大的学习率,然后逐渐减小,或者在模型收敛速度变慢时,减小学习率。有几种主要的调整方法
初始设置为一个初始值,然后在若干个epoch之后调整学习率 根据模型性能进行调整:每个N个step计算验证误差,当误差停止下降时,减小学习率 指数调整:
η ( t ) = η 0 10 − t / r
η
(
t
)
=
η
0
10
−
t
/
r
幂律调整:
η ( t ) = η 0 ( 1 + t / r ) − c
η
(
t
)
=
η
0
(
1
+
t
/
r
)
−
c
,这种方法的学习率下降速度比指数调整方法的下降速度要慢 Andrew Senior等人在2013年的一篇文章里指出:指数调整和根据模型性能进行调整的方法的性能都不错,但是指数调整方法更加容易封装
reset_graph()
n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10
X = tf.placeholder(tf.float32, shape=(None , n_inputs), name="X" )
y = tf.placeholder(tf.int64, shape=(None ), name="y" )
with tf.name_scope("dnn" ):
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1" )
hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2" )
logits = tf.layers.dense(hidden2, n_outputs, name="outputs" )
with tf.name_scope("loss" ):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss" )
with tf.name_scope("eval" ):
correct = tf.nn.in_top_k(logits, y, 1 )
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy" )
with tf.name_scope("train" ):
initial_learning_rate = 0.1
decay_steps = 10000
decay_rate = 1 /10
global_step = tf.Variable(0 , trainable=False , name="global_step" )
learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step,
decay_steps, decay_rate)
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9 )
training_op = optimizer.minimize(loss, global_step=global_step)
init = tf.global_variables_initializer()
n_epochs = 5
batch_size = 500
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for iteration in range(mnist.train.num_examples // batch_size):
X_batch, y_batch = mnist.train.next_batch(batch_size)
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,
y: mnist.test.labels})
print(epoch, "Test accuracy:" , accuracy_val)
0 Test accuracy: 0.9499
1 Test accuracy: 0.9628
2 Test accuracy: 0.9702
3 Test accuracy: 0.9741
4 Test accuracy: 0.9752
避免过拟合的一些方法
early stopping
当模型的性能在验证集上开始下降时,停止训练。 可以每个N个step就将当前的模型与最好的模型的性能进行比较,保存更好地模型作为最优模型,一直到迭代次数达到最大迭代次数 可以将early stopping方法与其他正则化方法结合,获得更好的性能。
L1和L2正则化
在loss函数中加入参数的绝对值和或者平方和,防止参数过大,有效防止过拟合,同时L1正则化可以使得很小的权值都为0,从而可以实现参数矩阵的稀疏化
Dropout
这是DNN中最常使用的一种正则化技术,加入dropout可以使得DNN的准确率提升1%~2%。 主要思想是:在每一次训练过程中,每个神经元节点(包括输入的神经元节点和隐含层节点,不包括输出层节点)都有一定的概率
p
p
被dropped out,即在这次训练过程中不被使用,p p 是超参数,为dropout概率,一般可以设置为50%。dropout只在训练过程中使用,在验证或者测试时,需要用到所有的节点 需要注意的是,我们在训练完参模型之后,需要将每个神经元的输入连接权值变为原来的
1 − p
1
−
p
倍,因为之前只用到了其中的部分参数;或者我们也可以在训练的过程中将每个神经元节点的输出除以保留概率
( 1 − p )
(
1
−
p
)
。 dropout会减慢收敛的速率,但是最终的模型性能一般都会更优
reset_graph()
n_inputs = 28 * 28
n_hidden1 = 300
n_outputs = 10
X = tf.placeholder(tf.float32, shape=(None , n_inputs), name="X" )
y = tf.placeholder(tf.int64, shape=(None ), name="y" )
with tf.name_scope("dnn" ):
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1" )
logits = tf.layers.dense(hidden1, n_outputs, name="outputs" )
W1 = tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0" )
W2 = tf.get_default_graph().get_tensor_by_name("outputs/kernel:0" )
scale = 0.001
with tf.name_scope("loss" ):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
logits=logits)
base_loss = tf.reduce_mean(xentropy, name="avg_xentropy" )
reg_losses = tf.reduce_sum(tf.abs(W1)) + tf.reduce_sum(tf.abs(W2))
loss = tf.add(base_loss, scale * reg_losses, name="loss" )
with tf.name_scope("eval" ):
correct = tf.nn.in_top_k(logits, y, 1 )
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy" )
learning_rate = 0.01
with tf.name_scope("train" ):
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
training_op = optimizer.minimize(loss)
init = tf.global_variables_initializer()
n_epochs = 20
batch_size = 200
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for iteration in range(mnist.train.num_examples // batch_size):
X_batch, y_batch = mnist.train.next_batch(batch_size)
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,
y: mnist.test.labels})
print(epoch, "Test accuracy:" , accuracy_val)
0 Test accuracy: 0.8355
1 Test accuracy: 0.8711
2 Test accuracy: 0.8827
3 Test accuracy: 0.8908
4 Test accuracy: 0.8953
5 Test accuracy: 0.8987
6 Test accuracy: 0.9003
7 Test accuracy: 0.9042
8 Test accuracy: 0.9053
9 Test accuracy: 0.9048
10 Test accuracy: 0.907
11 Test accuracy: 0.9063
12 Test accuracy: 0.9069
13 Test accuracy: 0.907
14 Test accuracy: 0.9076
15 Test accuracy: 0.9084
16 Test accuracy: 0.9074
17 Test accuracy: 0.9066
18 Test accuracy: 0.9067
19 Test accuracy: 0.9068
reset_graph()
X = tf.placeholder( tf.float32, shape=(None , n_inputs), name="X" )
y = tf.placeholder(tf.int64, shape=(None ), name="y" )
training = tf.placeholder_with_default( False , shape=(), name="training" )
dropout_rate = 0.5
X_drop = tf.layers.dropout(X, dropout_rate, training=training)
with tf.name_scope( "dnn" ):
hidden1 = tf.layers.dense( X_drop, n_hidden1, activation=tf.nn.relu, name="hidden1" )
hidden1_drop = tf.layers.dropout(hidden1, dropout_rate, training=training)
hidden2 = tf.layers.dense( hidden1_drop, n_hidden2, activation=tf.nn.relu, name="hidden2" )
hidden2_drop = tf.layers.dropout( hidden2, dropout_rate, training=training )
logits = tf.layers.dense( hidden2_drop, n_outputs, name="outputs" )
with tf.name_scope("loss" ):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits( labels=y, logits=logits )
loss = tf.reduce_mean(xentropy, name="loss" )
with tf.name_scope("train" ):
optimizer = tf.train.MomentumOptimizer( learning_rate, momentum=0.9 )
training_op = optimizer.minimize( loss )
with tf.name_scope("eval" ):
correct = tf.nn.in_top_k( logits, y, 1 )
accuracy = tf.reduce_mean( tf.cast(correct, tf.float32) )
init = tf.global_variables_initializer()
n_epochs = 20
batch_size = 100
with tf.Session() as sess:
init.run()
for epoch in range( n_epochs ):
for iteration in range( mnist.train.num_examples // batch_size ):
X_batch, y_batch = mnist.train.next_batch( batch_size )
sess.run( training_op, feed_dict={training:True , X:X_batch, y:y_batch} )
acc_test = accuracy.eval( feed_dict={X:mnist.test.images, y:mnist.test.labels} )
print( epoch, "test accuracy : " , acc_test )
0 test accuracy : 0.9041
1 test accuracy : 0.9242
2 test accuracy : 0.9375
3 test accuracy : 0.9436
4 test accuracy : 0.9506
5 test accuracy : 0.9494
6 test accuracy : 0.9529
7 test accuracy : 0.9574
8 test accuracy : 0.9592
9 test accuracy : 0.963
10 test accuracy : 0.9633
11 test accuracy : 0.963
12 test accuracy : 0.9665
13 test accuracy : 0.9651
14 test accuracy : 0.9638
15 test accuracy : 0.9668
16 test accuracy : 0.9647
17 test accuracy : 0.9677
18 test accuracy : 0.9652
19 test accuracy : 0.9672
最大范数正则化(Max-Norm Regularization)
Max-Norm是对每个神经元节点的权值范数进行限制,
∥ w ∥ 2 < r
‖
w
‖
2
<
r
。其中
r
r
是超参数
对于一个网络层,其每个节点的参数就是w参数矩阵中每一行的数据,所以在下面的使用过程中,设置axes=1
参考链接:https://www.tensorflow.org/api_docs/python/tf/clip_by_norm
一般会在每次训练结束之后,对w进行限幅。w = w r ∥ w ∥ 2 w = w r ‖ w ‖ 2
r
r
<script type="math/tex" id="MathJax-Element-64">r</script>越小,正则化作用越明显,过拟合现象越少。在未使用BN的情况下,这种方法也可以减少梯度弥散或者梯度爆炸的概率
a = np.random.rand(3 ,5 )
print( np.sum(a, axis=0 ) )
print( np.sum(a, axis=1 ) )
print( np.sum(a, axis=None ) )
[0.63031862 0.98162226 0.89974413 0.8518089 2.10474696]
[1.9078224 1.87284878 1.68756969]
5.468240874930748
reset_graph()
n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10
learning_rate = 0.01
momentum = 0.9
X = tf.placeholder(tf.float32, shape=(None , n_inputs), name="X" )
y = tf.placeholder(tf.int64, shape=(None ), name="y" )
with tf.name_scope("dnn" ):
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1" )
hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2" )
logits = tf.layers.dense(hidden2, n_outputs, name="outputs" )
with tf.name_scope("loss" ):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss" )
with tf.name_scope("train" ):
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum)
training_op = optimizer.minimize(loss)
with tf.name_scope("eval" ):
correct = tf.nn.in_top_k(logits, y, 1 )
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
threshold = 1.0
weights = tf.get_default_graph().get_tensor_by_name( "hidden1/kernel:0" )
clipped_weights = tf.clip_by_norm( weights, clip_norm=threshold, axes=1 )
clip_weights = tf.assign( weights, clipped_weights )
weights2 = tf.get_default_graph().get_tensor_by_name( "hidden2/kernel:0" )
clipped_weights2 = tf.clip_by_norm( weights2, clip_norm=threshold, axes=1 )
clip_weights2 = tf.assign( weights2, clipped_weights2 )
init = tf.global_variables_initializer()
n_epochs = 20
batch_size = 500
with tf.Session() as sess:
init.run()
for epoch in range( n_epochs ):
for iteration in range( mnist.train.num_examples // batch_size ):
X_batch, y_batch = mnist.train.next_batch( batch_size )
sess.run( training_op, feed_dict={X:X_batch, y:y_batch} )
clip_weights.eval()
clip_weights2.eval()
acc_test = sess.run( accuracy, feed_dict={X:mnist.test.images, y:mnist.test.labels} )
print(epoch, "Test accuracy:" , acc_test)
0 Test accuracy: 0.8962
1 Test accuracy: 0.9206
2 Test accuracy: 0.9296
3 Test accuracy: 0.9357
4 Test accuracy: 0.941
5 Test accuracy: 0.9462
6 Test accuracy: 0.9504
7 Test accuracy: 0.9535
8 Test accuracy: 0.9562
9 Test accuracy: 0.9587
10 Test accuracy: 0.9616
11 Test accuracy: 0.9625
12 Test accuracy: 0.9635
13 Test accuracy: 0.9655
14 Test accuracy: 0.9678
15 Test accuracy: 0.967
16 Test accuracy: 0.9694
17 Test accuracy: 0.9698
18 Test accuracy: 0.9712
19 Test accuracy: 0.9712
def max_norm_regularizer ( threshold, axes=1 , name="max_norm" , collection="max_norm" ) :
def max_norm ( weights ) :
clipped = tf.clip_by_norm( weights, clip_norm=threshold, axes=axes )
clip_weights = tf.assign( weights, clipped, name=name )
tf.add_to_collection( collection, clip_weights )
return
return max_norm
reset_graph()
n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10
learning_rate = 0.01
momentum = 0.9
X = tf.placeholder(tf.float32, shape=(None , n_inputs), name="X" )
y = tf.placeholder(tf.int64, shape=(None ), name="y" )
max_norm_reg = max_norm_regularizer( threshold=1.0 )
with tf.name_scope("dnn" ):
hidden1 = tf.layers.dense( X, n_hidden1, activation=tf.nn.relu, kernel_regularizer=max_norm_reg, name="hidden1" )
hidden2 = tf.layers.dense( hidden1, n_hidden2, activation=tf.nn.relu, kernel_regularizer=max_norm_reg, name="hidden2" )
logits = tf.layers.dense( hidden2, n_outputs, name="outputs" )
with tf.name_scope("loss" ):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss" )
with tf.name_scope("train" ):
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum)
training_op = optimizer.minimize(loss)
with tf.name_scope("eval" ):
correct = tf.nn.in_top_k(logits, y, 1 )
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
init = tf.global_variables_initializer()
n_epochs = 20
batch_size = 500
clip_all_weights = tf.get_collection( "max_nrom" )
with tf.Session() as sess:
init.run()
for epoch in range( n_epochs ):
for iteration in range( mnist.train.num_examples // batch_size ):
X_batch, y_batch = mnist.train.next_batch( batch_size )
sess.run( training_op, feed_dict={X:X_batch, y:y_batch} )
sess.run( clip_all_weights )
acc_test = sess.run( accuracy, feed_dict={X:mnist.test.images, y:mnist.test.labels} )
print(epoch, "Test accuracy:" , acc_test)
0 Test accuracy: 0.8953
1 Test accuracy: 0.9168
2 Test accuracy: 0.9279
3 Test accuracy: 0.9352
4 Test accuracy: 0.9415
5 Test accuracy: 0.9457
6 Test accuracy: 0.9511
7 Test accuracy: 0.9531
8 Test accuracy: 0.9562
9 Test accuracy: 0.9576
10 Test accuracy: 0.9593
11 Test accuracy: 0.9622
12 Test accuracy: 0.964
13 Test accuracy: 0.9652
14 Test accuracy: 0.9665
15 Test accuracy: 0.967
16 Test accuracy: 0.9687
17 Test accuracy: 0.9692
18 Test accuracy: 0.9694
19 Test accuracy: 0.9706
数据增强
数据增强技术指的是人工生成更多的训练集,将其与原始训练集进行混合并且进行训练 我们可以对原始训练数据进行resize、rotate、shift等操作,这些操作得到的训练数据是可以学习的;但是不能加入一些无法学习的操作,如添加白噪声等,在原始训练集上加入白噪声对训练没有帮助。
一些实用的技巧
在训练DNN时,可以采用的默认配置
initialization:He Initialization activation function:ELU normalization:Batch Normalization Regularization:dropout optimizer:Adam learning rate:constant
遇到的一些问题以及对应的解决办法
如果在很多学习率下,模型的性能都不好,则可以尝试使学习率随迭代过程变化 如果训练数据太少,可以使用数据增强 如果需要稀疏模型,则可以使用L1正则化 如果需要网络是轻量级的,则可以不使用BN,同时可以用leaky relu代替elu,同时稀疏模型也会减小网络的尺寸