chapter 11 Training Deep Neural Nets
OReilly. Hands-On Machine Learning with Scikit-Learn and TensorFlow读书笔记
A chapter that you will never miss a single word. This note thus may miss some import contents.
Problems concern training DNN:
- vanishing/exploding gradients
- training would be extremely slow
- overfitting
11.1 Vanishing/Exploding Gradients Problems
Vanishing gradients problem: gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, weights connecting lower layers stay unchanged during training process, and the algorithm never converges to an optimal solution.
Exploding gradients problem: the gradients grow larger and larger, among which some gradients grow insanely high, and the training diverges.
Some suspects for vanishing gradients: the logistic sigmoid activation function and random initialization using a normal distribution with a mean of 0 and a standard deviation of 1.
With this activation function and this initialization scheme, the variance of the outputs of each layer is much greater than the variance of its inputs. Going forward in the network, the variance keeps increasing after each layer until the activation function saturates at the top layers. This is actually made worse by the fact that the logistic function has a mean of 0.5, not 0.
11.1.1 Xavier and He Initialization
We need the signal to flow properly in both directions: in the forward direction when making predictions, and in the reverse direction when backpropagating gradients. We don’t want the signal to die out (approaches 0 0 0), nor do we want it to explode (approaches infinity) and saturate (stays at a constant).
We need the variance of the outputs of each layer to be equal to the variance of its inputs, and we also need the gradients to have equal variance before and after flowing through a layer in the reverse direction.
The connection weights must be initialized randomly as described in Equation 11-1, where n i n p u t s n_{inputs} ninputs and n o u t p u t s n_{outputs} noutputs are the number of input and output connections for the layer whose weights are being initialized (also called fan-in and fan-out). This initialization strategy is often called Xavier initialization (after the author’s first name), or sometimes Glorot initialization.
Equation 11-1. Xavier initialization (when using the logistic activation function)
Normal distribution with mean 0 and standard deviation
σ
=
2
n
i
n
p
u
t
s
+
n
o
u
t
p
u
t
s
Or a uniform distribution between ‐r and +r, with
r
=
6
n
i
n
p
u
t
s
+
n
o
u
t
p
u
t
s
\textrm{Normal distribution with mean 0 and standard deviation }\sigma=\sqrt{\frac{2}{n_{inputs}+n_{outputs}}}\\ \textrm{Or a uniform distribution between ‐r and +r, with }r=\sqrt{\frac{6}{n_{inputs}+n_{outputs}}}
Normal distribution with mean 0 and standard deviation σ=ninputs+noutputs2Or a uniform distribution between ‐r and +r, with r=ninputs+noutputs6
Table 11-1. Initialization parameters for each type of activation function
Activation function | Uniform distribution [–r, r] | Normal distribution |
---|---|---|
Logistic | r = 6 n i n p u t s + n o u t p u t s r=\sqrt{\frac{6}{n_{inputs}+n_{outputs}}} r=ninputs+noutputs6 | σ = 2 n i n p u t s + n o u t p u t s \sigma=\sqrt{\frac{2}{n_{inputs}+n_{outputs}}} σ=ninputs+noutputs2 |
Hyperbolic tangent | r = 4 6 n i n p u t s + n o u t p u t s r=4\sqrt{\frac{6}{n_{inputs}+n_{outputs}}} r=4ninputs+noutputs6 | σ = 4 2 n i n p u t s + n o u t p u t s \sigma=4\sqrt{\frac{2}{n_{inputs}+n_{outputs}}} σ=4ninputs+noutputs2 |
ReLU (and its variants) | r = 2 6 n i n p u t s + n o u t p u t s r=\sqrt{2}\sqrt{\frac{6}{n_{inputs}+n_{outputs}}} r=2ninputs+noutputs6 | σ = 2 2 n i n p u t s + n o u t p u t s \sigma=\sqrt{2}\sqrt{\frac{2}{n_{inputs}+n_{outputs}}} σ=2ninputs+noutputs2 |
import tensorflow as tf
n_inputs=28*28 #MNIST
n_hidden1=300
X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")
he_init=tf.variance_scaling_initializer()
hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,
kernel_initializer=he_init,name="hidden1")
11.1.2 Nonsaturating Activation Functions
The vanishing/exploding gradients problems were in part due to a poor choice of activation function. ReLU activation function behaves better as it does not saturate for positive values (and also because it is quite fast to compute).
ReLU activation function is not perfect. It suffers from a problem known as the dying ReLUs: during training, some neurons effectively die, meaning they stop outputting anything other than 0 if their inputs is negative.
The solution is to use a variant of ReLU
- leaky ReLU. The function is defined as LeakyReLu α ( z ) = max ( α z , z ) \textrm{LeakyReLu}_\alpha(z)=\max(\alpha z,z) LeakyReLuα(z)=max(αz,z). Setting α = 0.2 \alpha = 0.2 α=0.2 (huge leak) seemed to result in better performance than α = 0.01 \alpha = 0.01 α=0.01 (small leak).
- randomized leaky ReLU (RReLU), where
α
\alpha
α is picked randomly in a given range during
training, and it is fixed to an average value during testing. - parametric leaky ReLU (PReLU), where α \alpha α is authorized to be learned during training (instead of being a hyperparameter).
- exponential linear unit (ELU):
Equation 11-2. ELU activation function
ELU
α
(
z
)
=
{
α
(
exp
(
z
)
−
1
)
if
z
<
0
z
if
z
≥
0
\textrm{ELU}_\alpha(z)=\left\{\begin{array}{ll}\alpha(\exp(z)-1)&\textrm{ if }z<0\\z&\textrm{ if } z\ge 0\end{array}\right.
ELUα(z)={α(exp(z)−1)z if z<0 if z≥0
So which activation function should you use for the hidden layers of your deep neural networks? In general ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic. If you care a lot about runtime (testing) performance, then you may prefer leaky ReLUs over ELUs. If you don’t want to tweak yet another hyperparameter, you may just use the default
α
\alpha
α values suggested earlier (0.01 for the leaky ReLU, and 1 for ELU). If you have spare time and computing power, you can use cross-validation to evaluate other activation functions, in particular RReLU if your network is overfitting, or PReLU if you have a huge training set.
hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.elu,name="hidden1")
def leaky_relu(z, name=None):
return tf.maximum(0.01 * z, z, name=name)
hidden1 = tf.layers.dense(X, n_hidden1, activation=leaky_relu,name="hidden1")
11.1.3 Batch Normalization
Although using He initialization along with ELU (or any variant of ReLU) can significantly reduce the vanishing/exploding gradients problems at the beginning of training, it doesn’t guarantee that they won’t come back during training.
Batch Normalization (BN) is proposed to address the vanishing/exploding gradients problems,
and more generally the problem that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change (which they call the Internal Covariate Shift problem).
The technique consists of adding an operation in the model just before the activation function of each layer, simply zero-centering and normalizing the inputs, then scaling and shifting the result using two new parameters per layer (one for scaling, the other for shifting). In other words, this operation lets the model learn the optimal scale and mean of the inputs for each layer.
In order to zero-center and normalize the inputs, the algorithm needs to estimate the inputs’ mean and standard deviation. It does so by evaluating the mean and standard deviation of the inputs over the current mini-batch (hence the name “Batch Normalization”). The whole operation is summarized in Equation 11-3.
Equation 11-3. Batch Normalization algorithm
KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ 1.\textrm{ }\t…
- μ B \mu_B μB is the empirical mean, evaluated over the whole mini-batch B B B.•
- σ B \sigma_B σB is the empirical standard deviation, also evaluated over the whole mini-batch.
- m B m_B mB is the number of instances in the mini-batch.
- x ^ ( i ) \widehat{\textbf x}^{(i)} x (i) is the zero-centered and normalized input.
- γ \gamma γ is the scaling parameter for the layer.
- β \beta β is the shifting parameter (offset) for the layer.
- ϵ \epsilon ϵ is a tiny number to avoid division by zero (typically 1 0 – 3 10^{–3} 10–3). This is called a smoothing term.
- z ( i ) \textbf z^{(i)} z(i) is the output of the BN operation: it is a scaled and shifted version of the inputs.
At test time, there is no mini-batch to compute the empirical mean and standard deviation, so instead you simply use the whole training set’s mean and standard deviation. These are typically efficiently computed during training using a moving average. So, in total, four parameters are learned for each batch-normalized layer: γ \gamma γ (scale), β \beta β (offset), μ \mu μ (mean), and σ \sigma σ (standard deviation).
Advantages:
- considerably improved all the deep neural networks they experimented with
- The vanishing gradients problem was strongly reduced, to the point that they could use saturating activation functions such as the tanh and even the logistic activation function.
- The networks were also much less sensitive to the weight initialization.
- They were able to use much larger learning rates, significantly speeding up the learning process.
- improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
- Batch Normalization also acts like a regularizer, reducing the need for other regularization techniques (such as dropout).
Disadvantages:
- add some complexity to the model (although it removes the need for normalizing the input data since the first hidden layer will take care of that, provided it is batch-normalized).
- Moreover, there is a runtime penalty: the neural network makes slower predictions due to the extra computations required at each layer. So if you need predictions to be lightning-fast, you may want to check how well plain ELU + He initialization perform before playing with Batch Normalization.
11.1.3.1 Implementing Batch Normalization with TensorFlow
Normally, in the construction phase, the code for using placeholder nodes to represent the training data and targets, and constructing deep neural network looks like follows.
import tensorflow as tf
from tensorflow.layers import batch_normalization
tf.reset_default_graph()
n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10
X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")
training=tf.placeholder_with_default(False,shape=(),name="training")
hidden1=tf.layers.dense(X,n_hidden1,name="hidden1")
bn1=tf.layers.batch_normalization(hidden1,training=training,momentum=0.9)
bn1_act=tf.nn.elu(bn1)
hidden2=tf.layers.dense(bn1_act,n_hidden2,name="hidden2")
bn2=tf.layers.batch_normalization(hidden2,training=training,momentum=0.9)
bn2_act=tf.nn.elu(bn2)
logits_before_bn=tf.layers.dense(bn2_act,n_outputs,name="outputs")
logits=tf.layers.batch_normalization(logits_before_bn,training=training,momentum=0.9)
To prevent repeating the same parameters over and over again, we can use Python’s partial()
function instead.
from functools import partial
my_batch_norm_layer = partial(tf.layers.batch_normalization,
training=training, momentum=0.9)
hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1")
bn1 = my_batch_norm_layer(hidden1)
bn1_act = tf.nn.elu(bn1)
hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2")
bn2 = my_batch_norm_layer(hidden2)
bn2_act = tf.nn.elu(bn2)
logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs")
logits = my_batch_norm_layer(logits_before_bn)
The wholes code are as follows.
import tensorflow as tf
from tensorflow.layers import batch_normalization
tf.reset_default_graph()
batch_norm_momentum = 0.9
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")
training = tf.placeholder_with_default(False, shape=(), name='training')
with tf.name_scope("dnn"):
he_init = tf.variance_scaling_initializer()
my_batch_norm_layer = partial(
tf.layers.batch_normalization,
training=training,
momentum=batch_norm_momentum)
my_dense_layer = partial(
tf.layers.dense,
kernel_initializer=he_init)
hidden1 = my_dense_layer(X, n_hidden1, name="hidden1")
bn1 = tf.nn.elu(my_batch_norm_layer(hidden1))
hidden2 = my_dense_layer(bn1, n_hidden2, name="hidden2")
bn2 = tf.nn.elu(my_batch_norm_layer(hidden2))
logits_before_bn = my_dense_layer(bn2, n_outputs, name="outputs")
logits = my_batch_norm_layer(logits_before_bn)
with tf.name_scope("loss"):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss")
with tf.name_scope("train"):
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
training_op = optimizer.minimize(loss)
with tf.name_scope("eval"):
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
init = tf.global_variables_initializer()
saver = tf.train.Saver()
n_epochs = 20
batch_size = 200
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
#The update operations needed by batch normalization are added to the #UPDATE_OPS collection and you need to explicity run these operations during #training
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
sess.run([training_op, extra_update_ops],
feed_dict={training: True, X: X_batch, y: y_batch})
accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
print(epoch, "Validation accuracy:", accuracy_val)
save_path = saver.save(sess, "./my_model_final.ckpt")
0 Validation accuracy: 0.894
1 Validation accuracy: 0.9178
2 Validation accuracy: 0.9294
3 Validation accuracy: 0.9402
4 Validation accuracy: 0.947
5 Validation accuracy: 0.9512
6 Validation accuracy: 0.9566
7 Validation accuracy: 0.9592
8 Validation accuracy: 0.9616
9 Validation accuracy: 0.9638
10 Validation accuracy: 0.9682
11 Validation accuracy: 0.9676
12 Validation accuracy: 0.9688
13 Validation accuracy: 0.9692
14 Validation accuracy: 0.9724
15 Validation accuracy: 0.972
16 Validation accuracy: 0.9722
17 Validation accuracy: 0.973
18 Validation accuracy: 0.9736
19 Validation accuracy: 0.9728
If you train for longer it will get much better accuracy, but with such a shallow network, Batch Norm and ELU are unlikely to have very positive impact: they shine mostly for much deeper nets.
11.1.4 Gradient Clipping
A popular technique to lessen the exploding gradients problem is to simply clip the gradients during backpropagation so that they never exceed some threshold. This is called Gradient
Clipping. In general people now prefer Batch Normalization, but it’s still useful to know about Gradient Clipping and how to implement it.
In TensorFlow, the optimizer’s minimize()
function takes care of both computing the gradients and applying them, so you must instead call the optimizer’s compute_gradients()
method first, then create an operation to clip the gradients using the clip_by_value()
function, and finally create an operation to apply the clipped gradients using the optimizer’s apply_gradients()
method:
threshold = 1.0
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]
training_op = optimizer.apply_gradients(capped_gvs)
11.2 Reusing Pretrained Layers
Transfer learning means not trying to train a model from scratch, but just reusing the lower layers of a neural network that fulfils a similar task. It will not only speed up training considerably, but will also require much less training data. Transfer learning requires the new task to have the same size of input and more generally, will works well only if the inputs have similar low-level features.
11.2.1 Reusing a TensorFlow Model
- Restore the whole model:
In the first case, the original code is not available, you have only the saved model at hand.
To reuse the model, you may first use import_meta_graph()
function to load the graph structure into the default graph. The graph structure contains the nodes and operations from the original model. The function returns a Saver
than you can use to restore the model’ state, i.e., the model parameters. You can train the new model with parameters initialized from the original model. The new model starts training with the parameters of a trained model, thus speeds up the training and require less data.
In the second case, if you have access to the Python code that built the original graph, you can use it instead of import_meta_graph()
. The code for training a new model is the same as that of the original model except for the execution phase, where calling of init.run()
for initialization of parameters is replaced by saver.restore(sess,"./my_model_final.ckpt")
.
(i) First case
First you need to load the graph’s structure. The import_meta_graph()
function does just that, loading the graph’s operations into the default graph, and returning a Saver
that you can then use to restore the model’s state. Note that by default, a Saver
saves the structure of the graph into a .meta
file, so that’s the file you should load:
import tensorflow as tf
tf.reset_default_graph()
saver = tf.train.import_meta_graph("./my_model_final.ckpt.meta")
Next you need to get a handle on all the operations you will need for training. If you don’t know the graph’s structure, you can list all the operations:
for op in tf.get_default_graph().get_operations():
print(op.name)
X = tf.get_default_graph().get_tensor_by_name("X:0")
y = tf.get_default_graph().get_tensor_by_name("y:0")
accuracy = tf.get_default_graph().get_tensor_by_name("eval/accuracy:0")
training_op = tf.get_default_graph().get_operation_by_name("GradientDescent")
If you are the author of the original model, you could make things easier for people who will reuse your model by giving operations very clear names and documenting them. Another approach is to create a collection containing all the important operations that people will want to get a handle on:
for op in (X, y, accuracy, training_op):
tf.add_to_collection("my_important_ops", op)
This way people who reuse your model will be able to simply write:
X, y, accuracy, training_op = tf.get_collection("my_important_ops")
Now you can start a session, restore the model’s state and continue training on your data:
with tf.Session() as sess:
saver.restore(sess, "./my_model_final.ckpt")
for epoch in range(n_epochs):
for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
print(epoch, "Validation accuracy:", accuracy_val)
save_path = saver.save(sess, "./my_new_model_final.ckpt")
- Reuse only part of the original model
In general you will want to reuse only the lower layers. If you are using import_meta_graph()
it will load the whole graph, but you can simply ignore the parts you do not need. In this example, we add a new 4th hidden layer on top of the pretrained 3rd layer (ignoring the old 4th hidden layer). We also build a new output layer, the loss for this new output, and a new optimizer to minimize it. We also need another saver to save the whole graph (containing both the entire old graph plus the new operations), and an initialization operation to initialize all the new variables:
tf.reset_default_graph()
n_hidden4 = 20 # new layer
n_outputs = 10 # new layer
saver = tf.train.import_meta_graph("./my_model_final.ckpt.meta")
X = tf.get_default_graph().get_tensor_by_name("X:0")
y = tf.get_default_graph().get_tensor_by_name("y:0")
hidden3 = tf.get_default_graph().get_tensor_by_name("dnn/hidden3/Relu:0")
new_hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu, name="new_hidden4")
new_logits = tf.layers.dense(new_hidden4, n_outputs, name="new_outputs")
with tf.name_scope("new_loss"):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=new_logits)
loss = tf.reduce_mean(xentropy, name="loss")
with tf.name_scope("new_eval"):
correct = tf.nn.in_top_k(new_logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")
with tf.name_scope("new_train"):
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
training_op = optimizer.minimize(loss)
init = tf.global_variables_initializer()
#new_saver = tf.train.Saver()
#does not work, report an error
with tf.Session() as sess:
init.run()#for new nodes
saver.restore(sess, "./my_model_final.ckpt")#for reused nodes
for epoch in range(n_epochs):
for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
print(epoch, "Validation accuracy:", accuracy_val)
#save_path = new_saver.save(sess, "./my_new_model_final.ckpt")
save_path = saver.save(sess, "./my_new_model_final.ckpt")
If you have access to the Python code that built the original graph, you can just reuse the parts you need and drop the rest:
tf.reset_default_graph()
n_inputs = 28 * 28 # MNIST
n_hidden1 = 300 # reused
n_hidden2 = 50 # reused
n_hidden3 = 50 # reused
n_hidden4 = 20 # new!
n_outputs = 10 # new!
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")
with tf.name_scope("dnn"):
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1") # reused
hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2") # reused
hidden3 = tf.layers.dense(hidden2, n_hidden3, activation=tf.nn.relu, name="hidden3") # reused
hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu, name="hidden4") # new!
logits = tf.layers.dense(hidden4, n_outputs, name="outputs") # new!
with tf.name_scope("loss"):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss")
with tf.name_scope("eval"):
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")
with tf.name_scope("train"):
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
training_op = optimizer.minimize(loss)
As the new model contains layers share same names with that of in the original model, restore the whole model may raise confliction
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
scope="hidden[123]") # regular expression
restore_saver = tf.train.Saver(reuse_vars) # to restore layers 1-3
init = tf.global_variables_initializer()
saver = tf.train.Saver()
with tf.Session() as sess:
init.run()
restore_saver.restore(sess, "./my_model_final.ckpt")
for epoch in range(n_epochs): # not shown in the book
for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size): # not shown
sess.run(training_op, feed_dict={X: X_batch, y: y_batch}) # not shown
accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid}) # not shown
print(epoch, "Validation accuracy:", accuracy_val)# not shown
save_path = saver.save(sess, "./my_new_model_final.ckpt")
11.2.2 Reusing Models from Other Frameworks
In this example, for each variable we want to reuse, we find its initializer’s assignment operation, and we get its second input, which corresponds to the initialization value. When we run the initializer, we replace the initialization values with the ones we want, using a feed_dict
:
tf.reset_default_graph()
n_inputs = 2
n_hidden1 = 3
original_w = [[1., 2., 3.], [4., 5., 6.]] # Load the weights from the other framework
original_b = [7., 8., 9.] # Load the biases from the other framework
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,name="hidden1")
# [...] Build the rest of the model
# Get a handle on the assignment nodes for the hidden1 variables
graph = tf.get_default_graph()
assign_kernel = graph.get_operation_by_name("hidden1/kernel/Assign")
assign_bias = graph.get_operation_by_name("hidden1/bias/Assign")
init_kernel = assign_kernel.inputs[1]
init_bias = assign_bias.inputs[1]
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init, feed_dict={init_kernel: original_w, init_bias: original_b})
# [...] Train the model on your new task
print(hidden1.eval(feed_dict={X: [[10.0, 11.0]]})) # not shown in the book
11.2.3 Freezing the Lower Layers
tf.reset_default_graph()
n_inputs = 28 * 28 # MNIST
n_hidden1 = 300 # reused
n_hidden2 = 50 # reused
n_hidden3 = 50 # reused
n_hidden4 = 20 # new!
n_outputs = 10 # new!
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")
with tf.name_scope("dnn"):
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1") # reused
hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2") # reused
hidden3 = tf.layers.dense(hidden2, n_hidden3, activation=tf.nn.relu, name="hidden3") # reused
hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu, name="hidden4") # new!
logits = tf.layers.dense(hidden4, n_outputs, name="outputs") # new!
with tf.name_scope("loss"):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss")
with tf.name_scope("eval"):
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")
with tf.name_scope("train"): # not shown in the book
optimizer = tf.train.GradientDescentOptimizer(learning_rate)# not shown
train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
scope="hidden[34]|outputs")
#gets the list of all trainable variables in hidden layers 3 and 4 and in the #output layer. This frozen the hidden layers 1 and 2.
training_op = optimizer.minimize(loss, var_list=train_vars)
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
scope="hidden[123]") # regular expression
restore_saver = tf.train.Saver(reuse_vars) # to restore layers 1-3
init = tf.global_variables_initializer()
saver = tf.train.Saver()
with tf.Session() as sess:
init.run()
restore_saver.restore(sess, "./my_model_final.ckpt")
for epoch in range(n_epochs):
for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
print(epoch, "Validation accuracy:", accuracy_val)
save_path = saver.save(sess, "./my_new_model_final.ckpt")
Note that, the code within the name scope “train” determines the frozen layers (hidden1 and hidden2), whereas the code of restore_saver
determines the reused layers (hidden1, hidden2, and hidden3).
Alternatively, use tf.stop_gradients()
to stop one layer (thus the lower layers) to update gradients. Then training part under the name scope “train” is as usual and needs no modification.
hidden2_stop=tf.stop_gradient(hidden2)
hidden3 = tf.layers.dense(hidden2_stop, n_hidden3, activation=tf.nn.relu, name="hidden3")
11.2.4 Caching the Frozen Layers
Since the frozen layers won’t change, it is possible to cache the output of the topmost frozen layer for each training instance. Since training goes through the whole dataset many times, this will give you a huge speed boost as you will only need to go through the frozen layers once per training instance (instead of once per epoch). For example, you could first run the whole training set through the lower layers (assuming you have enough RAM):
reset_graph()
n_inputs = 28 * 28 # MNIST
n_hidden1 = 300 # reused
n_hidden2 = 50 # reused
n_hidden3 = 50 # reused
n_hidden4 = 20 # new!
n_outputs = 10 # new!
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")
with tf.name_scope("dnn"):
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
name="hidden1") # reused frozen
hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu,
name="hidden2") # reused frozen & cached
hidden2_stop = tf.stop_gradient(hidden2)
#another way to froze the lower layers
hidden3 = tf.layers.dense(hidden2_stop, n_hidden3, activation=tf.nn.relu,
name="hidden3") # reused, not frozen
hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu,
name="hidden4") # new!
logits = tf.layers.dense(hidden4, n_outputs, name="outputs") # new!
with tf.name_scope("loss"):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss")
with tf.name_scope("eval"):
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")
with tf.name_scope("train"):
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
training_op = optimizer.minimize(loss)
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
scope="hidden[123]") # regular expression
restore_saver = tf.train.Saver(reuse_vars) # to restore layers 1-3
init = tf.global_variables_initializer()
saver = tf.train.Saver()
import numpy as np
n_batches = len(X_train) // batch_size
with tf.Session() as sess:
init.run()
restore_saver.restore(sess, "./my_model_final.ckpt")
h2_cache = sess.run(hidden2, feed_dict={X: X_train})
h2_cache_valid = sess.run(hidden2, feed_dict={X: X_valid}) # not shown in the book
for epoch in range(n_epochs):
shuffled_idx = np.random.permutation(len(X_train))
hidden2_batches = np.array_split(h2_cache[shuffled_idx], n_batches)
y_batches = np.array_split(y_train[shuffled_idx], n_batches)
for hidden2_batch, y_batch in zip(hidden2_batches, y_batches):
sess.run(training_op, feed_dict={hidden2:hidden2_batch, y:y_batch})
accuracy_val = accuracy.eval(feed_dict={hidden2: h2_cache_valid,
y: y_valid})# not shown
print(epoch, "Validation accuracy:", accuracy_val) # not shown
save_path = saver.save(sess, "./my_new_model_final.ckpt")
11.2.5 Tweaking, Dropping, or Replacing the Upper Layers
You want to find the right number of layers to reuse. Try freezing all the copied layers first, then train your model and see how it performs. Then try unfreezing one or two of the top hidden layers to let backpropagation tweak them and see if performance improves. The more training data you have, the more layers you can unfreeze.
If you still cannot get good performance, and you have little training data, try dropping the top hidden layer(s) and freeze all remaining hidden layers again. You can iterate until you find the right number of layers to reuse. If you have plenty of training data, you may try replacing the top hidden layers instead of dropping them, and even add more hidden layers.
11.2.6 Model Zoos
This is one good reason to save all your models and organize them so you can retrieve them later easily. Another option is to search in a model zoo.
TensorFlow has its own model zoo available at https://github.com/tensorflow/models.
Caffe’s model zoo. https://github.com/ethereon/caffe-tensorflow
11.2.7 Unsupervised Pretraining
Unsupervised Pretraining: If you have plenty of unlabeled training data, you can try to train the layers one by one, starting with the lowest layer and then going up, using an unsupervised feature detector algorithm such as Restricted Boltzmann Machines (RBMs; see Appendix E) or autoencoders (see Chapter 15). Each layer is trained on the output of the previously trained layers (all layers except the one being trained are frozen). Once all layers have been trained this way, you can finetune the network using supervised learning (i.e., with backpropagation).
11.2.8 Pretraining on an Auxiliary Task
One last option is to train a first neural network on an auxiliary task for which you can easily obtain or generate labeled training data, then reuse the lower layers of that network for your actual task. The first neural network’s lower layers will learn feature detectors that will likely be reusable by the second neural network.
It is often rather cheap to gather unlabeled training examples, but quite expensive to label them. In this situation, a common technique is to label all your training examples as “good,” then generate many new training instances by corrupting the good ones, and label these corrupted instances as “bad”. Then you can train a first neural network to classify instances as good or bad.
Another approach is to train a first network to output a score for each training instance, and use a cost function that ensures that a good instance’s score is greater than a bad instance’s score by at least some margin. This is called max margin learning.
11.3 Faster Optimizers
Ways to speed up training:
- applying a good initialization strategy for the connection weights
- using a good activation function
- using Batch Normalization
- reusing parts of a pretrained network
- using a faster optimizer than the regular Gradient Descent optimizer
- Momentum optimization
- Nesterov Accelerated Gradient
- AdaGrad
- RMSProp
- Adam optimization
11.3.1 Momentum Optimization
Gradient Descent θ ← θ − η ∇ θ J ( θ ) \theta\leftarrow \theta-\eta\nabla_\theta J(\theta) θ←θ−η∇θJ(θ) does not care about what the earlier gradients were.
Momentum optimization cares a great deal about what previous gradients were: at each iteration, it adds the local gradient to the momentum vector m \textbf m m (multiplied by the learning rate η \eta η), and it updates the weights by simply subtracting this momentum vector (see Equation 11-4). In other words, the gradient is used as an acceleration, not as a speed. To simulate some sort of friction mechanism and prevent the momentum from growing too large, the algorithm introduces a new hyperparameter β \beta β, simply called the momentum, which must be set between 0 (high friction) and 1 (no friction). A typical momentum value is 0.9.
Equation 11-4. Momentum algorithm
$$
\begin{align}
- \textrm{ }\textrm{ }\textrm{ }\textrm{ }& \textbf m\leftarrow \beta\textbf m +\eta\nabla_\theta J(\theta)\
2.\textrm{ }\textrm{ }\textrm{ }\textrm{ } & \theta \leftarrow \theta -\textbf m
\end{align}
$$
optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate,
momentum=0.9)
11.3.2 Nesterov Accelerated Gradient
The idea of Nesterov Momentum optimization, or Nesterov Accelerated Gradient (NAG), is to measure the gradient of the cost function not at the local position but slightly ahead in the direction of the momentum (see Equation 11-5). NAG is faster than vanilla Momentum optimization.
Equation 11-5. Nesterov Accelerated Gradient algorithm
KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲1. \textrm{ }\t…
This small tweak works because in general the momentum vector will be pointing in the right direction (i.e., toward the optimum), so it will be slightly more accurate to use the gradient measured a bit farther in that direction rather than using the gradient at the original position. Moreover, it helps reduce oscillations and thus converges faster.
optimizer=tf.train.MomentumOptimizer(learning_rate=learning_rate,
momentum=0.9, use_nesterov=True)
11.3.3 AdaGrad
Gradient Descent starts by quickly going down the steepest direction, then slowly goes down to the optimum. AdaGrad tries to detect the right direction to the optimum early.
The AdaGrad algorithm achieves this by scaling down the gradient vector along the steepest dimensions (see Equation 11-6):
Equation 11-6. AdaGrad algorithm
KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲1. \textrm{ }\t…
The first step accumulates the square of the gradients into the vector
s
\textbf s
s (the
⊗
\otimes
⊗ symbol represents the element-wise multiplication). This vectorized form is equivalent to computing
s
i
←
s
i
+
(
∂
/
∂
θ
i
J
(
θ
)
)
2
s_i \leftarrow s_i+(\partial /\partial \theta_iJ(\theta))^2
si←si+(∂/∂θiJ(θ))2 for each element
s
i
s_i
si of the vector
s
\textbf s
s; in other words, each
s
i
s_i
si accumulates the squares of the partial derivative of the cost function with regards to parameter
θ
i
\theta_i
θi. If the cost function is steep along the ith dimension, then
s
i
s_i
si will get larger and larger at each iteration.
The second step is almost identical to Gradient Descent, but with one big difference: the gradient vector is scaled down by a factor of s + ϵ \sqrt{\textbf s+\epsilon} s+ϵ (the ⊘ \oslash ⊘ symbol represents the element-wise division, and ϵ \epsilon ϵ is a smoothing term to avoid division by zero, typically set to 1 0 – 10 10^{–10} 10–10). This vectorized form is equivalent to computing θ i ← θ i − η ∂ / ∂ θ i J ( θ ) ) / s i + ϵ \theta_i\leftarrow \theta_i -\eta\partial/\partial \theta_i J(\theta))/\sqrt{s_i+\epsilon} θi←θi−η∂/∂θiJ(θ))/si+ϵ for all parameters θ i \theta_i θi (simultaneously).
In short, this algorithm decays the learning rate, but it does so faster for steep dimensions than for dimensions with gentler slopes. This is called an adaptive learning rate. It helps point the resulting updates more directly toward the global optimum (see Figure 11-7). One additional benefit is that it requires much less tuning of the learning rate hyperparameter η \eta η.
optimizer=tf.train.AdagradOptimizer(learning_rate=learning_rate)
11.3.4 RMSProp
Although AdaGrad slows down a bit too fast and ends up never converging to the global optimum, the RMSProp algorithm fixes this by accumulating only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training). It does so by using exponential decay in the first step (see Equation 11-7).
Equation 11-7. RMSProp algorithm
KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲1. \textrm{ }\t…
optimizer=tf.train.RMSPropOptimizer(learning_rate=learning_rate,
momentum=0.9,decay=0.9,epsilon=1e-10)
11.3.5 Adam Optimization
Adam, which stands for adaptive moment estimation, combines the ideas of Momentum optimization and RMSProp: just like Momentum optimization it keeps track of an exponentially decaying average of past gradients, and just like RMSProp it keeps track of an exponentially decaying average of past squared gradients (see Equation 11-8).
Equation 11-8. Adam algorithm
$$
\begin{align}1. \textrm{ }\textrm{ }\textrm{ }\textrm{ }& \textbf m\leftarrow \beta_1\textbf m +(1-\beta_1)\nabla_\theta J(\theta)\
2. \textrm{ }\textrm{ }\textrm{ }\textrm{ }& \textbf s\leftarrow \beta_2\textbf s +(1-\beta_2)\nabla_\theta J(\theta)\otimes\nabla_\theta J(\theta) \
3. \textrm{ }\textrm{ }\textrm{ }\textrm{ }& \textbf m\leftarrow \frac{\textbf m}{1-\beta_1^T}\
4. \textrm{ }\textrm{ }\textrm{ }\textrm{ }& \textbf s\leftarrow \frac{\textbf s}{1-\beta_2^T}\
5.\textrm{ }\textrm{ }\textrm{ }\textrm{ } & \theta \leftarrow \theta -\eta\textbf m\oslash\sqrt{\textbf s+\epsilon}\end{align}
$$
- T T T represents the iteration number (starting at 1).
optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate)
Training Sparse Models
If you need a blazingly fast model at runtime, or if you need it to take up less memory, you may prefer to end up with a sparse model instead.
One trivial way to achieve this is to train the model as usual, then get rid of the tiny weights (set them to 0).
Another option is to apply strong $\ell_1 $ regularization during training, as it pushes the optimizer to zero out as many weights as it can (as discussed in Chapter 4 about Lasso Regression).
However, in some cases these techniques may remain insufficient. One last option is to apply Dual Averaging, often called Follow The Regularized Leader (FTRL), a technique proposed by Yurii Nesterov. When used with
ℓ
1
\ell_1
ℓ1 regularization, this technique often leads to very sparse models. TensorFlow implements a variant of FTRL called FTRL-Proximal in the FTRLOptimizer
class.
11.3.5 Learning Rate Scheduling
You can do better than a constant learning rate: if you start with a high learning rate and then reduce it once it stops making fast progress, you can reach a good solution faster than with the optimal constant learning rate. There are many different strategies to reduce the learning rate during training. These strategies are called learning schedules, the most common of which are:
Predetermined piecewise constant learning rate
For example, set the learning rate to
η
0
=
0.1
\eta_0 = 0.1
η0=0.1 at first, then to
η
1
=
0.001
\eta_1 = 0.001
η1=0.001 after 50 epochs. Although this solution can work very well, it often requires fiddling around to figure out the right learning rates and when to use them.
Performance scheduling
Measure the validation error every
N
N
N steps (just like for early stopping) and reduce the learning rate by a factor of
λ
\lambda
λ when the error stops dropping.
Exponential scheduling
Set the learning rate to a function of the iteration number t:
η
(
t
)
=
η
0
1
0
–
t
/
r
\eta(t) = \eta_0 10^{–t/r}
η(t)=η010–t/r. This works great, but it requires tuning
η
0
\eta_0
η0 and
r
r
r. The learning rate will drop by a factor of 10 every
r
r
r steps.
Power scheduling
Set the learning rate to
η
(
t
)
=
η
0
(
1
+
t
/
r
)
–
c
\eta(t) = \eta_0 (1 + t/r)^{–c}
η(t)=η0(1+t/r)–c. The hyperparameter
c
c
c is typically set to 1. This is similar to exponential scheduling, but the learning rate drops much more slowly.
import tensorflow as tf
import numpy as np
tf.reset_default_graph()
n_inputs = 28 * 28 # MNIST
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")
with tf.name_scope("dnn"):
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2")
logits = tf.layers.dense(hidden2, n_outputs, name="outputs")
with tf.name_scope("loss"):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss")
with tf.name_scope("eval"):
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")
with tf.name_scope("train"): # not shown in the book
initial_learning_rate = 0.1
decay_steps = 10000
decay_rate = 1/10
global_step = tf.Variable(0, trainable=False, name="global_step")
learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step,
decay_steps, decay_rate)
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
training_op = optimizer.minimize(loss, global_step=global_step)
init = tf.global_variables_initializer()
saver = tf.train.Saver()
def shuffle_batch(X, y, batch_size):
rnd_idx = np.random.permutation(len(X))
n_batches = len(X) // batch_size
for batch_idx in np.array_split(rnd_idx, n_batches):
X_batch, y_batch = X[batch_idx], y[batch_idx]
yield X_batch, y_batch
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]
n_epochs = 5
batch_size = 50
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
print(epoch, "Validation accuracy:", accuracy_val)
print(global_step.eval(),learning_rate.eval(feed_dict={X: X_batch, y: y_batch}))
save_path = saver.save(sess, "./my_model_final.ckpt")
0 Validation accuracy: 0.956
1100 0.077624716
1 Validation accuracy: 0.9702
2200 0.060255956
2 Validation accuracy: 0.9768
3300 0.04677352
3 Validation accuracy: 0.9806
4400 0.036307808
4 Validation accuracy: 0.9818
5500 0.02818383
Since AdaGrad, RMSProp, and Adam optimization automatically reduce the learning rate during training, it is not necessary to add an extra learning schedule. For other optimization algorithms, using exponential decay or performance scheduling can considerably speed up convergence.
11.4 Avoiding Overfitting Through Regularization
11.4.1 Early Stopping
11.4.2 ℓ 1 \ell_1 ℓ1 and ℓ 2 \ell_2 ℓ2 Regularization
import tensorflow as tf
import numpy as np
tf.reset_default_graph()
n_inputs=28*28
n_hidden1=300
n_outputs=10
X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")
y=tf.placeholder(tf.int32,shape=(None),name="y")
with tf.name_scope("dnn"):
hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,name="hidden1")
logits=tf.layers.dense(hidden1,n_outputs,activation=None,name="outputs")
W1=tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0")
W2=tf.get_default_graph().get_tensor_by_name("outputs/kernel:0")
scale = 0.001 # l1 regularization hyperparameter
with tf.name_scope("loss"):
xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,labels=y)
base_loss=tf.reduce_mean(xentropy,name="avg_xentropy")
reg_losses=tf.reduce_sum(abs(W1))+tf.reduce_sum(tf.abs(W2))
loss=tf.add(base_loss,scale*reg_losses,name="loss")
learning_rate=0.1
with tf.name_scope("train"):
optimizer=tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
training_op=optimizer.minimize(loss)
with tf.name_scope("eval"):
correct=tf.nn.in_top_k(logits,y,1)
accuracy=tf.reduce_mean(tf.cast(correct,tf.float32))
init=tf.global_variables_initializer()
saver=tf.train.Saver()
n_epochs = 20
batch_size = 200
with tf.Session() as sess:
sess.run(init)
for epoch in range(n_epochs):
for X_batch,y_batch in shuffle_batch(X_train,y_train,batch_size):
sess.run(training_op,feed_dict={X:X_batch,y:y_batch})
accuracy_val=accuracy.eval(feed_dict={X:X_valid,y:y_valid})
print(epoch, "Validation accuracy:", accuracy_val)
save_path=saver.save(sess,save_path="./my_final_model.ckpt")
11.4.3 Dropout
The most popular regularization technique for deep neural networks is arguably dropout. It was proposed by G. E. Hinton in 2012 and further detailed in a paper by Nitish Srivastava et al., and it has proven to be highly successful: even the state-of-the-art neural networks got a 1–2% accuracy boost simply by adding dropout. This may not sound like a lot, but when a model already has 95% accuracy, getting a 2% accuracy boost means dropping the error rate by almost 40% (going from 5% error to roughly 3%).
It is a fairly simple algorithm: at every training step, every neuron (including the input neurons but excluding the output neurons) has a probability p of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step (see Figure 11-9). The hyperparameter p p p is called the dropout rate, and it is typically set to 50%. After training, neurons don’t get dropped anymore.
There is one small but important technical detail. Suppose p = 50, in which case during testing a neuron will be connected to twice as many input neurons as it was (on average) during training. To compensate for this fact, we need to multiply each neuron’s input connection weights by 0.5 after training. If we don’t, each neuron will get a total input signal roughly twice as large as what the network was trained on, and it is unlikely to perform well. More generally, we need to multiply each input connection weight by the keep probability (1 – p) after training. Alternatively, we can divide each neuron’s output by the keep probability during training (these alternatives are not perfectly equivalent, but they work equally well).
import tensorflow as tf
import numpy as np
tf.reset_default_graph()
n_inputs=28*28
n_hidden1=300
n_hidden2=100
n_outputs=10
X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")
y=tf.placeholder(tf.int32,shape=(None),name="y")
training=tf.placeholder_with_default(False,shape=(),name="training")
dropout_rate=0.5 # ==1-keep_prob
X_drop=tf.layers.dropout(X,dropout_rate,training=training)
with tf.name_scope("dnn"):
hidden1=tf.layers.dense(X_drop,n_hidden1,activation=tf.nn.relu,
name="hidden1")
hidden1_drop=tf.layers.dropout(hidden1,dropout_rate,training=training)
hidden2=tf.layers.dense(hidden1_drop,n_hidden2,activation=tf.nn.relu,
name="hidden2")
hidden2_drop=tf.layers.dropout(hidden2,dropout_rate,training=training)
logits=tf.layers.dense(hidden2_drop,n_outputs,name="outputs")
with tf.name_scope("loss"):
xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,labels=y)
loss=tf.reduce_mean(xentropy,name="loss")
learning_rate=0.01
with tf.name_scope("train"):
optimizer=tf.train.MomentumOptimizer(momentum=0.9,learning_rate=learning_rate)
training_op=optimizer.minimize(loss)
with tf.name_scope("eval"):
correct=tf.nn.in_top_k(logits,y,1)
accuracy=tf.reduce_mean(tf.cast(correct,tf.float32))
init=tf.global_variables_initializer()
saver=tf.train.Saver()
n_epochs=20
batch_size=50
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for X_batch,y_batch in shuffle_batch(X_train,y_train,batch_size):
sess.run(training_op,feed_dict={X:X_batch,y:y_batch})
accuracy_val=accuracy.eval(feed_dict={X:X_valid,y:y_valid})
print(epoch,"Validation accuray:", accuracy_val)
save_path=saver.save(sess,"./my_final_model.ckpt")
11.4.4 Max-Norm Regularization
Another regularization technique that is quite popular for neural networks is called max-norm regularization: for each neuron, it constrains the weights w of the incoming connections such that ∣ ∣ w ∣ ∣ 2 ≤ r ||\textbf w||_2\le r ∣∣w∣∣2≤r, where r r r is the max-norm hyperparameter and ∣ ∣ ⋅ ∣ ∣ 2 ||\cdot||_2 ∣∣⋅∣∣2 is the ℓ 2 \ell_2 ℓ2 norm.
We typically implement this constraint by computing ∣ ∣ w ∣ ∣ 2 ||\textbf w||_2 ∣∣w∣∣2 after each training step and clipping w \textbf w w if needed ( w ← w r ∣ ∣ w ∣ ∣ 2 \textbf w \leftarrow \textbf w\frac{r}{||\textbf w||_2} w←w∣∣w∣∣2r).
Reducing r increases the amount of regularization and helps reduce overfitting. Max-norm regularization can also help alleviate the vanishing/exploding gradients problems (if you are not using Batch Normalization).
import tensorflow as tf
import numpy as np
tf.reset_default_graph()
n_inputs=28*28
n_hidden1=300
n_hidden2=50
n_outputs=10
X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")
y=tf.placeholder(tf.int32,shape=(None),name="y")
with tf.name_scope("dnn"):
hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,name="hidden1")
hidden2=tf.layers.dense(hidden1,n_hidden2,activation=tf.nn.relu,name="hidden2")
logits=tf.layers.dense(hidden2,n_outputs,name="outputs")
with tf.name_scope("loss"):
xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,labels=y)
loss=tf.reduce_mean(xentropy,name="loss")
learning_rate=0.01
momentum=0.9
with tf.name_scope("train"):
optimizer=tf.train.MomentumOptimizer(momentum=momentum,learning_rate=learning_rate)
training_op=optimizer.minimize(loss)
with tf.name_scope("eval"):
correct=tf.nn.in_top_k(logits,y,1)
accuracy=tf.reduce_mean(tf.cast(correct,tf.float32))
threshold=1.0
weights=tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0")
clipped_weights=tf.clip_by_norm(weights,clip_norm=threshold,axes=1)
clip_weights=tf.assign(weights,clipped_weights)
weights2=tf.get_default_graph().get_tensor_by_name("hidden2/kernel:0")
clipped_weights2=tf.clip_by_norm(weights2,clip_norm=threshold,axes=1)
clip_weights2=tf.assign(weights2,clipped_weights2)
init=tf.global_variables_initializer()
saver=tf.train.Saver()
n_epochs=20
batch_size=50
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for X_batch,y_batch in shuffle_batch(X_train,y_train,batch_size):
sess.run(training_op,feed_dict={X:X_batch,y:y_batch})
clip_weights.eval()
clip_weights2.eval()
accuracy_val=accuracy.eval(feed_dict={X:X_valid,y:y_valid})
print(epoch,"Acccuracy Validation:",accuracy_val)
save_path=saver.save(sess,"./my_final_model.chpt")
A cleaner solution is to create a max_norm_regularizer()
function and use it just like the earlier l1_regularizer()
function. This function returns a parametrized max_norm() function that you can use like any other regularizer.
import tensorflow as tf
import numpy as np
tf.reset_default_graph()
n_inputs=28*28
n_hidden1=300
n_hidden2=50
n_outputs=10
X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")
y=tf.placeholder(tf.int32,shape=(None),name="y")
def max_norm_regularizer(threshold,axes=1,name="max_norm",
collection="max_norm"):
def max_norm(weights):
clipped=tf.clip_by_norm(weights,clip_norm=threshold,axes=axes)
clip_weights=tf.assign(weights,clipped,name=name)
tf.add_to_collection(collection,clip_weights)
return None #there is no regularization loss term
return max_norm
max_norm_reg=max_norm_regularizer(threshold=1.0)
with tf.name_scope("dnn"):
hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,
kernel_regularizer=max_norm_reg,name="hidden1")
hidden2=tf.layers.dense(hidden1,n_hidden2,activation=tf.nn.relu,
kernel_regularizer=max_norm_reg,name="hidden2")
logits=tf.layers.dense(hidden2,n_outputs,name="outputs")
with tf.name_scope("loss"):
xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,labels=y)
loss=tf.reduce_mean(xentropy,name="loss")
learning_rate=0.01
momentum=0.9
with tf.name_scope("train"):
optimizer=tf.train.MomentumOptimizer(momentum=momentum,learning_rate=learning_rate)
training_op=optimizer.minimize(loss)
with tf.name_scope("eval"):
correct=tf.nn.in_top_k(logits,y,1)
accuracy=tf.reduce_mean(tf.cast(correct,tf.float32))
init=tf.global_variables_initializer()
saver=tf.train.Saver()
n_epochs=20
batch_size=50
clip_all_weights= tf.get_collection("max_norm")
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for X_batch,y_batch in shuffle_batch(X_train,y_train,batch_size):
sess.run(training_op,feed_dict={X:X_batch,y:y_batch})
sess.run(clip_all_weights)
accuracy_val=accuracy.eval(feed_dict={X:X_valid,y:y_valid})
print(epoch,"Acccuracy Validation:",accuracy_val)
save_path=saver.save(sess,"./my_final_model.chpt")
11.4.5 Data Augmentation
It is often preferable to generate training instances on the fly during training rather than wasting storage space and network bandwidth. TensorFlow offers several image manipulation operations such as transposing (shifting), rotating, resizing, flipping, and cropping, as well as adjusting the brightness, contrast, saturation, and hue (see the API documentation for more details). This makes it easy to implement data augmentation for image datasets.
11.5 Practical Guidelines
Table 11-2. Default DNN configuration
Initialization | He initialization |
---|---|
Activation function | ELU |
Normalization | Batch Normalization |
Regularizaiton | Dropout |
Optimizer | Adam |
Learning rate schedule | None |
This default configuration may need to be tweaked:
- If you can’t find a good learning rate (convergence was too slow, so you increased the training rate, and now convergence is fast but the network’s accuracy is suboptimal), then you can try adding a learning schedule such as exponential decay.
- If your training set is a bit too small, you can implement data augmentation.
- If you need a sparse model, you can add some ℓ 1 \ell_1 ℓ1 regularization to the mix (and optionally zero out the tiny weights after training). If you need an even sparser model, you can try using FTRL instead of Adam optimization, along with ℓ 1 \ell_1 ℓ1 regularization.
- If you need a lightning-fast model at runtime, you may want to drop Batch Normalization, and possibly replace the ELU activation function with the leaky ReLU. Having a sparse model will also help.