chapter 11 Training Deep Neural Nets

chapter 11 Training Deep Neural Nets

OReilly. Hands-On Machine Learning with Scikit-Learn and TensorFlow读书笔记

A chapter that you will never miss a single word. This note thus may miss some import contents.

Problems concern training DNN:

  • vanishing/exploding gradients
  • training would be extremely slow
  • overfitting

11.1 Vanishing/Exploding Gradients Problems

Vanishing gradients problem: gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, weights connecting lower layers stay unchanged during training process, and the algorithm never converges to an optimal solution.

Exploding gradients problem: the gradients grow larger and larger, among which some gradients grow insanely high, and the training diverges.

Some suspects for vanishing gradients: the logistic sigmoid activation function and random initialization using a normal distribution with a mean of 0 and a standard deviation of 1.

With this activation function and this initialization scheme, the variance of the outputs of each layer is much greater than the variance of its inputs. Going forward in the network, the variance keeps increasing after each layer until the activation function saturates at the top layers. This is actually made worse by the fact that the logistic function has a mean of 0.5, not 0.

11.1.1 Xavier and He Initialization

We need the signal to flow properly in both directions: in the forward direction when making predictions, and in the reverse direction when backpropagating gradients. We don’t want the signal to die out (approaches 0 0 0), nor do we want it to explode (approaches infinity) and saturate (stays at a constant).

We need the variance of the outputs of each layer to be equal to the variance of its inputs, and we also need the gradients to have equal variance before and after flowing through a layer in the reverse direction.

The connection weights must be initialized randomly as described in Equation 11-1, where n i n p u t s n_{inputs} ninputs and n o u t p u t s n_{outputs} noutputs are the number of input and output connections for the layer whose weights are being initialized (also called fan-in and fan-out). This initialization strategy is often called Xavier initialization (after the author’s first name), or sometimes Glorot initialization.

Equation 11-1. Xavier initialization (when using the logistic activation function)
Normal distribution with mean 0 and standard deviation  σ = 2 n i n p u t s + n o u t p u t s Or a uniform distribution between ‐r and +r, with  r = 6 n i n p u t s + n o u t p u t s \textrm{Normal distribution with mean 0 and standard deviation }\sigma=\sqrt{\frac{2}{n_{inputs}+n_{outputs}}}\\ \textrm{Or a uniform distribution between ‐r and +r, with }r=\sqrt{\frac{6}{n_{inputs}+n_{outputs}}} Normal distribution with mean 0 and standard deviation σ=ninputs+noutputs2 Or a uniform distribution between r and +r, with r=ninputs+noutputs6
Table 11-1. Initialization parameters for each type of activation function

Activation functionUniform distribution [–r, r]Normal distribution
Logistic r = 6 n i n p u t s + n o u t p u t s r=\sqrt{\frac{6}{n_{inputs}+n_{outputs}}} r=ninputs+noutputs6 σ = 2 n i n p u t s + n o u t p u t s \sigma=\sqrt{\frac{2}{n_{inputs}+n_{outputs}}} σ=ninputs+noutputs2
Hyperbolic tangent r = 4 6 n i n p u t s + n o u t p u t s r=4\sqrt{\frac{6}{n_{inputs}+n_{outputs}}} r=4ninputs+noutputs6 σ = 4 2 n i n p u t s + n o u t p u t s \sigma=4\sqrt{\frac{2}{n_{inputs}+n_{outputs}}} σ=4ninputs+noutputs2
ReLU (and its variants) r = 2 6 n i n p u t s + n o u t p u t s r=\sqrt{2}\sqrt{\frac{6}{n_{inputs}+n_{outputs}}} r=2 ninputs+noutputs6 σ = 2 2 n i n p u t s + n o u t p u t s \sigma=\sqrt{2}\sqrt{\frac{2}{n_{inputs}+n_{outputs}}} σ=2 ninputs+noutputs2
import tensorflow as tf
n_inputs=28*28 #MNIST
n_hidden1=300

X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")

he_init=tf.variance_scaling_initializer()
hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,
                        kernel_initializer=he_init,name="hidden1")

11.1.2 Nonsaturating Activation Functions

The vanishing/exploding gradients problems were in part due to a poor choice of activation function. ReLU activation function behaves better as it does not saturate for positive values (and also because it is quite fast to compute).

ReLU activation function is not perfect. It suffers from a problem known as the dying ReLUs: during training, some neurons effectively die, meaning they stop outputting anything other than 0 if their inputs is negative.

The solution is to use a variant of ReLU

  • leaky ReLU. The function is defined as LeakyReLu α ( z ) = max ⁡ ( α z , z ) \textrm{LeakyReLu}_\alpha(z)=\max(\alpha z,z) LeakyReLuα(z)=max(αz,z). Setting α = 0.2 \alpha = 0.2 α=0.2 (huge leak) seemed to result in better performance than α = 0.01 \alpha = 0.01 α=0.01 (small leak).
  • randomized leaky ReLU (RReLU), where α \alpha α is picked randomly in a given range during
    training, and it is fixed to an average value during testing.
  • parametric leaky ReLU (PReLU), where α \alpha α is authorized to be learned during training (instead of being a hyperparameter).
  • exponential linear unit (ELU):

Equation 11-2. ELU activation function
ELU α ( z ) = { α ( exp ⁡ ( z ) − 1 )  if  z &lt; 0 z  if  z ≥ 0 \textrm{ELU}_\alpha(z)=\left\{\begin{array}{ll}\alpha(\exp(z)-1)&amp;\textrm{ if }z&lt;0\\z&amp;\textrm{ if } z\ge 0\end{array}\right. ELUα(z)={α(exp(z)1)z if z<0 if z0
So which activation function should you use for the hidden layers of your deep neural networks? In general ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic. If you care a lot about runtime (testing) performance, then you may prefer leaky ReLUs over ELUs. If you don’t want to tweak yet another hyperparameter, you may just use the default α \alpha α values suggested earlier (0.01 for the leaky ReLU, and 1 for ELU). If you have spare time and computing power, you can use cross-validation to evaluate other activation functions, in particular RReLU if your network is overfitting, or PReLU if you have a huge training set.

hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.elu,name="hidden1")
def leaky_relu(z, name=None):
	return tf.maximum(0.01 * z, z, name=name)
hidden1 = tf.layers.dense(X, n_hidden1, activation=leaky_relu,name="hidden1")

11.1.3 Batch Normalization

Although using He initialization along with ELU (or any variant of ReLU) can significantly reduce the vanishing/exploding gradients problems at the beginning of training, it doesn’t guarantee that they won’t come back during training.

Batch Normalization (BN) is proposed to address the vanishing/exploding gradients problems,
and more generally the problem that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change (which they call the Internal Covariate Shift problem).

The technique consists of adding an operation in the model just before the activation function of each layer, simply zero-centering and normalizing the inputs, then scaling and shifting the result using two new parameters per layer (one for scaling, the other for shifting). In other words, this operation lets the model learn the optimal scale and mean of the inputs for each layer.

In order to zero-center and normalize the inputs, the algorithm needs to estimate the inputs’ mean and standard deviation. It does so by evaluating the mean and standard deviation of the inputs over the current mini-batch (hence the name “Batch Normalization”). The whole operation is summarized in Equation 11-3.

Equation 11-3. Batch Normalization algorithm
KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ 1.\textrm{ }\t…

  • μ B \mu_B μB is the empirical mean, evaluated over the whole mini-batch B B B.•
  • σ B \sigma_B σB is the empirical standard deviation, also evaluated over the whole mini-batch.
  • m B m_B mB is the number of instances in the mini-batch.
  • x ^ ( i ) \widehat{\textbf x}^{(i)} x (i) is the zero-centered and normalized input.
  • γ \gamma γ is the scaling parameter for the layer.
  • β \beta β is the shifting parameter (offset) for the layer.
  • ϵ \epsilon ϵ is a tiny number to avoid division by zero (typically 1 0 – 3 10^{–3} 103). This is called a smoothing term.
  • z ( i ) \textbf z^{(i)} z(i) is the output of the BN operation: it is a scaled and shifted version of the inputs.

At test time, there is no mini-batch to compute the empirical mean and standard deviation, so instead you simply use the whole training set’s mean and standard deviation. These are typically efficiently computed during training using a moving average. So, in total, four parameters are learned for each batch-normalized layer: γ \gamma γ (scale), β \beta β (offset), μ \mu μ (mean), and σ \sigma σ (standard deviation).

Advantages:

  • considerably improved all the deep neural networks they experimented with
  • The vanishing gradients problem was strongly reduced, to the point that they could use saturating activation functions such as the tanh and even the logistic activation function.
  • The networks were also much less sensitive to the weight initialization.
  • They were able to use much larger learning rates, significantly speeding up the learning process.
  • improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
  • Batch Normalization also acts like a regularizer, reducing the need for other regularization techniques (such as dropout).

Disadvantages:

  • add some complexity to the model (although it removes the need for normalizing the input data since the first hidden layer will take care of that, provided it is batch-normalized).
  • Moreover, there is a runtime penalty: the neural network makes slower predictions due to the extra computations required at each layer. So if you need predictions to be lightning-fast, you may want to check how well plain ELU + He initialization perform before playing with Batch Normalization.
11.1.3.1 Implementing Batch Normalization with TensorFlow

Normally, in the construction phase, the code for using placeholder nodes to represent the training data and targets, and constructing deep neural network looks like follows.

import tensorflow as tf
from tensorflow.layers import batch_normalization
tf.reset_default_graph()

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")

training=tf.placeholder_with_default(False,shape=(),name="training")

hidden1=tf.layers.dense(X,n_hidden1,name="hidden1")
bn1=tf.layers.batch_normalization(hidden1,training=training,momentum=0.9)
bn1_act=tf.nn.elu(bn1)

hidden2=tf.layers.dense(bn1_act,n_hidden2,name="hidden2")
bn2=tf.layers.batch_normalization(hidden2,training=training,momentum=0.9)
bn2_act=tf.nn.elu(bn2)

logits_before_bn=tf.layers.dense(bn2_act,n_outputs,name="outputs")
logits=tf.layers.batch_normalization(logits_before_bn,training=training,momentum=0.9)

To prevent repeating the same parameters over and over again, we can use Python’s partial() function instead.

from functools import partial

my_batch_norm_layer = partial(tf.layers.batch_normalization,
                              training=training, momentum=0.9)

hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1")
bn1 = my_batch_norm_layer(hidden1)
bn1_act = tf.nn.elu(bn1)
hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2")
bn2 = my_batch_norm_layer(hidden2)
bn2_act = tf.nn.elu(bn2)
logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs")
logits = my_batch_norm_layer(logits_before_bn)

The wholes code are as follows.

import tensorflow as tf
from tensorflow.layers import batch_normalization
tf.reset_default_graph()

batch_norm_momentum = 0.9

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")
training = tf.placeholder_with_default(False, shape=(), name='training')

with tf.name_scope("dnn"):
    he_init = tf.variance_scaling_initializer()

    my_batch_norm_layer = partial(
            tf.layers.batch_normalization,
            training=training,
            momentum=batch_norm_momentum)

    my_dense_layer = partial(
            tf.layers.dense,
            kernel_initializer=he_init)

    hidden1 = my_dense_layer(X, n_hidden1, name="hidden1")
    bn1 = tf.nn.elu(my_batch_norm_layer(hidden1))
    hidden2 = my_dense_layer(bn1, n_hidden2, name="hidden2")
    bn2 = tf.nn.elu(my_batch_norm_layer(hidden2))
    logits_before_bn = my_dense_layer(bn2, n_outputs, name="outputs")
    logits = my_batch_norm_layer(logits_before_bn)

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    
init = tf.global_variables_initializer()
saver = tf.train.Saver()

n_epochs = 20
batch_size = 200
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
#The update operations needed by batch normalization are added to the #UPDATE_OPS collection and you need to explicity run these operations during #training 
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run([training_op, extra_update_ops],
                     feed_dict={training: True, X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "./my_model_final.ckpt")
0 Validation accuracy: 0.894
1 Validation accuracy: 0.9178
2 Validation accuracy: 0.9294
3 Validation accuracy: 0.9402
4 Validation accuracy: 0.947
5 Validation accuracy: 0.9512
6 Validation accuracy: 0.9566
7 Validation accuracy: 0.9592
8 Validation accuracy: 0.9616
9 Validation accuracy: 0.9638
10 Validation accuracy: 0.9682
11 Validation accuracy: 0.9676
12 Validation accuracy: 0.9688
13 Validation accuracy: 0.9692
14 Validation accuracy: 0.9724
15 Validation accuracy: 0.972
16 Validation accuracy: 0.9722
17 Validation accuracy: 0.973
18 Validation accuracy: 0.9736
19 Validation accuracy: 0.9728

If you train for longer it will get much better accuracy, but with such a shallow network, Batch Norm and ELU are unlikely to have very positive impact: they shine mostly for much deeper nets.

11.1.4 Gradient Clipping

A popular technique to lessen the exploding gradients problem is to simply clip the gradients during backpropagation so that they never exceed some threshold. This is called Gradient
Clipping. In general people now prefer Batch Normalization, but it’s still useful to know about Gradient Clipping and how to implement it.

In TensorFlow, the optimizer’s minimize() function takes care of both computing the gradients and applying them, so you must instead call the optimizer’s compute_gradients() method first, then create an operation to clip the gradients using the clip_by_value() function, and finally create an operation to apply the clipped gradients using the optimizer’s apply_gradients() method:

threshold = 1.0

optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
              for grad, var in grads_and_vars]
training_op = optimizer.apply_gradients(capped_gvs)

11.2 Reusing Pretrained Layers

Transfer learning means not trying to train a model from scratch, but just reusing the lower layers of a neural network that fulfils a similar task. It will not only speed up training considerably, but will also require much less training data. Transfer learning requires the new task to have the same size of input and more generally, will works well only if the inputs have similar low-level features.

11.2.1 Reusing a TensorFlow Model

  1. Restore the whole model:

In the first case, the original code is not available, you have only the saved model at hand.

To reuse the model, you may first use import_meta_graph() function to load the graph structure into the default graph. The graph structure contains the nodes and operations from the original model. The function returns a Saver than you can use to restore the model’ state, i.e., the model parameters. You can train the new model with parameters initialized from the original model. The new model starts training with the parameters of a trained model, thus speeds up the training and require less data.

In the second case, if you have access to the Python code that built the original graph, you can use it instead of import_meta_graph(). The code for training a new model is the same as that of the original model except for the execution phase, where calling of init.run() for initialization of parameters is replaced by saver.restore(sess,"./my_model_final.ckpt").

(i) First case

First you need to load the graph’s structure. The import_meta_graph() function does just that, loading the graph’s operations into the default graph, and returning a Saver that you can then use to restore the model’s state. Note that by default, a Saver saves the structure of the graph into a .meta file, so that’s the file you should load:

import tensorflow as tf
tf.reset_default_graph()
saver = tf.train.import_meta_graph("./my_model_final.ckpt.meta")

Next you need to get a handle on all the operations you will need for training. If you don’t know the graph’s structure, you can list all the operations:

for op in tf.get_default_graph().get_operations():
    print(op.name)
X = tf.get_default_graph().get_tensor_by_name("X:0")
y = tf.get_default_graph().get_tensor_by_name("y:0")
accuracy = tf.get_default_graph().get_tensor_by_name("eval/accuracy:0")
training_op = tf.get_default_graph().get_operation_by_name("GradientDescent")

If you are the author of the original model, you could make things easier for people who will reuse your model by giving operations very clear names and documenting them. Another approach is to create a collection containing all the important operations that people will want to get a handle on:

for op in (X, y, accuracy, training_op):
    tf.add_to_collection("my_important_ops", op)

This way people who reuse your model will be able to simply write:

X, y, accuracy, training_op = tf.get_collection("my_important_ops")

Now you can start a session, restore the model’s state and continue training on your data:

with tf.Session() as sess:
    saver.restore(sess, "./my_model_final.ckpt")

    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "./my_new_model_final.ckpt")   
  1. Reuse only part of the original model

In general you will want to reuse only the lower layers. If you are using import_meta_graph() it will load the whole graph, but you can simply ignore the parts you do not need. In this example, we add a new 4th hidden layer on top of the pretrained 3rd layer (ignoring the old 4th hidden layer). We also build a new output layer, the loss for this new output, and a new optimizer to minimize it. We also need another saver to save the whole graph (containing both the entire old graph plus the new operations), and an initialization operation to initialize all the new variables:

tf.reset_default_graph()

n_hidden4 = 20  # new layer
n_outputs = 10  # new layer

saver = tf.train.import_meta_graph("./my_model_final.ckpt.meta")

X = tf.get_default_graph().get_tensor_by_name("X:0")
y = tf.get_default_graph().get_tensor_by_name("y:0")

hidden3 = tf.get_default_graph().get_tensor_by_name("dnn/hidden3/Relu:0")

new_hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu, name="new_hidden4")
new_logits = tf.layers.dense(new_hidden4, n_outputs, name="new_outputs")

with tf.name_scope("new_loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=new_logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("new_eval"):
    correct = tf.nn.in_top_k(new_logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

with tf.name_scope("new_train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()
#new_saver = tf.train.Saver() 
#does not work, report an error
with tf.Session() as sess:
    init.run()#for new nodes
    saver.restore(sess, "./my_model_final.ckpt")#for reused nodes

    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    #save_path = new_saver.save(sess, "./my_new_model_final.ckpt")
    save_path = saver.save(sess, "./my_new_model_final.ckpt")

If you have access to the Python code that built the original graph, you can just reuse the parts you need and drop the rest:

tf.reset_default_graph()

n_inputs = 28 * 28  # MNIST
n_hidden1 = 300 # reused
n_hidden2 = 50  # reused
n_hidden3 = 50  # reused
n_hidden4 = 20  # new!
n_outputs = 10  # new!

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")       # reused
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2") # reused
    hidden3 = tf.layers.dense(hidden2, n_hidden3, activation=tf.nn.relu, name="hidden3") # reused
    hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu, name="hidden4") # new!
    logits = tf.layers.dense(hidden4, n_outputs, name="outputs")                         # new!

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

As the new model contains layers share same names with that of in the original model, restore the whole model may raise confliction

reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                               scope="hidden[123]") # regular expression
restore_saver = tf.train.Saver(reuse_vars) # to restore layers 1-3

init = tf.global_variables_initializer()
saver = tf.train.Saver()

with tf.Session() as sess:
    init.run()
    restore_saver.restore(sess, "./my_model_final.ckpt")
    for epoch in range(n_epochs): # not shown in the book
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size): # not shown
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})        # not shown
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})     # not shown
        print(epoch, "Validation accuracy:", accuracy_val)# not shown
    save_path = saver.save(sess, "./my_new_model_final.ckpt")

11.2.2 Reusing Models from Other Frameworks

In this example, for each variable we want to reuse, we find its initializer’s assignment operation, and we get its second input, which corresponds to the initialization value. When we run the initializer, we replace the initialization values with the ones we want, using a feed_dict:

tf.reset_default_graph()

n_inputs = 2
n_hidden1 = 3

original_w = [[1., 2., 3.], [4., 5., 6.]] # Load the weights from the other framework
original_b = [7., 8., 9.]                 # Load the biases from the other framework

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,name="hidden1")
# [...] Build the rest of the model

# Get a handle on the assignment nodes for the hidden1 variables
graph = tf.get_default_graph()
assign_kernel = graph.get_operation_by_name("hidden1/kernel/Assign")
assign_bias = graph.get_operation_by_name("hidden1/bias/Assign")
init_kernel = assign_kernel.inputs[1]
init_bias = assign_bias.inputs[1]

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init, feed_dict={init_kernel: original_w, init_bias: original_b})
    # [...] Train the model on your new task
    print(hidden1.eval(feed_dict={X: [[10.0, 11.0]]}))  # not shown in the book

11.2.3 Freezing the Lower Layers

tf.reset_default_graph()

n_inputs = 28 * 28  # MNIST
n_hidden1 = 300 # reused
n_hidden2 = 50  # reused
n_hidden3 = 50  # reused
n_hidden4 = 20  # new!
n_outputs = 10  # new!

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")       # reused
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2") # reused
    hidden3 = tf.layers.dense(hidden2, n_hidden3, activation=tf.nn.relu, name="hidden3") # reused
    hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu, name="hidden4") # new!
    logits = tf.layers.dense(hidden4, n_outputs, name="outputs")                         # new!

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

with tf.name_scope("train"): # not shown in the book
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)# not shown
    train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
                                   scope="hidden[34]|outputs")
#gets the list of all trainable variables in hidden layers 3 and 4 and in the #output layer. This frozen the hidden layers 1 and 2. 
    training_op = optimizer.minimize(loss, var_list=train_vars)

reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                               scope="hidden[123]") # regular expression
restore_saver = tf.train.Saver(reuse_vars) # to restore layers 1-3

init = tf.global_variables_initializer()
saver = tf.train.Saver()

with tf.Session() as sess:
    init.run()
    restore_saver.restore(sess, "./my_model_final.ckpt")

    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "./my_new_model_final.ckpt")

Note that, the code within the name scope “train” determines the frozen layers (hidden1 and hidden2), whereas the code of restore_saver determines the reused layers (hidden1, hidden2, and hidden3).

Alternatively, use tf.stop_gradients() to stop one layer (thus the lower layers) to update gradients. Then training part under the name scope “train” is as usual and needs no modification.

hidden2_stop=tf.stop_gradient(hidden2)
hidden3 = tf.layers.dense(hidden2_stop, n_hidden3, activation=tf.nn.relu, name="hidden3")

11.2.4 Caching the Frozen Layers

Since the frozen layers won’t change, it is possible to cache the output of the topmost frozen layer for each training instance. Since training goes through the whole dataset many times, this will give you a huge speed boost as you will only need to go through the frozen layers once per training instance (instead of once per epoch). For example, you could first run the whole training set through the lower layers (assuming you have enough RAM):

reset_graph()

n_inputs = 28 * 28  # MNIST
n_hidden1 = 300 # reused
n_hidden2 = 50  # reused
n_hidden3 = 50  # reused
n_hidden4 = 20  # new!
n_outputs = 10  # new!

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
                              name="hidden1") # reused frozen
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu,
                              name="hidden2") # reused frozen & cached
    hidden2_stop = tf.stop_gradient(hidden2) 
    #another way to froze the lower layers 
    hidden3 = tf.layers.dense(hidden2_stop, n_hidden3, activation=tf.nn.relu,
                              name="hidden3") # reused, not frozen
    hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu,
                              name="hidden4") # new!
    logits = tf.layers.dense(hidden4, n_outputs, name="outputs") # new!

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                               scope="hidden[123]") # regular expression
restore_saver = tf.train.Saver(reuse_vars) # to restore layers 1-3

init = tf.global_variables_initializer()
saver = tf.train.Saver()

import numpy as np

n_batches = len(X_train) // batch_size

with tf.Session() as sess:
    init.run()
    restore_saver.restore(sess, "./my_model_final.ckpt")
    
    h2_cache = sess.run(hidden2, feed_dict={X: X_train})
    h2_cache_valid = sess.run(hidden2, feed_dict={X: X_valid}) # not shown in the book

    for epoch in range(n_epochs):
        shuffled_idx = np.random.permutation(len(X_train))
        hidden2_batches = np.array_split(h2_cache[shuffled_idx], n_batches)
        y_batches = np.array_split(y_train[shuffled_idx], n_batches)
        for hidden2_batch, y_batch in zip(hidden2_batches, y_batches):
            sess.run(training_op, feed_dict={hidden2:hidden2_batch, y:y_batch})

        accuracy_val = accuracy.eval(feed_dict={hidden2: h2_cache_valid,
                                                y: y_valid})# not shown
        print(epoch, "Validation accuracy:", accuracy_val) # not shown

    save_path = saver.save(sess, "./my_new_model_final.ckpt")

11.2.5 Tweaking, Dropping, or Replacing the Upper Layers

You want to find the right number of layers to reuse. Try freezing all the copied layers first, then train your model and see how it performs. Then try unfreezing one or two of the top hidden layers to let backpropagation tweak them and see if performance improves. The more training data you have, the more layers you can unfreeze.

If you still cannot get good performance, and you have little training data, try dropping the top hidden layer(s) and freeze all remaining hidden layers again. You can iterate until you find the right number of layers to reuse. If you have plenty of training data, you may try replacing the top hidden layers instead of dropping them, and even add more hidden layers.

11.2.6 Model Zoos

This is one good reason to save all your models and organize them so you can retrieve them later easily. Another option is to search in a model zoo.

TensorFlow has its own model zoo available at https://github.com/tensorflow/models.

Caffe’s model zoo. https://github.com/ethereon/caffe-tensorflow

11.2.7 Unsupervised Pretraining

Unsupervised Pretraining: If you have plenty of unlabeled training data, you can try to train the layers one by one, starting with the lowest layer and then going up, using an unsupervised feature detector algorithm such as Restricted Boltzmann Machines (RBMs; see Appendix E) or autoencoders (see Chapter 15). Each layer is trained on the output of the previously trained layers (all layers except the one being trained are frozen). Once all layers have been trained this way, you can finetune the network using supervised learning (i.e., with backpropagation).

11.2.8 Pretraining on an Auxiliary Task

One last option is to train a first neural network on an auxiliary task for which you can easily obtain or generate labeled training data, then reuse the lower layers of that network for your actual task. The first neural network’s lower layers will learn feature detectors that will likely be reusable by the second neural network.

It is often rather cheap to gather unlabeled training examples, but quite expensive to label them. In this situation, a common technique is to label all your training examples as “good,” then generate many new training instances by corrupting the good ones, and label these corrupted instances as “bad”. Then you can train a first neural network to classify instances as good or bad.

Another approach is to train a first network to output a score for each training instance, and use a cost function that ensures that a good instance’s score is greater than a bad instance’s score by at least some margin. This is called max margin learning.

11.3 Faster Optimizers

Ways to speed up training:

  • applying a good initialization strategy for the connection weights
  • using a good activation function
  • using Batch Normalization
  • reusing parts of a pretrained network
  • using a faster optimizer than the regular Gradient Descent optimizer
    • Momentum optimization
    • Nesterov Accelerated Gradient
    • AdaGrad
    • RMSProp
    • Adam optimization

11.3.1 Momentum Optimization

Gradient Descent θ ← θ − η ∇ θ J ( θ ) \theta\leftarrow \theta-\eta\nabla_\theta J(\theta) θθηθJ(θ) does not care about what the earlier gradients were.

Momentum optimization cares a great deal about what previous gradients were: at each iteration, it adds the local gradient to the momentum vector m \textbf m m (multiplied by the learning rate η \eta η), and it updates the weights by simply subtracting this momentum vector (see Equation 11-4). In other words, the gradient is used as an acceleration, not as a speed. To simulate some sort of friction mechanism and prevent the momentum from growing too large, the algorithm introduces a new hyperparameter β \beta β, simply called the momentum, which must be set between 0 (high friction) and 1 (no friction). A typical momentum value is 0.9.

Equation 11-4. Momentum algorithm
$$
\begin{align}

  1. \textrm{ }\textrm{ }\textrm{ }\textrm{ }& \textbf m\leftarrow \beta\textbf m +\eta\nabla_\theta J(\theta)\
    2.\textrm{ }\textrm{ }\textrm{ }\textrm{ } & \theta \leftarrow \theta -\textbf m
    \end{align}
    $$
optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate,
                                       momentum=0.9)

11.3.2 Nesterov Accelerated Gradient

The idea of Nesterov Momentum optimization, or Nesterov Accelerated Gradient (NAG), is to measure the gradient of the cost function not at the local position but slightly ahead in the direction of the momentum (see Equation 11-5). NAG is faster than vanilla Momentum optimization.

Equation 11-5. Nesterov Accelerated Gradient algorithm
KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲1. \textrm{ }\t…
This small tweak works because in general the momentum vector will be pointing in the right direction (i.e., toward the optimum), so it will be slightly more accurate to use the gradient measured a bit farther in that direction rather than using the gradient at the original position. Moreover, it helps reduce oscillations and thus converges faster.

optimizer=tf.train.MomentumOptimizer(learning_rate=learning_rate,
                                    momentum=0.9, use_nesterov=True)

11.3.3 AdaGrad

Gradient Descent starts by quickly going down the steepest direction, then slowly goes down to the optimum. AdaGrad tries to detect the right direction to the optimum early.

The AdaGrad algorithm achieves this by scaling down the gradient vector along the steepest dimensions (see Equation 11-6):

Equation 11-6. AdaGrad algorithm
KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲1. \textrm{ }\t…
The first step accumulates the square of the gradients into the vector s \textbf s s (the ⊗ \otimes symbol represents the element-wise multiplication). This vectorized form is equivalent to computing s i ← s i + ( ∂ / ∂ θ i J ( θ ) ) 2 s_i \leftarrow s_i+(\partial /\partial \theta_iJ(\theta))^2 sisi+(/θiJ(θ))2 for each element s i s_i si of the vector s \textbf s s; in other words, each s i s_i si accumulates the squares of the partial derivative of the cost function with regards to parameter θ i \theta_i θi. If the cost function is steep along the ith dimension, then s i s_i si will get larger and larger at each iteration.

The second step is almost identical to Gradient Descent, but with one big difference: the gradient vector is scaled down by a factor of s + ϵ \sqrt{\textbf s+\epsilon} s+ϵ (the ⊘ \oslash symbol represents the element-wise division, and ϵ \epsilon ϵ is a smoothing term to avoid division by zero, typically set to 1 0 – 10 10^{–10} 1010). This vectorized form is equivalent to computing θ i ← θ i − η ∂ / ∂ θ i J ( θ ) ) / s i + ϵ \theta_i\leftarrow \theta_i -\eta\partial/\partial \theta_i J(\theta))/\sqrt{s_i+\epsilon} θiθiη/θiJ(θ))/si+ϵ for all parameters θ i \theta_i θi (simultaneously).

In short, this algorithm decays the learning rate, but it does so faster for steep dimensions than for dimensions with gentler slopes. This is called an adaptive learning rate. It helps point the resulting updates more directly toward the global optimum (see Figure 11-7). One additional benefit is that it requires much less tuning of the learning rate hyperparameter η \eta η.

optimizer=tf.train.AdagradOptimizer(learning_rate=learning_rate)

11.3.4 RMSProp

Although AdaGrad slows down a bit too fast and ends up never converging to the global optimum, the RMSProp algorithm fixes this by accumulating only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training). It does so by using exponential decay in the first step (see Equation 11-7).

Equation 11-7. RMSProp algorithm
KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲1. \textrm{ }\t…

optimizer=tf.train.RMSPropOptimizer(learning_rate=learning_rate,
                                    momentum=0.9,decay=0.9,epsilon=1e-10)

11.3.5 Adam Optimization

Adam, which stands for adaptive moment estimation, combines the ideas of Momentum optimization and RMSProp: just like Momentum optimization it keeps track of an exponentially decaying average of past gradients, and just like RMSProp it keeps track of an exponentially decaying average of past squared gradients (see Equation 11-8).

Equation 11-8. Adam algorithm
$$
\begin{align}1. \textrm{ }\textrm{ }\textrm{ }\textrm{ }& \textbf m\leftarrow \beta_1\textbf m +(1-\beta_1)\nabla_\theta J(\theta)\
2. \textrm{ }\textrm{ }\textrm{ }\textrm{ }& \textbf s\leftarrow \beta_2\textbf s +(1-\beta_2)\nabla_\theta J(\theta)\otimes\nabla_\theta J(\theta) \
3. \textrm{ }\textrm{ }\textrm{ }\textrm{ }& \textbf m\leftarrow \frac{\textbf m}{1-\beta_1^T}\
4. \textrm{ }\textrm{ }\textrm{ }\textrm{ }& \textbf s\leftarrow \frac{\textbf s}{1-\beta_2^T}\

5.\textrm{ }\textrm{ }\textrm{ }\textrm{ } & \theta \leftarrow \theta -\eta\textbf m\oslash\sqrt{\textbf s+\epsilon}\end{align}
$$

  • T T T represents the iteration number (starting at 1).
optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate)

Training Sparse Models

If you need a blazingly fast model at runtime, or if you need it to take up less memory, you may prefer to end up with a sparse model instead.

One trivial way to achieve this is to train the model as usual, then get rid of the tiny weights (set them to 0).

Another option is to apply strong $\ell_1 $ regularization during training, as it pushes the optimizer to zero out as many weights as it can (as discussed in Chapter 4 about Lasso Regression).

However, in some cases these techniques may remain insufficient. One last option is to apply Dual Averaging, often called Follow The Regularized Leader (FTRL), a technique proposed by Yurii Nesterov. When used with ℓ 1 \ell_1 1 regularization, this technique often leads to very sparse models. TensorFlow implements a variant of FTRL called FTRL-Proximal in the FTRLOptimizer class.

11.3.5 Learning Rate Scheduling

You can do better than a constant learning rate: if you start with a high learning rate and then reduce it once it stops making fast progress, you can reach a good solution faster than with the optimal constant learning rate. There are many different strategies to reduce the learning rate during training. These strategies are called learning schedules, the most common of which are:

Predetermined piecewise constant learning rate
For example, set the learning rate to η 0 = 0.1 \eta_0 = 0.1 η0=0.1 at first, then to η 1 = 0.001 \eta_1 = 0.001 η1=0.001 after 50 epochs. Although this solution can work very well, it often requires fiddling around to figure out the right learning rates and when to use them.

Performance scheduling
Measure the validation error every N N N steps (just like for early stopping) and reduce the learning rate by a factor of λ \lambda λ when the error stops dropping.

Exponential scheduling
Set the learning rate to a function of the iteration number t: η ( t ) = η 0 1 0 – t / r \eta(t) = \eta_0 10^{–t/r} η(t)=η010t/r. This works great, but it requires tuning η 0 \eta_0 η0 and r r r. The learning rate will drop by a factor of 10 every r r r steps.

Power scheduling
Set the learning rate to η ( t ) = η 0 ( 1 + t / r ) – c \eta(t) = \eta_0 (1 + t/r)^{–c} η(t)=η0(1+t/r)c. The hyperparameter c c c is typically set to 1. This is similar to exponential scheduling, but the learning rate drops much more slowly.

import tensorflow as tf
import numpy as np
tf.reset_default_graph()

n_inputs = 28 * 28  # MNIST
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2")
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")
    
with tf.name_scope("train"):       # not shown in the book
    initial_learning_rate = 0.1
    decay_steps = 10000
    decay_rate = 1/10
    global_step = tf.Variable(0, trainable=False, name="global_step")
    learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step,
                                               decay_steps, decay_rate)
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
    training_op = optimizer.minimize(loss, global_step=global_step)
    
init = tf.global_variables_initializer()
saver = tf.train.Saver()

def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]

n_epochs = 5
batch_size = 50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)
        print(global_step.eval(),learning_rate.eval(feed_dict={X: X_batch, y: y_batch}))
        
    save_path = saver.save(sess, "./my_model_final.ckpt")
0 Validation accuracy: 0.956
1100 0.077624716
1 Validation accuracy: 0.9702
2200 0.060255956
2 Validation accuracy: 0.9768
3300 0.04677352
3 Validation accuracy: 0.9806
4400 0.036307808
4 Validation accuracy: 0.9818
5500 0.02818383

Since AdaGrad, RMSProp, and Adam optimization automatically reduce the learning rate during training, it is not necessary to add an extra learning schedule. For other optimization algorithms, using exponential decay or performance scheduling can considerably speed up convergence.

11.4 Avoiding Overfitting Through Regularization

11.4.1 Early Stopping

11.4.2 ℓ 1 \ell_1 1 and ℓ 2 \ell_2 2 Regularization

import tensorflow as tf
import numpy as np
tf.reset_default_graph()

n_inputs=28*28
n_hidden1=300
n_outputs=10

X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")
y=tf.placeholder(tf.int32,shape=(None),name="y")

with tf.name_scope("dnn"):
    hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,name="hidden1")
    logits=tf.layers.dense(hidden1,n_outputs,activation=None,name="outputs")

W1=tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0")
W2=tf.get_default_graph().get_tensor_by_name("outputs/kernel:0")
scale = 0.001 # l1 regularization hyperparameter


with tf.name_scope("loss"):
    xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,labels=y)
    base_loss=tf.reduce_mean(xentropy,name="avg_xentropy")
    reg_losses=tf.reduce_sum(abs(W1))+tf.reduce_sum(tf.abs(W2))
    loss=tf.add(base_loss,scale*reg_losses,name="loss")
    
learning_rate=0.1
with tf.name_scope("train"):
    optimizer=tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
    training_op=optimizer.minimize(loss)

with tf.name_scope("eval"):
    correct=tf.nn.in_top_k(logits,y,1)
    accuracy=tf.reduce_mean(tf.cast(correct,tf.float32))

init=tf.global_variables_initializer()
saver=tf.train.Saver()

n_epochs = 20
batch_size = 200
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        for X_batch,y_batch in shuffle_batch(X_train,y_train,batch_size):
            sess.run(training_op,feed_dict={X:X_batch,y:y_batch})
        accuracy_val=accuracy.eval(feed_dict={X:X_valid,y:y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)
    save_path=saver.save(sess,save_path="./my_final_model.ckpt")

11.4.3 Dropout

The most popular regularization technique for deep neural networks is arguably dropout. It was proposed by G. E. Hinton in 2012 and further detailed in a paper by Nitish Srivastava et al., and it has proven to be highly successful: even the state-of-the-art neural networks got a 1–2% accuracy boost simply by adding dropout. This may not sound like a lot, but when a model already has 95% accuracy, getting a 2% accuracy boost means dropping the error rate by almost 40% (going from 5% error to roughly 3%).

It is a fairly simple algorithm: at every training step, every neuron (including the input neurons but excluding the output neurons) has a probability p of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step (see Figure 11-9). The hyperparameter p p p is called the dropout rate, and it is typically set to 50%. After training, neurons don’t get dropped anymore.

There is one small but important technical detail. Suppose p = 50, in which case during testing a neuron will be connected to twice as many input neurons as it was (on average) during training. To compensate for this fact, we need to multiply each neuron’s input connection weights by 0.5 after training. If we don’t, each neuron will get a total input signal roughly twice as large as what the network was trained on, and it is unlikely to perform well. More generally, we need to multiply each input connection weight by the keep probability (1 – p) after training. Alternatively, we can divide each neuron’s output by the keep probability during training (these alternatives are not perfectly equivalent, but they work equally well).

import tensorflow as tf
import numpy as np

tf.reset_default_graph()
n_inputs=28*28
n_hidden1=300
n_hidden2=100
n_outputs=10

X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")
y=tf.placeholder(tf.int32,shape=(None),name="y")

training=tf.placeholder_with_default(False,shape=(),name="training")

dropout_rate=0.5 # ==1-keep_prob
X_drop=tf.layers.dropout(X,dropout_rate,training=training)

with tf.name_scope("dnn"):
    hidden1=tf.layers.dense(X_drop,n_hidden1,activation=tf.nn.relu,
                            name="hidden1")
    hidden1_drop=tf.layers.dropout(hidden1,dropout_rate,training=training)
    hidden2=tf.layers.dense(hidden1_drop,n_hidden2,activation=tf.nn.relu,
                           name="hidden2")
    hidden2_drop=tf.layers.dropout(hidden2,dropout_rate,training=training)
    logits=tf.layers.dense(hidden2_drop,n_outputs,name="outputs")

with tf.name_scope("loss"):
    xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,labels=y)
    loss=tf.reduce_mean(xentropy,name="loss")

learning_rate=0.01
with tf.name_scope("train"):
    optimizer=tf.train.MomentumOptimizer(momentum=0.9,learning_rate=learning_rate)
    training_op=optimizer.minimize(loss)
    
with tf.name_scope("eval"):
    correct=tf.nn.in_top_k(logits,y,1)
    accuracy=tf.reduce_mean(tf.cast(correct,tf.float32))
    
init=tf.global_variables_initializer()
saver=tf.train.Saver()

n_epochs=20
batch_size=50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch,y_batch in shuffle_batch(X_train,y_train,batch_size):
            sess.run(training_op,feed_dict={X:X_batch,y:y_batch})
        accuracy_val=accuracy.eval(feed_dict={X:X_valid,y:y_valid})
        print(epoch,"Validation accuray:", accuracy_val)
    save_path=saver.save(sess,"./my_final_model.ckpt")

11.4.4 Max-Norm Regularization

Another regularization technique that is quite popular for neural networks is called max-norm regularization: for each neuron, it constrains the weights w of the incoming connections such that ∣ ∣ w ∣ ∣ 2 ≤ r ||\textbf w||_2\le r w2r, where r r r is the max-norm hyperparameter and ∣ ∣ ⋅ ∣ ∣ 2 ||\cdot||_2 2 is the ℓ 2 \ell_2 2 norm.

We typically implement this constraint by computing ∣ ∣ w ∣ ∣ 2 ||\textbf w||_2 w2 after each training step and clipping w \textbf w w if needed ( w ← w r ∣ ∣ w ∣ ∣ 2 \textbf w \leftarrow \textbf w\frac{r}{||\textbf w||_2} www2r).

Reducing r increases the amount of regularization and helps reduce overfitting. Max-norm regularization can also help alleviate the vanishing/exploding gradients problems (if you are not using Batch Normalization).

import tensorflow as tf
import numpy as np

tf.reset_default_graph()

n_inputs=28*28
n_hidden1=300
n_hidden2=50
n_outputs=10

X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")
y=tf.placeholder(tf.int32,shape=(None),name="y")

with tf.name_scope("dnn"):
    hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,name="hidden1")
    hidden2=tf.layers.dense(hidden1,n_hidden2,activation=tf.nn.relu,name="hidden2")
    logits=tf.layers.dense(hidden2,n_outputs,name="outputs")

with tf.name_scope("loss"):
    xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,labels=y)
    loss=tf.reduce_mean(xentropy,name="loss")
    
learning_rate=0.01
momentum=0.9
with tf.name_scope("train"):
    optimizer=tf.train.MomentumOptimizer(momentum=momentum,learning_rate=learning_rate)
    training_op=optimizer.minimize(loss)

with tf.name_scope("eval"):
    correct=tf.nn.in_top_k(logits,y,1)
    accuracy=tf.reduce_mean(tf.cast(correct,tf.float32))
    
threshold=1.0
weights=tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0")
clipped_weights=tf.clip_by_norm(weights,clip_norm=threshold,axes=1)
clip_weights=tf.assign(weights,clipped_weights)

weights2=tf.get_default_graph().get_tensor_by_name("hidden2/kernel:0")
clipped_weights2=tf.clip_by_norm(weights2,clip_norm=threshold,axes=1)
clip_weights2=tf.assign(weights2,clipped_weights2)

init=tf.global_variables_initializer()
saver=tf.train.Saver()

n_epochs=20
batch_size=50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch,y_batch in shuffle_batch(X_train,y_train,batch_size):
            sess.run(training_op,feed_dict={X:X_batch,y:y_batch})
            clip_weights.eval()
            clip_weights2.eval()
        accuracy_val=accuracy.eval(feed_dict={X:X_valid,y:y_valid})
        print(epoch,"Acccuracy Validation:",accuracy_val)
    save_path=saver.save(sess,"./my_final_model.chpt")

A cleaner solution is to create a max_norm_regularizer() function and use it just like the earlier l1_regularizer() function. This function returns a parametrized max_norm() function that you can use like any other regularizer.

import tensorflow as tf
import numpy as np

tf.reset_default_graph()

n_inputs=28*28
n_hidden1=300
n_hidden2=50
n_outputs=10

X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")
y=tf.placeholder(tf.int32,shape=(None),name="y")

def max_norm_regularizer(threshold,axes=1,name="max_norm",
                        collection="max_norm"):
    def max_norm(weights):
        clipped=tf.clip_by_norm(weights,clip_norm=threshold,axes=axes)
        clip_weights=tf.assign(weights,clipped,name=name)
        tf.add_to_collection(collection,clip_weights)
        return None #there is no regularization loss term
    return max_norm

max_norm_reg=max_norm_regularizer(threshold=1.0)

with tf.name_scope("dnn"):
    hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,
                            kernel_regularizer=max_norm_reg,name="hidden1")
    hidden2=tf.layers.dense(hidden1,n_hidden2,activation=tf.nn.relu,
                            kernel_regularizer=max_norm_reg,name="hidden2")
    logits=tf.layers.dense(hidden2,n_outputs,name="outputs")

with tf.name_scope("loss"):
    xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,labels=y)
    loss=tf.reduce_mean(xentropy,name="loss")
    
learning_rate=0.01
momentum=0.9
with tf.name_scope("train"):
    optimizer=tf.train.MomentumOptimizer(momentum=momentum,learning_rate=learning_rate)
    training_op=optimizer.minimize(loss)

with tf.name_scope("eval"):
    correct=tf.nn.in_top_k(logits,y,1)
    accuracy=tf.reduce_mean(tf.cast(correct,tf.float32))
    
init=tf.global_variables_initializer()
saver=tf.train.Saver()

n_epochs=20
batch_size=50

clip_all_weights= tf.get_collection("max_norm")

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch,y_batch in shuffle_batch(X_train,y_train,batch_size):
            sess.run(training_op,feed_dict={X:X_batch,y:y_batch})
            sess.run(clip_all_weights)
        accuracy_val=accuracy.eval(feed_dict={X:X_valid,y:y_valid})
        print(epoch,"Acccuracy Validation:",accuracy_val)
    save_path=saver.save(sess,"./my_final_model.chpt")

11.4.5 Data Augmentation

It is often preferable to generate training instances on the fly during training rather than wasting storage space and network bandwidth. TensorFlow offers several image manipulation operations such as transposing (shifting), rotating, resizing, flipping, and cropping, as well as adjusting the brightness, contrast, saturation, and hue (see the API documentation for more details). This makes it easy to implement data augmentation for image datasets.

11.5 Practical Guidelines

Table 11-2. Default DNN configuration

InitializationHe initialization
Activation functionELU
NormalizationBatch Normalization
RegularizaitonDropout
OptimizerAdam
Learning rate scheduleNone

This default configuration may need to be tweaked:

  • If you can’t find a good learning rate (convergence was too slow, so you increased the training rate, and now convergence is fast but the network’s accuracy is suboptimal), then you can try adding a learning schedule such as exponential decay.
  • If your training set is a bit too small, you can implement data augmentation.
  • If you need a sparse model, you can add some ℓ 1 \ell_1 1 regularization to the mix (and optionally zero out the tiny weights after training). If you need an even sparser model, you can try using FTRL instead of Adam optimization, along with ℓ 1 \ell_1 1 regularization.
  • If you need a lightning-fast model at runtime, you may want to drop Batch Normalization, and possibly replace the ELU activation function with the leaky ReLU. Having a sparse model will also help.
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: 深度神经网络(Deep Neural Networks)是一种基于神经元模型的人工神经网络,它具有多个隐藏层,可以用于处理大规模的非线性问题。深度神经网络在图像识别、语音识别、自然语言处理等领域取得了很大的成功。 ### 回答2: 深度神经网络(Deep Neural Networks)是一种基于神经元相互连接的机器学习模型。它由多个神经网络层次组成,每个层次都有很多的神经元。与传统的浅层神经网络相比,深度神经网络可以学习到更加抽象和复杂的特征表示。 深度神经网络的训练过程通常使用前向传播和反向传播算法。在前向传播过程中,输入数据从输入层逐层传播到输出层,每一层都通过激活函数将输入信号进行非线性转换,生成输出。然后,通过与真实结果进行比较,计算损失函数,并使用反向传播算法更新每个神经元的权重和偏置值,使得损失函数最小化。 深度神经网络在许多任务中表现出色,如图像分类、语音识别和自然语言处理等。这些模型可以通过训练大量数据来学习到更复杂的特征,从而提高模型的性能。此外,深度神经网络还可以通过迁移学习和预训练模型的技术来应对数据不足的问题。 尽管深度神经网络在许多领域中取得了显著的成功,但也存在一些挑战。首先,深度神经网络的训练通常需要大量的计算资源和时间。此外,深度网络的结构非常复杂,导致模型的解释性较差。因此,解释模型的决策过程和发现模型中的错误仍然是一个开放的问题。 总之,深度神经网络是一种强大的机器学习模型,可以学习到更复杂的特征,提高模型性能。随着技术的不断发展和研究的深入,深度神经网络将在各个领域中发挥更重要的作用。 ### 回答3: 深度神经网络(Deep Neural Networks)是一种机器学习的模型,模仿人脑的神经网络结构和功能。它由多层神经元组成,每一层的神经元都会计算输入数据的线性组合,并通过激活函数将计算结果传递给下一层。 与传统的浅层神经网络相比,深度神经网络具有多层的隐藏层,这使得它能够更好地处理复杂的问题。深度神经网络通过逐层学习和特征提取,能够从输入数据中自动发现和学习更抽象和高级的特征。 深度神经网络在许多领域中取得了巨大的成功,如计算机视觉、自然语言处理和语音识别等。例如,在计算机视觉中,深度神经网络可以通过层层学习,识别图像中的物体、人脸或文字等特征。在自然语言处理中,它可以利用隐藏层的特征,实现机器翻译、文本分类或情感分析等任务。 然而,深度神经网络也面临一些挑战。首先,深度神经网络的训练需要大量的数据和计算资源,因为网络结构更加复杂,参数数量也会增加。其次,深度神经网络容易过拟合,即在训练集上表现良好,但在未见过的数据上表现较差。为了解决这个问题,研究人员提出了一些正则化方法,如dropout和L1/L2正则化等。 总的来说,深度神经网络是一种强大的机器学习模型,可以自动从数据中学习和发现特征。它在各种应用领域有着广泛的应用,并且将会在未来的研究中不断演进和改进。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值