Chapter 15 Autoencoders

Chapter 15 Autoencoders

OReilly. Hands-On Machine Learning with Scikit-Learn and TensorFlow读书笔记

Autoencoders are artificial neural networks capable of learning efficient representations of the input data, called codings, without any supervision (i.e., the training set is unlabeled).

Autoencoders are used in/as/for:

  • dimensionality reduction
  • powerful feature detectors
  • unsupervised pretraining of deep neural networks
  • generative model which are capable of randomly generating new data

The codings are byproducts of the autoencoder’s attempt to learn the identity function under some constraints.

15.1 Efficient Data Representations

Constraining an autoencoder during training pushes it to discover and exploit patterns in the data.

An autoencoder is always composed of two parts: an encoder (or recognition network) that converts the inputs to an internal representation, followed by a decoder (or generative network) that converts the internal representation to the outputs (see Figure 15-1). The outputs are often called the reconstructions since the autoencoder tries to reconstruct the inputs, and the cost function contains a reconstruction loss that penalizes the model when the reconstructions are different from the inputs.

Because the internal representation has a lower dimensionality than the input data (it is 2D instead of 3D), the autoencoder is said to be undercomplete. An undercomplete autoencoder cannot trivially copy its inputs to the codings, yet it must find a way to output a copy of its inputs. It is forced to learn the most important features in the input data (and drop the unimportant ones).

15.2 Performing PCA with an Undercomplete Linear Autoencoder

If the autoencoder uses only linear activations and the cost function is the Mean Squared Error (MSE), then it can be shown that it ends up performing Principal Component Analysis (see Chapter 8).

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import numpy.random as rnd
from sklearn.preprocessing import StandardScaler

#generating data
rnd.seed(4)
m = 200 
w1, w2 = 0.1, 0.3
noise = 0.1

angles = rnd.rand(m) * 3 * np.pi / 2 - 0.5
data = np.empty((m, 3))
data[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * rnd.randn(m) / 2
data[:, 1] = np.sin(angles) * 0.7 + noise * rnd.randn(m) / 2
data[:, 2] = data[:, 0] * w1 + data[:, 1] * w2 + noise * rnd.randn(m)

#data standardization
scaler = StandardScaler()
X_train=scaler.fit_transform(data[:100])
X_test=scaler.transform(data[100:])

tf.reset_default_graph()

n_inputs = 3 # 3D inputs
n_hidden = 2 # 2D codings
n_outputs= n_inputs

learning_rate = 0.01

X = tf.placeholder(tf.float32, shape=[None,n_inputs])
hidden =tf.layers.dense(X, n_hidden)
outputs =tf.layers.dense(hidden,n_outputs)

reconstruction_loss= tf.reduce_mean(tf.square(outputs-X)) #MSE

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(reconstruction_loss)

init=tf.global_variables_initializer()

n_iterations=1000
codings=hidden

with tf.Session() as sess:
    init.run()
    for iteration in range(n_iterations):
        training_op.run(feed_dict={X:X_train})
    codings_val=codings.eval(feed_dict={X:X_test})
    
fig=plt.figure(figsize=(4,3))
plt.plot(codings_val[:,0],codings_val[:,1],"b.")
plt.xlabel("$z_1$",fontsize=18)
plt.ylabel("$z_2$",fontsize=18,rotation=0)
plt.show();

15.3 Stacked Autoencoders

stacked autoencoders (or deep autoencoders) : autoencoders with multiple hidden layers.

The architecture of a stacked autoencoder is typically symmetrical with regards to the central hidden layer (the coding layer).

15.3.1 TensorFlow Implementation

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]

def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch
        
tf.reset_default_graph()

from functools import partial

n_inputs= 28*28
n_hidden1= 300
n_hidden2= 150
n_hidden3= n_hidden1
n_outputs=n_inputs

learning_rate = 0.01
l2_reg =0.0001

X = tf.placeholder(tf.float32,[None, n_inputs])

he_init = tf.variance_scaling_initializer()
l2_regularizer = tf.contrib.layers.l2_regularizer(l2_reg)
my_dense_layer = partial(tf.layers.dense,activation=tf.nn.elu,
                        kernel_initializer=he_init,
                        kernel_regularizer=l2_regularizer)

hidden1 = my_dense_layer(X,n_hidden1)
hidden2 = my_dense_layer(hidden1,n_hidden2)
hidden3 = my_dense_layer(hidden2,n_hidden3)
outputs = my_dense_layer(hidden3,n_outputs, activation=None)

reconstruction_loss = tf.reduce_mean(tf.square(outputs-X))
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
loss = tf.add_n([reconstruction_loss]+reg_losses)

optimizer =tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()
saver=tf.train.Saver()
n_epochs = 5
batch_size=150
codings=hidden2

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch,y_batch  in shuffle_batch(X_train,y_train,batch_size):
            sess.run(training_op,feed_dict={X:X_batch})
        loss_train=reconstruction_loss.eval(feed_dict={X:X_batch})
        print("\r{}".format(epoch), "Train MSE:", loss_train)
        saver.save(sess, "./my_model_all_layers.ckpt")
        
        
def plot_image(image, shape=[28, 28]):
    plt.imshow(image.reshape(shape), cmap="Greys", interpolation="nearest")
    plt.axis("off")
    
def show_reconstructed_digits(X, outputs, model_path = None, n_test_digits = 2):
    with tf.Session() as sess:
        if model_path:
            saver.restore(sess, model_path)
        X_t2 = X_test[:n_test_digits]
        outputs_val = outputs.eval(feed_dict={X: X_t2})

        fig = plt.figure(figsize=(8, 3 * n_test_digits))
        for digit_index in range(n_test_digits):
            plt.subplot(n_test_digits, 2, digit_index * 2 + 1)
            plot_image(X_test[digit_index])
            plt.subplot(n_test_digits, 2, digit_index * 2 + 2)
            plot_image(outputs_val[digit_index])
        
show_reconstructed_digits(X, outputs, "./my_model_all_layers.ckpt")

15.3.2 Tying Weights

When an autoencoder is neatly symmetrical, like the one we just built, a common technique is to tie the weights of the decoder layers to the weights of the encoder layers. This halves the number of weights in the model, speeding up training and limiting the risk of overfitting. Specifically, if the autoencoder has a total of N layers (not counting the input layer), and W L \textbf W_L WL represents the connection weights of the L t h L^{th} Lth layer, then the decoder layer weights can be defined simply as: W N – L + 1 = W L T \textbf W_{N–L+1} = \textbf W_{L}^T WNL+1=WLT (with L = 1 , 2 , ⋯   , N 2 L = 1, 2, \cdots, \frac{N}{2} L=1,2,,2N ).

activation =tf.nn.elu
initializer = tf.variance_scaling_initializer()
regularizer = tf.contrib.layers.l2_regularizer(l2_reg)

X=tf.placeholder(tf.float32, [None,n_inputs])
weights1_init = initializer([n_inputs,n_hidden1])
weights2_init = initializer([n_hidden1, n_hidden2])
weights1=tf.Variable(weights1_init,dtype=tf.float32,name="weights1")
weights2=tf.Variable(weights2_init,dtype=tf.float32,name="weights2")
weights3=tf.transpose(weights2,name="weights3")
weights4=tf.transpose(weights1,name="weights4")

biases1=tf.Variable(tf.zeros(n_hidden1),name="biases1")
biases2=tf.Variable(tf.zeros(n_hidden2),name="biases2")
biases3=tf.Variable(tf.zeros(n_hidden3),name="biases3")
biases4=tf.Variable(tf.zeros(n_outputs),name="biases4")

hidden1=activation(tf.matmul(X,weights1)+biases1)
hidden2=activation(tf.matmul(hidden1,weights2)+biases2)
hidden3=activation(tf.matmul(hidden2,weights3)+biases3)
outputs=tf.matmul(hidden3,weights4)+biases4

reconstruction_loss = tf.reduce_mean(tf.square(outputs-X))
reg_losses =regularizer(weights1)+regularizer(weights2)
loss=reconstruction_loss+reg_losses

optimizer =tf.train.AdamOptimizer(learning_rate)
training_op=optimizer.minimize(loss)

init=tf.global_variables_initializer()
saver=tf.train.Saver()

n_epochs = 5
batch_size=150
codings=hidden2

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch,y_batch  in shuffle_batch(X_train,y_train,batch_size):
            sess.run(training_op,feed_dict={X:X_batch})
        loss_train=reconstruction_loss.eval(feed_dict={X:X_batch})
        print("\r{}".format(epoch), "Train MSE:", loss_train)           # not shown
        saver.save(sess, "./my_tying_weights.ckpt")  

15.3.3 Training One Autoencoder at a Time

Rather than training the whole stacked autoencoder in one go like we just did, it is often much faster to train one shallow autoencoder at a time, then stack all of them into a single stacked autoencoder (hence the name), as shown on Figure 15-4. This is especially useful for very deep autoencoders.

During the first phase of training, the first autoencoder learns to reconstruct the inputs. During the second phase, the second autoencoder learns to reconstruct the output of the first autoencoder’s hidden layer.

To implement this multiphase training algorithm, the simplest approach is to use a different TensorFlow graph for each phase. After training an autoencoder, you just run the training set through it and capture the output of the hidden layer. This output then serves as the training set for the next autoencoder. Once all autoencoders have been trained this way, you simply copy the weights and biases from each autoencoder and use them to build the stacked autoencoder.

#training one autoencoder at a time in multiple graphs
tf.reset_default_graph()

#train one autoencoder and return the transformed training set
def train_autoencoder(X_train,n_neurons,n_epochs,batch_size,
                     learning_rate=0.01,l2_reg=0.0005,seed=42,
                     hidden_activation=tf.nn.elu,
                     output_activation=tf.nn.elu):
    graph=tf.Graph()
    with graph.as_default():
        tf.set_random_seed(seed)
        
        n_inputs=X_train.shape[1]
        
        X=tf.placeholder(tf.float32,[None,n_inputs])
        
        my_dense_layer = partial(
            tf.layers.dense,
            kernel_initializer=tf.variance_scaling_initializer(),            kernel_regularizer=tf.contrib.layers.l2_regularizer(l2_reg))
                 
        hidden=my_dense_layer(X,n_neurons,activation=hidden_activation,name="hidden")
        outputs=my_dense_layer(hidden,n_inputs,activation=output_activation,name="outputs")
        
        reconstruction_loss=tf.reduce_mean(tf.square(outputs-X))
        reg_losses=tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
        loss=tf.add_n([reconstruction_loss]+reg_losses)
        
        optimizer=tf.train.AdamOptimizer(learning_rate)
        training_op=optimizer.minimize(loss)
        
        init=tf.global_variables_initializer()
        
    with tf.Session(graph=graph) as sess:
        init.run()
        for epoch in range(n_epochs):
            for X_batch,y_batch in shuffle_batch(X_train,y_train,batch_size):
                sess.run(training_op,feed_dict={X:X_batch})            
            loss_train = reconstruction_loss.eval(feed_dict={X: X_batch})
            print("\r{}".format(epoch), "Train MSE:", loss_train)
            
        params =dict([(var.name, var.eval()) 
                      for var in tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)])
        hidden_val = hidden.eval(feed_dict={X: X_train})
        return hidden_val, params["hidden/kernel:0"], params["hidden/bias:0"], params["outputs/kernel:0"], params["outputs/bias:0"]
    
hidden_output, W1, b1, W4, b4 = train_autoencoder(X_train, n_neurons=300, n_epochs=4, batch_size=150,
                                                  output_activation=None)
_, W2, b2, W3, b3 = train_autoencoder(hidden_output, n_neurons=150, n_epochs=4, batch_size=150)

tf.reset_default_graph()
n_inputs=28*28

X=tf.placeholder(tf.float32,[None,n_inputs])
hidden1=tf.nn.elu(tf.matmul(X,W1)+b1)
hidden2 = tf.nn.elu(tf.matmul(hidden1, W2) + b2)
hidden3 = tf.nn.elu(tf.matmul(hidden2, W3) + b3)
outputs = tf.matmul(hidden3, W4) + b4

show_reconstructed_digits(X, outputs)

Another approach is to use a single graph containing the whole stacked autoencoder, plus some extra operations to perform each training phase, as shown in Figure 15-5.

tf.reset_default_graph()

from functools import partial

n_inputs= 28*28
n_hidden1= 300
n_hidden2= 150
n_hidden3= n_hidden1
n_outputs=n_inputs

learning_rate = 0.01
l2_reg =0.0001

activation = tf.nn.elu
regularizer = tf.contrib.layers.l2_regularizer(l2_reg)
initializer = tf.variance_scaling_initializer()

X = tf.placeholder(tf.float32,[None, n_inputs])

weights1_init = initializer([n_inputs, n_hidden1])
weights2_init = initializer([n_hidden1, n_hidden2])
weights3_init = initializer([n_hidden2, n_hidden3])
weights4_init = initializer([n_hidden3, n_outputs])

weights1 = tf.Variable(weights1_init, dtype=tf.float32, name="weights1")
weights2 = tf.Variable(weights2_init, dtype=tf.float32, name="weights2")
weights3 = tf.Variable(weights3_init, dtype=tf.float32, name="weights3")
weights4 = tf.Variable(weights4_init, dtype=tf.float32, name="weights4")

biases1 = tf.Variable(tf.zeros(n_hidden1), name="biases1")
biases2 = tf.Variable(tf.zeros(n_hidden2), name="biases2")
biases3 = tf.Variable(tf.zeros(n_hidden3), name="biases3")
biases4 = tf.Variable(tf.zeros(n_outputs), name="biases4")

hidden1 = activation(tf.matmul(X, weights1) + biases1)
hidden2 = activation(tf.matmul(hidden1, weights2) + biases2)
hidden3 = activation(tf.matmul(hidden2, weights3) + biases3)
outputs = tf.matmul(hidden3, weights4) + biases4

reconstruction_loss = tf.reduce_mean(tf.square(outputs - X))


optimizer =tf.train.AdamOptimizer(learning_rate)

with tf.name_scope("phase1"):
    phase1_outputs = tf.matmul(hidden1, weights4) + biases4  # bypass hidden2 and hidden3
    phase1_reconstruction_loss = tf.reduce_mean(tf.square(phase1_outputs - X))
    phase1_reg_loss = regularizer(weights1) + regularizer(weights4)
    phase1_loss = phase1_reconstruction_loss + phase1_reg_loss
    phase1_training_op = optimizer.minimize(phase1_loss)

with tf.name_scope("phase2"):
    phase2_reconstruction_loss = tf.reduce_mean(tf.square(hidden3 - hidden1))
    phase2_reg_loss = regularizer(weights2) + regularizer(weights3)
    phase2_loss = phase2_reconstruction_loss + phase2_reg_loss
    train_vars = [weights2, biases2, weights3, biases3]
    phase2_training_op = optimizer.minimize(phase2_loss, var_list=train_vars) # freeze hidden1

init = tf.global_variables_initializer()
saver = tf.train.Saver()

training_ops = [phase1_training_op, phase2_training_op]
reconstruction_losses = [phase1_reconstruction_loss, phase2_reconstruction_loss]
n_epochs = [4, 4]
batch_sizes = [150, 150]

with tf.Session() as sess:
    init.run()
    for phase in range(2):
        print("Training phase #{}".format(phase + 1))
        for epoch in range(n_epochs[phase]):
            for X_batch,y_batch in shuffle_batch(X_train,y_train,batch_sizes[phase]):
                sess.run(training_ops[phase], feed_dict={X: X_batch})
            loss_train = reconstruction_losses[phase].eval(feed_dict={X: X_batch})
            print("\r{}".format(epoch), "Train MSE:", loss_train)
            saver.save(sess, "./my_model_one_at_a_time.ckpt")
        loss_test = reconstruction_loss.eval(feed_dict={X: X_test})
        print("Test MSE:", loss_test)

The first phase is rather straightforward: we just create an output layer that skips hidden layers 2 and 3, then build the training operations to minimize the distance between the outputs and the inputs (plus some regularization).

The second phase just adds the operations needed to minimize the distance between the output of hidden layer 3 and hidden layer 1 (also with some regularization). Most importantly, specifying the list of trainable variables to the minimize() method ensures to leave out weights1 and biases1; this effectively freezes hidden layer 1 during phase 2.

During the execution phase, all you need to do is run the phase 1 training op for a number of epochs, then the phase 2 training op for some more epochs. Since hidden layer 1 is frozen during phase 2, its output will always be the same for any given training instance. To avoid having to recompute the output of hidden layer 1 at every single epoch, you can compute it for the whole training set at the end of phase 1, then directly feed the cached output of hidden layer 1 during phase 2. This can give you a nice performance boost.

15.3.4 Visualizing the Reconstructions

One way to ensure that an autoencoder is properly trained is to compare the inputs and the outputs. They must be fairly similar, and the differences should be unimportant details.

n_test_digits = 2
X_test_2 = X_test[:n_test_digits]

with tf.Session() as sess:
    saver.restore(sess, "./my_model_one_at_a_time.ckpt") # not shown in the book
    outputs_val = outputs.eval(feed_dict={X: X_test_2})

def plot_image(image, shape=[28, 28]):
    plt.imshow(image.reshape(shape), cmap="Greys", interpolation="nearest")
    plt.axis("off")

for digit_index in range(n_test_digits):
    plt.subplot(n_test_digits, 2, digit_index * 2 + 1)
    plot_image(X_test_2[digit_index])
    plt.subplot(n_test_digits, 2, digit_index * 2 + 2)
    plot_image(outputs_val[digit_index])

15.3.5 Visualizing Features

Arguably the simplest technique is to consider each neuron in every hidden layer, and find the training instances that activate it the most.

Let’s look at another technique. For each neuron in the first hidden layer, you can create an image where a pixel’s intensity corresponds to the weight of the connection to the given neuron. The weight matrix indicates the pattern that it identifies.

with tf.Session() as sess:
    saver.restore(sess, "./my_model_one_at_a_time.ckpt") # not shown in the book
    weights1_val = weights1.eval()

for i in range(5):
    plt.subplot(1, 5, i + 1)
    plot_image(weights1_val.T[i])

plt.show()                          # not shown

Another technique is to feed the autoencoder a random input image, measure the activation of the neuron you are interested in, and then perform backpropagation to tweak the image in such a way that the neuron will activate even more. If you iterate several times (performing gradient ascent), the image will gradually turn into the most exciting image (for the neuron). This is a useful technique to visualize the kinds of inputs that a neuron is looking for.

Finally, if you are using an autoencoder to perform unsupervised pretraining—for example, for a classification task—a simple way to verify that the features learned by the autoencoder are useful is to measure the performance of the classifier.

15.4 Unsupervised Pretraining Using Stacked Autoencoders

As we discussed in Chapter 11, if you are tackling a complex supervised task but you do not have a lot of labeled training data, one solution is to find a neural network that performs a similar task, and then reuse its lower layers. This makes it possible to train a high-performance model using only little training data because your neural network won’t have to learn all the low-level features; it will just reuse the feature detectors learned by the existing net.

Similarly, if you have a large dataset but most of it is unlabeled, you can first train a stacked autoencoder using all the data, then reuse the lower layers to create a neural network for your actual task, and train it using the labeled data.

15.5 Denoising Autoencoders

Another way to force the autoencoder to learn useful features is to add noise to its inputs, training it to recover the original, noise-free inputs. This prevents the autoencoder from trivially copying its inputs to its outputs, so it ends up having to find patterns in the data

The noise can be pure Gaussian noise added to the inputs, or it can be randomly switched off inputs, just like in dropout.

15.5.1 TensorFlow Implementation

tf.reset_default_graph()

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 150  # codings
n_hidden3 = n_hidden1
n_outputs = n_inputs

learning_rate = 0.01

noise_level=1.0
X = tf.placeholder(tf.float32, [None,n_inputs])
X_noisy = X + noise_level * tf.random_normal(tf.shape(X))

hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,
                       name="hidden1")
hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu,
                          name="hidden2")                            
hidden3 = tf.layers.dense(hidden2, n_hidden3, activation=tf.nn.relu,                           name="hidden3")                            
outputs = tf.layers.dense(hidden3, n_outputs, name="outputs") 

reconstruction_loss = tf.reduce_mean(tf.square(outputs-X))
optimizer =tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(reconstruction_loss)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

n_epochs = 10
batch_size = 150

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch,y_batch in shuffle_batch(X_train,y_train,batch_size):
            sess.run(training_op, feed_dict={X: X_batch})
        loss_train = reconstruction_loss.eval(feed_dict={X: X_batch})
        print("\r{}".format(epoch), "Train MSE:", loss_train)
        saver.save(sess, "./my_model_stacked_denoising_gaussian.ckpt")

15.6 Sparse Autoencoder

Another kind of constraint that often leads to good feature extraction is sparsity: by adding an appropriate term to the cost function, the autoencoder is pushed to reduce the number of active neurons in the coding layer. The more sparse a layer, the better feature extraction, as a sparse layer focuses on a specific feature.

In order to favor sparse models, we must first measure the actual sparsity of the coding layer at each training iteration. We do so by computing the average activation of each neuron in the coding layer, over the whole training batch. The batch size must not be too small, or else the mean will not be accurate.

Once we have the mean activation per neuron, we want to penalize the neurons that are too active by adding a sparsity loss to the cost function. Use the Kullback–Leibler divergence as penalization term, which has much stronger gradients than the Mean Squared Error.

Equation 15-1. Kullback–Leibler divergence
D KL ( P ∣ ∣ Q ) = ∑ i P ( i ) log ⁡ P ( i ) Q ( i ) \textrm{D}_\textrm{KL}(P||Q)=\sum_i P(i)\log\frac{P(i)}{Q(i)} DKL(PQ)=iP(i)logQ(i)P(i)
In our case, we want to measure the divergence between the target probability p p p that a neuron in the coding layer will activate, and the actual probability q q q (i.e., the mean activation over the training batch).

Equation 15-2. KL divergence between the target sparsity p and the actual sparsity q
D KL ( p ∣ ∣ q ) = p log ⁡ p q + ( 1 − p ) log ⁡ 1 − p 1 − q \textrm{D}_\textrm{KL}(p||q)=p\log\frac{p}{q}+(1-p)\log\frac{1-p}{1-q} DKL(pq)=plogqp+(1p)log1q1p
Once we have computed the sparsity loss for each neuron in the coding layer, we just sum up these losses, and add the result to the cost function. In order to control the relative importance of the sparsity loss and the reconstruction loss, we can multiply the sparsity loss by a sparsity weight hyperparameter. If this weight is too high, the model will stick closely to the target sparsity, but it may not reconstruct the inputs properly, making the model useless. Conversely, if it is too low, the model will mostly ignore the sparsity objective and it will not learn any interesting features.

15.6.1 TensorFlow Implementation

tf.reset_default_graph()

n_inputs = 28 * 28
n_hidden1 = 1000  # sparse codings
n_outputs = n_inputs

def kl_divergence(p,q):
    return p*tf.log(p/q)+(1-p)*tf.log((1-p)/(1-q))

learing_rate=0.01
sparsity_target= 0.1
sparsity_weight=0.2

X = tf.placeholder(tf.float32, shape=[None, n_inputs])            # not shown in the book

hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.sigmoid) # not shown
outputs = tf.layers.dense(hidden1, n_outputs)                     # not shown

hidden1_mean= tf.reduce_mean(hidden1,axis=0) #batch mean
sparsity_loss=tf.reduce_sum(kl_divergence(sparsity_target,hidden1_mean))
reconstruction_loss = tf.reduce_mean(tf.square(outputs - X)) # MSE
loss = reconstruction_loss + sparsity_weight * sparsity_loss

optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

n_epochs = 100
batch_size = 1000

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch,y_batch in shuffle_batch(X_train,y_train,batch_size):
            sess.run(training_op, feed_dict={X: X_batch})
        reconstruction_loss_val, sparsity_loss_val, loss_val = sess.run([reconstruction_loss, sparsity_loss, loss], feed_dict={X: X_batch})
        print("\r{}".format(epoch), "Train MSE:", reconstruction_loss_val, "\tSparsity loss:", sparsity_loss_val, "\tTotal loss:", loss_val)
        saver.save(sess, "./my_model_sparse.ckpt")

15.7 Variational Autoencoders

  • They are probabilistic autoencoders, meaning that their outputs are partly determined by chance, even after training (as opposed to denoising autoencoders, which use randomness only during training).
  • Most importantly, they are generative autoencoders, meaning that they can generate new instances that look like they were sampled from the training set.

There is a twist: instead of directly producing a coding for a given input, the encoder produces a mean coding μ \mu μ and a standard deviation σ \sigma σ. The actual coding is then sampled randomly from a Gaussian distribution with mean μ \mu μ and standard deviation σ \sigma σ. After that the decoder just decodes the sampled coding normally. First, the encoder produces μ \mu μ and σ \sigma σ, then a coding is sampled randomly (notice that it is not exactly located at μ \mu μ), and finally this coding is decoded, and the final output resembles the training instance.

Although the inputs may have a very convoluted distribution, a variational autoencoder tends to produce codings that look as though they were sampled from a simple Gaussian distribution: during training, the cost function (discussed next) pushes the codings to gradually migrate within the coding space (also called the latent space) to occupy a roughly (hyper)spherical region that looks like a cloud of Gaussian points. One great consequence is that after training a variational autoencoder, you can very easily generate a new instance: just sample a random coding from the Gaussian distribution, decode it, and voilà!

So let’s look at the cost function. It is composed of two parts. The first is the usual reconstruction loss that pushes the autoencoder to reproduce its inputs (we can use cross entropy for this, as discussed earlier). The second is the latent loss that pushes the autoencoder to have codings that look as though they were sampled from a simple Gaussian distribution, for which we use the KL divergence between the target distribution (the Gaussian distribution) and the actual distribution of the codings.

from functools import partial
tf.reset_default_graph()

n_inputs = 28 * 28 # for MNIST
n_hidden1 = 500
n_hidden2 = 500
n_hidden3 = 20 # codings
n_hidden4 = n_hidden2
n_hidden5 = n_hidden1
n_outputs = n_inputs
learning_rate = 0.001

initializer= tf.variance_scaling_initializer()

my_dense_layer = partial(
    tf.layers.dense,
    activation=tf.nn.elu,
    kernel_initializer=initializer)

X= tf.placeholder(tf.float32,[None,n_inputs])
hidden1=my_dense_layer(X,n_hidden1)
hidden2=my_dense_layer(hidden1,n_hidden2)
hidden3_mean=my_dense_layer(hidden2,n_hidden3,activation=None)
#hidden3_sigma= my_dense_layer(hidden2, n_hidden3, activation=None)
#noise = tf.random_normal(tf.shape(hidden3_sigma), dtype=tf.float32)
#hidden3 = hidden3_mean + hidden3_sigma * noise
hidden3_gamma = my_dense_layer(hidden2, n_hidden3, activation=None)
noise = tf.random_normal(tf.shape(hidden3_gamma), dtype=tf.float32)
hidden3 = hidden3_mean + tf.exp(0.5 * hidden3_gamma) * noise
hidden4 = my_dense_layer(hidden3, n_hidden4)
hidden5 = my_dense_layer(hidden4, n_hidden5)
logits = my_dense_layer(hidden5, n_outputs, activation=None)
outputs = tf.sigmoid(logits)

xentropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=X, logits=logits)
reconstruction_loss = tf.reduce_sum(xentropy)

eps = 1e-10 # smoothing term to avoid computing log(0) which is NaN
#latent_loss =0.5 * tf.reduce_sum(
#    tf.square(hidden3_sigma) + tf.square(hidden3_mean)
#    -1-tf.log(eps+tf.square(hidden3_sigma)))
latent_loss = 0.5 * tf.reduce_sum(
    tf.exp(hidden3_gamma) + tf.square(hidden3_mean) - 1 - hidden3_gamma)
    
loss =reconstruction_loss + latent_loss

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

# 15.7.1 Generating Digits 
n_digits = 60
n_epochs = 50
batch_size = 150

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train,y_train,batch_size):
            sess.run(training_op, feed_dict={X: X_batch})
        loss_val, reconstruction_loss_val, latent_loss_val = sess.run([loss, reconstruction_loss, latent_loss], feed_dict={X: X_batch})
        print("\r{}".format(epoch), "Train total loss:", loss_val, "\tReconstruction loss:", reconstruction_loss_val, "\tLatent loss:", latent_loss_val)
        saver.save(sess, "./my_model_variational.ckpt")
    
    codings_rnd = np.random.normal(size=[n_digits, n_hidden3])
    outputs_val = outputs.eval(feed_dict={hidden3: codings_rnd})
plt.figure(figsize=(8,50)) # not shown in the book
for iteration in range(n_digits):
    plt.subplot(n_digits, 10, iteration + 1)
    plot_image(outputs_val[iteration])

15.7.1 Generating Digits

15.8 Other Autoencoders

Contractive autoencoder (CAE)

The autoencoder is constrained during training so that the derivatives of the codings with regards to the inputs are small. In other words, two similar inputs must have similar codings.

Stacked convolutional autoencoders

Autoencoders that learn to extract visual features by reconstructing images processed through convolutional layers.

Generative stochastic network (GSN)

A generalization of denoising autoencoders, with the added capability to generate data.

Winner-take-all (WTA) autoencoder

During training, after computing the activations of all the neurons in the coding layer, only the top k% activations for each neuron over the training batch are preserved, and the rest are set to zero. Naturally this leads to sparse codings. Moreover, a similar WTA approach can be used to produce sparse convolutional autoencoders.

Adversarial autoencoders

One network is trained to reproduce its inputs, and at the same time another is trained to find inputs that the first network is unable to properly reconstruct. This pushes the first autoencoder to learn robust codings.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值