17_Representation Tying权重 CNN RNN denoising Sparse Autoencoder_潜在loss_accuracy_TSNE_KL Divergence_L1

最新推荐文章于 2021-07-16 15:27:12 发布

LIQING LIN

最新推荐文章于 2021-07-16 15:27:12 发布

阅读量1k

点赞数 1

分类专栏： pythonMachineLearningInAction

本文链接：https://blog.csdn.net/Linli522362242/article/details/116576478

版权

pythonMachineLearningInAction 专栏收录该内容

101 篇文章 1 订阅

订阅专栏

Autoencoders are artificial neural networks capable of learning dense representations of the input data, called latent representations or codings, without any supervision (i.e., the training set is unlabeled). These codings typically have a much lower dimensionality than the input data, making autoencoders useful for dimensionality reduction (see Cp8 https://blog.csdn.net/Linli522362242/article/details/105139547), especially for visualization purposes. Autoencoders also act as feature detectors, and they can be used for unsupervised pretraining of deep neural networks (as we discussed in Cp11 https://blog.csdn.net/Linli522362242/article/details/106935910). Lastly, some autoencoders are generative models: they are capable of randomly generating new data that looks very similar to the training data. For example, you could train an autoencoder on pictures of faces, and it would then be able to generate new faces. However, the generated images are usually fuzzy and not entirely realistic.

In contrast, faces generated by generative adversarial networks (GANs) are now so convincing that it is hard to believe that the people they represent do not exist. You can judge so for yourself by visiting https://thispersondoesnotexist.com/, a website that shows faces generated by a recent GAN architecture called StyleGAN (you can also check out https://thisrentaldoesnotexist.com/ to see some generated Airbnb bedrooms). GANs are now widely used for super resolution (increasing the resolution of an image), colorization着色, powerful image editing (e.g., replacing photo bombers[ˈbɔməz]轰炸机 with realistic background), turning a simple sketch into a photorealistic image, predicting the next frames in a video, augmenting a dataset (to train other models. Augmentation https://blog.csdn.net/Linli522362242/article/details/108396485), generating other types of data (such as text, audio, and time series), identifying the weaknesses in other models and strengthening them, and more.

Autoencoders and GANs are both unsupervised, they both learn dense representations, they can both be used as generative models, and they have many similar applications. However, they work very differently:

• Autoencoders simply learn to copy their inputs to their outputs. This may sound like a trivial task, but we will see that constraining the network in various ways can make it rather difficult. For example, you can limit the size of the latent representations, or you can add noise to the inputs and train the network to recover the original inputs. These constraints prevent the autoencoder from trivially copying the inputs directly to the outputs, which forces it to learn efficient ways of representing the data. In short, the codings are byproducts副产品 of the autoencoder learning the identity function under some constraints.
• GANs are composed of two neural networks: a generator that tries to generate data that looks similar to the training data, and a discriminator鉴别器 that tries to tell分辨出 real data from fake data. This architecture is very original in Deep Learning in that the generator and the discriminator compete against each other during training: the generator is often compared to a criminal trying to make realistic counterfeit仿造的，假冒的 money, while the discriminator is like the police investigator trying to tell real money from fake. Adversarial training对抗训练 (training competing neural networks) is widely considered as one of the most important ideas in recent years. In 2016, Yann LeCun even said that it was “the most interesting idea in the last 10 years in Machine Learning.”

we will start by exploring in more depth how autoencoders work and how to use them for dimensionality reduction, feature extraction, unsupervised pretraining, or as generative models. This will naturally lead us to GANs. We will start by building a simple GAN to generate fake images, but we will see that training is often quite difficult. We will discuss the main difficulties you will encounter with adversarial training, as well as some of the main techniques to work around these difficulties. Let’s start with autoencoders!

Efficient Data Representations

Which of the following number sequences do you find the easiest to memorize?

• 40, 27, 25, 36, 81, 57, 10, 73, 19, 68
• 50, 48, 46, 44, 42, 40, 38, 36, 34, 32, 30, 28, 26, 24, 22, 20, 18, 16, 14

At first glance, it would seem that the first sequence should be easier, since it is much shorter. However, if you look carefully at the second sequence, you will notice that it is just the list of even numbers from 50 down to 14. Once you notice this pattern, the second sequence becomes much easier to memorize than the first because you only need to remember the pattern (i.e., decreasing even numbers) and the starting and ending numbers (i.e., 50 and 14). Note that if you could quickly and easily memorize very long sequences, you would not care much about the existence of a pattern in the second sequence. You would just learn every number by heart, and that would be that. The fact that it is hard to memorize long sequences is what makes it useful to recognize patterns, and hopefully this clarifies why constraining an autoencoder during training pushes it to discover and exploit patterns in the data.

The relationship between memory, perception, and pattern matching was famously studied by William Chase and Herbert Simon in the early 1970s.(William G. Chase and Herbert A. Simon, “Perception in Chess,” Cognitive Psychology 4, no. 1 (1973): 55–81.) They observed that expert chess players were able to memorize the positions of all the pieces in a game by looking at the board for just five seconds, a task that most people would find impossible. However, this was only the case when the pieces were placed in realistic positions (from actual games), not when the pieces were placed randomly. Chess experts don’t have a much better memory than you and I; they just see chess patterns more easily, thanks to their experience with the game. Noticing patterns helps them store information efficiently.

Just like the chess players in this memory experiment, an autoencoder looks at the inputs, converts them to an efficient latent representation, and then spits out something that (hopefully) looks very close to the inputs. An autoencoder is always composed of two parts: an encoder (or recognition network) that converts the inputs to a latent representation, followed by a decoder (or generative network) that converts the internal representation to the outputs (see Figure 17-1).
Figure 17-1. The chess memory experiment (left) and a simple autoencoder (right)

As you can see, an autoencoder typically has the same architecture as a Multi-Layer Perceptron (MLP; see Cp10 https://blog.csdn.net/Linli522362242/article/details/111940633
), except that the number of neurons in the output layer must be equal to the number of inputs. In this example, there is just one hidden layer composed of two neurons (the encoder), and one output layer composed of three neurons (the decoder). The outputs are often called the reconstructions because the autoencoder tries to reconstruct the inputs, and the cost function contains a reconstruction loss that penalizes the model when the reconstructions are different from the inputs.

Because the internal representation(OR latent representation) has a lower dimensionality than the input data (it is 2D instead of 3D), the autoencoder is said to be undercomplete. An undercomplete autoencoder cannot trivially copy its inputs to the codings(OR latent representations), yet it must find a way to output a copy of its inputs. It is forced to learn the most important features in the input data (and drop the unimportant ones).

Let’s see how to implement a very simple undercomplete autoencoder for dimensionality reduction.

Performing PCA with an Undercomplete Linear Autoencoder

If the autoencoder uses only linear activations and the cost function is the mean squared error (MSE), then it ends up performing Principal Component Analysis (PCA; see Cp8 https://blog.csdn.net/Linli522362242/article/details/105139547).

PCA with a linear Autoencoder

Build 3D dataset:

The following code builds a simple linear autoencoder to perform PCA on a 3D dataset, projecting it to 2D:

import numpy as np

def generate_3d_data( m, w1=0.1, w2=0.3, noise=0.1 ):
    angles = np.random.rand(m) * 3 * np.pi / 2 - 0.5
    
    data = np.empty( (m, 3) )
    data[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * np.random.randn(m) / 2
    data[:, 1] = np.sin(angles) * 0.7 + noise * np.random.randn(m) / 2
    data[:, 2] = data[:, 0] * w1 + data[:, 1] * w2 + noise * np.random.randn(m)
    return data

X_train = generate_3d_data(60)
X_train = X_train - X_train.mean(axis=0, keepdims=0) # "-" exists broadcasting

Now let's build the Autoencoder

from tensorflow import keras
import tensorflow as tf

np.random.seed(42)
tf.random.set_seed(42)

encoder = keras.models.Sequential([
    keras.layers.Dense(2, input_shape=[3]),
])

decoder = keras.models.Sequential([
    keras.layers.Dense(3, input_shape=[2]),
])

autoencoder = keras.models.Sequential([encoder, decoder])

autoencoder.compile( loss="mse", optimizer=keras.optimizers.SGD(lr=1.5))

This code is really not very different from all the MLPs we built in past chapters, but there are a few things to note:

• We organized the autoencoder into two subcomponents: the encoder and the decoder. Both are regular Sequential models with a single Dense layer each, and the autoencoder is a Sequential model containing the encoder followed by the decoder (remember that a model can be used as a layer in another model).
• The autoencoder’s number of outputs is equal to the number of inputs (i.e., 3).
• To perform simple PCA, we do not use any activation function (i.e., all neurons are linear), and the cost function is the MSE. We will see more complex autoencoders shortly.

Now let’s train the model on a simple generated 3D dataset and use it to encode that same dataset (i.e., project it to 2D):

history = autoencoder.fit(X_train, X_train, epochs=20)

codings = encoder.predict(X_train) #latent representations

Note that the same dataset, X_train, is used as both the inputs and the targets. Figure 17-2 shows the original 3D dataset (on the left) and the output of the autoencoder’s hidden layer (i.e., the coding layer, on the right). As you can see, the autoencoder found the best 2D plane to project the data onto, preserving as much variance in the data as it could (just like PCA).
Figure 17-2. PCA performed by an undercomplete linear autoencoder

You can think of autoencoders as a form of self-supervised learning (i.e., using a supervised learning technique with automatically generated labels, in this case simply equal to the inputs, e.g. X_train, is used as both the inputs and the targets)

Self-supervised learning is when you automatically generate the labels from the data itself, then you train a model on the resulting “labeled” dataset using supervised learning techniques. Since this approach requires no human labeling whatsoever, it is best classified as a form of unsupervised learning..

Stacked Autoencoders

Just like other neural networks we have discussed, autoencoders can have multiple hidden layers. In this case they are called stacked autoencoders (or deep autoencoders). Adding more layers helps the autoencoder learn more complex codings. That said, one must be careful not to make the autoencoder too powerful. Imagine an encoder so powerful that it just learns to map each input to a single arbitrary number (and the decoder learns the reverse mapping). Obviously such an autoencoder will reconstruct the training data perfectly, but it will not have learned any useful data representation in the process (and it is unlikely to generalize well to new instances).

The architecture of a stacked autoencoder is typically symmetrical with regard to the central hidden layer (the coding layer). To put it simply, it looks like a sandwich. For example, an autoencoder for MNIST (introduced in Cp3 https://blog.csdn.net/Linli522362242/article/details/103786116) may have 784 inputs, followed by a hidden layer with 100 neurons, then a central hidden layer of 30 neurons, then another hidden layer with 100 neurons, and an output layer with 784 neurons. This stacked autoencoder is represented in Figure 17-3.
Figure 17-3. Stacked autoencoder

Implementing a Stacked Autoencoder Using Keras

You can implement a stacked autoencoder very much like a regular deep MLP. In particular, the same techniques we used in Cp11 https://blog.csdn.net/Linli522362242/article/details/106935910 for training deep nets can be applied. For example, the following code builds a stacked autoencoder for Fashion MNIST (loaded and normalized as in Cp10 https://blog.csdn.net/Linli522362242/article/details/106562190), using the SELU activation function:

Let's use MNIST:

(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()

X_train_full.shape, y_train_full.shape

X_train_full = X_train_full.astype( np.float32 )/255
X_test = X_test.astype( np.float32 )/255
X_train, X_valid = X_train_full[:-5000], X_train_full[-5000:]
y_train, y_valid = y_train_full[:-5000], y_train_full[-5000:]

Train all layers at once

Let's build a stacked Autoencoder with 3 hidden layers and 1 output layer (i.e., 2 stacked Autoencoders).

def rounded_accuracy(y_true, y_pred):
    return keras.metrics.binary_accuracy( tf.round(y_true), tf.round(y_pred) )# default threshold=0.5

accuracy: https://zhuanlan.zhihu.com/p/95293440
Losses and metrics are conceptually not the same thing: losses (e.g., cross entropy) are used by Gradient Descent(The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function(a cost function measures how bad your model is. ). https://blog.csdn.net/Linli522362242/article/details/104005906 ) to train a model, so they must be differentiable (at least where they are evaluated), and their gradients should not be 0 everywhere. Plus, it’s OK if they are not easily interpretable by humans. In contrast, metrics (e.g., accuracy) are used to evaluate a model: they must be more easily interpretable, and they can be non-differentiable or have 0 gradients everywhere. https://blog.csdn.net/Linli522362242/article/details/107294292
https://blog.csdn.net/Linli522362242/article/details/108414534

https://blog.csdn.net/Linli522362242/article/details/106849041

tf.random.set_seed(42)
np.random.seed(42)

stacked_encoder = keras.models.Sequential([
    keras.layers.Flatten( input_shape=[28,28] ), #==>28*28 to match dense's requirement
    keras.layers.Dense(100, activation='selu'), # default kernel_initializer="glorot_uniform" for weights or kernel
    keras.layers.Dense(30, activation='selu'),
])

stacked_decoder = keras.models.Sequential([
    keras.layers.Dense(100, activation='selu', input_shape=[30]),
    keras.layers.Dense(28*28, activation='sigmoid'),
    keras.layers.Reshape([28,28])
])

stacked_ae = keras.models.Sequential([stacked_encoder, stacked_decoder])
stacked_ae.compile( loss='binary_crossentropy', # https://blog.csdn.net/Linli522362242/article/details/108414534
                    optimizer=keras.optimizers.SGD(lr=1.5), 
                    metrics=[rounded_accuracy] )
history = stacked_ae.fit( X_train, X_train, epochs=20, 
                          validation_data=(X_valid, X_valid) )

Let’s go through this code:

• Just like earlier, we split the autoencoder model into two submodels: the encoder and the decoder.
• The encoder takes 28 × 28–pixel grayscale images, flattens them so that each image is represented as a vector of size 784, then processes these vectors through two Dense layers of diminishing[dɪ'mɪnɪʃɪŋ]逐渐缩小的sizes (100 units then 30 units), both using the SELU activation function(SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic sigmoid https://blog.csdn.net/Linli522362242/article/details/106935910)
(you may want to add LeCun normal initialization for kernel_initializer (replace with in Normal distribution with mean 0 and standard deviation OR ) as well, but the network is not very deep so it won’t make a big difference). For each input image, the encoder outputs a vector of size 30.
• The decoder takes codings of size 30 (output by the encoder) and processes them through two Dense layers of increasing sizes (100 units then 784 units), and it reshapes the final vectors into 28 × 28 arrays so the decoder’s outputs have the same shape as the encoder’s inputs.
• When compiling the stacked autoencoder, we use the binary cross-entropy loss instead of the mean squared error. We are treating the reconstruction task as a multilabel binary classification problem(https://blog.csdn.net/Linli522362242/article/details/103866244): each pixel(or each label) intensity represents the probability that the pixel should be black. Framing it this way (rather than as a regression problem) tends to make the model converge faster.(You might be tempted to use the accuracy metric, but it would not work properly, since this metric expects the labels to be either 0 or 1 for each pixel. You can easily work around this problem by creating a custom metric that computes the accuracy after rounding the targets and predictions to 0 or 1.)
• Finally, we train the model using X_train as both the inputs and the targets (and similarly, we use X_valid as both the validation inputs and targets).

... ...

Visualizing the Reconstructions

One way to ensure that an autoencoder is properly trained is to compare the inputs and the outputs: the differences should not be too significant. Let’s plot a few images from the validation set, as well as their reconstructions:

This function processes a few test images through the autoencoder and displays the original images and their reconstructions:

def plot_image(image):
    plt.imshow(image, cmap="binary")
    plt.axis("off")

def show_reconstructions(model, images=X_valid, n_images=5):
    reconstructions = model.predict( images[:n_images] )
    fig = plt.figure( figsize=(n_images*1.5, 3) )
    for image_index in range(n_images):
        plt.subplot( 2, n_images, 1+image_index)
        plot_image( images[image_index] )
        
        plt.subplot( 2, n_images, n_images+1+image_index )
        plot_image( reconstructions[image_index] )

show_reconstructions( stacked_ae )

Figure 17-4 shows the resulting images.
Figure 17-4. Original images (top) and their reconstructions (bottom)

The reconstructions are recognizable, but a bit too lossy. We may need to train the model for longer, or make the encoder and decoder deeper, or make the codings larger. But if we make the network too powerful, it will manage to make perfect reconstructions without having learned any useful patterns in the data. For now, let’s go with this model.

Visualizing the Fashion MNIST Dataset

Now that we have trained a stacked autoencoder, we can use it to reduce the dataset’s dimensionality. For visualization, this does not give great results compared to other dimensionality reduction algorithms (such as those we discussed in Cp8 https://blog.csdn.net/Linli522362242/article/details/105139547), but one big advantage of autoencoders is that they can handle large datasets, with many instances and many features. So one strategy is to use an autoencoder to reduce the dimensionality down to a reasonable level, then use another dimensionality reduction algorithm for visualization. Let’s use this strategy to visualize Fashion MNIST.

First, we use the encoder from our stacked autoencoder to reduce the dimensionality down to 30,

then we use Scikit-Learn’s implementation of the t-SNE algorithm to reduce the dimensionality down to 2 for visualization:

np.random.seed(42)

from sklearn.manifold import TSNE

X_valid_compressed = stacked_encoder.predict(X_valid) # reduce the dimensionality down to 30
tsne = TSNE()
X_valid_2D = tsne.fit_transform( X_valid_compressed )
X_valid_2D = ( X_valid_2D-X_valid_2D.min() ) / ( X_valid_2D.max()-X_valid_2D.min() ) # min-max scaling

t-SNE : https://blog.csdn.net/Linli522362242/article/details/105722461

plt.scatter( X_valid_2D[:, 0], X_valid_2D[:, 1], c=y_valid, s=10, cmap='tab10')
plt.axis("off")
plt.show()

Let's make this diagram a bit prettier:

# adapted from https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html
import matplotlib as mpl
%matplotlib inline

plt.figure( figsize=(10,8) )
cmap = plt.cm.tab10

plt.scatter( X_valid_2D[:,0], X_valid_2D[:,1], c=y_valid, s=10, cmap=cmap )

image_positions = np.array([ [1., 1.] ])
for index, position in enumerate( X_valid_2D ):
  dist = np.sum( (position-image_positions)**2, 
                 axis=1 ) # broadcasting
  if np.min(dist) > 0.02: # if far enough from 'other images'
      image_positions = np.r_[ image_positions, 
                               [position] ]# append current image's position
      imagebox = mpl.offsetbox.AnnotationBbox(
                        mpl.offsetbox.OffsetImage( X_valid[index],
                                                   cmap='binary' 
                                                 ),
                        position, #######
                        bboxprops ={ 'edgecolor': cmap(y_valid[index]), 'lw':2 }                           
                 )
      plt.gca().add_artist( imagebox ) # Add any Artist to the container box
plt.axis('off')
plt.show()

Figure 17-5. Fashion MNIST visualization using an autoencoder followed by t-SNE

Figure 17-5 shows the resulting scatterplot (beautified a bit by displaying some of the images). The t-SNE algorithm identified several clusters which match the classes reasonably well (each class is represented with a different color).

So, autoencoders can be used for dimensionality reduction. Another application is for unsupervised pretraining.

Unsupervised Pretraining Using Stacked Autoencoders

As we discussed in Chapter 11, if you are tackling a complex supervised task but you do not have a lot of labeled training data, one solution is to find a neural network that performs a similar task and reuse its lower layers(
########################################
transfer learning , unsupervised pretraining and Pretraining on an Auxiliary Task https://blog.csdn.net/Linli522362242/article/details/106982127

Transfer Learning with Keras

Figure 11-4. Reusing pretrained layers https://blog.csdn.net/Linli522362242/article/details/106935910

If the input pictures of your new task don’t have the same size as the ones used in the original task, you will usually have to add a preprocessing step to resize them to the size expected by the original model. More generally, transfer learning will work best when the inputs have similar low-level features.
```
X_train_A.shape, X_train_B.shape
```

The output layer of the original model should usually be replaced because it is most likely not useful at all for the new task, and it may not even have the right number of outputs for the new task.

import tensorflow as tf
import numpy as np
 
tf.random.set_seed(42)
np.random.seed(42)
 
#Construction Phase
model_A = keras.models.Sequential()
model_A.add( keras.layers.Flatten(input_shape=[28,28]) ) # to 1D
for n_hidden in (300, 100, 50, 50, 50):
    model_A.add( keras.layers.Dense(n_hidden, activation="selu") )#hidden layers
    
model_A.add(keras.layers.Dense(8, activation="softmax")) # output layers #########
 
model_A.compile( loss="sparse_categorical_crossentropy", 
                 optimizer=keras.optimizers.SGD(lr=1e-3),
                 metrics=["accuracy"] )

history = model_A.fit(X_train_A, y_train_A, epochs=20, validation_data=(X_valid_A, y_valid_A))

Similarly, the upper hidden layers of the original model are less likely to be as useful as the lower layers, since the high-level features that are most useful for the new task may differ significantly from the ones that were most useful for the original task. You want to find the right number of layers to reuse.
... ...

# First, you need to load model A and create a new model based on that model’s layers. 

model_A = keras.models.load_model("my_model_A.h5")
# reuse all the layers except for the output layer
model_B_on_A = keras.models.Sequential( model_A.layers[:-1] )
model_B_on_A.add( keras.layers.Dense(1, activation='sigmoid') )###########

Note that model_A and model_B_on_A now share some layers. When you train model_B_on_A, it will also affect model_A. If you want to avoid that, you need to clone model_A before you reuse its layers. To do this, you clone model A’s architecture with clone_model(), then copy its weights (since clone_model() does not clone the weights):
```
model_A_clone = keras.models.clone_model( model_A )
model_A_clone.set_weights( model_A.get_weights() )
```

Try freezing all the reused layers first (i.e., make their weights non-trainable so that Gradient Descent won’t modify them), then train your model and see how it performs.

# set every layer’s trainable attribute to False and compile the model:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

# Note: You must always compile your model after you freeze or unfreeze layers.    
model_B_on_A.compile( loss="binary_crossentropy",
                      optimizer=keras.optimizers.SGD(lr=1e-3),
                      metrics=['accuracy']
                    )

history = model_B_on_A.fit( X_train_B, y_train_B, epochs=4,
                            validation_data=(X_valid_B, y_valid_B) )

Then try unfreezing one or two of the top hidden layers to let backpropagation tweak them and see if performance improves. The more training data you have, the more layers you can unfreeze. It is also useful to reduce the learning rate when you unfreeze reused layers: this will avoid wrecking破坏 their fine-tuned weights.

for layer in model_B_on_A.layers[:-1]: #unfreeze
    layer.trainable=True
    
model_B_on_A.compile( loss="binary_crossentropy",
                      optimizer=keras.optimizers.SGD(lr=1e-3),
                      metrics=['accuracy']
                    )
 
history = model_B_on_A.fit( X_train_B, y_train_B, epochs=16,
                            validation_data=(X_valid_B, y_valid_B) )

model_B_on_A.evaluate(X_test_B, y_test_B)

If you still cannot get good performance, and you have little training data, try dropping the top hidden layer(s) and freezing all the remaining hidden layers again. You can iterate until you find the right number of layers to reuse. If you have plenty of training data, you may try replacing the top hidden layers instead of dropping them, and even adding more hidden layers.

Unsupervised Pretraining无监督的预训练

Figure 11-5. In unsupervised training, a model is trained on the unlabeled data (or on all the data) using an unsupervised learning technique, then it is fine-tuned for the final task on the labeled data using a supervised learning technique; the unsupervised part may train one layer at a time as shown here, or it may train the full model directly.

Note that in the early days of Deep Learning it was difficult to train deep models, so people would use a technique called greedy layer-wise pretraining贪婪分层预训练 (depicted in Figure 11-5). They would

first train an unsupervised model with a single layer(Hidden1), typically an RBM(restricted Boltzmann machines),
then they would freeze that layer( Hidden1 ) and add another one( Hidden2 ) on top of it, then train the model again (effectively just training the new layer),
then freeze the new layer( Hidden2 ) and add another layer( Hidden3 ) on top of it, train the model again,
and so on( Finally, it is fine-tuned for the final task on the labeled data using a supervised learning technique).

Nowadays, things are much simpler: people generally train the full unsupervised model in one shot (i.e., in Figure 11-5, just start directly at step three) and use autoencoders or GANs rather than RBMs(https://blog.csdn.net/Linli522362242/article/details/106982127).

Pretraining on an Auxiliary[ɔːɡˈzɪliəri]辅助的 Task

If you do not have much labeled training data, one last option is to train a first neural network on an auxiliary task for which you can easily obtain or generate labeled training data, then reuse the lower layers of that network for your actual task. The first neural network’s lower layers will learn feature detectors that will likely be reusable by the second neural network.

For example, if you want to build a system to recognize faces, you may only have a few pictures of each individual—clearly not enough to train a good classifier. Gathering hundreds of pictures of each person would not be practical. You could, however, gather a lot of pictures of random people on the web and train a first neural network to detect whether or not two different pictures feature the same person. Such a network would learn good feature detectors for faces, so reusing its lower layers would allow you to train a good face classifier that uses little training data.

########################################https://blog.csdn.net/Linli522362242/article/details/106982127
). This makes it possible to train a high-performance model using little training data because your neural network won’t have to learn all the low-level features; it will just reuse the feature detectors learned by the existing network.

Similarly, if you have a large dataset but most of it is unlabeled, you can first train a stacked autoencoder using all the data, then reuse the lower layers to create a neural network for your actual task and train it using the labeled data. For example, Figure 17-6 shows how to use a stacked autoencoder to perform unsupervised pretraining for a classification neural network. When training the classifier, if you really don’t have much labeled training data, you may want to freeze the pretrained layers (at least the lower ones).
Figure 17-6. Unsupervised pretraining using autoencoders

Having plenty of unlabeled data and little labeled data is common. Building a large unlabeled dataset is often cheap (e.g., a simple script can download millions of images off the internet), but labeling those images (e.g., classifying them as cute or not) can usually be done reliably only by humans. Labeling instances is timeconsuming and costly, so it’s normal to have only a few thousand human-labeled instances.

There is nothing special about the implementation: just train an autoencoder(learned any useful data representation in the process) using all the training data (labeled plus unlabeled), then reuse its encoder layers to create a new neural network (see the exercises at the end of this chapter for an example).

Next, let’s look at a few techniques for training stacked autoencoders.

Tying Weights

When an autoencoder is neatly symmetrical, like the one we just built, a common technique is to tie the weights of the decoder layers to the weights of the encoder layers(by simply using the transpose of the encoder's weights as the decoder weights. For this, we need to use a custom layer.). This halves the number of weights in the model, speeding up training and limiting the risk of overfitting. Specifically, if the autoencoder has a total of N layers (not counting the input layer), and represents the connection weights of the layer (e.g., layer 1 is the first hidden layer, layer N/2 is the coding layer, and layer N is the output layer), then the decoder layer weights can be defined simply as: = (with L = 1, 2, …, N/2).

e.g

Our autoencoder has N=4 layers(not counting the input layer),
and the 1st decoder hidden layer is L=3(3rd layer in autoencoder), then current decoder layer weight = (W_2 ^T since 2= 4-3+1 W_2 is from the 2nd hidden layer of encoder or autoencoder)
and the 2nd decoder hidden layer is L=4(4th layer in autoencoder), then current decoder layer weight = (W_1 ^T since 1= 4-4+1, W_1 is from the 1st hidden layer of encoder or autoencoder)

To tie weights between layers using Keras, let’s define a custom layer:

class DenseTranspose( keras.layers.Layer ):
  def __init__( self, dense, activation=None, **kwargs ):
    self.dense = dense
    self.activation = keras.activations.get( activation )
    super().__init__( **kwargs )

  def build( self, batch_input_shape ):
    self.biases = self.add_weight( name="bias",
                                   shape=[ self.dense.input_shape[-1] ], # uses 100 for the DenseTranspose( dense_2, activation="selu" )
                                   initializer='zeros' )
    super().build( batch_input_shape ) #batch_input_shape= (batch_size, input_dimensions)

  def call( self, inputs ):
    # for the DenseTranspose( dense_2, activation="selu" )
    # inputs:  Tensor("Placeholder:0", shape=(None, 30), dtype=float32)
    # self.dense.weights:  
    #           [<tf.Variable 'dense_1/kernel:0' shape=(100, 30) dtype=float32>, 
    #            <tf.Variable 'dense_1/bias:0' shape=(30,) dtype=float32> # can't be used #############################################
    #           ]
    z = tf.matmul( inputs, self.dense.weights[0], 
                   transpose_b = True # for the second argument is transposed 
                 )                    # before multiplication # self.dense.weights[0] ==> (30,100)        
    return self.activation( z+self.biases )

This custom layer acts like a regular Dense layer, but it uses another Dense layer’s weights, transposed (setting transpose_b=True is equivalent to transposing the second argument, but it’s more efficient as it performs the transposition on the fly within the matmul() operation). However, it uses its own bias vector(not current dense's bias). Next, we can build a new stacked autoencoder, much like the previous one, but with the decoder’s Dense layers tied to the encoder’s Dense layers:

keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

dense_1 = keras.layers.Dense( 100, activation="selu" ) 
dense_2 = keras.layers.Dense( 30, activation="selu" )

tied_encoder = keras.models.Sequential([
                                keras.layers.Flatten( input_shape=[28,28] ), # 784=28*28
                                dense_1, # weight shape: (784, 100) # input_shape(?,784 neurons)
                                dense_2, # weight shape: (100, 30)  # input_shape(?,100 neurons)                                      
               ]) # output==> (batch_size, 30)
tied_decoder = keras.models.Sequential([
                                DenseTranspose( dense_2, activation="selu" ),
                                DenseTranspose( dense_1, activation="sigmoid" ),
                                keras.layers.Reshape([28,28])
               ])

tied_ae = keras.models.Sequential([ tied_encoder, tied_decoder ])

tied_ae.compile( loss="binary_crossentropy",
                 optimizer = keras.optimizers.SGD(lr=1.5),
                 metrics=[rounded_accuracy] )
history = tied_ae.fit( X_train, X_train, epochs=10, 
                       validation_data=(X_valid, X_valid) )

• When compiling the stacked autoencoder, we use the binary cross-entropy loss instead of the mean squared error. We are treating the reconstruction task as a multilabel binary classification problem(https://blog.csdn.net/Linli522362242/article/details/103866244): each pixel(or each label) intensity represents the probability that the pixel should be black. Framing it this way (rather than as a regression problem) tends to make the model converge faster.(You might be tempted to use the accuracy metric, but it would not work properly, since this metric expects the labels to be either 0 or 1 for each pixel. You can easily work around this problem by creating a custom metric that computes the accuracy after rounding the targets and predictions to 0 or 1.)

def rounded_accuracy(y_true, y_pred):
    return keras.metrics.binary_accuracy( tf.round(y_true), tf.round(y_pred) )# default threshold=0.5

This model achieves a very slightly lower reconstruction error than the previous model, with almost half the number of parameters

import matplotlib.pyplot as plt

def plot_image(image):
    plt.imshow(image, cmap="binary")
    plt.axis("off")

def show_reconstructions(model, images=X_valid, n_images=5):
    reconstructions = model.predict( images[:n_images] )
    fig = plt.figure( figsize=(n_images*1.5, 3) )
    for image_index in range(n_images):
        plt.subplot( 2, n_images, 1+image_index)
        plot_image( images[image_index] )
        
        plt.subplot( 2, n_images, n_images+1+image_index )
        plot_image( reconstructions[image_index] )

show_reconstructions(tied_ae)
plt.show()

Training One Autoencoder at a Time

Rather than training the whole stacked autoencoder in one go like we just did, it is possible to train one shallow autoencoder at a time, then stack all of them into a single stacked autoencoder (hence the name), as shown in Figure 17-7. This technique is not used as much these days, but you may still run into papers that talk about “greedy layerwise training,” so it’s good to know what it means.
Figure 17-7. Training one autoencoder at a time

def train_autoencoder( n_neurons, X_train, X_valid, loss, optimizer,
                       n_epochs=10, output_activation=None, metrics=None ):
  n_inputs = X_train.shape[-1]

  encoder = keras.models.Sequential([
      keras.layers.Dense( n_neurons, activation="selu", input_shape=[n_inputs] )                                     
  ])

  decoder = keras.models.Sequential([
      keras.layers.Dense( n_inputs, activation=output_activation ),                               
  ])

  autoencoder = keras.models.Sequential([ encoder, decoder ])

  autoencoder.compile( optimizer, loss, metrics=metrics )
  autoencoder.fit( X_train, X_train, epochs=n_epochs,
                   validation_data=( X_valid, X_valid )
                 )
  return encoder, decoder, encoder( X_train ), encoder( X_valid )

During the first phase of training,

the first autoencoder learns to reconstruct the inputs.
Then we encode the whole training set using this first autoencoder, and this gives us a new (compressed) training set.

the second phase of training

We then train a second autoencoder on this new dataset.

tf.random.set_seed(42)
np.random.seed(42)

K = keras.backend
X_train_flat = K.batch_flatten( X_train ) # equivalent to .reshape(-1, 28 * 28)
X_valid_flat = K.batch_flatten( X_valid )

# 1st phase of training
enc1, dec1, X_train_enc1, X_valid_enc1 = train_autoencoder( 
    100, X_train_flat, X_valid_flat, "binary_crossentropy",
     keras.optimizers.SGD( lr=1.5 ), output_activation="sigmoid", metrics=[rounded_accuracy] )

# 2nd phase of training
# X_train_enc1, X_valid_enc1 ==>a new (compressed) training/valid set
enc2, dec2, _, _ = train_autoencoder(
    30, X_train_enc1, X_valid_enc1, "mse", 
    keras.optimizers.SGD( lr=0.05 ), output_activation = "selu",
)

1st phase of training in 1st autoencoder

2nd phase of training in 2nd autoencoder

Finally, we build a big sandwich using all these autoencoders, as shown in Figure 17-7 (i.e., we first stack the hidden layers of each autoencoder, then the output layers in reverse order). This gives us the final stacked autoencoder (see the “Training One Autoencoder at a Time” section in the notebook for an implementation). We could easily train more autoencoders this way, building a very deep stacked autoencoder.

stacked_ae_1_by_1 = keras.models.Sequential([
                                    keras.layers.Flatten( input_shape=[28,28] ),
                                    enc1, 
                                    enc2, 
                                    dec2, 
                                    dec1,
                                    keras.layers.Reshape([ 28,28 ]),                                             
                    ])

show_reconstructions( stacked_ae_1_by_1 )
plt.show()

stacked_ae_1_by_1.compile( loss="binary_crossentropy",
                           optimizer = keras.optimizers.SGD( lr=0.1 ), 
                           metrics=[rounded_accuracy] 
                         )
history = stacked_ae_1_by_1.fit( X_train, X_train, epochs=10,
                                 validation_data=(X_valid, X_valid)\
                               )

show_reconstructions( stacked_ae_1_by_1 )
plt.show()

As we discussed earlier, one of the triggers of the current tsunami of interest in Deep Learning was the discovery in 2006 by Geoffrey Hinton et al. that deep neural networks can be pretrained in an unsupervised fashion, using this greedy layerwise approach. They used restricted Boltzmann machines (RBMs; see https://blog.csdn.net/Linli522362242/article/details/106982127) for this purpose, but in 2007 Yoshua Bengio et al. showed(Yoshua Bengio et al., “Greedy Layer-Wise Training of Deep Networks,” Proceedings of the 19th International Conference on Neural Information Processing Systems (2006): 153–160.) that autoencoders worked just as well. For several years this was the only efficient way to train deep nets, until many of the techniques introduced in Cp11https://blog.csdn.net/Linli522362242/article/details/106935910 made it possible to just train a deep net in one shot.

Convolutional Autoencoders

If you are dealing with images, then the autoencoders we have seen so far will not work well (unless the images are very small): as we saw in Cp14 https://blog.csdn.net/Linli522362242/article/details/108669444, convolutional neural networks are far better suited than dense networks to work with images. So if you want to build an autoencoder for images (e.g., for unsupervised pretraining or dimensionality reduction), you will need to build a convolutional autoencoder.(Jonathan Masci et al., “Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction,” Proceedings of the 21st International Conference on Artificial Neural Networks 1 (2011): 52–59.) The encoder is a regular CNN composed of convolutional layers and pooling layers. It typically reduces the spatial dimensionality of the inputs (i.e., height and width) while increasing the depth (i.e., the number of feature maps). The decoder must do the reverse (upscale the image and reduce its depth back to the original dimensions), and for this you can use transpose convolutional layers (alternatively, you could combine upsampling layers with convolutional layers). Here is a simple convolutional autoencoder for Fashion MNIST:

Let's build a stacked Autoencoder with 3 hidden layers and 1 output layer (i.e., 2 stacked Autoencoders).
convolutional output_shape:

TransposeConvolutional Padding: https://towardsdatascience.com/understand-transposed-convolutions-and-build-your-own-transposed-convolution-layer-from-scratch-4f5d97b2967 and **os = n, e.g. 72=14** https://blog.csdn.net/Linli522362242/article/details/108669444

tf.random.set_seed( 42 )
np.random.seed( 42 )

conv_encoder = keras.models.Sequential([
  keras.layers.Reshape( [28, 28, 1], input_shape=[28,28] ),

  keras.layers.Conv2D( filters=16, kernel_size=3, padding="SAME", 
                       activation="selu" ),                         #==> (None, 28, 28, 16)
  keras.layers.MaxPool2D(pool_size=2),#strides default to pool_size #==> (None, 14, 14, 16)
  # max pooling preserves only the strongest features, getting rid of all the meaningless ones

  keras.layers.Conv2D( filters=32, kernel_size=3, padding="SAME", 
                       activation="selu" ),                         #==> (None, 14, 14, 32)
  keras.layers.MaxPool2D(pool_size=2),                              #==> (None, 7, 7, 32)

  keras.layers.Conv2D( filters=64, kernel_size=3, padding="SAME",   #==> (None, 7, 7, 64)
                       activation="selu" ),
  keras.layers.MaxPool2D(pool_size=2),                              #==> (None, 3, 3, 64)                 
]) # output==>28x28 x64
                                         
conv_decoder = keras.models.Sequential([
  keras.layers.Conv2DTranspose( 32, kernel_size=3, strides=2, padding="VALID", # padding="VALID" since 3*2 !=7
                                activation="selu", input_shape=[3,3,64] ), #==> (None, 7, 7, 32) 
  keras.layers.Conv2DTranspose( 16, kernel_size=3, strides=2, padding="SAME",  # 7x2 (strides)=14 and padding=0.5
                                activation="selu" ),                       #==> (None, 14, 14, 16)
  keras.layers.Conv2DTranspose( 1, kernel_size=3, strides=2, padding="SAME",   # 14x2(strides)=28
                                activation="sigmoid" ),                    #==> (None, 28, 28, 1)
  keras.layers.Reshape([28,28])                                                                                                                            
])

conv_ae = keras.models.Sequential([ conv_encoder, conv_decoder ])
conv_ae.compile( loss="binary_crossentropy",
                 optimizer=keras.optimizers.SGD( lr=1.0 ),
                 metrics=[rounded_accuracy]
               )
history = conv_ae.fit( X_train, X_train, epochs=5, 
                       validation_data=(X_valid, X_valid)
                     )

conv_encoder.summary()

conv_decoder.summary()

show_reconstructions( conv_ae )
plt.show()

Recurrent Autoencoders

If you want to build an autoencoder for sequences, such as time series or text (e.g., for unsupervised learning or dimensionality reduction), then recurrent neural networks (see Cp15 https://blog.csdn.net/Linli522362242/article/details/114941730) may be better suited than dense networks. Building a recurrent autoencoder is straightforward:

the encoder is typically a sequence-to-vector RNN which compresses the input sequence down to a single vector.
The decoder is a vector-to-sequence RNN that does the reverse:

recurrent_encoder = keras.models.Sequential([
  keras.layers.LSTM( 100, return_sequences=True, input_shape=[28,28] ), #[None, 28,28] ==> (None, 28, 100) 
  keras.layers.LSTM( 30 ) #return_sequences=False                #==> (None, 30)                                       
])

recurrent_decoder = keras.models.Sequential([
  keras.layers.RepeatVector( 28, input_shape=[30] ), #==> (None, 28, 30)
  keras.layers.LSTM( 100, return_sequences=True ),   #==> (None, 28, 100) 
  keras.layers.TimeDistributed( keras.layers.Dense( 28, activation="sigmoid" ) )                                           
])                                                   #==> (None, 28, 28)

recurrent_ae = keras.models.Sequential([ recurrent_encoder, recurrent_decoder ])

recurrent_ae.compile( loss="binary_crossentropy", 
                      optimizer=keras.optimizers.SGD(0.1),
                      metrics=[rounded_accuracy] 
                    )

This recurrent autoencoder can process sequences of any length, with 28 dimensions per time step. Conveniently, this means it can process Fashion MNIST images by treating each image as a sequence of rows: at each time step, the RNN will process a single row of 28 pixels. Obviously, you could use a recurrent autoencoder for any kind of sequence. Note that we use a RepeatVector layer as the first layer of the decoder, to ensure that its input vector gets fed to the decoder at each time step.

history = recurrent_ae.fit( X_train, X_train, epochs=10, 
                            validation_data=(X_valid, X_valid) )

show_reconstructions( recurrent_ae )
plt.show()

OK, let’s step back for a second. So far we have seen various kinds of autoencoders (basic, stacked, convolutional, and recurrent), and we have looked at how to train them (either in one shot or layer by layer). We also looked at a couple applications: data visualization and unsupervised pretraining.

Up to now, in order to force the autoencoder to learn interesting features, we have limited the size of the coding layer, making it undercomplete. There are actually many other kinds of constraints that can be used, including ones that allow the coding layer to be just as large as the inputs, or even larger, resulting in an overcomplete autoencoder. Let’s look at some of those approaches now.

Denoising Autoencoders(Gaussian noise OR dropout)

Another way to force the autoencoder to learn useful features is to add noise to its inputs, training it to recover the original, noise-free inputs. This idea has been around since the 1980s (e.g., it is mentioned in Yann LeCun’s 1987 master’s thesis). In a 2008 paper,(Pascal Vincent et al., “Extracting and Composing Robust Features with Denoising Autoencoders,” Proceedings of the 25th International Conference on Machine Learning (2008): 1096–1103.) Pascal Vincent et al. showed that autoencoders could also be used for feature extraction. In a 2010 paper,(Pascal Vincent et al., “Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion,” Journal of Machine Learning Research 11 (2010): 3371–3408.) Vincent et al. introduced stacked denoising autoencoders.

The noise can be pure Gaussian noise added to the inputs, or it can be randomly switched-off inputs, just like in dropout (introduced in Cp11 Neurons trained with dropout cannot co-adapt共同适应 with their neighboring neurons; they have to be as useful as possible on their own. They also cannot rely excessively过度地 on just a few input neurons; they must pay attention to each of their input neurons. They end up being less sensitive to slight changes in the inputs. In the end, you get a more robust network that generalizes better. https://blog.csdn.net/Linli522362242/article/details/107164478). Figure 17-8 shows both options.
Figure 17-8. Denoising autoencoders, with Gaussian noise (left) or dropout (right)

The implementation is straightforward: it is a regular stacked autoencoder with an additional Dropout layer applied to the encoder’s inputs (or you could use a GaussianNoise layer instead). Recall that the Dropout layer is only active during training (and so is the GaussianNoise layer):

Using Gaussian noise:

tf.random.set_seed(42)
np.random.seed(42)

denoising_encoder = keras.models.Sequential([
                                    keras.layers.Flatten( input_shape=[28,28] ),
                                    keras.layers.GaussianNoise(0.2),
                                    keras.layers.Dense(100, activation="selu"),
                                    keras.layers.Dense(30, activation="selu")                                             
                    ])

denoising_decoder = keras.models.Sequential([
  keras.layers.Dense( 100, activation="selu", input_shape=[30] ),
  keras.layers.Dense( 28*28, activation="sigmoid" ),
  keras.layers.Reshape( [28,28] )                                           
])

denoising_ae = keras.models.Sequential([ denoising_encoder, denoising_decoder ])
denoising_ae.compile( loss="binary_crossentropy", 
                      optimizer=keras.optimizers.SGD(lr=1.0),
                      metrics=[rounded_accuracy],
                    ),
history = denoising_ae.fit( X_train, X_train, epochs=10,
                            validation_data=(X_valid, X_valid) 
                          )

tf.random.set_seed(42)
np.random.seed(42)

noise = keras.layers.GaussianNoise( 0.2 )
show_reconstructions( denoising_ae, noise(X_valid, training=True) ) # training : Python boolean indicating 
# whether the layer should behave in training mode (adding noise) or in inference mode (doing nothing).
plt.show()

top: add noise to the input_images, bottom: reconstructions

Using dropout:

tf.random.set_seed(42)
np.random.seed(42)

dropout_encoder = keras.models.Sequential([
  keras.layers.Flatten( input_shape=[28,28] ),
  keras.layers.Dropout( 0.5 ),
  keras.layers.Dense( 100, activation="selu" ),
  keras.layers.Dense( 30, activation="selu" )                                             
])

dropout_decoder = keras.models.Sequential([
  keras.layers.Dense( 100, activation="selu", input_shape=[30] ),
  keras.layers.Dense( 28*28, activation="sigmoid" ),
  keras.layers.Reshape( [28, 28] )                                        
])

dropout_ae = keras.models.Sequential( [dropout_encoder, dropout_decoder] )
dropout_ae.compile( loss="binary_crossentropy",
                    optimizer = keras.optimizers.SGD(lr=1.0),
                    metrics=[rounded_accuracy]
                  )
history = dropout_ae.fit( X_train, X_train, epochs=10,
                          validation_data=(X_valid, X_valid)
                        )

tf.random.set_seed(42)
np.random.seed(42)

dropout = keras.layers.Dropout(0.5)
show_reconstructions( dropout_ae, dropout(X_valid, training=True) )
plt.show()

Figure 17-9. Noisy images (top) and their reconstructions (bottom)

Figure 17-9 shows a few noisy images (with half the pixels turned off), and the images reconstructed by the dropout-based denoising autoencoder. Notice how the autoencoder guesses details that are actually not in the input, such as the top of the white shirt (bottom row, fourth image). As you can see, not only can denoising autoencoders be used for data visualization or unsupervised pretraining, like the other autoencoders we’ve discussed so far, but they can also be used quite simply and efficiently to remove noise from images.

Sparse Autoencoders

Another kind of constraint that often leads to good feature extraction is sparsity: by adding an appropriate term to the cost function, the autoencoder is pushed to reduce the number of active neurons in the coding layer. For example, it may be pushed to have on average only 5% significantly active neurons in the coding layer. This forces the autoencoder(###by using (more) sparse latent representations or codings###) to represent each input as a combination of a small number of activations. As a result, each neuron in the coding layer typically ends up representing a useful feature (if you could speak only a few words per month, you would probably try to make them worth listening to).

A simple approach is to use the sigmoid activation function in the coding layer (to constrain the codings to values between 0 and 1), use a large coding layer (e.g., with 300 units), and add some regularization to the coding layer’s activations (the decoder is just a regular decoder):

##########################https://blog.csdn.net/Linli522362242/article/details/108230328

Now let's discuss L1 regularization and sparsity. The main concept behind L1 regularization is similar to what we have discussed here. However, since the L1 penalty is the sum of the absolute weight coefficients (remember that the L2 term is quadratic), we can represent it as a diamond shape budget, as shown in the following figure:
Remember that our goal is to find the combination of weight coefficients that minimize the cost function for the training data, as shown in the figure (the point in the center of the ellipses).

Now, we can think of regularization as adding a penalty term to the unpenalized cost function to encourage smaller weights; or, in other words, we penalize large weights(e.g. w_2 in the point in the center of the ellipses(Minimize cost) >> w_2 in the diamond at ).

Thus, by increasing the regularization strength via the regularization parameter , we shrink the weights towards zero so the we decrease the dependence of our model on the training data (###The larger the value of the weight coefficient, the more the model depends on the data###). The larger the value of the regularization parameter gets, the faster the penalized cost functiongrows, which leads to a narrower L1 diamond (Note here just 2 weight coefficients since we only have 2 features in our data). For example, if we increase the regularization parameter towards infinity, the weight coefficients will become effectively zero, denoted by the center of the L1 diamond . To summarize the main message of the example: our goal is to minimize the sum of the unpenalized cost function plus the penalty term, which can be understood as adding bias and preferring a simpler model to reduce the variance(try to underfit) in the absence of sufficient training data to fit the model.

In the preceding figure, we can see that the contour of the unpenalized cost function touches the L1 diamond at . Since the contours of an L1 regularized system are sharp, it is more likely that the optimum—that is, the intersection between the ellipses of the cost function and the boundary of the L1 diamond—is located on the axes( when), which encourages sparsity. The mathematical details of why L1 regularization can lead to sparse solutions are beyond the scope of this book. If you are interested, an excellent section on L2 versus L1 regularization can be found in section 3.4 of The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani, and Jerome Friedman, Springer.

Here, we simply replaced the square of the weights by the sum of the absolute values of the weights. In contrast to L2 regularization, L1 regularization usually yields sparse[spɑrs] feature vectors; most feature weights will be zero. Sparsity['spɑ:sɪtɪ] can be useful in practice if we have a high-dimensional dataset with many features that are irrelevant[ɪˈreləvənt] especially cases where we have more irrelevant dimensions than samples. In this sense, L1 regularization can be understood as a technique for feature selection.

BinaryCrossentropy() # https://blog.csdn.net/Linli522362242/article/details/96480059

sigmoid:

from_logits=False (using probabilities l): e.g. # logits = tf.constant([0.8]) # z=w^T *X
from_logits=True # 1/(1+np.exp(-0.8))

##########################https://blog.csdn.net/Linli522362242/article/details/108414534

A simple approach is to use the sigmoid activation function in the coding layer (to constrain the codings to values between 0 and 1)

tf.random.set_seed(42)
np.random.seed(42)

simple_encoder = keras.models.Sequential([
  keras.layers.Flatten( input_shape=[28,28] ),
  keras.layers.Dense(100, activation="selu"),
  keras.layers.Dense(30, activation="sigmoid"), # use the sigmoid activation function for the coding layer, to ensure that the coding values range from 0 to 1                                          
])

simple_decoder = keras.models.Sequential([
  keras.layers.Dense(100, activation="selu", input_shape=[30]),
  keras.layers.Dense(28*28, activation="sigmoid"),
  keras.layers.Reshape([28,28])                                        
])

simple_ae = keras.models.Sequential([simple_encoder, simple_decoder])

simple_ae.compile( loss="binary_crossentropy", 
                   optimizer=keras.optimizers.SGD(lr=1.),
                   metrics=[rounded_accuracy]
                  )
history = simple_ae.fit( X_train, X_train, epochs=10, 
                         validation_data=(X_valid, X_valid)
                       )

show_reconstructions(simple_ae)
plt.show()

Let's create a couple functions to print nice activation histograms:

import matplotlib as mpl
%matplotlib inline

def plot_percent_hist( ax, data, bins ):
  counts, _ = np.histogram( data, bins=bins ) # _ is bins
  width_list = bins[1:] - bins[:-1]
  x = bins[:-1] + width_list/2 # bar middle position
  ax.bar( x, counts/len(data), width=width_list*0.8 ) # 0.8 for spacing

  ax.xaxis.set_ticks( bins )
  ax.yaxis.set_major_formatter( mpl.ticker.FuncFormatter( 
                                  lambda y, position: "{}%".format( int( np.round(100*y) ) )
                                                        )              # format y major
                              )
  ax.grid(True)

def plot_activations_histogram( encoder, height=1, n_bins=10 ):
  X_valid_codings = encoder( X_valid ).numpy() # latent representations
  activation_means = X_valid_codings.mean( axis=0 ) # X_valid_codings shape: (None, 30) # the mean activation over the training batch 
  mean = activation_means.mean()# [30 number in the activation_means_over_batch]==>a mean value
  bins = np.linspace( 0,1, n_bins+1 ) #==>array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
  
  fig, [ax1, ax2] = plt.subplots( figsize=(10,3), nrows=1, ncols=2, sharey=True )

  plot_percent_hist( ax1, X_valid_codings.ravel(), bins )
  ax1.plot( [mean, mean], [0, height], "k--",
            label="Overal Mean = {:.2f}".format(mean) )
  ax1.legend( loc="upper center", fontsize=14 )
  ax1.set_xlabel( "Activation" )
  ax1.set_ylabel( "% Activations" ) #percentage of Activation amounts
  ax1.axis( [0,1, 0,height] )

  plot_percent_hist( ax2, activation_means, bins )
  ax2.plot([mean, mean], [0,height], "k--")
  ax2.set_xlabel("Neuron Mean Activation")
  ax2.set_ylabel("% Neurons")
  ax2.axis([0,1, 0,height])

Let's use these functions to plot histograms of the activations of the encoding layer.

plot_activations_histogram( simple_encoder, height=0.35 )
plt.show()

The histogram on the left shows the distribution of all the activations. You can see that values close to 0 or 1 are more frequent overall, which is consistent with the saturating nature of the sigmoid function(https://blog.csdn.net/Linli522362242/article/details/113311720).

The histogram on the right shows the distribution of mean neuron activations(total 30 neurons ###activation_means = X_valid_codings.mean( axis=0 ) # X_valid_codings shape: (None, 30) ###): you can see that most neurons have a mean activation close to 0.49., e.g.(about 23% of all neurons have a mean activation between 0.4 and 0.5)

Both histograms tell us that each neuron tends to either fire close to 0 or 1, with about 50% probability each. However, some neurons fire almost all the time (right side of the right histogram).

Adding L1(penalty + cost) to the coding layer’s activations

tf.random.set_seed(42)
np.random.seed(42)

sparse_l1_encoder = keras.models.Sequential([
  keras.layers.Flatten( input_shape=[28,28] ),
  keras.layers.Dense(100, activation="selu"),
  keras.layers.Dense(300, activation="sigmoid"),
  keras.layers.ActivityRegularization(l1=1e-3) # Alternatively, you could add
                                               # activity_regularizer=keras.regularizers.l1(1e-3)
                                               # to the previous layer(keras.layers.Dense(300, activation="sigmoid")).                                             
]) # add to the loss (or cost function) for training 

sparse_l1_decoder = keras.models.Sequential([
  keras.layers.Dense( 100, activation="selu", input_shape=[300] ),
  keras.layers.Dense( 28*28, activation="sigmoid" ),
  keras.layers.Reshape([28,28])                                           
])

sparse_l1_ae = keras.models.Sequential( [sparse_l1_encoder, sparse_l1_decoder] )
sparse_l1_ae.compile( loss="binary_crossentropy", 
                      optimizer=keras.optimizers.SGD(lr=1.0),
                      metrics=[rounded_accuracy]
                    )
history = sparse_l1_ae.fit( X_train, X_train, epochs=10,
                            validation_data=(X_valid, X_valid)
                          )

This ActivityRegularization layer just returns its inputs, but as a side effect it adds a training loss equal to the sum of absolute values of its inputs (this layer only has an effect during training). Equivalently, you could remove the ActivityRegularization layer and set activity_regularizer=keras.regularizers.l1(1e-3) in the previous layer. This penalty will encourage the neural network to produce codings(after activation function) close to 0 (e.g. Over Mean=0.02), but since it will also be penalized if it does not reconstruct the inputs correctly, it will have to output at least a few nonzero values. Using the ℓ1 norm rather than the ℓ2 norm will push the neural network to preserve the most important codings while eliminating the ones that are not needed for the input image (rather than just reducing all codings).

show_reconstructions( sparse_l1_ae )

plot_activations_histogram( sparse_l1_encoder, height=1. )
plt.show()

sparsity: by adding an appropriate term to the cost function(), the autoencoder is pushed to reduce the number of active neurons in the coding layer (usually, each neuron tends to either fire close to 0 or 1, with about 50% probability each if not using L1 pernalty；But after using L1 pernalty during training that will make the coding layer generate a more sparse vector for each input instance). However, here, some neurons fire almost all the time (left side of the right histogram).). For example, it may be pushed to have some significantly active neurons in the coding layer. This forces the autoencoder to represent each input as a combination of a small number of activations.

Another approach, which often yields better results, is to measure the actual sparsity of the coding layer at each training iteration, and penalize the model when the measured sparsity differs from a target sparsity. We do so by computing the average activation of each neuron in the coding layer, over the whole training batch. The batch size must not be too small, or else the mean will not be accurate.

Once we have the mean activation per neuron, we want to penalize the neurons that are too active, or not active enough, by adding a sparsity loss to the cost function. For example, if we measure that a neuron has an average activation of 0.3, but the target sparsity is 0.1, it must be penalized to activate less. One approach could be simply adding the squared error to the cost function, but in practice a better approach is to use the Kullback–Leibler (KL) divergence分歧 (briefly discussed in Cp4 https://blog.csdn.net/Linli522362242/article/details/104124771

##################################################

CROSS ENTROPY

(cross)entropy: 0 ≤ ≤ 1 OR
Here, is the proportion of the samples that belongs to class i for a particular node t.

The entropy is therefore 0 if all samples at a node belong to the same class,

For example, in a binary class setting, the entropy is 0 = -( 1*(1) + 0*(0) ) if p(i =1 | t ) =1
the entropy is 0 = -( 0*(0) + 1*(1) ) if p(i =0 | t ) =0.
and the entropy is maximal if we have a uniform class distribution.

If the classes are distributed uniformly with p(i =1| t ) =0.5 and p(i = 0| t ) =0.5, the entropy is 1= - ( 0.5*-1 + 0.5*-1 ) .
Therefore, we can say that the entropy criterion attempts to maximize the mutual information in the tree.
log以2为底的话，我们可以将这个式子解释为：要花费至少多少位的编码来表示此概率分布。从此式也可以看出，信息熵的本质是一种期望。

Cross entropy originated from information theory. Suppose you want to efficiently transmit information about the weather every day. If there are eight options (sunny, rainy, etc.), you could encode each option using 3 bits since = 8. However, if you think it will be sunny almost every day, it would be much more efficient to code “sunny” on just one bit (0_ _ _) and the other seven options on 4 bits (starting with a 1 _ _ _). Cross entropy measures the average number of bits you actually send per option. If your assumption about the weather is perfect, cross entropy will just be equal to the entropy of the weather itself (i.e., its intrinsic unpredictability内在的不可预测性). But if your assumptions are wrong (e.g., if it rains often), cross entropy will be greater by an amount called the Kullback–Leibler divergence.
https://blog.csdn.net/Linli522362242/article/details/107755405

We start with a dataset at the parent node that consists of 40 samples from class 1 and 40 samples from class 2 that we split into two datasets and , respectively.

the entropy criterion would favor scenario ( ) over scenario () :
since the entropy is 0 if all samples at a node belong to the same class,

The cross entropy between two probability distributions p and q is defined as (at least when the distributions are discrete).
所谓KL散度，是指当某分布q(x)被用于近似p(x)时的信息损失。
若我们使用分布q来表示分布p，那么信息熵的损失如下
OR也就是说，q(x)能在多大程度上表达p(x)所包含的信息，KL散度越大，表达效果越差。
可将该式写作期望的形式
Equation 9-4. KL divergence from q(z) to p(z|X) https://blog.csdn.net/Linli522362242/article/details/105973507

一些简单的例子

##################################################
),which has much stronger gradients than the mean squared error, as you can see in Figure 17-10.

Another approach, which often yields better results, is to measure the actual sparsity of the coding layer at each training iteration, and penalize the model when the measured sparsity differs from a target sparsity. We do so by computing the average activation of each neuron in the coding layer, over the whole training batch. The batch size must not be too small, or else the mean will not be accurate.

Once we have the mean activation per neuron, we want to penalize the neurons that are too active, or not active enough, by adding a sparsity loss to the cost function. For example, if we measure that a neuron has an average activation of 0.3, but the target sparsity is 0.1, it must be penalized to activate less. One approach could be simply adding the squared error to the cost function, but in practice a better approach is to use the Kullback–Leibler (KL) divergence分歧 (briefly discussed in Cp4 https://blog.csdn.net/Linli522362242/article/details/104124771),which has much stronger gradients than the mean squared error, as you can see in Figure 17-10.

import matplotlib.pyplot as plt

p = 0.1                            # target probability p
q = np.linspace(0.001, 0.999, 500) # actual probability q (i.e., the mean activation over the training batch)
kl_div = p*np.log(p/q) + (1-p)*np.log( (1-p)/(1-q) )
mse = (p-q)**2
mae = np.abs(p-q)

plt.plot( [p,p], [0, 0.3], "k:" )
plt.text( 0.05, 0.3+0.02, "Target\nsparsity", fontsize=14 )

plt.plot( q, kl_div, "b-", label="KL divergence" )
plt.plot( q, mae, "g--", label=r"MAE ($\ell_1$)" )
plt.plot( q, mse, "r--", label=r"MSE ($\ell_2$)", lw=1 )

plt.legend( loc="upper left", fontsize=14 )
plt.xlabel("Actual sparsity")
plt.ylabel("Cost", rotation=0)
plt.axis([0,1, 0,0.95])
plt.show()

Given two discrete probability distributions P and Q, the KL divergence between these distributions, noted , can be computed using Equation 17-1.

Equation 17-1. Kullback–Leibler divergence

In our case, we want to measure the divergence between the target probability p that a neuron in the coding layer will activate and the actual probability q (i.e., the mean activation over the training batch). So the KL divergence simplifies to Equation 17-2.

Equation 17-2. KL divergence between the target sparsity p and the actual sparsity q

Once we have computed the sparsity loss for each neuron in the coding layer, we sum up these losses and add the result to the cost function. In order to control the relative importance of the sparsity loss and the reconstruction loss, we can multiply the sparsity loss by a sparsity weight hyperparameter. If this weight is too high, the model will stick closely to the target sparsity, but it may not reconstruct the inputs properly, making the model useless. Conversely, if it is too low, the model will mostly ignore the sparsity objective and will not learn any interesting features.

We now have all we need to implement a sparse autoencoder based on the KL divergence. First, let’s create a custom regularizer to apply KL divergence regularization:

K = keras.backend
kl_divergence = keras.losses.kullback_leibler_divergence

class KLDivergenceRegularizer( keras.regularizers.Regularizer ):
  def __init__( self, weight, target=0.1 ):
    self.weight = weight
    self.target = target
  def __call__( self, inputs ):
    # mean_activities for each neuron
    mean_activities = K.mean( inputs, axis=0 ) # the mean activation over the training batch
    return self.weight * (
                            kl_divergence( self.target, mean_activities ) + 
                            kl_divergence( 1.-self.target, 1.-mean_activities )
           )

Now we can build the sparse autoencoder, using the KLDivergenceRegularizer for the coding layer’s activations:

tf.random.set_seed(42)
np.random.seed(42)

kld_reg = KLDivergenceRegularizer( weight=0.05, target=0.1 )
sparse_kl_encoder = keras.models.Sequential([
  keras.layers.Flatten( input_shape=[28,28] ),
  keras.layers.Dense( 100, activation="selu" ),
  keras.layers.Dense( 300, activation="sigmoid", activity_regularizer=kld_reg)                                            
])

sparse_kl_decoder = keras.models.Sequential([
  keras.layers.Dense(100, activation="selu", input_shape=[300]),
  keras.layers.Dense( 28*28, activation="sigmoid" ),
  keras.layers.Reshape([28,28])                                           
])

sparse_kl_ae = keras.models.Sequential( [sparse_kl_encoder, sparse_kl_decoder] )
sparse_kl_ae.compile( loss="binary_crossentropy", 
                      optimizer=keras.optimizers.SGD(lr=1.0),
                      metrics=[rounded_accuracy]
                    )
history = sparse_kl_ae.fit( X_train, X_train, epochs=10,
                            validation_data=(X_valid, X_valid) )

show_reconstructions( sparse_kl_ae )

plot_activations_histogram( sparse_kl_encoder )
plt.show()

After training this sparse autoencoder on Fashion MNIST, the activations of the neurons in the coding layer are mostly close to 0 (about 70% of all activations are lower than 0.1 or ...encourage the neural network to produce codings(after activation function) close to 0(Overal Mean = 0.11)), and all neurons have a mean activation around 0.1 (about 90% of all neurons have a mean activation between 0.1 and 0.2, (But after adding KL divergence to cost during training that will make the coding layer generate a more sparse vector for each input instance) that reduce the number of active neurons in the coding layer), as shown in Figure 17-11.

Figure 17-11. Distribution of all the activations in the coding layer (left) and distribution of the mean activation per neuron (right)

My summary:

codings (OR latent representations) ==> sigmoid function(Logistic function) ==Compression==> range[0,1] ==> add L1 or KL divergence penalty to cost function during the training ==> encourage the neural network to produce codings(after activation function) close to 0, reduce the number of active neurons in the coding layer(forces the autoencoder(###by using (more) sparse latent representations or codings###) to represent each input as a combination of a small number of activations.), push the neural network to preserve the most important codings while eliminating the ones that are not needed for the input image (rather than just reducing all codings)

In each layer, we use neurons to receive input (x_0, x_1, ..., x_n; n=100 if the number of the previous neurons=100) from previous layer's output, then use current neurons' weights ( weight shape=(100,30) ), net input Z= ==> 30 net input neurons ==>30 same activation functions ( such as"sigmoid" ) ==> if fire or not, e.g. ==> generate current layer's output (output shape=(none_batch_size, 30 if the number of neurons=30 in the current layer). Besides, adding term to cost will make the current layer's output more sparse (or less output neurons active).

Variational Autoencoders

Another important category of autoencoders was introduced in 2013 by Diederik Kingma and Max Welling and quickly became one of the most popular types of autoencoders: variational autoencoders.(Diederik Kingma and Max Welling, “Auto-Encoding Variational Bayes,” arXiv preprint arXiv:1312.6114(2013).)

They are quite different from all the autoencoders we have discussed so far, in these particular ways:

• They are probabilistic autoencoders, meaning that their outputs are partly determined by chance, even after training (as opposed to denoising autoencoders, which use randomness only during training).
• Most importantly, they are generative autoencoders, meaning that they can generate new instances that look like they were sampled from the training set.

Both these properties make them rather similar to RBMs(restricted Boltzmann machines), but they are easier to train, and the sampling process is much faster (with RBMs you need to wait for the network to stabilize into a “thermal equilibrium” before you can sample a new instance). Indeed, as their name suggests, variational autoencoders perform variational Bayesian inference (introduced in Cp9 https://blog.csdn.net/Linli522362242/article/details/105973507), which is an efficient way to perform approximate Bayesian inference.
Figure 17-12. Variational autoencoder (left) and an instance going through it (right)

Let’s take a look at how they work. Figure 17-12 (left) shows a variational autoencoder. You can recognize the basic structure of all autoencoders, with an encoder followed by a decoder (in this example, they both have two hidden layers), but there is a twist: instead of directly producing a coding for a given input, the encoder produces a mean coding μ and a standard deviation σ. The actual coding is then sampled randomly from a Gaussian distribution with mean μ and standard deviation σ. After that the decoder decodes the sampled coding normally. The right part of the diagram shows a training instance going through this autoencoder.

First, the encoder produces μ and σ,
then a coding is sampled randomly (notice that it is not exactly located at μ), and
finally this coding is decoded; the final output resembles类似 the training instance.

As you can see in the diagram, although the inputs may have a very convoluted[ˈkɑnvəlutɪd]卷曲的 distribution, a variational autoencoder tends to produce codings that look as though好像 they were sampled from a simple Gaussian distribution (Variational autoencoders are actually more general; the codings are not limited to Gaussian distributions.): during training, the cost function (discussed next) pushes the codings to gradually migrate[ˈmaɪɡreɪt]迁移 within the coding space (also called the latent space) to end up looking like a cloud of Gaussian points. One great consequence is that after training a variational autoencoder, you can very easily generate a new instance: just sample a random coding from the Gaussian distribution, decode it, and voilà!然后提高精度！

Now, let’s look at the cost function. It is composed of two parts.

The first is the usual reconstruction loss that pushes the autoencoder to reproduce its inputs (we can use cross entropy for this, as discussed earlier).
The second is the latent loss that pushes the autoencoder to have codings that look as though they were sampled from a simple Gaussian distribution: it is the KL divergence between the target distribution (i.e., the Gaussian distribution) and the actual distribution of the codings. The math is a bit more complex than with the sparse autoencoder, in particular because of the Gaussian noise, which limits the amount of information that can be transmitted to the coding layer (thus pushing the autoencoder to learn useful features). Luckily, the Variational Autoencoders equations simplify, so the latent loss can be computed quite simply using Equation 17-3:

Equation 17-3. Variational autoencoder’s latent loss

In this equation, ℒ is the latent loss, K is the codings’ dimensionality, and and are the mean and standard deviation of the component of the codings. The vectors μ and σ (which contain all the and ) are output by the encoder, as shown in Figure 17-12 (left).

A common tweak to the variational autoencoder’s architecture is to make the encoder output γ = log() rather than σ. The latent loss can then be computed as shown in Equation 17-4. This approach is more numerically stable and speeds up training.

Equation 17-4. Variational autoencoder’s latent loss, rewritten using γ = log() :

Let’s start building a variational autoencoder for Fashion MNIST (as shown in Figure 17-12, but using the γ tweak). First, we will need a custom layer to sample the
codings, given μ and γ:

from tensorflow import keras

K = keras.backend
class Sampling( keras.layers.Layer):
    def call(self, inputs):
        mean, log_var = inputs
        # Gaussian noise: random_normal <== default mean=0.0, stddev=1.0
        # log_var = log( variance^2 )
        # exp(log_var/2) ==> exp( log( variance^2 )/2) ) ==> exp( 2*log(variance)/2 ) ==> variance 
        return K.random_normal( tf.shape(log_var) ) * K.exp(log_var/2) + mean

This Sampling layer takes two inputs: mean (μ) and log_var (γ). It uses the function K.random_normal() to sample a random vector (of the same shape as γ) from the Normal distribution, with mean 0 and standard deviation 1. Then it multiplies it by exp(γ / 2) (which is equal to σ, as you can verify), and finally it adds μ and returns the result. This samples a codings vector from the Normal distribution with mean μ and standard deviation σ.

Next, we can create the encoder, using the Functional API because the model is not entirely sequential:

import numpy as np
import tensorflow as tf

codings_size = 10

inputs = keras.layers.Input( shape=[28,28] )
z = keras.layers.Flatten()(inputs)
z = keras.layers.Dense(150, activation="selu")(z)
z = keras.layers.Dense(100, activation="selu")(z)

codings_mean = keras.layers.Dense( codings_size )(z)
codings_log_var = keras.layers.Dense( codings_size )(z)
codings = Sampling()([ codings_mean, codings_log_var ])

variational_encoder = keras.models.Model( inputs=[inputs],
                                          outputs=[codings_mean, codings_log_var, codings]
                                        )

decoder_inputs = keras.layers.Input( shape=[codings_size] )
x = keras.layers.Dense(100, activation="selu")(decoder_inputs)
x = keras.layers.Dense(150, activation="selu")(x)
x = keras.layers.Dense( 28*28, activation="sigmoid" )(x)
outputs = keras.layers.Reshape([28,28])(x)
variational_decoder = keras.models.Model( inputs=[decoder_inputs],
                                          outputs=[outputs]
                                        )

encoder:

Note that the Dense layers that output codings_mean (μ) and codings_log_var (γ) have the same inputs (i.e., the outputs of the second Dense layer). We then pass both codings_mean and codings_log_var to the Sampling layer. Finally, the variational_encoder model has three outputs, in case you want to inspect the values of codings_mean and codings_log_var. The only output we will use is the last one (codings).
decoder:

we could have used the Sequential API instead of the Functional API, since it is really just a simple stack of layers, virtually identical to many of the decoders we have built so far.

variational_encoder.summary()
variational_decoder.summary()

Finally, let’s build the variational autoencoder model:

_, _, codings = variational_encoder( inputs )
reconstructions = variational_decoder( codings )
variational_ae = keras.models.Model( inputs=[inputs],
                                     outputs=[reconstructions]
                                   )
variational_ae.summary()

Note that we ignore the first two outputs of the encoder (we only want to feed the codings to the decoder). Lastly, we must add the latent loss and the reconstruction loss:

Equation 17-4. Variational autoencoder’s latent loss, rewritten using γ = log() :

# codings_log_var = log( variance^2 )
latent_loss = -0.5*K.sum( 1 + codings_log_var - K.exp(codings_log_var) - K.square(codings_mean), 
                          axis=-1
                        ) # codings_mean.shape : TensorShape([None_batch_size, 10])
# latent_loss.shape : TensorShape([None_batch_size])
variational_ae.add_loss( K.mean(latent_loss)/784. )

variational_ae.compile( loss="binary_crossentropy", optimizer="rmsprop", 
                        metrics=[rounded_accuracy] )
history = variational_ae.fit( X_train, X_train, epochs=25, batch_size=128, 
                              validation_data = (X_valid, X_valid) )

We first apply Equation 17-4 to compute the latent loss for each instance in the batch (we sum over the last axis).
Then we compute the mean loss over all the instances in the batch,
and we divide the result by 784 to ensure it has the appropriate scale compared to the reconstruction loss.
Indeed, the variational autoencoder’s reconstruction loss is supposed to be the sum of the pixel reconstruction errors,
but when Keras computes the "binary_crossentropy" loss, it computes the mean over all 784 pixels, rather than the sum. So, the reconstruction loss is 784 times smaller than we need it to be. We could define a custom loss to compute the sum rather than the mean, but it is simpler to divide the latent loss by 784 (the final loss will be 784 times smaller than it should be, but this just means that we should use a larger learning rate).

Note that we use the RMSprop optimizer(https://blog.csdn.net/Linli522362242/article/details/106982127), which works well in this case. And finally we can train the autoencoder!

... ...

show_reconstructions( variational_ae )
plt.show()

Generating Fashion MNIST Images

Now let’s use this variational autoencoder to generate images that look like fashion items. All we need to do is sample random codings from a Gaussian distribution and decode them and plot the resulting images:

def plot_multiple_images( images, n_cols=None ):
  n_cols = n_cols or len(images)
  n_rows = ( len(images)-1 )//n_cols + 1
  if images.shape[-1] == 1:
    images = np.squeeze( images, axis=-1 )
    
  plt.figure( figsize=(n_cols, n_rows) )
  for index, image in enumerate(images):
    plt.subplot( n_rows, n_cols, index+1 )
    plt.imshow( image, cmap="binary" )
    plt.axis("off")

Figure 17-13 shows the 12 generated images.

tf.random.set_seed(42)

codings = tf.random.normal( shape=[12, codings_size] )# codings_size = 10
images = variational_decoder(codings).numpy()
plot_multiple_images( images, 4 )

Figure 17-13. Fashion MNIST images generated by the variational autoencoder

The majority of these images look fairly convincing, if a bit too fuzzy. The rest are not great, but don’t be too harsh[hɑːrʃ]严厉的,严厉的 on the autoencoder—it only had a few minutes to learn! Give it a bit more fine-tuning and training time, and those images should look better.

Variational autoencoders make it possible to perform semantic interpolation: instead of interpolating two images at the pixel level (which would look as if the two images were overlaid), we can interpolate at the codings level.

We first run both images through the encoder,
then we interpolate the two codings we get, and
finally we decode the interpolated codings to get the final image. It will look like a regular Fashion MNIST image, but it will be an intermediate between the original images.

In the following code example, we take the 12 codings we just generated, we organize them in a 3 × 4 grid, and we use TensorFlow’s tf.image.resize() function to resize this grid to 5 × 7. By default, the resize() function will perform bilinear interpolation(https://blog.csdn.net/Linli522362242/article/details/108669444), so every other row and column will contain interpolated codings. We then use the decoder to produce all the images:

Now let's perform semantic interpolation between these images:

tf.random.set_seed(42)
np.random.seed(42)

# codings.shape : TensorShape([12, 10=codings_size])
codings_grid = tf.reshape( codings, [1,3,4, codings_size] ) # output shape: [batch_size=1,height=3,width=4, channels=10]
larger_grid = tf.image.resize( codings_grid, size=[5,7] ) # method=ResizeMethod.BILINEAR
# larger_grid.shape # TensorShape([1, 5, 7, 10])
interpolated_codings = tf.reshape( larger_grid, [-1, codings_size] )
images = variational_decoder( interpolated_codings ).numpy()

plt.figure( figsize=(7,5) )
for index, image in enumerate(images):
  plt.subplot(5,7, index+1)
  # if index%7%2==0 and index//7%2==0:
  #   plt.gca().get_xaxis().set_visible(False)
  #   plt.gca().get_yaxis().set_visible(False)
  # else:
  plt.axis("off")
  plt.imshow(image, cmap="binary")

tf.random.set_seed(42)
np.random.seed(42)

# codings.shape              # TensorShape([12, 10=codings_size])
codings_grid = tf.reshape( codings, [1,3,4, codings_size] ) # output shape: [batch_size=1,height=3,width=4, channels=10]
larger_grid = tf.image.resize( codings_grid, size=[5,7] ) # method=ResizeMethod.BILINEAR
# larger_grid.shape          # TensorShape([1, 5, 7, 10])
interpolated_codings = tf.reshape( larger_grid, [-1, codings_size] )
# interpolated_codings.shape # TensorShape([35, 10])
images = variational_decoder( interpolated_codings ).numpy()
# images.shape               # (35, 28, 28)

plt.figure( figsize=(7,5) )
for index, image in enumerate(images):
  plt.subplot(5,7, index+1)
  if index%7%2==0:
    plt.gca().get_xaxis().set_visible(False)#不显示x轴
    plt.gca().get_yaxis().set_visible(False)#不显示y轴
  else:
    plt.axis("off")
  plt.imshow(image, cmap="binary")

0 1 2 3 4 5 6

tf.random.set_seed(42)
np.random.seed(42)

# codings.shape              # TensorShape([12, 10=codings_size])
codings_grid = tf.reshape( codings, [1,3,4, codings_size] ) # output shape: [batch_size=1,height=3,width=4, channels=10]
larger_grid = tf.image.resize( codings_grid, size=[5,7] ) # method=ResizeMethod.BILINEAR
# larger_grid.shape          # TensorShape([1, 5, 7, 10])
interpolated_codings = tf.reshape( larger_grid, [-1, codings_size] )
# interpolated_codings.shape # TensorShape([35, 10])
images = variational_decoder( interpolated_codings ).numpy()
# images.shape               # (35, 28, 28)

plt.figure( figsize=(7,5) )
for index, image in enumerate(images):
  plt.subplot(5,7, index+1)
  if index%7%2==0 and index//7%2==0:
    plt.gca().get_xaxis().set_visible(False)
    plt.gca().get_yaxis().set_visible(False)
  else:
    plt.axis("off")
  plt.imshow(image, cmap="binary")

Figure 17-14 shows the resulting images. The original images are framed, and the rest are the result of semantic interpolation between the nearby images. Notice, for example, how the T-shirt in the 1st row and 6th column is a nice interpolation between the two T-shirts located left and right it.

Figure 17-14. Semantic interpolation(3x4 ==> 5x7)

Hashing Using a Binary Autoencoder

Let's load the Fashion MNIST dataset again:

from tensorflow import keras
import tensorflow as tf
import numpy as np

(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()

X_train_full = X_train_full.astype( np.float32 ) / 255.
X_test = X_test.astype( np.float32 )/255.

X_train, X_valid = X_train_full[:-5000], X_train_full[-5000:]
y_train, y_valid = y_train_full[:-5000], y_train_full[-5000:]

For several years, variational autoencoders were quite popular, but GANs eventually took the lead, in particular because they are capable of generating much more realistic and crisp清晰分明的 images. So let’s turn our attention to GANs.

Let's train an autoencoder where the encoder has a 16-neuron output layer, using the sigmoid activation function, and heavy Gaussian noise ### keras.layers.GaussianNoise(15.), ### just before it. During training, the noise layer will encourage the previous layer to learn to output large values, since small values will just be crushed by the noise. In turn, this means that the output layer will output values close to 0 or 1, thanks to the sigmoid activation function. Once we round the output values to 0s and 1s, we get a 16-bit "semantic" hash. If everything works well, images that look alike will have the same hash.

This can be very useful for search engines: for example, if we store each image on a server identified by the image's semantic hash, then all similar images will end up on the same server. Users of the search engine can then provide an image to search for, and the search engine will compute the image's hash using the encoder, and quickly return all the images on the server identified by that hash.

tf.random.set_seed(42)
np.random.seed(42)

hasing_encoder = keras.models.Sequential([
  keras.layers.Flatten( input_shape=[28,28] ),
  keras.layers.Dense( 100, activation="selu" ),
  keras.layers.GaussianNoise(15.),
  keras.layers.Dense( 16, activation="sigmoid" ),                                          
])

hasing_decoder = keras.models.Sequential([
  keras.layers.Dense( 100, activation="selu" ),
  keras.layers.Dense( 28*28, activation="sigmoid" ),
  keras.layers.Reshape([28,28])                                        
])

def rounded_accuracy(y_true, y_pred):
    return keras.metrics.binary_accuracy( tf.round(y_true), tf.round(y_pred) )# default threshold=0.5

hasing_ae = keras.models.Sequential( [hasing_encoder, hasing_decoder] )
hasing_ae.compile( loss="binary_crossentropy", 
                   optimizer=keras.optimizers.Nadam(),
                   metrics=[rounded_accuracy]
                 )
history = hasing_ae.fit( X_train, X_train, epochs=10,
                         validation_data=(X_valid, X_valid)
                       )

def plot_image(image):
    plt.imshow(image, cmap="binary")
    plt.axis("off")
 
def show_reconstructions(model, images=X_valid, n_images=5):
    reconstructions = model.predict( images[:n_images] )
    fig = plt.figure( figsize=(n_images*1.5, 3) )
    for image_index in range(n_images):
        plt.subplot( 2, n_images, 1+image_index)
        plot_image( images[image_index] )
        
        plt.subplot( 2, n_images, n_images+1+image_index )
        plot_image( reconstructions[image_index] )
 
show_reconstructions( hasing_ae )

import matplotlib as mpl
%matplotlib inline
 
def plot_percent_hist( ax, data, bins ):
  counts, _ = np.histogram( data, bins=bins ) # _ is bins
  width_list = bins[1:] - bins[:-1]
  x = bins[:-1] + width_list/2 # bar middle position
  ax.bar( x, counts/len(data), width=width_list*0.8 ) # 0.8 for spacing
 
  ax.xaxis.set_ticks( bins )
  ax.yaxis.set_major_formatter( mpl.ticker.FuncFormatter( 
                                  lambda y, position: "{}%".format( int( np.round(100*y) ) )
                                                        )              # format y major
                              )
  ax.grid(True)
 
def plot_activations_histogram( encoder, height=1, n_bins=10 ):
  X_valid_codings = encoder( X_valid ).numpy() # latent representations
  activation_means = X_valid_codings.mean( axis=0 ) # X_valid_codings shape: (None, 30) # the mean activation over the training batch 
  mean = activation_means.mean()# [30 number in the activation_means_over_batch]==>a mean value
  bins = np.linspace( 0,1, n_bins+1 ) #==>array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
  
  fig, [ax1, ax2] = plt.subplots( figsize=(10,3), nrows=1, ncols=2, sharey=True )
 
  plot_percent_hist( ax1, X_valid_codings.ravel(), bins )
  ax1.plot( [mean, mean], [0, height], "k--",
            label="Overal Mean = {:.2f}".format(mean) )
  ax1.legend( loc="upper center", fontsize=14 )
  ax1.set_xlabel( "Activation" )
  ax1.set_ylabel( "% Activations" ) #percentage of Activation amounts
  ax1.axis( [0,1, 0,height] )
 
  plot_percent_hist( ax2, activation_means, bins )
  ax2.plot([mean, mean], [0,height], "k--")
  ax2.set_xlabel("Neuron Mean Activation")
  ax2.set_ylabel("% Neurons")
  ax2.axis([0,1, 0,height])


plot_activations_histogram(hashing_encoder)
plt.show()

Notice that the outputs are indeed very close to 0 or 1 (left graph, During training, heavy Gaussian noise will encourage the previous layer to learn to output large values (autoencoders could also be used for feature extraction) ~ important features, since small values will just be crushed by the noise ==> 16-neuron output layer, using the sigmoid activation function, will output values close to 0 or 1 ==> we round the output values to 0s and 1s, we get a 16-bit "semantic" hash):

The histogram on the right shows the distribution of mean neuron activations(total 16 neurons ###activation_means = X_valid_codings.mean( axis=0 ) # X_valid_codings shape: (None, 16) ###): you can see that most neurons have a mean activation not close to 0.49.,

Both histograms tell us that each neuron tends to either fire close to 0 or 1, with about 50% probability each. However, some neurons fire almost all the time (right side of the right histogram).

Now let's see what the hashes look like for the first few images in the validation set:

hashes = np.round( hashing_encoder.predict(X_valid) ).astype( np.int32 )
# hashes.shape : (5000, 16)

# hashes[:5] # from left to right
# array([[1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0],
#        [1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0],
#        [1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0],
#        [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0],
#        [0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0]], dtype=int32)

# [2**bit for bit in range(16)]
# [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768]
hashes *= np.array([ [2**bit for bit in range(16)] ])
hashes = hashes.sum( axis=1 )
print(hashes[:5])


for h in hashes[:5]:
  print( "{:016b}".format(h) ) # convert integer to 16-bit "semantic" hash
print("...")

from collections import Counter

n_hashes = 10 # rows
n_images = 8  # columns

top_hashes = Counter(hashes).most_common(n_hashes) #autoencoders could also be used for feature extraction

plt.figure(figsize=(n_images, n_hashes))
for hash_index, (image_hash, hash_count) in enumerate(top_hashes):
    indices = (hashes == image_hash)
    for index, image in enumerate(X_valid[indices][:n_images]): # select first 8 images
                                      # row_index
        plt.subplot(n_hashes, n_images, hash_index * n_images + index + 1)
        plt.imshow(image, cmap="binary")
        plt.axis("off")