Note
This section assumes the reader has already read through Classifying MNIST digits using Logistic Regression and Multilayer Perceptron and Restricted Boltzmann Machines (RBM). Additionally it uses the following Theano functions and concepts : T.tanh, shared variables, basic arithmetic ops, T.grad, Random numbers, floatX. If you intend to run the code on GPU also read GPU.
Note
The code for this section is available for download here.
Deep Belief Networks
[Hinton06] showed that RBMs can be stacked and trained in a greedy manner to form so-called Deep Belief Networks (DBN). DBNs are graphical models which learn to extract a deep hierarchical representation of the training data. They model the joint distribution between observed vector and the hidden layers as follows:
(1)
where , is a conditional distribution for the visible units conditioned on the hidden units of the RBM at level , and is the visible-hidden joint distribution in the top-level RBM. This is illustrated in the figure below.
The principle of greedy layer-wise unsupervised training can be applied to DBNs with RBMs as the building blocks for each layer [Hinton06], [Bengio07]. The process is as follows:
1. Train the first layer as an RBM that models the raw input as its visible layer.
2. Use that first layer to obtain a representation of the input that will be used as data for the second layer. Two common solutions exist. This representation can be chosen as being the mean activations or samples of .
3. Train the second layer as an RBM, taking the transformed data (samples or mean activations) as training examples (for the visible layer of that RBM).
4. Iterate (2 and 3) for the desired number of layers, each time propagating upward either samples or mean values.
5. Fine-tune all the parameters of this deep architecture with respect to a proxy for the DBN log- likelihood, or with respect to a supervised training criterion (after adding extra learning machinery to convert the learned representation into supervised predictions, e.g. a linear classifier).
In this tutorial, we focus on fine-tuning via supervised gradient descent. Specifically, we use a logistic regression classifier to classify the input based on the output of the last hidden layer of the DBN. Fine-tuning is then performed via supervised gradient descent of the negative log-likelihood cost function. Since the supervised gradient is only non-null for the weights and hidden layer biases of each layer (i.e. null for the visible biases of each RBM), this procedure is equivalent to initializing the parameters of a deep MLP with the weights and hidden layer biases obtained with the unsupervised training strategy.
Justifying Greedy-Layer Wise Pre-Training
Why does such an algorithm work ? Taking as example a 2-layer DBN with hidden layers and (with respective weight parameters and ), [Hinton06] established (see also Bengio09]_ for a detailed derivation) that can be rewritten as,
(2)
represents the KL divergence between the posterior of the first RBM if it were standalone, and the probability for the same layer but defined by the entire DBN (i.e. taking into account the prior defined by the top-level RBM). is the entropy of the distribution .
It can be shown that if we initialize both hidden layers such that , and the KL divergence term is null. If we learn the first level RBM and then keep its parameters fixed, optimizing Eq. (2) with respect to can thus only increase the likelihood .
Also, notice that if we isolate the terms which depend only on , we get:
Optimizing this with respect to amounts to training a second-stage RBM, using the output of as the training distribution, when is sampled from the training distribution for the first RBM.
Implementation
To implement DBNs in Theano, we will use the class defined in the Restricted Boltzmann Machines (RBM) tutorial. One can also observe that the code for the DBN is very similar with the one for SdA, because both involve the principle of unsupervised layer-wise pre-training followed by supervised fine-tuning as a deep MLP. The main difference is that we use the RBM class instead of the dA class.
We start off by defining the DBN class which will store the layers of the MLP, along with their associated RBMs. Since we take the viewpoint of using the RBMs to initialize an MLP, the code will reflect this by seperating as much as possible the RBMs used to initialize the network and the MLP used for classification.
self.sigmoid_layers
will store the feed-forward graphs which together form the MLP, while self.rbm_layers
will store the RBMs used to pretrain each layer of the MLP.
Next step, we construct n_layers
sigmoid layers (we use the HiddenLayer
class introduced in Multilayer Perceptron, with the only modification that we replaced the non-linearity from tanh
to the logistic function ) and n_layers
RBMs, where n_layers
is the depth of our model. We link the sigmoid layers such that they form an MLP, and construct each RBM such that they share the weight matrix and the hidden bias with its corresponding sigmoid layer.
All that is left is to stack one last logistic regression layer in order to form an MLP. We will use the LogisticRegression
class introduced in Classifying MNIST digits using Logistic Regression.
The class also provides a method which generates training functions for each of the RBMs. They are returned as a list, where element is a function which implements one step of training for the RBM
at layer .
In order to be able to change the learning rate during training, we associate a Theano variable to it that has a default value.
Now any function pretrain_fns[i]
takes as arguments index
and optionally lr
– the learning rate. Note that the names of the parameters are the names given to the Theano variables (e.g. lr
) when they are constructed and not the name of the python variables (e.g. learning_rate
). Keep this in mind when working with Theano. Optionally, if you provide k
(the number of Gibbs steps to perform in CD or PCD) this will also become an argument of your function.
In the same fashion, the DBN class includes a method for building the functions required for finetuning ( a train_model
, a validate_model
and a test_model
function).
Note that the returned valid_score
and test_score
are not Theano functions, but rather Python functions. These loop over the entire validation set and the entire test set to produce a list of the losses obtained over these sets.
Putting it all together
The few lines of code below constructs the deep belief network :
There are two stages in training this network: (1) a layer-wise pre-training and (2) a fine-tuning stage.
For the pre-training stage, we loop over all the layers of the network. For each layer, we use the compiled theano function which determines the input to the i
-th level RBM and performs one step of CD-k within this RBM. This function is applied to the training set for a fixed number of epochs given by pretraining_epochs
.
The fine-tuning loop is very similar to the one in the Multilayer Perceptron tutorial, the only difference being that we now use the functions given by build_finetune_functions
.
Running the Code
The user can run the code by calling:
With the default parameters, the code runs for 100 pre-training epochs with mini-batches of size 10. This corresponds to performing 500,000 unsupervised parameter updates. We use an unsupervised learning rate of 0.01, with a supervised learning rate of 0.1. The DBN itself consists of three hidden layers with 1000 units per layer. With early-stopping, this configuration achieved a minimal validation error of 1.27 with corresponding test error of 1.34 after 46 supervised epochs.
On an Intel(R) Xeon(R) CPU X5560 running at 2.80GHz, using a multi-threaded MKL library (running on 4 cores), pretraining took 615 minutes with an average of 2.05 mins/(layer * epoch). Fine-tuning took only 101 minutes or approximately 2.20 mins/epoch.
Hyper-parameters were selected by optimizing on the validation error. We tested unsupervised learning rates in and supervised learning rates in . We did not use any form of regularization besides early-stopping, nor did we optimize over the number of pretraining updates.
Tips and Tricks
One way to improve the running time of your code (given that you have sufficient memory available), is to compute the representation of the entire dataset at layer i
in a single pass, once the weights of the -th layers have been fixed. Namely, start by training your first layer RBM. Once it is trained, you can compute the hidden units values for every example in the dataset and store this as a new dataset which is used to train the 2nd layer RBM. Once you trained the RBM for layer 2, you compute, in a similar fashion, the dataset for layer 3 and so on. This avoids calculating the intermediate (hidden layer) representations, pretraining_epochs
times at the expense of increased memory usage.