RestrictedBoltzmannMachines.py源码剖析

最新推荐文章于 2024-06-18 16:48:36 发布

yyHaker

最新推荐文章于 2024-06-18 16:48:36 发布

阅读量343

点赞数 1

分类专栏：深度学习

本文链接：https://blog.csdn.net/yyHaker/article/details/78434651

版权

深度学习专栏收录该内容

6 篇文章 0 订阅

订阅专栏

RestrictedBoltzmannMachines.py源码剖析

这篇博客是承接我的上一篇Restricted Boltzmann Machines的一个实现RBM的例子，代码详见我的github：RestrictedBoltzmannMachines.py.

RBM.class

We construct an RBM class. The parameters of the network can either be initialized by the constructor or can be passed as arguments. This option is useful when an RBM is used as the building block of a deep network, in which case the weight matrix and the hidden layer bias is shared with the corresponding sigmoidal layer of an MLP network.

class RBM(object):
    """Restricted Bolzmann Machines. (RBM)"""
    def __init__(self, input=None, n_visible=28 * 28, n_hidden=500, w=None, hbias=None, vbias=None,
                 numpy_rng=None, theano_rng=None):
        """
        RBM constructor. Define the parameters of the model along with basic operations for inferring
        hidden from visible (and vice-versa), as well as for performing CD updates.
        :param input: None for standalone RBMs or symbolic variable if RBM is part of a larger graph.
        :param n_visible: number of visible units
        :param n_hidden: number of hidden units
        :param w: None for standalone RBMs or symbolic variable pointing to a shared weights matrix in case
          RBM is part of a DBN network; in a DBN , the weights are shared between RBMs and layers of a MLP
        :param hbias: None for standalone RBMs or symbolic variable pointing to  a shared hidden units bias
          vector in case RBM is part of a different network
        :param vbias: None for standalone RBMs or a symbolic variable pointing to a shared visible units bias
        :param numpy_rng:
        :param theano_rng:
        """
        self.n_visible = n_visible
        self.n_hidden = n_hidden

        # create a number generator
        if numpy_rng is None:
            numpy_rng = numpy.random.RandomState(1234)
        if theano_rng is None:
            theano_rng = RandomStreams(numpy_rng.randint(2 ** 30))

        # initial w and hbias, vbias, create theano shared variables and bias
        if w is None:
            initial_w = numpy.asarray(numpy_rng.uniform(
                low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)),
                high=4 * numpy.sqrt(6. / (n_hidden + n_visible)),
                size=(n_visible, n_hidden)
            ), dtype=theano.config.floatX)
        w = theano.shared(value=initial_w, name='w', borrow=True)

        if hbias is None:
            hbias = theano.shared(value=numpy.zeros(n_hidden, dtype=theano.config.floatX),
                                  name='hbias', borrow=True)
        if vbias is None:
            vbias = theano.shared(value=numpy.zeros(n_visible, dtype=theano.config.floatX),
                                  name='vbias', borrow=True)

        # initialize input layer for standalone RBMs or layer0 for DBN
        self.input = input
        if not input:
            self.input = T.matrix('input')

        self.w = w
        self.vbias = vbias
        self.hbias = hbias
        self.theano_rng = theano_rng

        self.params = [self.w, self.hbias, self.vbias]

Next step is to define functions which construct the symbolic graph associated with Eqs.(12)-(13),the code is as follows:

def propup(self, vis):
        """
        This function propagates the visible units activation upwards to the hidden units.
        :param vis:
        :return:
        """
        pre_sigmoid_activation = T.dot(vis, self.w) + self.hbias
        return [pre_sigmoid_activation, T.nnet.sigmoid(pre_sigmoid_activation)]

def sample_h_given_v(self, v0_example):
        """
        This function infers state of hidden units given visible units.
        :param v0_example:
        :return:
        """
        # p(h|v), compute the activation of the hidden units given a sample of the visibles.
        pre_sigmoid_h1, h1_mean = self.propup(v0_example)

        # get a sample of the hiddens given their activation
        # 利用p(h|v)采样h
        h1_sample = self.theano_rng.binomial(size=h1_mean.shape, n=1, p=h1_mean, dtype=theano.config.floatX)
        return [pre_sigmoid_h1, h1_mean, h1_sample]

def propdown(self, hid):
        """
        This function propagates the hidden units activation down to the visible  units.
        :param hid:
        :return:
        """
        pre_sigmoid_activation = T.dot(hid, self.w.T) + self.vbias
        return [pre_sigmoid_activation, T.nnet.sigmoid(pre_sigmoid_activation)]

def sample_v_given_h(self, h0_sample):
        """
        This function infers state of visible units given hidden units.
        :param h0_sample:
        :return:
        """
        # p(v|h), compute the activation of the visible units given the hidden sample
        pre_sigmoid_v1, v1_mean = self.propdown(h0_sample)

        # get a sample of the visible given their activation
        v1_example = self.theano_rng.binomial(size=v1_mean.shape, n=1, p=v1_mean, dtype=theano.config.floatX)
        return [pre_sigmoid_v1, v1_mean, v1_example]

We can then use these functions to define the symbolic graph for a Gibbs sampling step. We define two functions:

gibbs_vhv which performs a step of Gibbs sampling starting from the visible units. As we shall see, this will be useful for sampling from the RBM.
gibbs_hvh which performs a step of Gibbs sampling starting from the hidden units. This function will be useful for performing CD and PCD updates.

The code is as follows:

def gibbs_hvh(self, h0_sample):
        """
        This function implements one step of Gibbs sampling, starting from the hidden state.
        :param h0_sample:
        :return:
        """
        pre_sigmoid_v1, v1_mean, v1_sample = self.sample_v_given_h(h0_sample)
        pre_sigmoid_h1, h1_mean, h1_sample = self.sample_h_given_v(v1_sample)
        return [pre_sigmoid_v1, v1_mean, v1_sample,
                pre_sigmoid_h1, h1_mean, h1_sample]

def gibbs_vhv(self, v0_sample):
        """
        This function implements one step of Gibbs sampling, starting from the visible state.
        :param v0_sample:
        :return:
        """
        pre_sigmoid_h1, h1_mean, h1_sample = self.sample_h_given_v(v0_sample)
        pre_sigmoid_v1, v1_mean, v1_sample = self.sample_v_given_h(h1_sample)
        return [pre_sigmoid_h1, h1_mean, h1_sample,
                pre_sigmoid_v1, v1_mean, v1_sample]

The class also has a function that computes the free energy of the model, needed for computing the gradient of the parameters (see Eq. (14) (6)). Note that we also return the pre-sigmoid.

F (v) = - b' v - \sum i log (1 + e (c i + W i v)) . (14)

$\mathcal{F}(v)= - b'v - \sum_i \log(1 + e^{(c_i + W_i v)}). \tag{14}$

- \partial l o g P ( x ) \partial θ = \partial F ( x ) \partial θ - \sum x ¯ P (x ¯) \partial F ( x ¯ ) \partial θ (6)

$- \frac{\partial logP(x)}{\partial \theta} = \frac{\partial \mathcal F(x)}{\partial \theta} - \sum_\bar x P(\bar x)\frac{\partial \mathcal F(\bar x)}{\partial \theta} \tag 6$

The code is as follows:

def free_energy(self, v_sample):
        """
        Function to compute the free energy.
        :param v_sample:
        :return:
        """
        wx_b = T.dot(v_sample, self.w) + self.hbias
        vbias_term = T.dot(v_sample, self.vbias)
        hidden_term = T.sum(T.log(1 + T.exp(wx_b)), axis=1)
        return -hidden_term - vbias_term

We then add a get_cost_updates method, whose purpose is to generate the symbolic gradients for CD-k and PCD-k updates.

 def get_cost_updates(self, learning_rate=0.1, persistent=None, k=1):
        """
        This function implements one step of CD-k or PCD-k.
        :param learning_rate: learning rate used to train the RBM.
        :param persistent: None for CD. For PCD, shared variables containing old state of Gibbs
        chain. This must be a shared variable of size (batch_size, number of hidden units).
        :param k: number of Gibbs steps to do in CD-k / PCD-k
        :return: a proxy for the cost and the updates dictionary.The
        dictionary contains the update rules for weights and biases but
        also an update of the shared variable used to store the persistent
        chain, if one is used.
        """

        # compute positive phase
        pre_sigmoid_ph, ph_mean, ph_sample = self.sample_h_given_v(self.input)

        # decide how to initialize persistent chain: for CD, we use the newly generate hidden sample,
        # for PCD, we initialize from the old state of the chain
        if persistent is None:
            chain_start = ph_sample
        else:
            chain_start = persistent

        # perform actual negative phase
        # use "scan" to scan over the function that implements on gibbs step k times
        # the scan will return the entire Gibbs chain
        # the None are place holders, saying that chain_start is the initial state corresponding to the 6th
        ([pre_sigmoid_nvs, nv_means, nv_samples, pre_sigmoid_nhs, nh_means, nh_samples], updates) = \
        theano.scan(self.gibbs_hvh, outputs_info=[None, None, None, None, None, chain_start], n_steps=k,
                    name="gibbs_hvh")

        # determine gradients on RBM parameters
        # note that we only need the sample at the end of the chain
        chain_end = nv_samples[-1]

        cost = T.mean(self.free_energy(self.input)) - T.mean(self.free_energy(chain_end))
        # we must not compute the gradient through the gibbs sampling
        gparams = T.grad(cost, self.params, consider_constant=[chain_end])

        # construct the update dictionary
        for gparam, param in zip(gparams, self.params):
            updates[param] = param - gparam * T.cast(learning_rate, theano.config.floatX)
        if persistent:
            updates[persistent] = nh_samples[-1]
            # pseudo-likelihood cross-entropy is a better proxy for PCD
            monitoring_cost = self.get_pseudo_likelihood_cost(updates)
        else:
            # reconstruction cross-entropy is a better proxy for CD
            monitoring_cost = self.get_reconstruction_cost(updates, pre_sigmoid_nvs[-1])

        return monitoring_cost, updates

如果使用CD-k sampling，那么monitoring_cost 直接使用reconstruction cross-entropy，代码如下：

def get_reconstruction_cost(self, updates, pre_sigmoid_nv):
        """
        Approximation to the reconstruction error.
        :param updates:
        :param pre_sigmoid_nv:
        :return:
        """
        cross_entropy = T.mean(T.sum(self.input * T.log(T.nnet.sigmoid(pre_sigmoid_nv)) +
                                     (1 - self.input) * T.log(1 - T.nnet.sigmoid(pre_sigmoid_nv)), axis=1))
        return cross_entropy

注意这里就是一个pre_sigmoid_nv被返回的原因，是为了theano能够计算优化。

如果使用PCD-k sampling,那么monitoring_cost使用pseudo-likelihood。
Pseudo-likelihood (PL) is much less expensive to compute, as it assumes that all bits are independent. Therefore,

P L (x) = \prod i P (x i | x - i) and log P L (x) = \sum i log P (x i | x - i)

$PL(x) = \prod_i P(x_i | x_{-i}) \text{ and } \log PL(x) = \sum_i \log P(x_i | x_{-i})$
Here

x−i $x_{-i}$ denotes the set of all bits of

x $x$ except bit i. The log-PL is therefore the sum of the log-probabilities of each bit

xi $x_i$ , conditioned on the state of all other bits. For MNIST, this would involve summing over the 784 input dimensions, which remains rather expensive. For this reason, we use the following stochastic approximation to log-PL:

g = N \cdot log P (x i | x - i), where i \sim U (0, N), and E [g] = log P L (x)

$g = N \cdot \log P(x_i | x_{-i}) \text{, where } i \sim U(0,N), \text{ and}\\ E[ g ] = \log PL(x)$

where the expectation is taken over the uniform random choice of index i, and N is the number of visible units. In order to work with binary units, we further introduce the notation $\tilde{x}_i$ to refer to x with bit-i being flipped (1->0, 0->1). The log-PL for an RBM with binary units is then written as:

log P L (x) \approx N \cdot log P (x i | x - i) \approx N \cdot log P ( x i , x - i ) P ( x - i ) \approx N \cdot log P ( x ) P ( x i , x - i ) + P ( x ~ i , x - i ) \approx N \cdot log e - F E ( x ) e - F E ( x ) + e - F E ( x ~ i ) \approx N \cdot log [s i g m o i d (F E (x ~ i) - F E (x))]

$\log PL(x) \approx N \cdot \log P(x_i|x_{-i}) \\ \approx N \cdot \log \frac{P(x_i, x_{-i})}{P(x_{-i})}\\ \approx N \cdot \log \frac{P(x)}{P(x_i, x_{-i}) + P(\tilde{x}_i, x_{-i}) } \\ \approx N \cdot \log \frac {e^{-FE(x)}} {e^{-FE(x)} + e^{-FE(\tilde{x}_i)}} \\ \approx N \cdot \log[ sigmoid (FE(\tilde{x}_i) - FE(x)) ]$

 def get_pseudo_likelihood_cost(self, updates):
        """
        Stochastic approximation to the pseudo-likelihood.
        :param updates:
        :return:
        """
        # index of bit i in expression p(xi|x-i)
        bit_i_idx = theano.shared(value=0, name='bit_i_idx')

        # binarize the input image by rounding to nearest integer
        xi = T.round(self.input)

        # calculate free energy for the given bit configuration
        fe_xi = self.free_energy(xi)

        # flip bit x_i of matrix xi and preserve all other bits x-i
        xi_flip = T.set_subtensor(xi[:, bit_i_idx], 1 - xi[:, bit_i_idx])

        # calculate free energy for given bit flipped
        fe_xi_flip = self.free_energy(xi_flip)

        cost = T.mean(self.n_visible * T.log(T.nnet.sigmoid(fe_xi_flip - fe_xi)))

        # increment bit_i_idx % number as part of updates
        updates[bit_i_idx] = (bit_i_idx + 1) % self.n_visible

        return cost

Main Loop

We now have all the necessary ingredients to start training our network.

Before going over the training loop however, the reader should familiarize himself with the function tile_raster_images (see tile_raster_images). Since RBMs are generative models, we are interested in sampling from them and plotting/visualizing these samples. We also want to visualize the filters (weights) learnt by the RBM, to gain insights into what the RBM is actually doing. Bear in mind however, that this does not provide the entire story, since we neglect the biases and plot the weights up to a multiplicative constant (weights are converted to values between 0 and 1).
Having these utility functions, we can start training the RBM and plot/save the filters after each training epoch. We train the RBM using PCD, as it has been shown to lead to a better generative model ([Tieleman08]).

def test_rbm(learning_rate=0.1, training_epochs=15, dataset='mnist.pkl.gz', batch_size=20, n_chains=20,
             n_samples=10, output_folder='rbm_plots', n_hidden=600):
    """
    Demonstrate how to train and afterwards sample from it using Theano.
    :param learning_rate:
    :param training_epochs:
    :param dataset:
    :param batch_size:
    :param n_chains: number of parallel Gibbs chains used to train the RBM
    :param n_samples: number of samples to plot for each chain
    :param output_folder:
    :param n_hidden:
    :return:
    """
    datasets = load_data(dataset)

    train_set_x, train_set_y = datasets[0]
    test_set_x, test_set_y = datasets[2]

    n_train_batches = train_set_x.get_value(borrow=True).shape[0] // batch_size

    # allocate symbolic variables for the data
    index = T.lscalar('index')  # index to a minibatch
    x = T.matrix('x')  # the data is presented as rasterized images

    rng = numpy.random.RandomState(123)
    theano_rng = RandomStreams(rng.randint(2 ** 30))

    # initialize storage for the persistent chain (state = hidden layer of chain)
    persistent_chain = theano.shared(numpy.zeros((batch_size, n_hidden), dtype=theano.config.floatX),
                                     borrow=True)
    # construct the RBM class
    rbm = RBM(input=x, n_visible=28 * 28, n_hidden=n_hidden, numpy_rng=rng, theano_rng=theano_rng)

    # get the cost and gradient corresponding to one step of CD-15
    cost, updates = rbm.get_cost_updates(learning_rate=learning_rate, persistent=persistent_chain, k=15)

    """
    Train the RBM
    """
    if not os.path.isdir(output_folder):
        os.makedirs(output_folder)
    os.chdir(output_folder)

    train_rbm = theano.function(inputs=[index], outputs=cost, updates=updates,
                                givens={
                                    x: train_set_x[index * batch_size: (index + 1) * batch_size]
                                }, name='train_rbm')

    plotting_time = 0.
    start_time = timeit.default_timer()

    # go through training epochs
    for epoch in range(training_epochs):
        mean_cost = []
        for batch_index in range(n_train_batches):
            mean_cost += [train_rbm(batch_index)]
        print "Training epoch %d, cost is " % epoch, numpy.mean(mean_cost)

        # plot filters after each training epoch
        plotting_start = timeit.default_timer()

        # construct image from the weight matrix
        image = Image.fromarray(tile_raster_images(X=rbm.w.get_value(borrow=True).T, img_shape=(28, 28),
                                                   tile_shape=(10, 10), tile_spacing=(1, 1)))
        image.save('filters_at_epoch_%i.png' % epoch)
        plotting_stop = timeit.default_timer()
        plotting_time += (plotting_stop - plotting_start)

    end_time = timeit.default_timer()

    pretraining_time = (end_time - start_time) - plotting_time

    print "Training took %f minutes " % (pretraining_time / 60.)

visualize the filters (weights) learnt by the RBM

经过15次迭代后我们得到了‘filters_at_epoch_14.png’，如下图所示：

这里写图片描述

Sampling from the RBM

Once the RBM is trained, we can then use the gibbs_vhv function to implement the Gibbs chain required for sampling. We initialize the Gibbs chain starting from test examples (although we could as well pick it from the training set) in order to speed up convergence and avoid problems with random initialization. We again use Theano’s scan op to do 1000 steps before each plotting.

 """
    Sampling from the RBM
    """
    # find out the number of test samples
    number_of_test_samples = test_set_x.get_value(borrow=True).shape[0]

    # pick random test examples, with which to initialize the persistent chain
    test_idx = rng.randint(number_of_test_samples - n_chains)
    persistent_vis_chain = theano.shared(numpy.asarray(test_set_x.get_value(borrow=True)[test_idx: test_idx + n_chains],
                                                       dtype=theano.config.floatX))

Next we create the 20 persistent chains in parallel to get our samples. To do so, we compile a theano function which performs one Gibbs step and updates the state of the persistent chain with the new visible sample. We apply this function iteratively for a large number of steps, plotting the samples at every 1000 steps.

plot_every = 1000
    # compile a theano function which performs one Gibbs step and updates the state of the persistent
    # chain with the new visible sample.
    # define one step of Gibbs sampling (mf = mean - field)
    # define a function that does 'plot_every' steps before returning the sample for plotting
    # gibbs_vhv return: [pre_sigmoid_h1, h1_mean, h1_sample, pre_sigmoid_v1, v1_mean, v1_sample]
    ([presig_hids, hid_mfs, hid_samples, presig_vis, vis_mfs, vis_samples], updates) = \
    theano.scan(rbm.gibbs_vhv, outputs_info=[None, None, None, None, None, persistent_vis_chain], n_steps=plot_every,
                name='gibbs_vhv')

    # add to updates the shared variable that takes care of our persistent chain
    updates.update({persistent_vis_chain: vis_samples[-1]})

    # construct the function that implements our persistent chain.
    # we generate the "mean field" activations for plotting and the actual
    # samples for reinitializing the state of our persistent chain
    sample_fn = theano.function(inputs=[], outputs=[vis_mfs[-1], vis_samples[-1]], updates=updates, name='sample_fn')

    # create a space to store the image for plotting (we need to leave room for the tile_spacing as well)
    image_data = numpy.zeros((29 * n_samples + 1, 29 * n_chains - 1), dtype='uint8')

    # vis_mfs(n_chains, nv)
    for idx in range(n_samples):
        # generate `plot_every` intermediate samples that we discard,
        # because successive samples in the chain are too correlated
        vis_mf, vis_sample = sample_fn()
        print(' ... plotting sample %d' % idx)
        image_data[29 * idx:29 * idx + 28, :] = tile_raster_images(
            X=vis_mf,
            img_shape=(28, 28),
            tile_shape=(1, n_chains),
            tile_spacing=(1, 1)
        )

        # construct image
    image = Image.fromarray(image_data)
    image.save('samples.png')
    # end-snippet-7
    os.chdir('../')