DeepLearning——Restricted Boltzmann Machines

Restricted Boltzmann Machines

—-目录

Energy-Based Models (EBM)

   Energy-based models associate a scalar energy to each configuration of the variables of interest. Learning corresponds to modifying that energy function so that its shape has desirable properties. For example, we would like plausible or desirable configurations to have low energy. Energy-based probabilistic models define a probability distribution through an energy function, as follows:

P(x)=eE(x)Z(1)

The normalizing factor Z is called the partition function by analogy with physical systems.
Z=xeE(x)(2)

EBMs with Hidden Units

  In many cases of interest, we do not observe the example x fully, or we want to introduce some non-observed variables to increase the expressive power of the model. So we consider an observed part (still denoted x here) and a hidden part h. We can then write:

P(x)=hP(x,h)=heE(x)Z(3)

In such cases, to map this formulation to one similar to Eq. (1), we introduce the notation (inspired from physics) of free energy, defined as follows:
F(x)=logheE(x)(4)

which allow us to write:
P(x)=eF(x)ZwithZ=xeF(x)(5)

The data negative log-likelihood gradient then has a particularly intresting form.
logP(x)θ=F(x)θx¯P(x¯)F(x¯)θ(6)

推导过程如下:(注意x和 x¯ 不同)

   logP(x)θ=logx¯eF(x¯)θlogeF(x)θ
                      =F(x)θ1x¯eF(x¯)x¯eF(x¯)F(x)θ
                       =F(x)θx¯P(x¯)F(x¯)θ

  Notice that the above gradient contains two terms, which are referred to as the positive and negative phase. The terms positive and negative do not refer to the sign of each term in the equation, but rather reflect their effect on the probability density defined by the model. The first term increases the probability of training data (by reducing the corresponding free energy), while the second term decreases the probability of samples generated by the model.
  It is usually difficult to determine this gradient analytically, as it involves the computation of EP[F(x)θ] . This is nothing less than an expectation over all possible configurations of the input x (under the distribution P formed by the model) !
  The first step in making this computation tractable is to estimate the expectation using a fixed number of model samples. Samples used to estimate the negative phase gradient are referred to as negative particles, which are denoted as N. The gradient can then be written as:

logp(x)θF(x)θ1|N|x¯N F(x¯)θ.(7)

where we would ideally like elements x¯ of N to be sampled according to P (i.e. we are doing Monte-Carlo).
  有了以上的公式,我们基本上就可以训练一个EBM了,唯一留下的问题是how to extract these negative particles N ,不过下面就要介绍一种方法Markov Chain Monte Carlo methods.

Restricted Boltzmann Machines (RBM)

    Boltzmann Machines (BMs) are a particular form of log-linear Markov Random Field (MRF), i.e., for which the energy function is linear in its free parameters. To make them powerful enough to represent complicated distributions (i.e., go from the limited parametric setting to a non-parametric one), we consider that some of the variables are never observed (they are called hidden). By having more hidden variables (also called hidden units), we can increase the modeling capacity of the Boltzmann Machine (BM). Restricted Boltzmann Machines further restrict BMs to those without visible-visible and hidden-hidden connections. A graphical depiction of an RBM is shown below.

                                                                RBM

The energy function E(v,h) of an RBM is defined as:

E(v,h)=bvchhWv(8)

where Wnh×nv represents the weights connecting hidden and visible units and b, c are the offsets of the visible and hidden layers respectively.

This translates directly to the following free energy formula(其中 Wi 为行向量, i 表示对所有的hidden units求和):

F(v)=bviloghiehi(ci+Wiv).(9)

Because of the specific structure of RBMs, visible and hidden units are conditionally independent given one-another. Using this property, we can write:

p(h|v)=ip(hi|v).(10)

p(v|h)=jp(vj|h).(11)

RBMs with binary units

    In the commonly studied case of using binary units (where vj and hi{0,1}) , we obtain from Eq. (6) and (2), a probabilistic version of the usual neuron activation function:

P(hi=1|v)=sigmoid(ci+Wiv)(12)

P(vj=1|h)=sigmoid(bj+W.jh)(13)

P(vj=1|h) 的推导过程如下:
首先我们将式(8)写成标量的形式:

E(v,h)=j=1nvbjvji=1nhcivii=1nhj=1nvhiWijvj(8.1)

所以有: P(vj=1|h)=P(vj=1|vj,h)
                                     =P(vj=1,vi,h)P(vj,h)
                                     =P(vj=1,vi,h)P(vj=1,vj,h)+P(vj=0,vj,h)
                                     =1ZeE(vj=1,vj,h)1ZeE(vj=1,vj,h)+1ZeE(vj=0,vj,h)
                                     =11+eE(vj=0,vj,h)+E(vj=1,vj,h)
                                     =11+e(bj+nhi=1hiWij)
                                     =sigmoid(bj+W.jh)
注意:其中 W.j 指列向量, Wi 指行向量。
同理也可以推导出式(12)。
The free energy of an RBM with binary units further simplifies to:

F(v)=bvilog(1+e(ci+Wiv)).(14)

因为 hi 只有两种取值0和1.
  我们可以通过式(6)来对RBM的各个参数(b, c, W)求导,不过计算量太大,不可取,我们可以通过采样的方法来近似估计。
logP(x)θ=F(x)θx¯P(x¯)F(x¯)θ(6)

Sampling in an RBM

  Samples of p(x) can be obtained by running a Markov chain to convergence, using Gibbs sampling as the transition operator.

  Gibbs sampling of the joint of N random variables S=(S1,...,SN) is done through a sequence of N sampling sub-steps of the form Sip(Si|Si) where Si contains the N-1 other random variables in S excluding Si .

  For RBMs, S consists of the set of visible and hidden units. However, since they are conditionally independent, one can perform block Gibbs sampling. In this setting, visible units are sampled simultaneously given fixed values of the hidden units. Similarly, hidden units are sampled simultaneously given the visibles. A step in the Markov chain is thus taken as follows:

h(n+1)P(h|v)=sigm(Wv(n)+c),

v(n+1)P(v|h)=sigm(Wh(n+1)+b).

注意h和v均是向量,每个分量同时采样。

  where h(n) refers to the set of all hidden units at the n-th step of the Markov chain. What it means is that, for example, h(n+1)i is randomly chosen to be 1 (versus 0) with probability sigmoid(Wiv(n)+ci) , and similarly, v(n+1)j is randomly chosen to be 1 (versus 0) with probability sigmoid(W.jh(n+1)+bj) .

This can be illustrated graphically:

                                  markov_chain
As t , samples (v(t),h(t)) are guaranteed to be accurate samples of p(v,h).[详细理论推导见下面参考资料3]

Contrastive Divergence (CD-k)

  Contrastive Divergence uses two tricks to speed up the sampling process:

  • since we eventually want p(v)ptrain(v) (the true, underlying distribution of the data), we initialize the Markov chain with a training example (i.e., from a distribution that is expected to be close to p, so that the chain will be already close to having converged to its final distribution p).例如我们用训练样本来初始化Markov chain,使得模型最后达到和训练数据相同的分布。
  • CD does not wait for the chain to converge. Samples are obtained after only k-steps of Gibbs sampling. In pratice, k=1 has been shown to work surprisingly well.
  • restarting a chain for each observed example. 每次Markov Chain 收敛,求得参数的梯度后,再重新初始化Markov Chain,再求参数的梯度,以此类推。

    我们可以通过CD-k算法来估计梯度的计算:

logP(x)θ=F(x)θx¯P(x¯)F(x¯)θ(6)

  x可以是输入的训练样本,而 x¯ 可以是Markov Chain收敛采样得到的样本,用来估计式(6)的第二项,从而来估计对各个参数的梯度。
  详细的请见下面代码:

# determine gradients on RBM parameters
        # note that we only need the sample at the end of the chain
        chain_end = nv_samples[-1]

        cost = T.mean(self.free_energy(self.input)) - T.mean(
            self.free_energy(chain_end))
        # We must not compute the gradient through the gibbs sampling
        gparams = T.grad(cost, self.params, consider_constant=[chain_end])

Persistent CD

  Persistent CD Tieleman08(详细请参考论文) uses another approximation for sampling from p(v,h). It relies on a single Markov chain, which has a persistent state (i.e., not restarting a chain for each observed example). For each parameter update, we extract new samples by simply running the chain for k-steps. The state of the chain is then preserved for subsequent updates.

  The general intuition is that if parameter updates are small enough compared to the mixing rate of the chain, the Markov chain should be able to “catch up” to changes in the model.

one RBM example

代码请参考我的githubRestrictedBoltzmannMachines.py.

参考资料


【1】DeepLearning Tutorial Restricted Boltzmann Machines (RBM),这个是使用Theano实现深度学习算法的一个教程。
【2】 section 5 of Learning Deep Architectures for AI这是Hinton大神提出怎么训练RBM的文章。
【3】受限玻尔兹曼机(RBM)学习笔记,这是一个很详细的RBM原理讲解的教程,非常好懂。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值