DeepLearning——Restricted Boltzmann Machines

最新推荐文章于 2023-06-03 10:58:55 发布

yyHaker

最新推荐文章于 2023-06-03 10:58:55 发布

阅读量478

点赞数 1

分类专栏：深度学习

本文链接：https://blog.csdn.net/yyHaker/article/details/78426942

版权

深度学习专栏收录该内容

6 篇文章 0 订阅

订阅专栏

Restricted Boltzmann Machines

—-目录

Restricted Boltzmann Machines

Energy-Based Models (EBM)

Energy-based models associate a scalar energy to each configuration of the variables of interest. Learning corresponds to modifying that energy function so that its shape has desirable properties. For example, we would like plausible or desirable configurations to have low energy. Energy-based probabilistic models define a probability distribution through an energy function, as follows:

P (x) = e - E ( x ) Z (1)

$P(x) = \frac{ e^ {-E(x)}}{Z} \tag 1$
The normalizing factor Z is called the partition function by analogy with physical systems.

Z = \sum x e - E (x) (2)

$Z = \sum_x e^ {-E(x)} \tag 2$

EBMs with Hidden Units

In many cases of interest, we do not observe the example x fully, or we want to introduce some non-observed variables to increase the expressive power of the model. So we consider an observed part (still denoted x here) and a hidden part h. We can then write:

P (x) = \sum h P (x, h) = \sum h e - E ( x ) Z (3)

$P(x) = \sum_h P(x, h) = \sum_h \frac{ e^ {-E(x)}}{Z} \tag 3$
In such cases, to map this formulation to one similar to Eq. (1), we introduce the notation (inspired from physics) of free energy, defined as follows:

F (x) = - l o g \sum h e - E (x) (4)

$\mathcal F(x) = -log\sum_h e^ {-E(x)} \tag 4$
which allow us to write:

P (x) = e - F ( x ) Z w i t h Z = \sum x e - F (x) (5)

$P(x) = \frac{ e^ {-\mathcal F(x)}}{Z} with Z = \sum_x e^ {-\mathcal F(x)} \tag 5$
The data negative log-likelihood gradient then has a particularly intresting form.

- \partial l o g P ( x ) \partial θ = \partial F ( x ) \partial θ - \sum x ¯ P (x ¯) \partial F ( x ¯ ) \partial θ (6)

$- \frac{\partial logP(x)}{\partial \theta} = \frac{\partial \mathcal F(x)}{\partial \theta} - \sum_\bar x P(\bar x)\frac{\partial \mathcal F(\bar x)}{\partial \theta} \tag 6$
推导过程如下：(注意x和

x¯ $\bar x$ 不同)

   $- \frac{\partial \log P(x)}{\partial \theta} = \frac{\partial \log \sum_\bar x e^ {-\mathcal F(\bar x)}}{\partial \theta} - \frac{\partial \log e^ {-\mathcal F(x)}}{\partial \theta}$
                    $= \frac{\partial \mathcal F(x)}{\partial \theta} - \frac{1}{ \sum_\bar x e^ {-\mathcal F(\bar x)}}\sum_\bar x e^{- \mathcal F(\bar x)} \frac{\partial \mathcal F(x)}{\partial \theta}$
                       $=\frac{\partial \mathcal F(x)}{\partial \theta} - \sum_\bar x P(\bar x)\frac{\partial \mathcal F(\bar x)}{\partial \theta}$

  Notice that the above gradient contains two terms, which are referred to as the positive and negative phase. The terms positive and negative do not refer to the sign of each term in the equation, but rather reflect their effect on the probability density defined by the model. The first term increases the probability of training data (by reducing the corresponding free energy), while the second term decreases the probability of samples generated by the model.
  It is usually difficult to determine this gradient analytically, as it involves the computation of $E_P [ \frac{\partial \mathcal{F}(x)} {\partial \theta} ]$ . This is nothing less than an expectation over all possible configurations of the input x (under the distribution P formed by the model) !
  The first step in making this computation tractable is to estimate the expectation using a fixed number of model samples. Samples used to estimate the negative phase gradient are referred to as negative particles, which are denoted as $\mathcal{N}.$ The gradient can then be written as:

- \partial log p ( x ) \partial θ \approx \partial F ( x ) \partial θ - 1 | N | \sum x ¯ \in N \partial F ( x ¯ ) \partial θ . (7)

$- \frac{\partial \log p(x)}{\partial \theta} \approx \frac{\partial \mathcal{F}(x)}{\partial \theta} - \frac{1}{|\mathcal{N}|}\sum_{\bar{x} \in \mathcal{N}} \ \frac{\partial \mathcal{F}(\bar{x})}{\partial \theta}. \tag 7$
where we would ideally like elements

x¯ $\bar{x}$ of

N $\mathcal{N}$ to be sampled according to P (i.e. we are doing Monte-Carlo).
有了以上的公式，我们基本上就可以训练一个EBM了，唯一留下的问题是how to extract these negative particles

N $\mathcal N$ ,不过下面就要介绍一种方法Markov Chain Monte Carlo methods.

Restricted Boltzmann Machines (RBM)

Boltzmann Machines (BMs) are a particular form of log-linear Markov Random Field (MRF), i.e., for which the energy function is linear in its free parameters. To make them powerful enough to represent complicated distributions (i.e., go from the limited parametric setting to a non-parametric one), we consider that some of the variables are never observed (they are called hidden). By having more hidden variables (also called hidden units), we can increase the modeling capacity of the Boltzmann Machine (BM). Restricted Boltzmann Machines further restrict BMs to those without visible-visible and hidden-hidden connections. A graphical depiction of an RBM is shown below.

RBM

The energy function E(v,h) of an RBM is defined as:

E (v, h) = - b' v - c' h - h' W v (8)

$E(v,h) = - b'v - c'h - h'Wv \tag 8$
where $W_{n_{h} \times n_{v}}$ represents the weights connecting hidden and visible units and b, c are the offsets of the visible and hidden layers respectively.

This translates directly to the following free energy formula(其中 $W_i$ 为行向量， $\sum_i$ 表示对所有的hidden units求和):

F (v) = - b' v - \sum i log \sum h i e h i (c i + W i v) . (9)

$\mathcal{F}(v)= - b'v - \sum_i \log \sum_{h_i} e^{h_i (c_i + W_i v)}. \tag 9$

Because of the specific structure of RBMs, visible and hidden units are conditionally independent given one-another. Using this property, we can write:

p (h | v) = \prod i p (h i | v) . (10)

$p(h|v) = \prod_i p(h_i|v). \tag {10}$

p (v | h) = \prod j p (v j | h) . (11)

$p(v|h) = \prod_j p(v_j|h). \tag {11}$

RBMs with binary units

In the commonly studied case of using binary units (where $v_j$ and $h_i \in\{0,1\})$ , we obtain from Eq. (6) and (2), a probabilistic version of the usual neuron activation function:

P (h i = 1 | v) = s i g m o i d (c i + W i v) (12)

$P(h_i=1|v) = sigmoid(c_i + W_i v) \tag {12}$

P (v j = 1 | h) = s i g m o i d (b j + W . j h) (13)

$P(v_j=1|h) = sigmoid(b_j + W_{.j} h) \tag{13}$

$P(v_j=1|h)$ 的推导过程如下：
首先我们将式（8）写成标量的形式：

E (v, h) = - \sum j = 1 n v b j v j - \sum i = 1 n h c i v i - \sum i = 1 n h \sum j = 1 n v h i W i j v j (8.1)

$E(v, h) = -\sum_{j=1}^{n_v} b_j v_j - \sum_{i=1}^{n_h} c_i v_i - \sum_{i=1}^{n_h} \sum_{_j=1}^{n_v} h_iW_{ij}v_j \tag {8.1}$
所以有：

P(vj=1|h)=P(vj=1|v−j,h) $P(v_j=1|h)=P(v_j=1|v_{-j},h)$

=P(vj=1,v−i,h)P(v−j,h) $=\frac {P(v_j=1,v_{-i},h)}{P(v_{-j},h)}$

=P(vj=1,v−i,h)P(vj=1,v−j,h)+P(vj=0,v−j,h) $=\frac {P(v_j=1,v_{-i},h)}{P(v_j=1,v_{-j},h) + P(v_j=0,v_{-j}, h)}$

=1Ze−E(vj=1,v−j,h)1Ze−E(vj=1,v−j,h)+1Ze−E(vj=0,v−j,h) $=\frac{\frac{1}{Z}e^{-E(v_j=1,v_{-j},h)}}{\frac{1}{Z}e^{-E(v_j=1,v_{-j},h)} + \frac{1}{Z}e^{-E(v_j=0,v_{-j},h)}}$

=11+e−E(vj=0,v−j,h)+E(vj=1,v−j,h) $=\frac{1}{1 + e^{-E(v_j=0,v_{-j},h)+ E(v_j=1,v_{-j},h)} }$

=11+e−(bj+∑nhi=1hiWij) $=\frac{1}{1 + e^{-(b_j + \sum_{i=1}^{n_h}h_iW_{ij})}}$

=sigmoid(bj+W.jh) $=sigmoid(b_j + W_{.j}h)$
注意：其中

W.j $W_{.j}$ 指列向量，

Wi $W_i$ 指行向量。
同理也可以推导出式(12)。
The free energy of an RBM with binary units further simplifies to:

F (v) = - b' v - \sum i log (1 + e (c i + W i v)) . (14)

$\mathcal{F}(v)= - b'v - \sum_i \log(1 + e^{(c_i + W_i v)}). \tag{14}$
因为

hi $h_i$ 只有两种取值0和1.
我们可以通过式(6)来对RBM的各个参数（b, c, W）求导，不过计算量太大，不可取，我们可以通过采样的方法来近似估计。

- \partial l o g P ( x ) \partial θ = \partial F ( x ) \partial θ - \sum x ¯ P (x ¯) \partial F ( x ¯ ) \partial θ (6)

$- \frac{\partial logP(x)}{\partial \theta} = \frac{\partial \mathcal F(x)}{\partial \theta} - \sum_\bar x P(\bar x)\frac{\partial \mathcal F(\bar x)}{\partial \theta} \tag 6$

Sampling in an RBM

Samples of p(x) can be obtained by running a Markov chain to convergence, using Gibbs sampling as the transition operator.

Gibbs sampling of the joint of N random variables $S=(S_1, ... , S_N)$ is done through a sequence of N sampling sub-steps of the form $S_i \sim p(S_i | S_{-i})$ where $S_{-i}$ contains the N-1 other random variables in S excluding $S_i$ .

For RBMs, S consists of the set of visible and hidden units. However, since they are conditionally independent, one can perform block Gibbs sampling. In this setting, visible units are sampled simultaneously given fixed values of the hidden units. Similarly, hidden units are sampled simultaneously given the visibles. A step in the Markov chain is thus taken as follows:

h (n + 1) \sim P (h | v) = s i g m (W v (n) + c),

$h^{(n+1)} \sim P(h|v)= sigm(Wv^{(n)} + c) ,$

v (n + 1) \sim P (v | h) = s i g m (W' h (n + 1) + b) .

$v^{(n+1)} \sim P(v|h) = sigm(W' h^{(n+1)} + b).$
注意h和v均是向量，每个分量同时采样。

where $h^{(n)}$ refers to the set of all hidden units at the n-th step of the Markov chain. What it means is that, for example, $h^{(n+1)}_i$ is randomly chosen to be 1 (versus 0) with probability $sigmoid(W_iv^{(n)} + c_i)$ , and similarly, $v^{(n+1)}_j$ is randomly chosen to be 1 (versus 0) with probability $sigmoid(W_{.j} h^{(n+1)} + b_j)$ .

This can be illustrated graphically:

markov_chain
As $t \rightarrow \infty$ , samples $(v^{(t)}, h^{(t)})$ are guaranteed to be accurate samples of p(v,h).[详细理论推导见下面参考资料3]

Contrastive Divergence (CD-k)

Contrastive Divergence uses two tricks to speed up the sampling process:

since we eventually want $p(v) \approx p_{train}(v)$ (the true, underlying distribution of the data), we initialize the Markov chain with a training example (i.e., from a distribution that is expected to be close to p, so that the chain will be already close to having converged to its final distribution p).例如我们用训练样本来初始化Markov chain,使得模型最后达到和训练数据相同的分布。
CD does not wait for the chain to converge. Samples are obtained after only k-steps of Gibbs sampling. In pratice, k=1 has been shown to work surprisingly well.
restarting a chain for each observed example. 每次Markov Chain 收敛，求得参数的梯度后，再重新初始化Markov Chain，再求参数的梯度，以此类推。

我们可以通过CD-k算法来估计梯度的计算：

- \partial l o g P ( x ) \partial θ = \partial F ( x ) \partial θ - \sum x ¯ P (x ¯) \partial F ( x ¯ ) \partial θ (6)

x¯ $\bar x$ 可以是Markov Chain收敛采样得到的样本，用来估计式(6)的第二项，从而来估计对各个参数的梯度。
详细的请见下面代码：

# determine gradients on RBM parameters
        # note that we only need the sample at the end of the chain
        chain_end = nv_samples[-1]

        cost = T.mean(self.free_energy(self.input)) - T.mean(
            self.free_energy(chain_end))
        # We must not compute the gradient through the gibbs sampling
        gparams = T.grad(cost, self.params, consider_constant=[chain_end])

Persistent CD

Persistent CD Tieleman08(详细请参考论文) uses another approximation for sampling from p(v,h). It relies on a single Markov chain, which has a persistent state (i.e., not restarting a chain for each observed example). For each parameter update, we extract new samples by simply running the chain for k-steps. The state of the chain is then preserved for subsequent updates.

The general intuition is that if parameter updates are small enough compared to the mixing rate of the chain, the Markov chain should be able to “catch up” to changes in the model.

one RBM example

代码请参考我的githubRestrictedBoltzmannMachines.py.

参考资料

【1】DeepLearning Tutorial Restricted Boltzmann Machines (RBM)，这个是使用Theano实现深度学习算法的一个教程。
【2】 section 5 of Learning Deep Architectures for AI这是Hinton大神提出怎么训练RBM的文章。
【3】受限玻尔兹曼机（RBM）学习笔记，这是一个很详细的RBM原理讲解的教程，非常好懂。

yyHaker

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
DeepLearning——Restricted Boltzmann Machines

Restricted Boltzmann Machines—-目录Restricted Boltzmann MachinesEnergy-Based Models EBMEBMs with Hidden UnitsRestricted Boltzmann Machines RBMRBMs with binary unitsSampling in an RBMContrastive Div
复制链接

扫一扫

专栏目录