Deep learning----------Deep Belief Networks

深度信念网络(DBN)是一种2006年提出的概率生成模型,由多个受限玻尔兹曼机(RBM)堆叠而成。在训练时采用逐层无监督学习,先通过RBM学习数据的层次表示,然后通过有监督学习进行微调,适用于复杂模型的初始化权重。DBN可以用于建立深层结构,并通过贪婪逐层训练提高学习效率。
摘要由CSDN通过智能技术生成


中文简介: 

DBN是2006年提出的一种概率生成模型, 由多个限制玻尔兹曼机(RBM)[3]堆栈而成:

由Hinton在奋斗

  在训练时, Hinton采用了逐层无监督的方法来学习参数。首先把数据向量x和第一层隐藏层作为一个RBM, 训练出这个RBM的参数(连接x和h1的权重, x和h1各个节点的偏置等等), 然后固定这个RBM的参数, 把h1视作可见向量, 把h2视作隐藏向量, 训练第二个RBM, 得到其参数, 然后固定这些参数, 训练h2和h3构成的RBM, 具体的训练算法如下:

  

  上图最右边就是最终训练得到的生成模型:

  

  用公式表示为:

  

3. 利用DBN进行有监督学习

  在使用上述的逐层无监督方法学得节点之间的权重以及节点的偏置之后(亦即初始化), 可以在DBN的最顶层再加一层, 来表示我们希望得到的输出, 然后计算模型得到的输出和希望得到的输出之间的误差, 利用后向反馈的方法来进一步优化之前设置的初始权重。因为我们已经使用逐层无监督方法来初始化了权重值, 使其比较接近最优值, 解决了之前多层神经网络训练时存在的问题, 能够得到很好的效果。


The explanation in English and the corresponding code

Deep Belief Networks

[Hinton06] showed that RBMs can be stacked and trained in a greedy manner to form so-called Deep Belief Networks (DBN). DBNs are graphical models which learn to extract a deep hierarchical representation of the training data. They model the joint distribution between observed vector x and the \ell hidden layers h^k as follows:

(1)P(x, h^1, \ldots, h^{\ell}) = \left(\prod_{k=0}^{\ell-2} P(h^k|h^{k+1})\right) P(h^{\ell-1},h^{\ell})

where x=h^0P(h^{k-1} | h^k) is a conditional distribution for the visible units conditioned on the hidden units of the RBM at level k, and P(h^{\ell-1}, h^{\ell}) is the visible-hidden joint distribution in the top-level RBM. This is illustrated in the figure below.

_images/DBN3.png

The principle of greedy layer-wise unsupervised training can be applied to DBNs with RBMs as the building blocks for each layer[Hinton06][Bengio07]. The process is as follows:

1. Train the first layer as an RBM that models the raw input x =h^{(0)} as its visible layer.

2. Use that first layer to obtain a representation of the input that will be used as data for the second layer. Two common solutions exist. This representation can be chosen as being the mean activations p(h^{(1)}=1|h^{(0)}) or samples of p(h^{(1)}|h^{(0)}).

3. Train the second layer as an RBM, taking the transformed data (samples or mean activations) as training examples (for the visible layer of that RBM).

4. Iterate (2 and 3) for the desired number of layers, each time propagating upward either samples or mean values.

5. Fine-tune all the parameters of this deep architecture with respect to a proxy for the DBN log- likelihood, or with respect to a supervised training criterion (after adding extra learning machinery to convert the learned representation into supervised predictions, e.g. a linear classifier).

In this tutorial, we focus on fine-tuning via supervised gradient descent. Specifically, we use a logistic regression classifier to classify the input x based on the output of the last hidden layer h^{(l)} of the DBN. Fine-tuning is then performed via supervised gradient descent of the negative log-likelihood cost function. Since the supervised gradient is only non-null for the weights and hidden layer biases of each layer (i.e. null for the visible biases of each RBM), this procedure is equivalent to initializing the parameters of a deep MLP with the weights and hidden layer biases obtained with the unsupervised training strategy.

Justifying Greedy-Layer Wise Pre-Training

Why does such an algorithm work ? Taking as example a 2-layer DBN with hidden layers h^{(1)} and h^{(2)} (with respective weight parameters W^{(1)} and W^{(2)}), [Hinton06] established (see also Bengio09]_ for a detailed derivation) that \logp(x) can be rewritten as,

(2)\log p(x) = &KL(Q(h^{(1)}|x)||p(h^{(1)}|x)) + H_{Q(h^{(1)}|x)} + \\            &\sum_h Q(h^{(1)}|x)(\log p(h^{(1)}) + \log p(x|h^{(1)})).

KL(Q(h^{(1)}|x) || p(h^{(1)}|x)) represents the KL divergence between the posterior Q(h^{(1)}|x) of the first RBM if it were standalone, and the probability p(h^{(1)}|x) for the same layer but defined by the entire DBN (i.e. taking into account the prior p(h^{(1)},h^{(2)}) defined by the top-level RBM). H_{Q(h^{(1)}|x)} is the entropy of the distribution Q(h^{(1)}|x).

It can be shown that if we initialize both hidden layers such that W^{(2)}={W^{(1)}}^TQ(h^{(1)}|x)=p(h^{(1)}|x) and the KL divergence term is null. If we learn the first level RBM and then keep its parameters W^{(1)} fixed, optimizing Eq. (2) with respect to W^{(2)} can thus only increase the likelihood p(x).

Also, notice that if we isolate the terms which depend only on W^{(2)}, we get:

\sum_h Q(h^{(1)}|x)p(h^{(1)})

Optimizing this with respect to W^{(2)} amounts to training a second-stage RBM, using the output of Q(h^{(1)}|x) as the training distribution, when x is sampled from the training distribution for the first RBM.

Implementation

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值