为什么无监督的预训练可以帮助深度学习


本篇文章主要是review Dumitru Erhan∗,Yoshua Bengio,Aaron Courville,Pierre-Antoine Manzagol 在2010年发表的《why does unsupervised pre-training help deep learning?》.


一 话题导入


最近深度学习框架中比如:Deep Belief Networks,Stacks of Auto-Encoder variants. 都基本基于先通过无监督的预训练,然后再通过有监督的fine tune,达到了最好的效果。


二 实验现象


1. Better Generalization 

When choosing the number of units per layer, the learning rate and the number of training iterationsto optimize classification error on the validation set, unsupervised pre-training gives substantially lower test classification error than no pre-training, for the same depth or for smaller depth on variousvision data sets.

2. More Robust

These experiments show that the variance of final test error with respect to initialization randomseed is larger without pre-training, and this effect is magnified for deeper architectures. It should however be noted that there is a limit to the success of this technique: performance degrades for 5layers on this problem. 

3. 总之

 Better generalization that seems to be robust to random initializations is indeed achieved by pre-trained models, which indicates that unsupervised learning of P(X ) is helpful in learning P(Y |X ). 


三 解释原因:

(1)是否来自于Precondition。这种无监督的训练是否是可以帮我们寻找到更好的初始化权重的分布,而不是我们经常采用的[−1/k^0.5,1/k^0.5] 的均匀分布。

答案是否

所谓的Precondition,就是作者重新挑选一个和pre training 后的参数为作为初始参数,注意这里作者是分层进行的,所谓的分层进行就是对于每一层的参数,作者根据多次的pre training结果,构造的经验的分布。然后从每层的经验分布中采样,构成参数的初始点。原文表述如下:

“By conditioning, we mean the range and marginal distribution from which we draw initial weights. In other words, could we get the same performance advantage as unsupervised pre-training if we were still drawing the initial weights independently, but from a more suitable distribution than the uniform

To verify this, we performed unsupervised pre-training, and computed marginal histograms for each layer’s pre-trained weights and biases (one histogram per each layer’s weights and biases). We then resampled new “initial” random weights and biases according to these histograms (independently for each parameter), and performed fine-tuning from there. 

The resulting parameters have the same marginal statistics as those obtained after unsupervised pre-training, but not the same joint distribution. ”

结果怎么样呢?

如果是采用uniform的参数的初始化方法,测试集的误差为avg:1.77,std:0.10。如果使用上面的方法,称为Histogram,avg:169,std:0.11. 如果使用无监督的预训练,Unsup,pre,avg:1.37,std:0.09.可以看出这种方法仅仅比uiform好一点。因此precondition是无法解释。

(2)是否来自于无监督的预训练可以通过降低训练集的误差解释呢(The Effect of Pre-training on Training Error )。但是结果也不是

“The remarkable observation is rather that, at a same training cost level, the pre-trained models systematically yield a lower test cost than the randomly initialized ones. The advantage appears to be one of better generalization rather than merely a better optimization procedure. ”

(3)作者认为,无监督预训练可以为参数,提供先验(prior or regularizer),而且这种先验分布或者说是正则化,与传统的形式不同,它没有显示的正则化项,并且是依赖于数据自动发现。作者原文

unsupervised pre-training appears to have a similar effect to that of a good regularizer or a good “prior” on the parameters, even though no explicit regularization term is apparent in the cost being optimized. As we stated in the hypothesis, it might be reasoned that restricting the possible starting points in parameter space to those that minimize the unsupervised pre-training criterion (as with the SDAE), does in effect restrict the set of possible final configurations for parameter values. Like regularizers in general, unsupervised pre-training (in this case, with denoising auto-encoders) might thus be seen as decreasing the variance and introducing a bias (towards parameter configurations suitable for performing denoising). Unlike ordinary regularizers, unsupervised pre-training does so in a data-dependent manner.”

下面将继续探讨这种特殊化,无具体形式的,并且依赖数据的正则化项来源于何处?

(4)作者认为,如果假设确实来源于正则化,那么个根据正则化典型的一个性质,正则化带来的效用会随着模型的复杂性的增大而增大。作者的设想如下:

Another signature characteristic of regularization is that the effectiveness of regularization increases as capacity (e.g., the number of hidden units) increases, effectively trading off one constraint on the model complexity for another. In this experiment we explore the relationship between the number ofunits per layer and the effectiveness of unsupervised pre-training. The hypothesis that unsupervised pre-training acts as a regularizer would suggest that we should see a trend of increasing effectiveness of unsupervised pre-training as the number of units per layer are increased. 

但是实验结果显示,这个效应只有对layer size 足够大(100个隐藏层),网络足够深。无监督的预训练带来的效益才才会随着模型的复杂性增加而增加。对于简单的网络,无监督的预训练反而是多余的。这是一个意外的实验发现。

“What we observe is a more systematic effect: while unsupervised pre-training helps for larger layers and deeper networks, it also appears to hurt for too small networks.”

“As the model size decreases from 800 hidden units, the generalization error increases, and it increases more with unsupervised pre-training presumably because of the extra regularization effect: small networks have a limited capacity already so further restricting it (or introducing an additional bias) can harm generalization. ”

除了上面一般性解释(简单模型,不需要正则化,因为模型已经很简单)作者给出的解释如下:

The effect can be explained in terms of the role of unsupervised pre-training as promoting input transformations (in the hidden layers) that are useful at capturing the main variations in the input distribution P(X). It may be that only a small subset of these variations are relevant for predicting the class label Y. When the hidden layers are small it is less likely that the transformations for predicting Y are included in the lot learned by unsupervised pre-training. 

简单的说就是简单网络的无监督,对X进行变换的时候,可能会把对Y特别有用的特征过滤掉,因为是非监督的,并不能确定有一些X特征会对Y的预测有用。如果是复杂网络,可以保留更多的可能。好像有点道理。


(5)作者不认为 这种有效性来源于优化的结果:Challenging the Optimization Hypothesis。一般来说,由于深度网络往往很难训练,可能这种无监督的预训练可以提升一个使得训练函数更加小的局部最优点。

作者质疑Bengio et al. (2007) 的实验设计,因为在Bengio et al. (2007),中涉及到了“early stopping”,作者认为这种技巧本身就是一种正则化(regularizer)。如果不使用这种技巧,那么结论就不成立了。

“Figure 10 shows what happens without early stopping. The training error is still higher for pre-trained networks even though the generalization error is lower. This result now favors the regularization hypothesis against the optimization story. What may have happened is that early stopping prevented the networks without pre-training from moving too much towards their apparent local minimum.”

因为对于使用了无监督的预训练的网络最终泛化能力好,但是却训练误差高。因此优化假设是成立的。而因为Bengio et al. (2007) 使用了过停止的技巧,所以导致了没有使用了无监督预训练的网络,也变相使用了一种正则化的技巧,这种技巧导致了网络不会太偏向于局部最优。


下面的问题来了,既然我们已经知道了无监督预训练的魔力来自于某种先验知识(prior)或者正则化(regularizer),那么因为这种正则化并没有确定的正则化项,因此很难判担这种正则化到底是什么样的,因此作者在接下来的实验中,确定这种正则化的内容。首先我们知道,(参见,the elements of statistical learning),其实正则化是来源于贝叶斯定理,通过对参数假设一个先验分布(prior distribution),通过贝叶斯定理,可以求出参数的后验部分。因此对参数不同的先验分布往往可以推出不同的正则化项。对于我们常见的L1,L2正则化项,其实是假设参数先验分布是一个均值为0的某种概率分布(比如均值为0的正太分布),之所以是均值为0的假设是因为,我们想要得到一个尽量简单的模型。为什么一定要尽量简单的模型呢?这是因为机器学习中一个重要的理论-奥卡姆剃刀定律。好了,说了这么多,我们来看看,通过无监督的预训练的得到的隐式的正则化项是不是L1,L2. (剧透一下,肯定没有那么简单)

(6)揭开正则化的内容,是不是L1,L2.

作者于是比较了对神经网络分别加上了L1,L2正则项,并与通过无监督预训练网络之间区别。

“We found that while in the case ofMNIST a small penalty can in principle help, the gain is nowhere near as large as it is with pre-training. ForInfiniteMNIST, the optimal amount ofL1 and L2regularization is zero ”

结果发现,在MNIST这种简单的任务上,这种正则化还有一些帮助。在复杂一点的任务InfiniteMNIST,这种正则化基本上是没有价值的。

下面的作者评价说

“This is not an entirely surprising finding: not all regularizers are created equal and these results are consistent with the literature on semi-supervised training that shows that unsupervised learning can be exploited as a particularly effective form of regularization. ”


Not all regularizers are created equal。





(7)总结一下

总之,是正则化,而且还不是一般的正则化,更不是优化的假设,也不是边际分布所能解释的。是某种特殊的先验分布带来的正则化。而且这种正则化项,和early stopping以及semi-superviesed的原理比较相似。

  • 6
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
深度玻尔兹曼机(Deep Belief Network,DBN)是一种集深度学习、概率论、统计学习等多个领域知识于一体的机器学习技术。它是受限玻尔兹曼机(Restricted Boltzmann Machine,RBM)的扩展,可以表示多层次的非线性结构。DBN的预训练过程包括两个步骤:单独训练每一个受限玻尔兹曼机,然后将每一个受限玻尔兹曼机深入展开,构成一个前向传播的深层网络。在预训练阶段,每一层的权重参数通过无监督学习的方式进行训练,以逐步学习输入数据的特征表示。预训练完成后,可以使用反向传播算法(BP)对整个网络进行微调,从而优化网络的性能。因此,预训练是深度玻尔兹曼机中重要的一步,它可以帮助网络学习到更好的特征表示,提高模型的泛化能力和性能。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *2* [【总结】关于玻尔兹曼机(BM)、受限玻尔兹曼机(RBM)、深度玻尔兹曼机(DBM)、深度置信网络(DBN)理论总结和...](https://blog.csdn.net/qq_43462005/article/details/108712717)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT0_1"}}] [.reference_item style="max-width: 50%"] - *3* [深度玻尔兹曼机训练方法](https://blog.csdn.net/universsky2015/article/details/132202265)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT0_1"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值