Learning Deep Architectures for AI 读书笔记

最新推荐文章于 2023-06-03 20:00:58 发布

VIP文章 jiqiujia

最新推荐文章于 2023-06-03 20:00:58 发布

阅读量921

点赞数

本文链接：https://blog.csdn.net/jiqiujia/article/details/47977851

版权

Learning Deep Architectures for AI 读书笔记

Chapter 2

一个函数称之为紧凑的【compact】如果它有比较少的计算单元【比如神经网络中的一个个结点】；
如果一个函数能够紧凑得被一个深层结构所表示，那么一个不那么深的结构可能得很大才足以表示这个函数
文献【62】：一个k层的多项式大小的逻辑门电路结构当变成k-1层时，它将变成指数大小

Theorem 2.1. A monotone weighted threshold circuit of depth $k − 1$ computing a function $f_k \in F_{k,N}$ has size at least $2^{cN}$ for some constant $c > 0$ and $N >N_0$ [63].
monotone weighted threshold circuit: i.e. multilayer neural networks with linear threshold units and positive weights

这其实也是为什么我们不用两层网络的原因。理论上来说，只要隐层的神经元数目足够多，两层网络已经足以模拟任何的函数，但是这需要的神经网数目是指数增长的，根本无法负担。

值得注意的一点是：没有一个特定的深度能够紧凑得表示任何函数

Chapter 3

3.1

An estimator that is local in input space obtains good generalization for a new input x by mostly exploiting training examples in the neighborhood of x【比如KNN】
现有很多方法都是局部泛化的【local generalization】：学习到的函数在数据空间的不同区域表现不同，而且这些区域所要调节的参数也不同
维度其实不是阻碍泛化的真正原因，真正影响泛化的是我们想要训练得到的函数中“突变”【variations】的数目
a function is highly varying when a piecewise approximation (e.g., piecewise-constant or piecewise-linear) of that function would require a large number of pieces.
核函数机【kernel machines】其实都隐含了一个光滑假设【smoothness prior】，即目标函数是光滑的或者能由分段光滑的函数近似得到，但是当目标函数是不光滑时，这个先验就显得不那么充分了。所以我们可以设计核函数：
文献[160]中使用DBN学习到了一组特征向量，并用它改善了高斯过程的结果
文献[165],[13,19]中表明：用高斯核函数学习的机器，训练它所需要的样本随目标函数中峰谷的数目线性增长；而且对于极剧变化的函数【如parity function】，使用高斯核达到某个错误率所需要的样本数目随着输入样本维的增加呈指数增长。
所以随着维数的增加，决策面的复杂度快速增加将使得局部核方法不再适用；
然而如果样本间的“变化”相互之间并不相关【通过一些隐含规则】，那么没有其他的学习方法能够胜过局部泛化的方法；
但是，寻找更紧凑得表示这些变化的方法仍然值得尝试，因为一旦找到，那个方法很有可能拥有更好的泛化能力，特别是一些无法从训练集中学习到的变化——能取得更好的泛化能力当且仅当目标函数中的“变化”符合某些隐含规则【而现在学习到的函数表现的变化中并不隐含这些规则】
【注：这里提到了很多的“变化”，我自己的理解是这样的，我们能从训练集中的“变化”学习到目标函数t的近似函数y中的“变化”：如果我们是做分类问题，那么y的“变化”很多时候也对应了决策面的“变化”；但是这些“变化”并不一定能完全反映t中的真正变化，这个时候就值得尝试寻找更一般的y，使之更加符合实际的t】

3.2

In a distributed representation the input pattern is represented by a set of features that are not mutually exclusive, and might even be statistically independent

【以下这一段来自：http://www.quora.com/Deep-Learning/What-is-meant-by-a-distributed-representation】
A distributed representation a way of mapping or encoding information to some physical medium such as a memory or neural network. A traditional representation is non-distributed and is based on storing information in hard-wired spaces with hard-wired locations. For example a record such as “Name: Joe”, “Sex”: Male” is mapped to a specific location, each part of the record such as “Name” or “Joe” taking specific locations. Every bit is very important for a local representation and also there is a need to keep track of where its stored. In contrast, distributed representations store information as encoded vectors and encoding spreads the information across the entire dimension of the vector, meaning every bit will have some fragment of the stored information. Not all bits are needed for recall and there is no need to keep track of locations. These vectors are built from codes for fields and values, typically in a bit space or a numeric space using operators such as XOR or a circular convolution and vector algebra.

两个例子

i∈{1~N}的整数可以用一个第i为1，其他位为0的向量来表示，但是也可以用一个logN的整数来表示；
一棵决策树是locality，但是多棵决策树【集成】隐式地组成了distributed representation
【The identity of the leaf node in which the input pattern is associated for each tree forms a tuple that is a very rich description of the input pattern: it can represent a very large number of possible patterns, because the number of intersections of the leaf regions associated with the n trees can be exponential in n.】

Chapter 4

用梯度下降的方法训练神经网络很容易陷入局部最优，但是如果对每一层用无监督学习的方法预训练，我们就能得到较好的结果。无监督预训练的方法一般是这样的：

逐层用无监督学习算法【如RBM或者auto encoder】训练参数，得到的结果做为下一层的输入；所有层的参数训练完毕之后，再用有监督学习的方法对整个网络进行微调

无监督预训练可以看成是一种正则化（或者先验），对解存在的参数空间添加了约束，迫使解逼近用无监督训练训练出来的参数对应的——这寄希望训练的参数能够捕获到输入数据的有效统计结构【要有足够多的样本提供训练】。

BP的缺点是梯度回溯到低层的时候会比高层的更小，无法提供更有效的信息【陷入local minimal】，所以越深的网络训练越难。这也暗示着更改低层的参数比更改高层的参数对结果的影响要大得多，而无监督预训练之所以能取得更好的效果，正是因为它能对低层的网络提供一个有效的初始化。

4.4 Deep Generative Architectures

大部分的生成模型都可以表示成一个图模型：节点表示随机变量，边表示随机变量之间的相互依赖

sigmoid belief network

激活函数：

P (h k i = 1 | h k + 1) = s i g m (b k i + \sum

最低0.47元/天解锁文章

jiqiujia

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Learning Deep Architectures for AI 读书笔记

Learning Deep Architectures for AI 读书笔记Chapter 2一个函数称之为紧凑的【compact】如果它有比较少的计算单元【比如神经网络中的一个个结点】；如果一个函数能够紧凑得被一个深层结构所表示，那么一个不那么深的结构可能得很大才足以表示这个函数文献【62】：一个k层的多项式大小的逻辑门电路结构当变成k-1层时，它将变成指数大小Theo
复制链接

扫一扫