第10章无监督学习（3）

最新推荐文章于 2024-05-23 11:07:09 发布

亦余心之所向兮

最新推荐文章于 2024-05-23 11:07:09 发布

阅读量744

点赞数

分类专栏：个人兴趣深度学习小白 Interest

本文链接：https://blog.csdn.net/u012742806/article/details/48238573

版权

个人兴趣同时被 3 个专栏收录

24 篇文章 0 订阅

订阅专栏

Interest

8 篇文章 0 订阅

订阅专栏

深度学习小白

6 篇文章 0 订阅

订阅专栏

Manifold Interpretation of PCA and Linear Auto-Encoders

目标是寻找 $x$ 在子空间的一个映射，并保存尽量多的信息

令编码器为

h = f (x) = W T (x - μ)

$h=f(x)=W^T(x-\mu)$

$h$ 是 $x$ 的一个低维模拟

解码器为

x^= g (h) = b + V h

$\hat{x}=g(h)=b+Vh$

因为解码器和编码器都是线性的，那么最小化重构误差就是

E [| | x - x^| | 2]

$E [||x-\hat{x}||^2]$
也就是

V = W, μ = b = E [x]

$V=W, \mu=b=E[x]$

W $W$ 的行形成协方差矩阵主要特征向量空间一组正交基

C = E [(x - μ) (x - μ) T]

$C=E[(x-\mu)(x-\mu)^T]$

对于PCA, $W$ 的行就是这些特征向量，以对应特征值的重要性排序

最优的重构误差

min E [| | x - x^| | 2] = \sum i = d + 1 D λ i

$\min E[||x-\hat{x}||^2]=\sum_{i=d+1}^D \lambda_i$
其中

$x\in R^D$
$h\in R^d$
$\lambda_i$ 是协方差矩阵的特征值.

如果协方差的秩是 $d$ , 那么重构误差就是 0.

ICA

Independent Component Analysis
独立成分分析
Herault and Ans, 1984; Jutten and Herault, 1991; Comon, 1994; Hyv¨arinen,1999; Hyv¨arinen et al., 2001

和概率PCA，特征分析相似，它也满足线性特征模型的条件

sample real-valued factors
h∼P(h)
- sample the real-valued observable variables $x = W h + b + m n o i s e$ $x=Wh+b+mnoise$
- 其中不同的一点是它不假设先验分布是高斯分布，它只假设是参数化的，例如
  
  $P (h) = \prod i P (h i)$ $P(h)=\prod_iP(h_i)$
  假如假定隐藏变量是非高斯分布的，那么就可以重现他们。这也是ICA的目的。
  
  Sparse Coding as a Generative Model
  
  一个比较有趣的非高斯分布模型-分布是稀疏的
  
  $P(h)$ 在0附近值很大也就是 $h$ 是0附近的概率很高例如参数化的拉普拉斯密度先验分布
  
  P(h)=∏iP(hi)=∏iλ2e−λ|hi|
  
  Student_t prior is
  
  P(h)=∏iP(hi)∝∏i11+h2ivv+12
  
  Greedy Layerwise Unsupervised Pre-Training
  - Greedy - 不同层没有一起统合起来训练，可能会得到局部最优解
  - Layerwise - 每次只训练一层，训练第K层的时候保持前面的层保持不变
  - Unsupervised - 每一层都是无监督学习
  - Pre-Training - 它只是算法的第一步
  Transfer Learning and Domain Adaptation
  
  目标是抽取和利用数据集A的信息来应用到数据集B
  譬如，不同的领域的具体评价不同（电影，音乐，书籍的评价），但有些地方是相同的。所以叫 Domain Adaptation
  
  两个例子
  - Mesnil et al., 2011
  - Goodfellow et al., 2011
  Extreme forms of transfer learning
  - one-shot learning
  - zero-shot learning
  - zero-data learning
  Manifold Interpretation of PCA and Linear Auto-Encoders
  
  标签（空格分隔）：深度学习个人兴趣
  
  Look for projections of $x$ into a subspace that preserves as much as information as possible about $x$
  
  Let the encoder be
  
  h=f(x)=WT(x−μ)
  - $h$ is a low-dimensional representation of $x$
  Decoder
  
  x^=g(h)=b+Vh
  
  With liner encoder and decoder, minimizing reconstruction error
  
  E[||x−x^||2]
  
  means that
  V=W,μ=b=E[x]
  
  and the rows of W form an orthonormal basis which spans the same subspace as the principal eigenvectors of the covariance matrix
  
  C=E[(x−μ)(x−μ)T]
  
  In the case of PCA, the rows of $W$ are these eigenvectors, ordered by the magnitude of the corresponding eigenvalues.
  
  the optimal reconstruction error
  
  minE[||x−x^||2]=∑i=d+1Dλi
  
  Where
  - $D$ is the dimension of $x$
  - $d$ is the dimension of $h$
  - $\lambda_i$ are the eigenvalues of the convariance.
  If the covariance has rank $d$ , the reconstrcution error is 0.
  
  ICA
  
  Independent Component Analysis
  Herault and Ans, 1984; Jutten and Herault, 1991; Comon, 1994; Hyv¨arinen,1999; Hyv¨arinen et al., 2001
  
  Like probabilistic PCA and factor analysis, it also fits the linear factor model of Eqs.
  - sample real-valued factors $h \sim P (h)$ $h\sim P(h)$
  - sample the real-valued observable variables $x = W h + b + m n o i s e$ $x=Wh+b+mnoise$
  - What is particular about ICA is that unlike PCA and factor analysis it does not assume that the prior is Gaussian. It only assumes that it is factorized, i.e.
    
    $P (h) = \prod i P (h i)$ $P(h)=\prod_iP(h_i)$
    In this case, if we assume that the latent variables are non-Gaussian, then we can recover them, and this is what ICA is trying to achieve.
    
    Sparse Coding as a Generative Model
    
    A particularly interesting form of non-Gaussianity arises with distributions that are sparse.
    
    $P(h)$ puts high probability at or around 0. For instance, the factorized Laplace density prior is
    
    P(h)=∏iP(hi)=∏iλ2e−λ|hi|
    
    Student_t prior is
    
    P(h)=∏iP(hi)∝∏i11+h2ivv+12
    
    Greedy Layerwise Unsupervised Pre-Training
    - Greedy - the different layers are not jointly trained with respect to a global training objective, which could make the procedure sub-optimal
    - Layerwise - it proceeds one layer at a time, training the k-layer while keeping the previous ones fixed.
    - Unsupervised - each layer is trained with an unsupervised representation learning algorithm.
    - Pre-Training - it should be only a first step before a joint training algorithm is applied to fine-tune all the layers together with respect to a criterion of interest
    Transfer Learning and Domain Adaptation
    
    The objective is to take advantage of data from a first setting to extract information that may be useful when learning or even directly making predictions in the second setting.
    
    Two examples
    - Mesnil et al., 2011
    - Goodfellow et al., 2011
    Extreme forms of transfer learning
    - one-shot learning
    - zero-shot learning
    - zero-data learning

亦余心之所向兮

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
第10章无监督学习（3）

Manifold Interpretation of PCA and Linear Auto-Encoders 目标是寻找xx在子空间的一个映射，并保存尽量多的信息令编码器为 h=f(x)=WT(x−μ)h=f(x)=W^T(x-\mu)hh 是 xx的一个低维模拟解码器为 x^=g(h)=b+Vh\hat{x}=g(h)=b+Vh因为解码器和编码器都是线性的，那么最小化重构误差就是 E
复制链接

扫一扫