第10章 无监督学习(3)

Manifold Interpretation of PCA and Linear Auto-Encoders


目标是寻找 x 在子空间的一个映射,并保存尽量多的信息

令编码器为

h=f(x)=WT(xμ)

  • h x的一个低维模拟

解码器为

x^=g(h)=b+Vh

因为解码器和编码器都是线性的,那么最小化重构误差就是

E[||xx^||2]

也就是
V=W,μ=b=E[x]

W 的行形成协方差矩阵主要特征向量空间一组正交基
C=E[(xμ)(xμ)T]

对于PCA, W 的行就是这些特征向量,以对应特征值的重要性排序

最优的重构误差

minE[||xx^||2]=i=d+1Dλi

其中

  • xRD
  • hRd
  • λi 是协方差矩阵的特征值.

如果协方差的秩是 d , 那么重构误差就是 0.


ICA

Independent Component Analysis
独立成分分析
Herault and Ans, 1984; Jutten and Herault, 1991; Comon, 1994; Hyv¨arinen,1999; Hyv¨arinen et al., 2001

和概率PCA,特征分析相似,它也满足线性特征模型的条件

  • sample real-valued factors
    hP(h)

    • sample the real-valued observable variables
      x=Wh+b+mnoise
    • 其中不同的一点是它不假设先验分布是高斯分布,它只假设是参数化的,例如

      P(h)=iP(hi)

      假如假定隐藏变量是非高斯分布的,那么就可以重现他们。这也是ICA的目的。


      Sparse Coding as a Generative Model

      一个比较有趣的非高斯分布模型-分布是稀疏的

      P(h) 在0附近值很大也就是 h 是0附近的概率很高 例如 参数化的拉普拉斯密度先验分布

      P(h)=iP(hi)=iλ2eλ|hi|

      Student_t prior is

      P(h)=iP(hi)i11+h2ivv+12


      Greedy Layerwise Unsupervised Pre-Training

      • Greedy - 不同层没有一起统合起来训练,可能会得到局部最优解
      • Layerwise - 每次只训练一层,训练第K层的时候保持前面的层保持不变
      • Unsupervised - 每一层都是无监督学习
      • Pre-Training - 它只是算法的第一步

      Transfer Learning and Domain Adaptation

      目标是抽取和利用数据集A的信息来应用到数据集B
      譬如,不同的领域的具体评价不同(电影,音乐,书籍的评价),但有些地方是相同的。所以叫 Domain Adaptation

      两个例子

      Extreme forms of transfer learning

      • one-shot learning
      • zero-shot learning
      • zero-data learning

      Manifold Interpretation of PCA and Linear Auto-Encoders

      标签(空格分隔): 深度学习 个人兴趣


      Look for projections of x into a subspace that preserves as much as information as possible about x

      Let the encoder be

      h=f(x)=WT(xμ)

      • h is a low-dimensional representation of x

      Decoder

      x^=g(h)=b+Vh

      With liner encoder and decoder, minimizing reconstruction error

      E[||xx^||2]

      means that
      V=W,μ=b=E[x]

      and the rows of W form an orthonormal basis which spans the same subspace as the principal eigenvectors of the covariance matrix
      C=E[(xμ)(xμ)T]

      In the case of PCA, the rows of W are these eigenvectors, ordered by the magnitude of the corresponding eigenvalues.

      the optimal reconstruction error

      minE[||xx^||2]=i=d+1Dλi

      Where

      • D is the dimension of x
      • d is the dimension of h
      • λi are the eigenvalues of the convariance.

      If the covariance has rank d , the reconstrcution error is 0.


      ICA

      Independent Component Analysis
      Herault and Ans, 1984; Jutten and Herault, 1991; Comon, 1994; Hyv¨arinen,1999; Hyv¨arinen et al., 2001

      Like probabilistic PCA and factor analysis, it also fits the linear factor model of Eqs.

      • sample real-valued factors
        hP(h)

      • sample the real-valued observable variables
        x=Wh+b+mnoise
      • What is particular about ICA is that unlike PCA and factor analysis it does not assume that the prior is Gaussian. It only assumes that it is factorized, i.e.

        P(h)=iP(hi)

        In this case, if we assume that the latent variables are non-Gaussian, then we can recover them, and this is what ICA is trying to achieve.


        Sparse Coding as a Generative Model

        A particularly interesting form of non-Gaussianity arises with distributions that are sparse.

        P(h) puts high probability at or around 0. For instance, the factorized Laplace density prior is

        P(h)=iP(hi)=iλ2eλ|hi|

        Student_t prior is
        P(h)=iP(hi)i11+h2ivv+12


        Greedy Layerwise Unsupervised Pre-Training

        • Greedy - the different layers are not jointly trained with respect to a global training objective, which could make the procedure sub-optimal
        • Layerwise - it proceeds one layer at a time, training the k-layer while keeping the previous ones fixed.
        • Unsupervised - each layer is trained with an unsupervised representation learning algorithm.
        • Pre-Training - it should be only a first step before a joint training algorithm is applied to fine-tune all the layers together with respect to a criterion of interest

        Transfer Learning and Domain Adaptation

        The objective is to take advantage of data from a first setting to extract information that may be useful when learning or even directly making predictions in the second setting.

        Two examples

        Extreme forms of transfer learning

        • one-shot learning
        • zero-shot learning
        • zero-data learning
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值