PCA, SVD, Eigenvector...

Variance and Covariance

The variance of a scalar variable x is a measure of how x deviates from its mean.

var(x)=E(xx¯)2
The covariance of two scalar variables xi,xj measures the joint variability of two variables.
Cov(xi,xj)=E[(xix¯i)(xj)x¯j]
If the covariance is possitive, it means the two variables tends to take on a higher value at the same time. The correlation normalizes the contribution of two variables in order to see to what extend the two variables are correlated.
Corr(xi,xj)=Cov(xi,xj)σxiσxj
Noted that “covariance” is related to the concept “independence” but not equal. If Cov(xi,xj) is not zero, then it means such two variables are not independent. However, if Cov(xi,xj) is zero, we cannot draw the conclusion that such two variables are independent. “Independence” is stronger concept than “zero covariance” because it also excludes nonlinear relationship.

For a variable XRn , we introduce the “covariance matrix” to represent the covariance of any pair of variables xi,xj . Cov(X)n×n , Cov(X)i,j=Cov(xi,xj) .

Eigenvector, Eigenvalue

Eigenvectors and eigenvalues tell us about the invariant subspace under linear transformation A .

Ax=λx
If A is not square, we can see that the formular won’t have solution. When we talk about eigenvalues, we are talking about square matrix. For non-square matrix, we use Singular Value Decomposition instead.

It is beneficial to gain some intuitive understanding about eigenvectors.

这里写图片描述

The vector v(1)andv(2) are two eigenvectors of matrix A. When we apply linear transform A to the circle in left plot, the result is a eclipse. We can see the transformation stretched the space, and the direction corresponds to the largest eigenvalues are stretched most.

  • The eigenvalues of a matrix is just the roots of the characteristic polynomial. Thus a square matrix always has n eigenvalues(could be complex conjugate root or multiple root). Real symmetric matrix always has real eigenvalues and real eigenvectors, always diagonalizable.
  • The eigenvectors correspond to distinct eigenvalues are linear independent.
  • The “algebraic multiplicity” could be larger than the “geometric multiplicity”. Iff all algebraic multiplicity is equal to geometric multiplicity, then the matrix A is diagonalizable.
  • The nullspace dimension + rank = dimension n

Why do Principle Component Analysis(PCA) ?

High dimensional data usually lives in a much smaller subspace. This means it is possible to represent the high dimensional data with fewer variables, which leads to data compression. Such subspace need not to be linear, but we could use linear subspace as an approximation. PCA is one technique to select such subspace. Besides compression, the lower representation(for example 2 dimensional representation) could be used to visualize high dimensional data.

We can also use PCA to preprocess the data such that variables have zero means and zero “covariance”. This process is called “Whitening” or “Sphereing”. Noted that this is different from “standardizing”, which only substract the means and divide the variance, because this process will not zero the covariance.

Two Non Probabilistic View of PCA

PCA as Maximum Variance Projection

Consider the task to find a projection RMRD so as the variance in the RD subspace is maximized. We restricted the every row of the project matrix to be orthogonal to each other. The variance of the data points after the projection is just the sum of variance in each orthogonal project direction. Consider a speacial case, that we project into 1 dimensional space. We can show that maximize the variance corresponds to solve following constrained optimizing problem:

uTSu
s.t.
uTu=1

Where S=1NNi=1(xix¯)(xix¯)T is called the “data covariance matrix”. It is called data covariance matrix because it compute an approximation of covariance matrix using existing data samples. Optimize it with lagrange multiplier, we can find that all the static points are eigenvectors of the data covariance matrix. Some authors refer to such eigenvector as “the direction of maximum variance”
It is worth noting that computing eigen decomposition of a D×D matrix costs O(D3) .

PCA as Minimize Reconstruction Error

For a vector xRn , we define a encode function f(x):RnRl that transform x to lower dimension representation. A decode function g(x):RlRn that map back to original space.
One additional constraint added is we set g(c)=DTc . This means we represent the original vector using a linear subspace spanned by several vectors. If we use such way to constrain the encoding and decoding, we can derive the exactly same result as the maximize variance view.

PCA V.S. Fisher Linear Discriminate

Both methods can be viewed as techniques for dimension reduction. PCA is unsupervised, i.e. it doesn’t need any label information of the data xi . In contrast, Fisher Linear Discriminate is supervised, it find the projection direction such that the two classes are mostly seperated.

Probabilistic PCA

TODO

Reference

Ian Goodfellows et al, Deep Learning Textbook
C Bishop et al, Pattern Recognition and Machine Learning

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值