PCA:Principle Component Analysis [3]

wikipedia上关于PCA的解释:http://en.wikipedia.org/wiki/Principal_component_analysis,这里写的都是基于对wikipedia上的一点总结,加一些自己对PCA特性的理解。


【1】.PCA是干什么的


PCA can be done by eigenvalue decomposition of a data covariance (orcorrelation) matrix or singular value decomposition of a data matrix, usually aftermean centering (and normalizing or using Z-scores) the data matrix for each attribute.Its operation can be thought of as revealing the internal structure of the data in a way that best explains the variance in the data. PCA defines a new orthogonal coordinate system that optimally describes variance in a single dataset.


在进行PCA之前,需要对数据进行预处理,将数据的质心移到原点处,也就是减去数据集的平均值,为什么呢?因为PCA是用来刻画方差的,在其寻找的主分量上,数据的方差最大,其公式定义和文字解释如下。


公式定义:

\mathbf{w}_{(1)} = \underset{\Vert \mathbf{w} \Vert = 1}{\operatorname{\arg\,max}}\,\{ \sum_i \left(t_1\right)^2_{(i)} \} = \underset{\Vert \mathbf{w} \Vert = 1}{\operatorname{\arg\,max}}\, \sum_i \left(\mathbf{x}_{(i)} \cdot \mathbf{w} \right)^2

文字定义:

Given a set of points in Euclidean space, the first principal component corresponds to a line that passes through the multidimensional mean and minimizes the sum of squares of the distances of the points from the line


如果不将质心移到原点,第一主分量会或多或少跟收到质心位置的影响,wikipedia的解释如下:


Mean subtraction (a.k.a. "mean centering") is necessary for performing PCA to ensure that the first principal component describes the direction of maximum variance. If mean subtraction is not performed, the first principal component might instead correspond more or less to the mean of the data. A mean of zero is needed for finding a basis that minimizes the mean square error of the approximation of the data.


PCA就是要找到一个矩阵,将原来的数据X变换到新的空间,得到新的数据集T


\mathbf{T} = \mathbf{X} \mathbf{W}


X是(n,p)的矩阵,每行代表一个样本(向量),W是(p,p)的矩阵,每列是矩阵 X‘ * X的一个(单位)特征向量。得到了W矩阵,就能根据每个特征向量对应的特征值大小,合理的去掉一些特征值小的维度,将W变成一个(n,k)的矩阵,这里 { k <= n },这样就达到了压缩数据,或者是去噪的目的。这里每个特征值对应的含义是:


The singular values (in Σ) are the square roots of the eigenvalues of the matrix XTX. Each eigenvalue is proportional to the portion of the "variance" (more correctly of the sum of the squared distances of the points from their multidimensional mean) that is correlated with each eigenvector. The sum of all the eigenvalues is equal to the sum of the squared distances of the points from their multidimensional mean. PCA essentially rotates the set of points around their mean in order to align with the principal components. This moves as much of the variance as possible (using an orthogonal transformation) into the first few dimensions. The values in the remaining dimensions, therefore, tend to be small and may be dropped with minimal loss of information (see below).The eigenvalues represent the distribution of the source data's energy[clarification needed] among each of the eigenvectors, where the eigenvectors form a basis for the data


【2】.关于W怎么求


两种方式:


(a)

\mathbf{Q} \propto \mathbf{X}^T \mathbf{X} = \mathbf{W} \mathbf{\Lambda} \mathbf{W}^T


where Λ is the diagonal matrix of eigenvalues λ(k) of XTX, (k) being equal to the sum of the squares over the dataset associated with each component kλ(k) = Σi tk2(i) = Σi (x(i) ⋅ w(k))2)


(b)

\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{W}^T



Comparison with the eigenvector factorisation of XTX establishes that the right singular vectors W of X are equivalent to the eigenvectors of XTX, while the singular values σ(k) of X are equal to the square roots of the eigenvaluesλ(k) of XTX.


由方法(b)可以得出如下公式




所以如果通过 [u, s, v] = svd(X) 分解,再进行变换时,应该是用 X * W,因为之前一直用的是协方差矩阵 X' * X 进行的 svd() 分解,得出的 u 和 v 是相同的,所以用 X * u 做的,但其实这样是不对的。

两种方法相比,通过svd()直接分解的方法目前存在高效的算法,所以不需要像方法(a)一样,先计算协方差矩阵,所以方法(b)的效率更高。 最后通过得到的W,可以进行维度压缩:


\mathbf{T}_L = \mathbf{U}_L\mathbf{\Sigma}_L = \mathbf{X} \mathbf{W}_L


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值