PCA, SVD, Eigenvector...

最新推荐文章于 2021-06-20 20:42:01 发布

TramSevenTeen

最新推荐文章于 2021-06-20 20:42:01 发布

阅读量810

点赞数

分类专栏：统计学机器学习文章标签： learning

本文链接：https://blog.csdn.net/yzp294562577/article/details/53846312

版权

统计学同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

机器学习

3 篇文章 0 订阅

订阅专栏

Variance and Covariance

The variance of a scalar variable x is a measure of how x deviates from its mean.

v a r (x) = E (x - x ¯) 2

$var(x)=\mathbb{E}(x-\overline x)^2$ The covariance of two scalar variables

xi,xj $x_i, x_j$ measures the joint variability of two variables.

C o v (x i, x j) = E [(x i - x ¯ i) (x j) - x ¯ j]

$Cov(x_i,x_j) = \mathbb{E}[(x_i-\overline x_i)(x_j) - \overline x_j]$ If the covariance is possitive, it means the two variables tends to take on a higher value at the same time. The correlation normalizes the contribution of two variables in order to see to what extend the two variables are correlated.

C o r r (x i, x j) = C o v ( x i , x j ) σ x i σ x j

$Corr(x_i, x_j) = \frac{Cov(x_i,x_j)}{\sigma_{x_i} \sigma_{x_j}}$ Noted that “covariance” is related to the concept “independence” but not equal. If

Cov(xi,xj) $Cov(x_i,x_j)$ is not zero, then it means such two variables are not independent. However, if

Cov(xi,xj) $Cov(x_i,x_j)$ is zero, we cannot draw the conclusion that such two variables are independent. “Independence” is stronger concept than “zero covariance” because it also excludes nonlinear relationship.

For a variable $X\in \mathbb{R}^n$ , we introduce the “covariance matrix” to represent the covariance of any pair of variables $x_i, x_j$ . $Cov(X)\in n\times n$ , $Cov(X)_{i,j}=Cov(x_i,x_j)$ .

Eigenvector, Eigenvalue

Eigenvectors and eigenvalues tell us about the invariant subspace under linear transformation $A$ .

A x = λ x

$Ax=\lambda x$ If A is not square, we can see that the formular won’t have solution. When we talk about eigenvalues, we are talking about square matrix. For non-square matrix, we use Singular Value Decomposition instead.

It is beneficial to gain some intuitive understanding about eigenvectors.

这里写图片描述

The vector $v^{(1)} and v^{(2)}$ are two eigenvectors of matrix A. When we apply linear transform A to the circle in left plot, the result is a eclipse. We can see the transformation stretched the space, and the direction corresponds to the largest eigenvalues are stretched most.

The eigenvalues of a matrix is just the roots of the characteristic polynomial. Thus a square matrix always has n eigenvalues(could be complex conjugate root or multiple root). Real symmetric matrix always has real eigenvalues and real eigenvectors, always diagonalizable.
The eigenvectors correspond to distinct eigenvalues are linear independent.
The “algebraic multiplicity” could be larger than the “geometric multiplicity”. Iff all algebraic multiplicity is equal to geometric multiplicity, then the matrix A is diagonalizable.
The nullspace dimension + rank = dimension n

Why do Principle Component Analysis(PCA) ?

High dimensional data usually lives in a much smaller subspace. This means it is possible to represent the high dimensional data with fewer variables, which leads to data compression. Such subspace need not to be linear, but we could use linear subspace as an approximation. PCA is one technique to select such subspace. Besides compression, the lower representation(for example 2 dimensional representation) could be used to visualize high dimensional data.

We can also use PCA to preprocess the data such that variables have zero means and zero “covariance”. This process is called “Whitening” or “Sphereing”. Noted that this is different from “standardizing”, which only substract the means and divide the variance, because this process will not zero the covariance.

Two Non Probabilistic View of PCA

PCA as Maximum Variance Projection

Consider the task to find a projection $\mathbb{R}^M\rightarrow \mathbb{R}^{D}$ so as the variance in the $\mathbb{R}^{D}$ subspace is maximized. We restricted the every row of the project matrix to be orthogonal to each other. The variance of the data points after the projection is just the sum of variance in each orthogonal project direction. Consider a speacial case, that we project into 1 dimensional space. We can show that maximize the variance corresponds to solve following constrained optimizing problem:

u T S u

$u^T S u$ s.t.

u T u = 1

$u^Tu = 1$
Where

S=1N∑Ni=1(xi−x¯)(xi−x¯)T $S = \frac{1}{N}\sum_{i=1}^N(x_i - \overline x)(x_i-\overline x)^T$ is called the “data covariance matrix”. It is called data covariance matrix because it compute an approximation of covariance matrix using existing data samples. Optimize it with lagrange multiplier, we can find that all the static points are eigenvectors of the data covariance matrix. Some authors refer to such eigenvector as “the direction of maximum variance”
It is worth noting that computing eigen decomposition of a

D×D $D\times D$ matrix costs

O(D3) $O(D^3)$ .

PCA as Minimize Reconstruction Error

For a vector $x\in \mathbb{R}^n$ , we define a encode function $f(x): \mathbb{R}^n\rightarrow \mathbb{R}^l$ that transform x to lower dimension representation. A decode function $g(x): \mathbb{R}^l\rightarrow \mathbb{R}^n$ that map back to original space.
One additional constraint added is we set $g(c)=D^T c$ . This means we represent the original vector using a linear subspace spanned by several vectors. If we use such way to constrain the encoding and decoding, we can derive the exactly same result as the maximize variance view.

PCA V.S. Fisher Linear Discriminate

Both methods can be viewed as techniques for dimension reduction. PCA is unsupervised, i.e. it doesn’t need any label information of the data $x_i$ . In contrast, Fisher Linear Discriminate is supervised, it find the projection direction such that the two classes are mostly seperated.

Probabilistic PCA

TODO

Reference

Ian Goodfellows et al, Deep Learning Textbook
C Bishop et al, Pattern Recognition and Machine Learning

TramSevenTeen

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
PCA, SVD, Eigenvector...

Variance and CovarianceThe variance of a scalar variable x is a measure of how x deviates from its mean. var(x)=E(x−x¯)2var(x)=\mathbb{E}(x-\overline x)^2 The covariance of two scalar variables xi,xjx_
复制链接

扫一扫