Variance and Covariance
The variance of a scalar variable x is a measure of how x deviates from its mean.
For a variable X∈Rn , we introduce the “covariance matrix” to represent the covariance of any pair of variables xi,xj . Cov(X)∈n×n , Cov(X)i,j=Cov(xi,xj) .
Eigenvector, Eigenvalue
Eigenvectors and eigenvalues tell us about the invariant subspace under linear transformation A .
It is beneficial to gain some intuitive understanding about eigenvectors.
The vector v(1)andv(2) are two eigenvectors of matrix A. When we apply linear transform A to the circle in left plot, the result is a eclipse. We can see the transformation stretched the space, and the direction corresponds to the largest eigenvalues are stretched most.
- The eigenvalues of a matrix is just the roots of the characteristic polynomial. Thus a square matrix always has n eigenvalues(could be complex conjugate root or multiple root). Real symmetric matrix always has real eigenvalues and real eigenvectors, always diagonalizable.
- The eigenvectors correspond to distinct eigenvalues are linear independent.
- The “algebraic multiplicity” could be larger than the “geometric multiplicity”. Iff all algebraic multiplicity is equal to geometric multiplicity, then the matrix A is diagonalizable.
- The nullspace dimension + rank = dimension n
Why do Principle Component Analysis(PCA) ?
High dimensional data usually lives in a much smaller subspace. This means it is possible to represent the high dimensional data with fewer variables, which leads to data compression. Such subspace need not to be linear, but we could use linear subspace as an approximation. PCA is one technique to select such subspace. Besides compression, the lower representation(for example 2 dimensional representation) could be used to visualize high dimensional data.
We can also use PCA to preprocess the data such that variables have zero means and zero “covariance”. This process is called “Whitening” or “Sphereing”. Noted that this is different from “standardizing”, which only substract the means and divide the variance, because this process will not zero the covariance.
Two Non Probabilistic View of PCA
PCA as Maximum Variance Projection
Consider the task to find a projection
RM→RD
so as the variance in the
RD
subspace is maximized. We restricted the every row of the project matrix to be orthogonal to each other. The variance of the data points after the projection is just the sum of variance in each orthogonal project direction. Consider a speacial case, that we project into 1 dimensional space. We can show that maximize the variance corresponds to solve following constrained optimizing problem:
Where S=1N∑Ni=1(xi−x¯)(xi−x¯)T is called the “data covariance matrix”. It is called data covariance matrix because it compute an approximation of covariance matrix using existing data samples. Optimize it with lagrange multiplier, we can find that all the static points are eigenvectors of the data covariance matrix. Some authors refer to such eigenvector as “the direction of maximum variance”
It is worth noting that computing eigen decomposition of a D×D matrix costs O(D3) .
PCA as Minimize Reconstruction Error
For a vector
x∈Rn
, we define a encode function
f(x):Rn→Rl
that transform x to lower dimension representation. A decode function
g(x):Rl→Rn
that map back to original space.
One additional constraint added is we set
g(c)=DTc
. This means we represent the original vector using a linear subspace spanned by several vectors. If we use such way to constrain the encoding and decoding, we can derive the exactly same result as the maximize variance view.
PCA V.S. Fisher Linear Discriminate
Both methods can be viewed as techniques for dimension reduction. PCA is unsupervised, i.e. it doesn’t need any label information of the data xi . In contrast, Fisher Linear Discriminate is supervised, it find the projection direction such that the two classes are mostly seperated.
Probabilistic PCA
TODO
Reference
Ian Goodfellows et al, Deep Learning Textbook
C Bishop et al, Pattern Recognition and Machine Learning