1. Motivation I: Data Compression
- Reduce data from 2D to 1D
2. Motivation II: Data Visualization
3. Principal Component Analysis problem formulation
- reduce from 2D to 1D: find a direction onto which to project the data so as to minimize the projection error
- reduce from n-D to k-D: find k vectors onto which to project the data so as to minimize the projection error
- PCA is not linear regression
4. Principal Component Analysis algorithm
- data preprocessing
training set:
preprocessing (feature scaling/mean normalization)
if different features on different scales, scale features to have comparable range of values
- PCA algorithm - reduce data from n-D to k-D
compute covariance matrix
compute eigenvectors of
5. Reconstruction from compressed representation
6. Choosing the number of principal components
- average squared projection error
- total variation in the data
- choose k to be smallest value so that
7. Advice for applying PCA
- supervised learning speedup
this mapping can be applied as well to examples in the cross validation and test sets
- application of PCA
- compression
- reduce memory/disk needed to store data
- speedup learning algorithm
- visualization
- compression
- bad use of PCA: to prevent overfitting, this might work OK, but isn't a good way to address overfitting, use regularization instead
PCA与Linear Regression的区别
- PCA衡量的是orthogonal distance,而linear regression是所有x点对应的真实值y=g(x)与估计值f(x)之间的vertical distance距离
- more general 的解释:PCA中为的是寻找一个surface,将各feature{x1,x2,...,xn}投影到这个surface后使得各点间variance最大(跟y没有关系,是寻找最能够表现这些feature的一个平面);而Linear Regression是给出{x1,x2,...,xn},希望根据x去预测y,所以进行回归
信息熵的做法应该属于projection pursuit降维了,PCA是factor analysis的方法