Dimensionality Reduction(降维)
Motivation I:Data Compression 数据压缩
take less memory and disk, speed up our learning algorithms
Motivation II: Data Visualization(数据可视化)
when you are trouble in too many dimesionality data
Dimensionality Reduction, in order to reduce data from 50 dimensions or whatever, down to 2dimensions or 3 dimensions, so that you can plant it and understand yuor data better.
Principal Component Analysis problem formulation(PAC主成分分析方法的算法)
- Projection Error(投影误差)即求测试点到拟合线上的投影,保持最小化
- Reduce from 2-dimension to 1-dimension:Find a direction(a vector u ( 1 ) ∈ R n u^{(1)}\in R^n u(1)∈Rn onto which to project the data so as to minimize the projection error
Principal Component Analysis algorithm
- Before applying PCA, there is a data pre-processing step, which you should always do.
- Training set: x ( 1 ) , x ( 2 ) , x ( 3 ) , . . . . . , x ( m ) x^{(1)},x^{(2)},x^{(3)},.....,x^{(m)} x(1),x(2),x(3),.....,x(m)
- Preprocessing(feature scaling / mean normalization):
- μ j = 1 m \mu_j = \frac{1}{m} μj=m1 ∑ i = 1 m x j ( i ) \displaystyle \sum^{m}_{i=1}{x_j^{(i)}} i=1∑mxj(i)
- Replace each x j ( i ) x_j^{(i)} xj(i) with x j − μ j x_j-\mu_j xj−μj
- If different features on different scales(e.g., x 1 = x_1= x1=size of house, x 2 x_2 x2 = number of bedrooms),scale features to have comparable range of values.
- x j ( i ) x_j^{(i)} xj(i)<<<< X j ( i ) − μ j S j \frac{X_j^{(i)}-\mu_j}{S_j} SjXj(i)−μj
Principal Component Analysis(PCA)algorithm
- Reduce data from n-dimensions to k-dimensions
- Compute “covariance matrix”:
- ∑ i = 1 n ( x ( i ) ) ( x ( i ) ) T \displaystyle \sum^{n}_{i=1}{(x^{(i)})(x^{(i)})^T} i=1∑n(x(i))(x(i))T
- Compute “eigenvectors” of matrix ∑ \displaystyle \sum ∑<<<<<sigm
[U,S,V] = svd(Sigma);
sigm>>>>covariance matrix- sigma = ∑ i = 1 m ( x ( i ) ) ( x ( i ) ) T \displaystyle \sum^{m}_{i=1}{(x^{(i)})(x^{(i)})^T} i=1∑m(x(i))(x(i))T
choosing the number of principal components
Reconstruction from compressed representation (压缩算法)
back to an approximation of your original high dimensional data.
Reconstruction of the original data(数据重构),where we think of trying to reconstruct, where we think of trying to reconstruct the original value of x from the compressed repressentation