14-1 Motivation I:Data Compression
Data Compression
Reduce data from 2D to 1D:project line x 1 , x 2 ⟶ z 1 x_1,x_2\longrightarrow z_1 x1,x2⟶z1
Reduce data from 3D to 2D:project plane x 1 , x 2 , x 3 ⟶ z 1 , z 2 x_1,x_2,x_3\longrightarrow z_1,z_2 x1,x2,x3⟶z1,z2
14-2 Motivation I:Data Visualization
Data Visualization
14-3 Principal Component Analysis problem formulation
PCA 主成分分析
Principal Component Analysis (PCA) problem formulation
Reduce from 2-dimension to 1-dimension: Find a direction (a vector u ( i ) ∈ R n u^{(i)}\in \mathbb{R}^n u(i)∈Rn) onto which to project the data so as to minimize the projection error
Reduce from n-dimension to k-dimension: Find k vectors u ( 1 ) , u ( 2 ) , ⋯ , u ( k ) u^{(1)},u^{(2)},\cdots,u^{(k)} u(1),u(2),⋯,u(k) onto which to project the data, so as to minimize the projection error.
PCA is not linear regression
linear regression: distance y x ⟶ \longrightarrow ⟶y
PCA: distance x 1 , x 2 , ⋯ x_1,x_2,\cdots x1,x2,⋯
14-4 Principal Component Analysis algorithm
Data preprocessing
Training set: x ( i ) , x ( 2 ) , ⋯ , x ( m ) x^{(i)},x^{(2)},\cdots,x^{(m)} x(i),x(2),⋯,x(m)
Preprocessing (feature scaling/mean normalization):
μ j = 1 m ∑ i = 1 m x j ( i ) \mu_j=\frac{1}{m}\sum^m_{i=1}x_j^{(i)} μj=m1∑i=1mxj(i)
Replace each x j ( i ) x_j^{(i)} xj(i) with x j − μ j x_j-\mu_j xj−μj.
If different features on different scales(e.g., x 1 x_1 x1=size of house, x 2 x_2 x2 =number of bedrooms), scale features to have comparable range of values.
Principal Component Analysis (PCA) algorithm
Reduce data from n-dimensions to k-dimensions
Compute “covariance matrix”
∑
=
1
m
∑
i
=
1
n
(
x
(
i
)
)
(
x
(
i
)
)
T
\sum=\frac{1}{m}\sum^{n}_{i=1}(x^{(i)})(x^{(i)})^T
∑=m1∑i=1n(x(i))(x(i))T
Compute "eigenvectors"of matrix
∑
\sum
∑:
[U,S,V]=svd(Sigma); %or eig(sigma)
Sigma: n × n n\times n n×n matrix
From [U,S,V]=svd(Sigma),we get:
U r e d u c e U_{reduce} Ureduce : U = [ u ( 1 ) , u ( 2 ) , u ( 3 ) , ⋯ , u ( n ) ] ∈ R n × n U=[u^{(1)},u^{(2)},u^{(3)},\cdots,u^{(n)}]\in\mathbb{R}^{n\times n} U=[u(1),u(2),u(3),⋯,u(n)]∈Rn×n
After mean normalization(ensure every feature has zero mean )and optionally feature scaling:
S i g m a = 1 m ∑ i = 1 n ( x ( i ) ) ( x ( i ) ) T Sigma=\frac{1}{m}\sum^{n}_{i=1}(x^{(i)})(x^{(i)})^T Sigma=m1∑i=1n(x(i))(x(i))T
[ U , S , V ] = s v d ( S i g m a ) ; [U,S,V]=svd(Sigma); [U,S,V]=svd(Sigma);
U r e d u c e = U ( : , 1 : k ) ; Ureduce=U(:,1:k); Ureduce=U(:,1:k);
z = U r e d u c e ′ ∗ x z=Ureduce'*x z=Ureduce′∗x
14-5 Choosing the number of principal components
Choosing k (number of principal components)
14-6 Reconstruction from compressed representation
Reconstruction from compressed representation
z = U r e d u c e T x x a p p r o x = U r e d u c e . z z=U_{reduce}^Tx\quad x_{approx}=U_{reduce}.z z=UreduceTxxapprox=Ureduce.z
14-7 Advice for applying PCA
Supervise learning speedup
( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , ⋯ , ( x ( m ) , y ( m ) ) ( x ^ { ( 1 ) } , y ^ { ( 1 ) } ) , ( x ^ { ( 2 ) } , y ^ { ( 2 ) } ) , \cdots , ( x ^ { ( m ) } , y ^ { ( m ) } ) (x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))
Extract inputs:
Unlabeled dataset: x ( 1 ) , x ( 2 ) , ⋯ , x ( m ) ∈ R 10000 ⟶ z ( 1 ) , z ( 2 ) , ⋯ , z ( m ) ∈ R 1000 x ^ { ( 1 ) } , x ^ { ( 2 ) } , \cdots , x ^ { ( m ) }\in\mathbb{R}^{10000}\longrightarrow z ^ { ( 1 ) } , z ^ { ( 2 ) } , \cdots , z ^ { ( m ) }\in\mathbb{R}^{1000} x(1),x(2),⋯,x(m)∈R10000⟶z(1),z(2),⋯,z(m)∈R1000
New training set:
( z ( 1 ) , y ( 1 ) ) , ( z ( 2 ) , y ( 2 ) ) , ⋯ , ( z ( m ) , y ( m ) ) ( z ^ { ( 1 ) } , y ^ { ( 1 ) } ) , ( z ^ { ( 2 ) } , y ^ { ( 2 ) } ) , \cdots , ( z ^ { ( m ) } , y ^ { ( m ) } ) (z(1),y(1)),(z(2),y(2)),⋯,(z(m),y(m))
Note: Mapping x ( i ) → z ( i ) x^{(i)}\rightarrow z^{(i)} x(i)→z(i) should be defined by running PCA only on the training set. This mapping can be applied as well to the examples x c v ( i ) x_{cv}^{(i)} xcv(i) and x t e s t ( i ) x_{test}^{(i)} xtest(i) in the cross validation and test sets.
Application of PCA
- Compression
- Reduce memory/disk needed to store data
- Speed up learning algorithm
- Visualization
Bad use of PCA: To prevent overfitting
Use :
z
(
i
)
z^{(i)}
z(i) instead of
x
(
i
)
x^{(i)}
x(i) to reduce the number of features to
k
<
n
k<n
k<n
Thus, fewer features, less likely to overfit
× \color{red}\large{\times} ×
This might work OK, but isnt a good way to address overfitting. Use regularization instead
m i n θ 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ 2 m ∑ j = 1 n θ j 2 \underset{\theta}{min} \frac { 1 } { 2 m } \sum _ { i = 1 } ^ { m } ( h _ { \theta } ( x ^ { ( i ) } ) - y ^ { ( i ) } )^2+ \frac { \lambda } { 2 m } \sum _ { j = 1 } ^ { n } \theta _ { j } ^ { 2 } θmin2m1i=1∑m(hθ(x(i))−y(i))2+2mλj=1∑nθj2
PCA is sometimes used where it shouldn’t be
Design of ML system:
-
Get training set { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , ⋯ , ( x ( m ) , y ( m ) ) } \{ ( x ^ { ( 1 ) } , y ^ { ( 1 ) } ) , ( x ^ { ( 2 ) } , y ^ { ( 2 ) } ) , \cdots , ( x ^ { ( m ) } , y ^ { ( m ) } ) \} {(x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))}
-
Run PCA to reduce x ( i ) x^{(i)} x(i) in dimension to get z ( i ) z^{(i)} z(i)
-
Train logistic regression on { ( z ( 1 ) , y ( 1 ) ) , ⋯ , ( z ( m ) , y ( m ) ) } \{ ( z ^ { ( 1 ) } , y ^ { ( 1 ) } ) , \cdots , ( z ^ { ( m ) } , y ^ { ( m ) } ) \} {(z(1),y(1)),⋯,(z(m),y(m))}
-
Test on test set:Map x t e s t ( i ) x_{test}^{(i)} xtest(i) to z t e s t ( i ) z_{test}^{(i)} ztest(i).Run h θ ( z ) h_\theta(z) hθ(z) on { ( z t e s t ( 1 ) , y t e s t ( 1 ) , ⋯ , ( z t e s t ( m ) , y t e s t ( m ) ) } \{(z_{test}^{(1)},y_{test}^{(1)},\cdots,(z_{test}^{(m)},y_{test}^{(m)})\} {(ztest(1),ytest(1),⋯,(ztest(m),ytest(m))}
How about doing the whole thing without using PCA?
Before implementing PCA, first try running whatever you want to do with the original /raw data x ( i ) x^{(i)} x(i). Only if that doesn’t do what you want, then implement PCA and consider using z ( i ) z^{(i)} z(i)