Study notes for Principal Component Analysis

最新推荐文章于 2023-12-19 17:08:40 发布

Felix_夜雨

最新推荐文章于 2023-12-19 17:08:40 发布

阅读量1.3k

点赞数 1

分类专栏： Machine Learning 文章标签： machine learning 机器学习 study notes

本文链接：https://blog.csdn.net/u010693617/article/details/9077041

版权

Machine Learning 专栏收录该内容

23 篇文章 0 订阅

订阅专栏

Motivations for Dimensionality Reduction

Data Comparison
- Speed up algorithms. By reducing a large number (e.g. 10, 000) of dimensionality of feature space, a learning algorithm may be too slow to be useful. With PCA, we can reduce the dimensionality and make it tractable. Typically, you can reduce data dimensionality by 5-10x without a major hit to algorithm.
- Reduce disk/memory space used by data
- Reduce highly correlated (redundant) features; hence better represent original data (by transforming from x to z space).
- Note that the number of training examples is not reduced, but the dimensionality of feature vectors for each training example (i.e., the number of features).
- In practice, we'd normally try and do 1000D --> 100D.
Visualization
- Represent data in 2D/3D space. It is hard to visualize highly dimensional data.
- Visualization helps us understand and interpret our data because we focus on the two or three main dimensions of variations.

Principal Component Analysis (PCA)

It is the most commonly used technique for the dimensionality reduction problem.
You should normally do mean normalization and feature scaling on your data before PCA.
PCA aims to find a lower dimensional space such that the sum of squared projection error is minimized.
Formally, to reduce from nD to kD, we find k vectors u=(u⁽¹⁾, u⁽²⁾, ..., u^(k)) onto which to project the data to minimize the projection error. Those vectors represent a new space.

The PCA Algorithm

Compute the covariance matrix:
$\Sigma=\frac{1}{m}\sum\limits_{i=1}^n(x^{(i)})(x^{(i)})^T$
This is commonly denoted as (greek uppercase sigma) - NOT summation symbol. It is a nxn matrix and the example x⁽ⁱ⁾ is a nx1 matrix. In matlab, it is implemented by:
$sigma=\frac{1}{m}(X'*X)$
Q: Why to calculate the (expensive) covariance matrix?
- For covariance matrix, the exact value of each entry is not as important as its sign.
  - A positive value indicates that both dimensions increase or decrease together. E.g. as the number of hours studied increases, the grades in that subject also increase.
  - A negative value indicates while one increases the other decreases, or vice-versa. E.g. active social life vs. performance in computer science.
  - If covariance is zero: the two dimensions are independent of each other. E.g. heights of students vs. grades obtained in a subject.
- Covariance calculations are used to find relationships between dimensions in high dimensional data sets where visualization is difficult.
Compute eigen vectors of the covariance matrix: [U, S, V]=svd(sigma), where U matrix is also an nxn matrix, and it turns out that the columns of U are the u vectors we expect. In other words, we can take the first k columns from U to form u, i.e., u = U(:, 1:k).
Transform x to z space: $z^{(i)}=u^T*x^{(i)}$ , where z is a k x 1 matrix. Note that to recover from z to x: $x_{approx}^{(i)}=u*z^{(i)}$ . Note that we lose some of the information. i.e., not all X can be perfectly recovered from the Z space.
How to determine k value (=number of principal components)? Guideline: to retain 99% of the variance of the original data. In other words, the ratio between the average squared projection error with total variation in data should be less than 0.01:
$\frac{\frac{1}{m} \sum_{i=1}^m \|x^{(i)}-x^{(i)}_{approx}\|^2}{\frac{1}{m} \sum_{i=1}^m \|x^{(i)}\|^2}\le 0.01$
For implementation, we adopt the below equaiton instead:
$\frac{\sum_{i=1}^k s_{ii}}{\sum_{i=1}^n s_ii}\ge 0.95$
where $s_{ii}$ is the diagonal elements of matrix S. Choose the minimal value k when the above equation satisfies.
Q: Why choose the first k dimensions with the greatest variance.
A: The underlying assumption is that large variances have important dynamics. Hence, principal components with larger associated variances represent interesting dynamics, while those with lower variances represent noise.In other words, after feature transformation (from x to z), the most important features in the new space are the first k vectors.

Related to linear regression?

PCA is NOT linear regression. Despite cosmetic similarities, very different.
For linear regression, it is a supervised learning algorithm which aims to find a straight line to minimize the straight line between a point and a squared line. The objective is to find a fitting line that predicts y values accurately. It works on the training space (x⁽ⁱ⁾, y⁽ⁱ⁾, where i=1, ..., m).
For PCA, it is a unsupervised learning algorithm which aims to minimize the magnitude of the shortest orthogonal distance, the projection error. The objective is to reduce the dimensionality of feature space but maintain high variance of original data. It works on the feature space x⁽ⁱ⁾=(x₁, ..., x_n);

Advice for Applying PCA

DO NOT use PCA to prevent over-fitting. Although reducing the number of feature space, it is less likely to over-fit, it is not a good way to address over-fitting. It is always suggested to use regularization instead. One important reason is that PCA throws away some data without knowing what the values it's losing.
Always try your learning algorithm on the original data first. ONLY if you find that it takes long time to train, you can use PCA to reduce dimensionality and speed up the algorithm.
PCA is easy enough to add on as a pre-processing step.