Dimensionality Reduction - Advice for applying PCA

最新推荐文章于 2022-11-25 16:50:57 发布

王彩旗 edwardwangcq.com

最新推荐文章于 2022-11-25 16:50:57 发布

阅读量154

点赞数

分类专栏：人工智能 # 机器学习

本文链接：https://blog.csdn.net/edward_wang1/article/details/111244929

版权

人工智能同时被 2 个专栏收录

142 篇文章 0 订阅

订阅专栏

机器学习

109 篇文章 0 订阅

订阅专栏

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程，第十五章《降维》中第121课时《应用PCA的建议》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正，使其更加简洁，方便阅读，以便日后查阅使用。现分享给大家。如有错误，欢迎大家批评指正，在此表示诚挚地感谢！同时希望对大家的学习能有所帮助.
————————————————

In an earlier video, I had said that PCA can sometimes be used to speed up the running time of a learning algorithm. In this vido, I'd like to explain how to actually do that and try to give some advice about how to apply PCA.

Here's how you can use PCA to speed up a learning algorithm. Supervised learning algorithm speed up is actually the most common use that I personally make use of PCA. Let's say you have a supervised learning problem:

$(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(m)}, y^{(m)})$

Note this is a supervised learning problem with input and lables . Let's say that your examples $x^{(i)}$ are very high dimensional. So let's say $x^{(i)}\in \mathbb{R}^{10,000}$ .

One example of that would be computer vision problem where you have 100x100 images, and so that's 10,000 pixels. So if $x^{(i)}$ are feature vectors that contain 10,000 pixel intensity values, then you have 10,000 dimensional feature vectors. So with very high dimensional feature vectors like this, running a learning algorithm can be slow. Just, if you feed 10,000 dimensional feature vectors into logistic regression, or a neural network, or support vector machine or whatever you have, it can make your learning algorithm run more slowly. Fortunately, with PCA we'll be able to reduce the dimension of this data and so make our algoithms run more efficiently.

Here's how you do that. We're going to first check our labeled training set and extract just the inputs and temporarily put aside the Y's. So this will now give us an unlabelled trainning set: $x^{(1)}, x^{(2)},...,x^{(m)}\in \mathbb{R}^{10,000}$

Then apply PCA and we'll get reduced dimensional data:

$z^{(1)},z^{(2)},...,z^{(m)}\in \mathbb{R}^{1000}$

That's 10x savings. And gives me a new training set:

$(z^{(1)}, y^{(1)}), (z^{(2)}, y^{(2)}), ..., (z^{(m)}, y^{(m)})$

Finally, I can take this reduced dimension training set and feed it to a learning algorithm and learn the hypothesis $h_{\theta }(z)$ and try to make predictions.

For example, if I were using logistic regression for example. I would train a hypothesis that outputs:

$h_{\theta }(z)=\frac{1}{1+e^{-\theta ^{T}z}}$

If you have a new test example , you would take this , and get corresponding by PCA. Then feed to this hypothesis, then this hypothesis make a prediction on your input .

On final note, what PCA does is it defines a mapping of $x^{(i)}\rightarrow z^{(i)}$ . This mapping should be defined by running PCA only on the training set. And then having found $u_{reduced}$ or having found the parameters for feature scaling, the mean normalization, you can then apply the same mapping in your cross-validation set or in your test set.

By the way, in this example I talked about reducing the data from 10,000 dimensional to 1000 dimensional. This is actually not that unrealistic. For many problems, we can actually reduce the dimensional data by 5x maybe by 10x. And still retain most of the variance and we can do this with barely effecting the performance. In terms of classification accuracy, let's say, barely affecting the classification accuracy of the learning algorithm. And by reducing the dimension, our learning algorithm can run much much faster.

To summarize, we've talked about the following applications of PCA:

Compression application. We might do so to reduce the memory/disk needed to store data and to speed up a learning algorithm. In these applications, in order to choose , often we'll do so to figure out what's the percentage of variance retained. So, for the learning algorithm speed up, we'll often retain 99% of the variance. That are a very typical choise for how to choose .
Visualization. We'll usually choose or because we can only plot 2D and 3D data sets.

I should mention that there is often one frequent misuse of PCA. You sometimes hear about others doing this hopefully not too often. I just want to mention this so that you know not to do it. And there is one bad use of PCA, which is to try to use it to prevent over-fitting. This is a bad way to use PCA. It's not that this method works badly. If you were to use this method to reduce the dimensional data to try to prevent over-fitting, it might actually work ok. But this is just not a good way to address over-fitting. If you worried about over-fitting, it's a much better way to address it to use regularization. The reason is, PCA doesn't use the labels . It just looking at the inputs $x^{(i)}$ to find a lower-dimensional approximation of the data. Thus PCA actually throws away some information. It turns out that if you're retaining 99% or 95% of variance or whatever using PCA, regularization will often give you at least as good a method for preventing over-fitting and actually often works better. The reason is when you're applying linear regression or logistic regression or some other methods with regularization, the minimization problem actually knows what the values of are and is less likely to throw away some valuable information.

Finally, one last misuse of PCA. People sometimes use PCA where it shouldn't be. When people design ML system, they may write down the plan like this:

Actually, before wring out a ML plan, one very good quesion to ask is: what if we were to just do the whole thing without using PCA? What I often advise people is before you implementing PCA, first consider doing it with your original raw data $x^{(i)}$ and only if that doesn't do what you want, then implement PCA and consider using $z^{(i)}$ . Only if you had a reason to believe that doesn't work, only if your learning algorithm ends up running too slowly, or only if the memory requirement or the disk space requirement is too large, so you want to compress your representation.