If you have ever taken an online course on Machine Learning, you must have come across Principal Component Analysis for dimensionality reduction, or in simple terms, for compression of data. Guess what, I had taken such courses too but I never really understood the graphical significance of PCA because all I saw was matrices and equations. It took me quite a lot of time to understand this concept from various sources. So, I decided to compile it all in one place.

如果您曾经参加过有关机器学习的在线课程,那么您必须碰到主成分分析以降低维度,或者简单来说就是压缩数据。 猜猜我也参加过此类课程,但是我从来没有真正理解PCA的图形意义,因为我看到的只是矩阵和方程式。 我花了很多时间从各种来源了解这个概念。 因此,我决定将其全部编译在一个地方。

In this article, we will take a visual (graphical) approach to understand PCA and how it can be used to compress data. Basic knowledge of Linear Algebra and Matrices is assumed. If you are new to this concept, just follow along, I have tried my best to keep this as simple as possible.

在本文中,我们将采用视觉(图形)方法来理解PCA以及如何将其用于压缩数据。 假设线性代数和矩阵的基本知识。 如果您是这个概念的新手,那么请跟随我,我已尽力使它尽可能简单。

介绍 (Introduction)

These days, datasets containing a large number of dimensions are increasingly common and are often difficult to interpret. One example can be a database of face photographs of let’s say, 1,000,000 people. If each face photograph has a dimension of 100x100, then the data of each face is 10000 dimensional (there are 100x100 = 10,000 unique values to be stored for each face). Now, if 1 byte is required to store the information of each pixel, then 10,000 bytes are required to store 1 face. Since there are 1000 faces in the database,10,000 x 1,000,000 = 10 GB will be needed to store the dataset.

如今,包含大量维的数据集变得越来越普遍,并且通常难以解释。 一个例子可以是一个数据库,假设有100万人的面部照片。 如果每个人脸照片的尺寸为100x100,则每个人脸的数据为10000维(每个人脸要存储100x100 = 10,000个唯一值)。 现在,如果需要1个字节来存储每个像素的信息,则需要10,000个字节来存储1个面部。 由于数据库中有1000张面Kong,因此需要10,000 x 1,000,000 = 10 GB来存储数据集。

Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, exploiting the fact that the images in these datasets have something in common. For instance, in a dataset consisting of face photographs, each photograph will have facial features like eyes, nose, mouth. Instead of encoding this information pixel by pixel, we could make a template of each type of these features and then just combine these templates to generate any face in the dataset. In this approach, each template will still be 100x100 = 1000 dimensional, but since we will be reusing these templates (basis functions) to generate each face in the dataset, the number of templates required will be very small. PCA does exactly this.

主成分分析(PCA)是一种利用此类数据集中的图像具有共同点的事实来降低此类数据集的维数的技术。 例如,在由脸部照片组成的数据集中,每张照片将具有面部特征,例如眼睛,鼻子,嘴巴。 不用逐个像素地编码此信息,我们可以制作这些特征的每种类型的模板,然后将这些模板组合在一起以生成数据集中的任何人脸。 在这种方法中,每个模板仍将是100x100 = 1000尺寸,但是由于我们将重用这些模板(基本函数)以生成数据集中的每个面,因此所需模板的数量将非常少。 PCA正是这样做的。

PCA如何工作? (How does PCA work?)

This part is going to be a bit technical, so bear with me! I will try to explain the working of PCA with a simple example. Let’s consider the data shown below containing 100 points each 2 dimensional (x & y coordinates is needed to represent each point).

这部分将有点技术性,请多多包涵! 我将尝试通过一个简单的例子来解释PCA的工作。 让我们考虑下面显示的数据,每个数据包含100个点(二维)(需要用x和y坐标表示每个点)。

Image for post
Image by Author

Currently, we are using 2 values to represent each point. Let’s explain this situation in a more technical way. We are currently using 2 basis functions, x as (1, 0) and y as (0, 1). Each point in the dataset is represented as a weighted sum of these basis functions. For instance, point (2, 3) can be represented as 2(1, 0) + 3(0, 1) = (2, 3). If we omit either of these basis functions, we will not be able to represent the points in the dataset accurately. Therefore, both the dimensions necessary, and we can’t just drop one of them to reduce the storage requirement. This set of basis functions is actually the cartesian coordinate in 2 dimensions.

当前,我们使用2个值来表示每个点。 让我们以更技术性的方式解释这种情况。 我们目前正在使用2个基函数,x作为(1,0),y作为(0,1)。 数据集中的每个点都表示为这些基函数的加权和。 例如,点(2,3)可以表示为2(1,0)+ 3(0,1)=(2,3)。 如果我们忽略这些基本函数中的任何一个,我们将无法准确表示数据集中的点。 因此,这两个尺寸都是必需的,我们不能只丢掉其中一个以减少存储需求。 这套基础函数实际上是二维的直角坐标。

If we notice closely, we can very well see that the data approximates a line as shown by the red line below.


Image for post
Image by Author

Now, let’s rotate the coordinate system such that the x-axis lies along the red line. Then, the y-axis (green line) will be perpendicular to this red line. Let’s call these new x and y axes as a-axis and b-axis respectively. This is shown below.

现在,让我们旋转坐标系,以使x轴沿着红线。 然后,y轴(绿线)将垂直于该红线。 我们将这些新的x和y轴分别称为a轴b轴 。 如下所示。

Image for post
Image by Author

Now, if we use a and b as the new set basis functions (instead of using x and y) for this dataset, it wouldn’t be wrong to say that most of the variance in the dataset is along the a-axis. Now, if we drop the b-axis, we can still represent the points in the dataset very accurately, using just a-axis. Therefore, we now only need half as must storage to store the dataset and reconstruct it accurately. This is exactly how PCA works.

现在,如果我们使用ab作为该数据集的新集合基函数(而不是使用xy ),那么说数据集中的大多数方差都沿着a轴是没有错的。 现在,如果我们放下b轴,我们仍然可以仅使用a轴就非常精确地表示数据集中的点。 因此,我们现在只需要存储一半就可以存储数据集并准确地重建它。 这正是PCA的工作方式。

PCA is a 4 step process. Starting with a dataset containing n dimensions (requiring n-axes to be represented):

PCA是一个四步过程。 从包含n个维度的数据集开始(需要表示n个轴):

  • Find a new set of basis functions (n-axes) where some axes contribute to most of the variance in the dataset while others contribute very little.

    找到一组新的基函数( n轴),其中一些轴对数据集中的大部分方差有贡献,而另一些轴则贡献很小。

  • Arrange these axes in the decreasing order of variance contribution.

  • Now, pick the top k axes to be used and drop the remaining n-k axes.


  • Now, project the dataset onto these k axes.


After these 4 steps, the dataset will be compressed from n-dimensions to just k-dimensions (k<n).

经过这4个步骤,数据集将从n维压缩为仅k维( k < n )。

脚步 (Steps)

For the sake of simplicity, let’s take the above dataset and apply PCA on that. The steps involved will be technical and basic knowledge of linear algebra is assumed. You can view the Colab Notebook here:

为了简单起见,让我们采用以上数据集并在其上应用PCA。 所涉及的步骤将是线性代数的技术和基本知识。 您可以在此处查看Colab笔记本:

第1步 (Step 1)

Since this is a 2-dimensional dataset, n=2. The first step is to find the new set of basis functions (a & b). In the explanation above, we saw that the dataset had the maximum variance along a line and we manually chose that line as a-axis and the line the perpendicular to it as b-axis. In practice, we want this step to be automated.

由于这是二维数据集,因此n = 2。 第一步是找到新的基础函数集( ab )。 在上面的说明中,我们看到数据集沿一条线具有最大方差,我们手动选择了该线作为a轴,垂直选择与其垂直的线作为b轴。 实际上,我们希望这一步骤是自动化的。

To accomplish this, we can find the eigenvalues and eigenvectors of the covariance matrix of the dataset. Since the dataset is 2 dimensional, we will get 2 eigenvalues and their corresponding eigenvectors. Then, the 2 eigenvectors are two basis functions (new axes) and the two eigenvalues tell us the variance contribution of the corresponding eigenvectors. A large value of eigenvalue implies that the corresponding eigenvector (axis) contributes more towards the total variance of the dataset.

为此,我们可以找到数据集协方差矩阵的特征值和特征向量。 由于数据集是二维的,因此我们将获得2个特征值及其对应的特征向量。 然后,这两个特征向量是两个基函数(新轴),两个特征值告诉我们相应特征向量的方差贡献。 特征值的较大值表示相应的特征向量(轴)对数据集的总方差贡献更大。

Image for post
Image by Author

第2步 (Step 2)

Now, sort the eigenvectors (axes) according to decreasing eigenvalues. Here, we can see that the eigenvalue for a-axis is much larger than that of the b-axis meaning that a-axis contributes more towards the dataset variance.

现在,根据递减的特征值对特征向量(轴)进行排序。 在这里,我们可以看到a轴的特征值远大于b轴的特征值, 意味着a轴对数据集方差的贡献更大。

Image for post
Image by Author

The percentage contribution of each axis towards the total dataset variance can be calculated as:


Image for post
Image for post
Image by Author

The above numbers prove that the a-axis contributes 99.7% towards the dataset variance and that we can drop the b-axis and lose just 0.28% of the variance.

以上数字证明, a轴对数据集方差的贡献为99.7%,我们可以删除b轴并仅损失0.28%的方差。

第三步 (Step 3)

Now, we will drop the b-axis and keep only the a-axis.

现在,我们将放下b轴 ,仅保留a轴。

Image for post
Image by Author

第4步 (Step 4)

Now, reshape the first eigenvector (a-axis) into a 2x1 matrix, called the projection matrix. It will be used to project the original dataset of shape (100, 2) onto the new basis function (a-axis), thus compressing it to (100, 1).

现在,将第一个特征向量(a轴)整形为2x1矩阵,称为投影矩阵。 它将用于将形状为(100,2)的原始数据集投影到新的基函数(a轴)上,从而将其压缩为(100,1)。

Image for post
Image for post

重建数据 (Reconstruct the data)

Now, we can use the projection matrix to expand the data back to its original size, with of course a small loss of variance (0.28%).


Image for post
Image for post

The reconstructed data is shown below:


Image for post
Image by Author

Please note that the variance along the b-axis (0.28%) is lost as evident by the above figure.


那是所有人! (That’s all folks!)

If you made it till here, hats off to you! In this article, we took a graphical approach to understand how Principal Component Analysis works and how it can be used for data compression. In my next article, I will show how PCA can be used to compress Labelled Faces in the Wild (LFW), a large scale dataset consisting of 13233 human-face images.

如果您做到了这里,就向您致敬! 在本文中,我们采用图形化的方法来了解主成分分析的工作原理以及如何将其用于数据压缩。 在下一篇文章中,我将展示如何使用PCA压缩野外标记的面部(LFW),LFW是由13233张人脸图像组成的大规模数据集。

If you have any suggestions, please leave a comment. I write articles regularly so you should consider following me to get more such articles in your feed.

如果您有任何建议,请发表评论。 我会定期撰写文章,因此您应该考虑关注我,以便在您的供稿中获取更多此类文章。

If you liked this article, you might as well love these:


Visit my website to learn more about me and my work.


翻译自: https://towardsdatascience.com/principal-component-analysis-visualized-17701e18f2fa


  • 1
  • 0
  • 1
  • 一键三连
  • 扫一扫,分享海报

评论将由博主筛选后显示,对所有人可见 | 还能输入1000个字符
©️2021 CSDN 皮肤主题: 1024 设计师:白松林 返回首页
钱包余额 0