


There are lot of dimensionality reduction techniques available in Machine Learning. It is one of the most integral part in Data Science field.Therefore, In this article, I am going to describe one of the most important dimensionality reduction techniques that is being used nowadays,called Principal Component Analysis(PCA).

机器学习中有很多降维技术可用。 因此,在本文中,我将描述当今使用的最重要的降维技术之一,即主成分分析(PCA)。

But before doing that, one thing we need to know what is Dimensionality Reduction and why it is so important.


什么是降维: (What is Dimensionality Reduction:)

Dimensionality Reduction is a technique,used to reduce the dimensions of the feature space.For an example, let’s say, if there are 100 features or columns in a dataset and you want to get only 10 features,using this dimensionality reduction techniques you can achieve this feat. Overall, it transforms the dataset which is in n dimensional space to n’ dimensional space where n’<n.

降维是一种用于减少要素空间尺寸的技术。例如,假设一个数据集中有100个要素或列,而您只想获得10个要素,使用这种降维技术可以实现这个壮举。 总的来说,它将n维空间中的数据集转换为n'<n的n'维空间。

为什么要降维? (Why Dimensionality Reduction ?)

Dimensionality Reduction is important in machine learning in a lot of ways, but the most important reason above all is the ‘Curse of Dimensionality’.


In machine learning,we often augment as many features as possible at first to get the higher accurate results. However, at a certain point of time,the performance of the model decreases(mainly overfitting) with the increasing number of features. This is the concept of ‘Curse of Dimensionality’.So,this is why dimensionality reduction is very crucial in the field of Machine Learning.

在机器学习中,我们通常一开始会尽可能多地扩展功能以获得更准确的结果。 但是,在某个时间点,模型的性能会随着特征数量的增加而降低(主要是过度拟合)。 这就是“维数的诅咒”的概念。因此,这就是为什么降维在机器学习领域非常重要的原因。



Now let’s come to the PCA.


主成分分析(PCA): (Principal Component Analysis(PCA):)

PCA is a dimensionality reduction technique that enables us to identify correlations and patterns in a dataset so that it can be transformed into a new dataset of significantly lower dimensionality without the loss of any important information.


PCA背后的数学(Mathematics Behind PCA:)

The whole process of Mathematics in PCA can be divided into 5 parts.


  1. Standardizing the Data

  2. Calculate the co-variance matrix

  3. Calculating the EigenVectors and EigenValues

  4. Computing the Principal Components

  5. Reducing the dimension of the datasets


Let’s talk about each of these above sections separately.


  1. Standardizing the Data:


Standardizing is the process of scaling the data in such a way that all the variables and their values lie within a similar range.


The formula for Standardization is shown below:


where x^i=Observation or sample, Mu(μ)= Mean,Sigma(σ): Standard deviation.

其中x ^ i =观测值或样本,Mu(μ)=平均值,Sigma(σ):标准偏差。

2. Calculate the co-variance matrix:

2. 计算协方差矩阵:

A co-variance matrix expresses the correlation between the different variables in the data set. It is essential to identify highly dependent variables because they contain biased and redundant information which can hamper the overall performance of the model.

协方差矩阵表示数据集中不同变量之间的相关性。 识别高度因变量至关重要,因为它们包含有偏见和多余的信息,这些信息可能会妨碍模型的整体性能。

The calculation for co-variance is done this way —


where x^i=values of the x variable, x̅=mean of x variable, y^i=values of the y variable, ȳ=mean of y variable.

其中x ^ i = x变量的值,x̅= x变量的平均值,y ^ i = y变量的值,ȳ= y变量的平均值。

If our dataset has more than 2 dimensions then it can have more than one covariance measurement. For example, if we have a dataset with 3 dimensions x, y and z, then the covariance matrix of this dataset will look like this —

如果我们的数据集具有2个以上的维度,则它可以具有多个协方差度量。 例如,如果我们有一个3维x,y和z的数据集,则该数据集的协方差矩阵将如下所示:

3. Calculating the EigenVectors and EigenValues:


EigenVectors are those vectors when a linear transformation is performed on them, then their directions does not change.


EigenValues simply denote the scalars of their respective eigenvectors.


Let A be a square matrix, ν a vector and λ a scalar that satisfies Aν = λν, then λ is called eigenvalue associated with eigenvector ν of A.

A是一个正方形矩阵,ν的载体和λ一个标量,其满足 ν=λν,λ被称为特征值与A的本征向量ν相关联。

Now, Lets do some math and find the eigenvector and eigenvalue of a sample vector.


As you can see in our above calculations, [1,1] is the Eigenvector and 2 is the Eigenvalue. Now, lets see how we can find the Eigen pairs of a sample vector A.

如您在上述计算中所见,[1,1]是特征向量,而2是特征值。 现在,让我们看看如何找到样本矢量A的本征对。

Replacing the value of our vector A in the above formula we get:


With the found Eigen values, lets try and find the corresponding Eigen vectors which satisfies AX= λX.

使用找到的特征值,让我们尝试找到满足AX =λX的相应特征向量

For Eigenvector, λ= 2:

对于特征向量,λ= 2:

For Eigenvector, λ = 3:

对于特征向量,λ= 3:

The above shows how we can calculate


4. Computing the Principal Components:

4. 计算主要成分:

Once we have computed the EigenVectors and Eigenvalues as shown above, all we have to do is order them into descending order, where the eigenvector with the highest eigen value is the most significant and therefore forms the first principal component.


5.减少数据集的维数: (5. Reducing the dimension of the datasets:)

In the last step,we have to re-arrange the original dataset with the final principal components which represent the maximum and most significant information of the dataset.


python中的PCA: (PCA in python:)

Now, Let’s assemble all of these above steps into python code.


import numpy as np
import pandas as pd#load mnist datad0 = pd.read_csv('./mnist_train.csv') # save the labels into a variable l.
l = d0['label']# Drop the label feature and store the pixel data in d. d=d0.drop("label",axis=1)# Pick first 15K data-points
labels = l.head(15000)
data = d.head(15000)
# Data-preprocessing: Standardizing the data
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data)
#find the co-variance matrix which is : A^T * A
sample_data = standardized_data
covar_matrix = np.matmul(sample_data.T , sample_data)
#finding the top two eigen-values and corresponding eigen-vectors
for projecting onto a 2-Dim space
from scipy.linalg import eigh
values, vectors = eigh(covar_matrix, eigvals=(782,783))
vectors = vectors.T
#Computing the Principal Components:new_coordinates = np.matmul(vectors, sample_data.T)# appending label to the 2d projected datanew_coordinates = np.vstack((new_coordinates, labels)).T
#New Dataframe with Reduced dimension
dataframe = pd.DataFrame(data=new_coordinates, columns("1st_principal", "2nd_principal", "label"))

After executing above code,the result will looks like this —


PCA的局限性: (Limitations of PCA:)

Though PCA works well, but it has some drawbacks too —


Let’s discuss some of it’s significant drawbacks.


Independent variables become less interpretable: After implementing PCA on the dataset, your original features will turn into Principal Components. Principal Components are the linear combination of your original features. Principal Components are not as readable and interpretable as original features.

自变量变得难以解释:在数据集上实施PCA之后,您的原始特征将变为主要组件。 主成分是原始特征的线性组合。 主要组件不像原始功能那样易读易懂。

Data standardization is must before PCA: You must standardize your data before implementing PCA, otherwise PCA will not be able to find the optimal Principal Components.


Information Loss: Although Principal Components try to cover maximum variance among the features in a dataset, if we don’t select the number of Principal Components with care, it may miss some information as compared to the original list of features.


Not for good visualization: PCA works well only for dimensionality reduction, However, it will not perform as expected when it comes to Data Visualization.

不能实现良好的可视化 :PCA仅在降低维数方面效果良好,但是,在数据可视化方面,它的性能无法达到预期。



In this article, I have shown you what is Dimensionality Reduction and it is so effective in the field machine learning.Besides,I have also given you a succinctinformation about PCA. There are lot of Dimensionality Reduction techniques available, However, which technique needs to be used at which point of time , it depends on your model and also business requirements.

在本文中,我向您展示了什么是降维,它在现场机器学习中是如此有效。此外,我还为您提供了有关PCA的简要信息。 有许多降维技术可用,但是,哪种技术需要在哪个时间点使用,取决于您的模型和业务需求。

Hope you have now got a brief overview of PCA!!!!


