主成分分析 一个主成分_了解主成分分析并从头开始实施

主成分分析 一个主成分

Principal component analysis is a technique used for dimensionality reduction. It’s widely used for data visualization by extracting information from a dataset with n features (e.g n dimensions) and representing this dataset in a space with lower dimensions (e.g m dimensions with m<n). So, it’s basically a technique that allow us to reduce dimensions for better visualizations while preserving information as much as possible.

主成分分析是用于降维的技术。 它通过从具有n 个要素(例如n个维度)的数据集中提取信息,并在具有较低维度(例如m个维度且m <n)的空间中表示此数据集而广泛用于数据可视化 因此,从根本上说,这是一种技术,它使我们可以缩小尺寸以获得更好的可视化效果,同时又要尽可能保留信息。

Image for post
Learning to implement PCA step by step 学习逐步实施PCA

Actually a lot of people find a hard time understanding this technique and interpreting it. For this reason, I’ve decided to break it down with a practical implementation through these steps :

实际上,很多人很难理解和解释这项技术 出于这个原因,我决定通过以下步骤通过实际的实现将其分解:

数据集描述: (Dataset description :)

For this task we choose the classic breast cancer dataset for its simplicity in order to show how PCA is useful for visualization. Breast cancer dataset holds 12 features as characteristics of the cell nuclei present in the image and label classifying type of tumor diagnosed as malignant 0 and benign 1 :

为了完成此任务,我们选择经典的乳腺癌数据集以简化操作,以显示PCA如何用于可视化。 乳腺癌数据集具有12个特征,它们是图像中存在的细胞核的特征,并被分类为诊断为恶性0和良性1的肿瘤标签类型:

Image for post
The first 5 line of the dataset
数据集的前5行

The features labeled 0 represent 37% of the dataset while the features labeled 1 represent more than 62% of the data :

标记为0的要素表示数据集的37%,而标记为1的要素表示数据的62%以上:

Image for post

So, the goal of implementing PCA is reducing the 12 dimensions into 2D dimensions so we can visualize if our dataset is visually separated into two classes.

因此,实施PCA的目标是将12个维度缩减为2D维度,以便我们可以可视化我们的数据集是否在视觉上分为两类。

标准化: (Standardization :)

Features standardization is defined as :

功能标准化定义为:

Image for post

Standardization is important in the sense that it makes all the features having the same range of values, so we use the formula above to normalize all features. This how the data would look like after standardization :

标准化在使所有要素具有相同值范围的意义上很重要,因此我们使用上述公式对所有要素进行规格化。 标准化后的数据如下所示:

Image for post

协方差矩阵: (Covariance Matrix :)

To understand the covariance matrix we suppose that we have only 5 features a, b, c, d, e :

为了理解协方差矩阵,我们假设我们只有5个特征a,b,c,d,e:

Image for post

The values of variance on the diagonal represents the measure of the spread of a certain feature:

对角线上的方差值表示某个特征的扩展程度:

Image for post

The elements other than the diagonal elements are the covariance of each feature with the other 4 features which allow us to measure how two features vary together :

除对角线元素以外的元素是每个要素与其他4个要素的协方差,这使我们能够测量两个要素如何一起变化:

Image for post

奇异值分解 : (Singular value decomposition :)

We apply singular value decomposition on the covariance matrix to get a matrix of eigenvectors and diagonal matrix containing the eigenvalues of each eigenvector :

我们对协方差矩阵进行奇异值分解,得到特征向量矩阵和包含每个特征向量特征值的对角矩阵:

Image for post

Given A is a covariance matrix means it's a symmetric matrix. Thus, the eigenvectors are orthogonal which allow us to obtain new uncorrelated features. And the eigenvalue represents how much information retained by each new feature, that's why we choose the first m components sorting the eigenvectors according to their eigenvalues in decreasing order.

给定A是协方差矩阵,意味着它是对称矩阵。 因此,特征向量是正交的,这使我们可以获得新的不相关特征。 特征值代表每个新特征保留的信息量,这就是为什么我们选择前m个分量按照特征值的降序对特征向量进行排序的原因。

Now, enough talking lets move to the code :

现在,足够多的讨论让我们转到代码:

PCA的实施: (Implementation of PCA :)

Now after having an idea about how to get the most important components sorted in decreasing order, we choose the first 2 components representing 2D plan that allow us to retain as much information as possible :

在了解了如何使最重要的组件按降序排序后,我们选择了代表2D计划的前两个组件,这些组件使我们能够保留尽可能多的信息:

def compute_pca(X, n_components=2):# mean center the data
X_demeaned = X-X.mean()# calculate the covariance matrix
covariance_matrix = np.cov(X_demeaned.T)# calculate eigenvectors & eigenvalues of the covariance matrix
eigen_vals, eigen_vecs =np.linalg.eigh(covariance_matrix)# sort eigenvalue in increasing order (get the indices from the sort)
idx_sorted = np.argsort(eigen_vals)
# reverse the order so that it's from highest to lowest.
idx_sorted_decreasing = np.flip(idx_sorted)# sort the eigen values by idx_sorted_decreasing
eigen_vals_sorted = eigen_vals[idx_sorted_decreasing]# sort eigenvectors using the idx_sorted_decreasing indices
eigen_vecs_sorted = eigen_vecs[:,idx_sorted_decreasing]# select the first n eigenvectors (n is desired dimension
# of rescaled data array, or dims_rescaled_data)
eigen_vecs_subset = eigen_vecs_sorted[:,:2]# transform the data by multiplying the transpose of the eigenvectors
# with the transpose of the de-meaned data
# Then take the transpose of that product.
X_reduced = np.dot(eigen_vecs_subset.T,X_demeaned.T).Treturn X_reduced

In this python code we’ve implemented each step explained above in order to compute the first 2 components. Here's the result of projecting the dataset’s features on those 2 components to visualize our dataset :

在此python代码中,我们已经实现了上面解释的每个步骤,以便计算前两个组件。 这是将数据集的特征投影到这两个组件上以可视化我们的数据集的结果:

Image for post
Scatter plot
散点图

As you can see the data is visually separated into two classes which means we'd get a good results training a basic classification model on this dataset. For the full code check the notebook on GitHub.

如您所见,数据在视觉上分为两类,这意味着在该数据集上训练基本分类模型会得到很好的结果。 有关完整代码,请在GitHub上查看笔记本

You can check my last article on Sentiment Analysis From Scratch With Logistic Regression.

您可以查看我的上一篇有关从逻辑回归从零开始的情感分析的文章。

翻译自: https://medium.com/swlh/understand-principal-component-analysis-and-implement-it-from-scratch-9b35a12ca0f4

主成分分析 一个主成分

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值