numpy实现pca_pca的简要概述及其使用numpy的实现

最新推荐文章于 2022-04-11 20:35:03 发布

杨_明

最新推荐文章于 2022-04-11 20:35:03 发布

阅读量1.2k

点赞数

文章标签： numpy python

原文链接：https://medium.com/@dhirajmishra57/brief-overview-of-pca-and-implementation-of-same-using-numpy-d864425e4a56

版权

本文提供PCA（主成分分析）的简要概述，并详细介绍了如何使用Python的numpy库实现PCA。通过理解PCA的基本原理和numpy的实现，读者可以更好地掌握数据降维的方法。

摘要由CSDN通过智能技术生成

numpy实现pca

This is my first article. I hope you guys like it!

这是我的第一篇文章。我希望你们喜欢！

In this module, you will learn about another type of unsupervised machine learning technique — Principal Component Analysis (PCA). PCA is widely used to simplify high-dimensional datasets to lower-dimensional ones.

在本模块中，您将学习另一种无监督的机器学习技术-主成分分析(PCA)。 PCA被广泛用于将高维数据集简化为低维数据集。

Table of Contents:

Motivation
动机
What is Principal Component Analysis?
什么是主成分分析？
Building Blocks of PCA
PCA的构建基块
Illustration — working of PCA
插图— PCA的工作
PCA — Algorithm
PCA —算法
Checking it using scikit-learn pca function
使用scikit-learn pca函数进行检查
But how?
但是如何？
Shortcomings of PCA
PCA的缺点
Summary
概要

动机：(Motivation:)

Situation 1: A logistic regression setting where you have a lot of correlated variables (high multicollinearity). How do you handle this?

情况1：逻辑回归设置，其中有很多相关变量(高共线性)。您如何处理？

— One way would be doing a variable selection (step-wise/forward/backward).

—一种方法是进行变量选择(逐步/向前/向后)。

— But each time you drop a variable, aren’t you losing some information?

—但是每次删除变量时，您是否会丢失一些信息？

There must be a better way of doing this!

一定有更好的方法可以做到这一点！

Situation 2: You’re doing EDA on a dataset with n records and p variables. You want to visualize this dataset.

情况2：您正在对具有n个记录和p个变量的数据集进行EDA。您要可视化此数据集。

— You could look at pairwise scatter plots.

—您可以查看成对散点图。

— You’ll need to look at (p*(p-1)/2) plots

—您需要查看(p *(p-1)/ 2)个图

— But even if p = 20, this would mean 190 plots!

—但是，即使p = 20，也意味着190个地块！

Again, there must be a better way of doing this!

同样，必须有一种更好的方法！

什么是主成分分析(PCA)？ (What is Principal Component Analysis (PCA)?)

PCA is the statistical procedure to correct observations of possible highly correlated variables into Principal Components that are:1. They’re weighted linear combinations of the original variables.2. They’re perpendicular / Independent to each other. 3. They capture maximum variance of the data and are ordered.

PCA是将可能的高度相关变量的观察结果校正为以下主要成分的统计程序：1。它们是原始变量的加权线性组合。2。它们是垂直的/彼此独立。 3.它们捕获数据的最大方差并排序。

PCA is an unsupervised technique: there is no ‘Y’ or dependent/response variable.

PCA是一种不受监督的技术：没有“ Y”或因变量/响应变量。

A very powerful technique, Principal Component Analysis, has several use cases:

主成分分析是一项非常强大的技术，它具有以下几个用例：

Dimensionality reduction.
降维。
Data visualization and Exploratory Data Analysis.
数据可视化和探索性数据分析。
Create uncorrelated features/variables that can be an input to a prediction model.
创建可以作为预测模型输入的不相关特征/变量。
Uncovering latent variables/themes/concepts.
发现潜在的变量/主题/概念。
Noise reduction in the dataset.
数据集中的降噪。

PCA的构建基块： (Building Blocks of PCA:)

Basis Vectors: So if you actually have a set of vectors where no matter what other vector you pick, it can always be represented as a linear combination of that set, then it is known as the basis vectors for that data-space or dimension. For example the vectors (2,3) and (3,4) can represent any other vector in 2-D as a linear combination of themselves and hence they’re a set of basis vectors for the 2-D space.
基础向量：因此，如果实际上有一组向量，无论您选择什么其他向量，它总是可以表示为该集合的线性组合，那么它就被称为该数据空间或维度的基础向量。例如，向量(2,3)和(3,4)可以将2-D中的任何其他向量表示为其自身的线性组合，因此它们是2-D空间的一组基础向量。
Basis transformation: Basis transformation is the process of converting your information from one set of basis to another. Or, representing your data in new columns different from the original. Often for convenience & efficiency, or just from common sense. E.g. Watching a video recorded on 3-D environment in a 2-D screen is an day-to-day example of basis transformation.
基础转换：基础转换是将您的信息从一组基础转换为另一组基础的过程。或者，用不同于原始列的新列表示数据。通常是为了方便和效率，或者只是出于常识。例如，在2-D屏幕中观看在3-D环境中录制的视频是基础转换的日常示例。
Variance as information: If two variables are very highly correlated, they together don’t add a lot information than they do individually. So you can drop one of them. Variance = Information!
作为信息的方差：如果两个变量的相关性非常高，则它们在一起所添加的信息并不比它们各自所提供的很多。因此，您可以删除其中之一。方差=信息！

With this, we have covered the 3 building blocks needed to understand PCA.

至此，我们涵盖了理解PCA所需的3个构建模块。

插图— PCA的工作 (Illustration — working of PCA)

Illustration — finding the principal components

插图-查找主要成分

X1, X2 have correlation, but aren’t perfectly correlated.

X1，X2具有相关性，但并不是完美相关。

Objective: to find directions/lines on which the projected data has maximum variance. Or, variance in data points, should be seen in the projections too.

目的：寻找投影数据最大方差的方向/线。或者，也应在预测中看到数据点的差异。

We have several (infinite, actually) options here.

我们在这里有几个(实际上是无限个)选项。

Objective: to find direction/line on which the projected data has maximum variance.

目标：寻找投影数据最大方差的方向/线。

We saw that a purely horizontal or vertical axis will not suffice as neither captures variation in both directions.We therefore need a line that is angled.

我们看到纯水平轴或垂直轴都不够用，因为它们都无法捕捉到两个方向的变化，因此我们需要一条倾斜的线。

One such line is a line drawn in the below diagram. This is the line which is closest to the data.

下图中绘制了一条线。这是最接近数据的线。

Projections on this line will retain maximum variation in the original data points.
此线上的投影将保留原始数据点的最大变化。
Note: in fact, PCA can also be considered as finding lines/planes/surfaces closest to data points.
注意：实际上，PCA也可以视为查找最接近数据点的线/平面/曲面。

This line is our first Principal Component!

这条线是我们的第一个主要组件！

We still have some variance that is left — in the direction perpendicular to our first PC. This is our 2nd principal component!

在垂直于我们的第一台PC的方向上，我们还剩下一些差异。这是我们的第二个主要组成部分！

PCA —算法 (PCA — Algorithm)

In this segment, you’ll get to learn about the algorithm through which PCA works. Originally PCA used the eigendecomposition route in finding the principal components. However, much faster algorithms like SVD have come up which are predominantly used nowadays. However, one thing to note here is that SVD is actually a generalized procedure of eigendecomposition. Therefore, both of them will be having some key similarities.

在本部分中，您将了解PCA的工作算法。最初，PCA使用特征分解路径来找到主要成分。但是，出现了更快的算法，如SVD，如今已被广泛使用。但是，这里要注意的一件事是，SVD实际上是特征分解的通用过程。因此，他们两个将具有一些关键的相似之处。

The steps involved in the eigendecomposition algorithm are as follows:

本征分解算法涉及的步骤如下：

From the original matrix that you have, you compute its covariance matrix C. (You can read about the covariance matrix here)
从您拥有的原始矩阵中，计算其协方差矩阵C。(您可以在此处了解有关协方差矩阵的信息)
After computing the covariance matrix, you do the eigendecomposition and find its respective eigenvalues and eigenvectors
计算协方差矩阵后，进行特征分解并找到其各自的特征值和特征向量
Sort the eigenvectors on the basis of the eigenvalues.
根据特征值对特征向量进行排序。
These eigenvectors are the principal components of the original matrix.
这些特征向量是原始矩阵的主要成分。
The eigenvalues denote the amount of variance explained by the eigenvectors. Higher the eigenvalues, higher is the variance explained by the corresponding eigenvector.
特征值表示特征向量解释的方差量。特征值越高，由相应特征向量解释的方差越高。
These eigenvectors are orthonormal,i.e. they’re unit vectors and are perpendicular to each other.
这些特征向量是正交向量，即它们是单位向量并且彼此垂直。

Step 1: Initializing array

步骤1：初始化数组

import pandas as pd
import numpy as np
# Let's take this dataset
a = [[0,0],[1,2],[2,3],[3,6],[4,8],[5,9]]
b = ['X','Y']
dat = pd.DataFrame(a,columns = b)
dat

Step 2: Calculating covariance, eigenvalues and eigenvectors

步骤2：计算协方差，特征值和特征向量

#Let's create the covariance matrix here.
# An intuitive reason as to why we're doing this is to capture the variance of the entire dataset
C = np.cov(dat.T)
eigenvalues, eigenvectors = np.linalg.eig(C)

Step 3: Sorting values

步骤3：对值进行排序

# Let's sort them now
idx = eigenvalues.argsort()[::-1]
eigenvalues= eigenvalues[idx]
eigenvectors = eigenvectors[:,idx]# Let's check them again
eigenvalues
>> array([16.11868923,  0.04797743])eigenvectors
>> ([[-0.46346747, -0.88611393],[-0.88611393,  0.46346747]])

Step 4: These eigenvectors are the principal components of the original matrix

步骤4：这些特征向量是原始矩阵的主要成分

Note: That the columns in the eigenvector matrix are to be compared with the rows of pca.components_ matrix. Also, if the directions are reversed for the second axis. This wouldn’t make a difference as even though they’re antiparallel, they would represent the same 2-D space. For example, X/Y and X/-Y both cover the entire 2-D plane.

注意：将特征向量矩阵中的列与pca.components_矩阵的行进行比较。同样，如果第二根轴的方向相反。即使它们是反平行的，它们也代表相同的二维空间，这没有什么区别。例如，X / Y和X / -Y都覆盖整个2-D平面。

Step 5: Check pca.explained_variance_ratio_, this is almost same to what we have achieved in eigenvalues

步骤5：检查 pca.explained_variance_ratio_，这几乎与我们在特征值上获得的结果相同

Step 6: The dot product of scalarvalues / pca.explained_variance_ratio_ are perpendicular to each other or orthonormal

步骤6：标量值/ pca.explained_variance_ratio_的点积彼此垂直或正交

np.dot(pca.components_[0],pca.components_[1])
>> 0.0

使用scikit-learn pca函数进行检查 (Checking it using scikit-learn pca function)

from sklearn.decomposition import PCA
pca = PCA(random_state=42)
pca.fit(dat)
#Let's check the componentspca.components_
>> array([[-0.46346747, -0.88611393],[ 0.88611393, -0.46346747]])# Let's check the variance explained
pca.explained_variance_ratio_
>> array([0.99703232, 0.00296768])

但是如何？ (But how?)

Because Spectral Theorem exists! Because of this theorem eigendecomposition of the covariance matrix will always:

因为谱定理存在！由于这个定理，协方差矩阵的本征分解将始终是：

Yield the eigenvectors which are perpendicular to each other
产生彼此垂直的特征向量
Have maximum variances allocated to them in an ordered way depending on the magnitude of the eigenvalues.
根据特征值的大小，有序地分配最大方差。

Spectral Theorem Keypoints : Spectral theorem states that

谱定理要点：谱定理指出

When you do the eigendecomposition of the covariance matrix, the corresponding eigenvectors would be the principal components of the original matrix.
当您进行协方差矩阵的特征分解时，相应的特征向量将是原始矩阵的主要成分。
These eigenvectors would be orthonormal to each other and hence they follow the property that PCs need to be perpendicular to each other.
这些特征向量彼此正交，因此遵循PC需要彼此垂直的特性。
They would also be in an ordered fashion — the eigenvalues would dictate the variance explained by that principal component and hence ordering the matrix according to the eigenvalues would give us the resultant principal component matrix.
它们也将是有序的-特征值将决定由该主成分解释的方差，因此，根据特征值对矩阵进行排序将得到结果主成分矩阵。
These eigenvectors would also be the linear combinations of the original variables as well.
这些特征向量也将是原始变量的线性组合。

PCA的缺点：(Shortcomings of PCA:)

Below are some important shortcomings of PCA:

以下是PCA的一些重要缺点：

PCA is limited to linearity, though we can use non-linear techniques such as t-SNE as well (you can read more about t-SNE in the optional reading material below).
尽管我们也可以使用t-SNE等非线性技术，但PCA仅限于线性(您可以在下面的可选阅读材料中阅读有关t-SNE的更多信息)。
PCA needs the components to be perpendicular, though in some cases, that may not be the best solution. The alternative technique is to use Independent Components Analysis.
PCA需要组件垂直，尽管在某些情况下，这可能不是最佳解决方案。另一种技术是使用独立成分分析。
PCA assumes that columns with low variance are not useful, which might not be true in prediction setups (especially classification problem with class imbalance).
PCA假定具有低方差的列无用，这在预测设置中可能不是正确的(尤其是类别不平衡的分类问题)。

概要： (Summary:)

Those were some important points to remember while using PCA. To summarize:

这些是使用PCA时要记住的一些重要点。总结一下：

Most software packages use SVD to compute the components and assume that the data is scaled and centered, so it is important to do standardization/normalization.
大多数软件包使用SVD来计算组件，并假定数据已缩放和居中，因此进行标准化/标准化很重要。
PCA is a linear transformation method and works well in tandem with linear models such as linear regression, logistic regression etc., though it can be used for computational efficiency with non-linear models as well.
PCA是一种线性变换方法，与线性模型(例如线性回归，逻辑回归等)配合使用时效果很好，尽管它也可以用于非线性模型的计算效率。
It should not be used forcefully to reduce dimensionality (when the features are not correlated).
当要素不相关时，不应强力使用它来减小尺寸。

外部链接： (External links:)

My code on Kaggle.
我在Kaggle上的代码。
The basic idea of linear dependence/independence of vectors — Khan Academy.
向量线性相关性/独立性的基本思想-可汗学院。
Eigenvectors and eigenvalues — a visual understanding — 3Blue1Brown.
特征向量和特征值-一种视觉理解-3Blue1Brown 。
Computing the eigenvalues and eigenvectors — Khan Academy.
计算特征值和特征向量—可汗学院。
You can further explore how PCA, SVD and eigenvectors are related.
您可以进一步探索PCA，SVD和特征向量之间的关系。
Scree Plots.
碎石图。
Laurens van der Maaten’s (creator of t-SNE) website
Laurens van der Maaten(t-SNE的创建者)网站
Visualising data using t-SNE: Journal of Machine Learning Research
使用t-SNE可视化数据：机器学习研究杂志
How to use t-SNE effectively.
如何有效地使用t-SNE 。
Independent Components Analysis.
独立成分分析。

翻译自: https://medium.com/@dhirajmishra57/brief-overview-of-pca-and-implementation-of-same-using-numpy-d864425e4a56

numpy实现pca

杨_明

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
numpy实现pca_pca的简要概述及其使用numpy的实现

numpy实现pcaThis is my first article. I hope you guys like it!这是我的第一篇文章。我希望你们喜欢！In this module, you will learn about another type of unsupervised machine learning technique — Principal Component Analy...
复制链接

扫一扫