python pcap教程_PCA教程

最新推荐文章于 2024-08-29 22:54:45 发布

cumei1658

最新推荐文章于 2024-08-29 22:54:45 发布

阅读量795

点赞数

文章标签：大数据 python 机器学习数据分析人工智能

原文链接：https://www.pybloggers.com/2016/11/pca-tutorial/

版权

python pcap教程

Principal Component Analysis (PCA) is an important method for dimensionality reduction and data cleaning. I have used PCA in the past on this blog for estimating the latent variables that underlie player statistics. For example, I might have two features: average number of offensive rebounds and average number of defensive rebounds. The two features are highly correlated because a latent variable, the player’s rebounding ability, explains common variance in the two features. PCA is a method for extracting these latent variables that explain common variance across features.

主成分分析（PCA）是降维和清除数据的重要方法。过去，我在此博客中使用PCA来估计构成玩家统计信息的潜在变量。例如，我可能有两个特征：进攻篮板的平均数和防守篮板的平均数。这两个特征高度相关，因为潜在变量（球员的篮板能力）解释了这两个特征的共同差异。 PCA是一种提取这些潜在变量的方法，这些潜在变量解释了要素之间的常见差异。

In this tutorial I generate fake data in order to help gain insight into the mechanics underlying PCA.

在本教程中，我将生成虚假数据，以帮助您深入了解PCA的基本机制。

Below I create my first feature by sampling from a normal distribution. I create a second feature by adding a noisy normal distribution to the first feature multiplied by two. Because I generated the data here, I know it’s composed to two latent variables, and PCA should be able to identify these latent variables.

下面，我通过从正态分布中采样来创建我的第一个功能。我通过向第一个特征乘以2加上噪声正态分布来创建第二个特征。因为我是在这里生成数据的，所以我知道它是由两个潜在变量组成的，PCA应该能够识别这些潜在变量。

I generate the data and plot it below.

我生成数据并将其绘制在下面。

The first step before doing PCA is to normalize the data. This centers each feature (each feature will have a mean of 0) and divides data by its standard deviation (changing the standard deviation to 1). Normalizing the data puts all features on the same scale. Having features on the same scale is important because features might be more or less variable because of measurement rather than the latent variables producing the feature. For example, in basketball, points are often accumulated in sets of 2s and 3s, while rebounds are accumulated one at a time. The nature of basketball puts points and rebounds on a different scales, but this doesn’t mean that the latent variables scoring ability and rebounding ability are more or less variable.

执行PCA之前的第一步是规范化数据。这会将每个要素居中（每个要素的平均值为0），并将数据除以其标准偏差（将标准偏差更改为1）。规范化数据会使所有要素处于相同的比例。具有相同比例的要素很重要，因为要素可能因测量而或多或少地具有可变性，而不是产生要素的潜在变量。例如，在篮球比赛中，积分通常以2s和3s的形式累积，而篮板则一次累积1个。篮球的性质将得分和篮板置于不同的范围，但这并不意味着潜在变量的得分能力和篮板能力或多或少是可变的。

Below I normalize and plot the data.

下面，我对数据进行归一化和绘图。

After standardizing the data, I need to find the eigenvectors and eigenvalues. The eigenvectors point in the direction of a component and eigenvalues represent the amount of variance explained by the component. Below, I plot the standardized data with the eigenvectors ploted with their eigenvalues as the vectors distance from the origin.

在对数据进行标准化之后，我需要找到特征向量和特征值。特征向量指向分量的方向，特征值表示分量所解释的方差量。下面，我用特征向量和特征值作图，以标准化数据作为特征点与原点的距离。

As you can see, the blue eigenvector is longer and points in the direction with the most variability. The purple eigenvector is shorter and points in the direction with less variability.

如您所见，蓝色特征向量更长，指向变化最大的方向。紫色特征向量更短，指向的方向性变化较小。

As expected, one component explains far more variability than the other component (becaus both my features share variance from a single latent gaussian distribution).

不出所料，一个组件比另一个组件解释了更多的可变性（因为我的两个特征都共享来自单个潜在高斯分布的方差）。

Next I order the eigenvectors according to the magnitude of their eigenvalues. This orders the components so that the components that explain more variability occur first. I then transform the data so that they’re axis aligned. This means the first component explain variability on the x-axis and the second component explains variance on the y-axis.

接下来，我根据特征向量的大小对特征向量进行排序。这对组件进行排序，以便首先解释更多可变性的组件。然后，我将数据转换为轴向对齐。这意味着第一个成分说明x轴上的变异性，第二个成分说明y轴上的变异性。

Finally, just to make sure the PCA was done correctly, I will call PCA from the sklearn library, run it, and make sure it produces the same results as my analysis.

最后，为了确保PCA正确完成，我将从sklearn库调用PCA，运行它，并确保它产生与分析相同的结果。

(-1.0, 0.0)
(-1.0, 0.0)
(-1.0, 0.0)
(-1.0, 0.0)

翻译自: https://www.pybloggers.com/2016/11/pca-tutorial/

python pcap教程

cumei1658

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫