vc sne和t sne是我的邻居

最新推荐文章于 2024-09-27 10:11:28 发布

杨_明

最新推荐文章于 2024-09-27 10:11:28 发布

阅读量209

点赞数

文章标签： python

原文链接：https://medium.com/@jeheonpark93/vc-sne-and-t-sne-who-is-my-neighbor-34e738bf9e71

版权

Many dimensionality reduction techniques attempt to preserve distances of the original data. However, it can be beneficial to focus on preserving the nearest neighbours for visualization. t-SNE[van der Maaten/Hinton 2008] abstracts away density and distance information. Since it preserves the neighbours, it often reveals the cluster structure more clearly than any other dimensionality reduction technique. t-SNE is really popular in many applications including life science.

许多降维技术试图保留原始数据的距离。但是，专注于保留最近的邻居以进行可视化可能是有益的。 t-SNE [van der Maaten / Hinton 2008]提取了密度和距离信息。由于它保留了邻居，因此与其他降维技术相比，它通常更清楚地揭示出簇结构。 t-SNE在包括生命科学在内的许多应用中确实很受欢迎。

比较方式 (Comparison)

We are going to do many dimensionality reduction techniques for the same purpose to cluster the MNIST dataset.

为了相同的目的，我们将做许多降维技术来聚类MNIST数据集。

This images from the sci-kit-learn official guideline you can take a look more result. As you can see, t-SNE overwhelmingly perform well. It also shows the semantics of distances. The small cluster of 1, it has an underbar at the bottom, is closer to 2 than 1 because 2 has the same underbar. Now, you know why you should learn t-SNE.

这张来自sci-kit-learn官方指南的图片可以使您看到更多结果。如您所见，t-SNE表现出色。它还显示了距离的语义。 1的小集群在底部有一个下划线，比1更接近2，因为2具有相同的下划线。现在，您知道了为什么应该学习t-SNE。

SNE (SNE)

Stochastic Neighborhood Embedding(SNE) is the basic idea of t-SNE. We need to know SNE before we learn t-SNE. SNE defines distance-based conditional probabilities that xi (data point) would pick xj (the other data point) as its neighbor, P(xj is neighbor hood| xi). We use exponential distribution, you can think normal distribution, to calculate the probabilities.

随机邻域嵌入(SNE)是t-SNE的基本思想。在学习t-SNE之前，我们需要了解SNE。 SNE定义了基于距离的条件概率，即xi(数据点)将xj(另一个数据点)作为其邻居，P(xj是邻居hood | xi)。我们使用指数分布，可以认为是正态分布，可以计算出概率。

It compares the distance of the selected data points and the sum of distances of other points. It will give a high probability if the selected data points close to each other. The variance of the distribution is based on the specific data point, sparse region will give you a high variance, I will elaborate this later on this post. Now, we need to turn this information into low dimensions. We will use the KL divergence. Thus, we need to calculate the corresponding probability in low dimensions.

它比较所选数据点的距离和其他点的距离之和。如果所选数据点彼此靠近，则将具有很高的概率。分布的方差基于特定的数据点，稀疏区域会给您带来很大的方差，我将在本文的后面详细说明。现在，我们需要将这些信息转换为低维度。我们将使用KL散度。因此，我们需要计算低维的相应概率。

The probability function in low dimensions is the same as the high dimension one. The variance is determined by yourself, it chooses how much space you use for visualization.

低维的概率函数与高维的概率函数相同。方差由您自己确定，它选择您用于可视化的空间。

KL Divergence is methods to approximate two probability distribution, it makes two distribution the same. You should know KL Divergence is asymmetric. KL(P|Q) != KL(Q|P).

KL散度是近似两个概率分布的方法，它使两个分布相同。您应该知道KL Divergence是不对称的。 KL(P | Q)！= KL(Q | P)。

We defined the cost function using KL divergence. Let’s look at the meaning of the cost function. If the points are close to each other in high dimensional space, p will be high. Therefore, q should be smaller and smaller. If the points are far from each other in high dimensional space, p will be small. Therefore, q should be somehow higher. However, you can immediately observe the asymmetric relationship. The former case costs are expensive but the later case costs are relatively small. This result affects the behavior of SNE, it keeps neighbors close together but it doesn’t care so much about widely separated points.

我们使用KL散度定义了成本函数。让我们看一下成本函数的含义。如果这些点在高维空间中彼此靠近，则p将很高。因此，q应该越来越小。如果在高维空间中这些点彼此远离，则p将很小。因此，q应该更高。但是，您可以立即观察到不对称关系。前一种情况的成本昂贵，而后一种情况的成本相对较小。此结果影响SNE的行为，使邻居保持靠近，但不太在乎分散的点。

选择具有困惑度的方差。 (Select the variance with Perplexity.)

Using a fixed 𝜎 does not work well when densities in high-dimensional space vary. SNE uses the desired perplexity that the user chooses instead of the fixed 𝜎.

当高维空间中的密度变化时，使用固定的𝜎效果不佳。 SNE使用用户选择的所需困惑度代替固定𝜎。

It intuitionally reflects the effective number of neighbors. You can think this method is similar to kNN. You can choose perplexity but you should not select the number that is greater than the number of data points. If you think when the entropy will be the maximum value, it is when every probability has the same value. In this case, the perplexity is n!

它直观地反映了邻居的有效数量。您可以认为此方法类似于kNN。您可以选择困惑度，但是您不应选择大于数据点数量的数字。如果您认为何时熵将是最大值，那就是每个概率都具有相同的值。在这种情况下，困惑度为n！

You can see how the probability of becoming neighbors is changing depending on the selected perplexity. These distributions are representing for one data point, it is zero in distribution. To sum up, the selected perplexity controls how many data points will be considered neighbors and it is calculated for each data point with respect to its probability distribution, entropy is calculated based on it.

您可以看到成为邻居的可能性如何根据所选的困惑而变化。这些分布代表一个数据点，分布为零。综上所述，所选的困惑度控制着多少个数据点将被视为邻居，并且针对每个数据点针对其概率分布进行计算，并根据其计算熵。

优化 (Optimization)

Initialize y(the data points in low dimensions) randomly
随机初始化y(低维数据点)
Iteratively shift y in Optimization function with learning rate
带学习率的优化函数中的y迭代移位

The learning rate controls how fast the optimization will be done by stressing the cost and we multiply the distance between y1 and y2 to emphasize the difference of the probabilities of each distribution.

学习速率通过强调成本来控制优化的完成速度，我们乘以y1和y2之间的距离以强调每种分布的概率之差。

In practice, we use other tricks to boost the speed or correctness of algorithms.

在实践中，我们使用其他技巧来提高算法的速度或正确性。

Decreasing the learning rate with time.
随着时间的流逝降低学习速度。
Randomly perturbing points to avoid local suboptima.
随机扰动点以避免局部次优。
Including a momentum term that continues to drive points in a similar direction as was taken previously.
包括一个动量项，该动量项继续驱动点朝着与以前相似的方向移动。
Early compression and early exaggeration
早期压缩和早期夸张

These methods are also widely used in deep learning. I believe you familiar with them.

这些方法也广泛用于深度学习。我相信您熟悉它们。

拥挤的问题 (The Crowding Problem)

ISOMAP solved the swiss roll problem perfectly because it is intrinsically two-dimensional. However, what if the intrinsic dimensionality is not 2D and it is greater than 2D. Think of a sphere in 3D and try to project down all points into 2D. Many points will be collapsed together. This problem occurs in SNE too. t-SNE is developed to solve this problem.

ISOMAP完美地解决了瑞士卷问题，因为它本质上是二维的。但是，如果固有尺寸不是2D并且大于2D，该怎么办。考虑一下3D球体，然后尝试将所有点投影到2D中。许多点将一起崩溃。 SNE中也会发生此问题。开发t-SNE就是为了解决这个问题。

吨位 (t-SNE)

The problem occurs because SNE makes the area indicating neighbors too small. t-SNE changes the distribution in high dimensions and low dimensions to t-distribution. Heavy tails allow points to move further apart in low dimensions compared to high dimensions. t-distribution is a well-known distribution for heavy tails.

发生问题是因为SNE使指示邻居的区域太小。 t-SNE将高尺寸和低尺寸的分布更改为t分布。与高尺寸的尾巴相比，沉重的尾巴使点在低尺寸的地方可以进一步分开。 t分布是众所周知的重尾巴分布。

对称成本函数 (Symmetric Cost Function)

SNE has an asymmetric cost function but t-SNE changes it to a symmetric cost function.

SNE具有不对称成本函数，但t-SNE将其更改为对称成本函数。

It changes p and q to unconditional probabilities, this is why it is symmetric. p is normalized by n. q is also changed. q does not use exponential terms. Thus, its computation is more convenient.

它将p和q更改为无条件概率，这就是为什么它是对称的。 p由n归一化。 q也改变了。 q不使用指数项。因此，其计算更加方便。

The optimization function is also changed because the cost function was changed.

由于成本函数已更改，因此优化功能也已更改。

计算成本 (Computational Cost)

If you apply this naive application, it requires Θ(n^2) effort. Can we improve this? [van der Maaten 2014], this paper will give you an answer.

如果您应用此简单应用程序，则需要Θ(n ^ 2)的努力。我们可以改善吗？ [van der Maaten 2014] ，本文将为您提供答案。

The strategy is:

该策略是：

Only compute p, high dimensions, for a fixed number of nearest neighbors, we choose the perplexity and we multiply 3. The data point in the range of 3*perplexity will be calculated.
对于固定数量的最近邻居，仅计算p维，即高维，我们选择困惑度并乘以3。将计算3 *困惑度范围内的数据点。
Other remote clusters of points are approximated with a single value.
其他点的远程群集用单个值近似。
For computing q, they put embedding points into the spatial acceleration data structure. It helps to identify the groups. It only calculates the distance between the virtual points and the target point. It gives them the weight of the number of data points.
为了计算q，他们将嵌入点放入空间加速度数据结构中。它有助于识别组。它仅计算虚拟点和目标点之间的距离。它赋予它们数据点数量的权重。

生命科学的例子 (Example in Life Science)

They use to distinguish the cell types by high dimensional protein expression profiles. They find that some proteins, it was considered important, are not that important to distinguish cell types by subtracting it from t-SNE. The shape of the t-SNE still remains similar.

他们通过高维蛋白质表达谱来区分细胞类型。他们发现一些蛋白质，被认为很重要，通过从t-SNE中减去来区分细胞类型并不那么重要。 t-SNE的形状仍然保持相似。

警告！ (Warning!)

This site helps you to understand t-SNE with practical experiments.

该站点可通过实际实验帮助您了解t-SNE。

This is an obvious picture because t-SNE only care about neighbors. Therefore, the densities and spreads of clusters of original data are not preserved.

这是显而易见的情况，因为t-SNE只关心邻居。因此，原始数据的群集的密度和散布不保留。

Distance is affected by perplexity. Low perplexity means low sigma. Thus, a smaller region will be considered as neighborhood region. If perplexity is big enough to see other clusters, then it can consider the distances.

距离受困惑的影响。低困惑度意味着低sigma。因此，较小的区域将被视为邻近区域。如果困惑度足够大，可以看到其他群集，则可以考虑距离。