维度诅咒_逃避维度的诅咒

维度诅咒

How do machines ‘see’? Or, in general, how can computers reduce an input of complex, high-dimensional data into a more manageable number of features?

机器如何“看到”? 或者,通常,计算机如何将复杂的高维数据的输入减少为更易于管理的功能?

Extend your open hand in front of a nearby light-source, so that it casts a shadow against the nearest surface. Rotate your hand and study how its shadow changes. Note that from some angles it casts a narrow, thin shadow. Yet from other angles, the shadow looks much more recognizably like the shape of a hand.

将您的张开的手伸到附近的光源前面,以使阴影投射到最近的表面上。 旋转您的手并研究其阴影如何变化。 请注意,从某些角度来看,它会投射出狭窄的细阴影。 但是从其他角度看,阴影看起来更像是手的形状。

See if you can find the angle which best projects your hand. Preserve as much of the information about its shape as possible.

看看是否可以找到最适合您伸出手的角度。 保留有关其形状的尽可能多的信息。

Behind all the linear algebra and computational methods, this is what dimensionality reduction seeks to do with high-dimensional data. Through rotation you can find the optimal angle which represents your 3-D hand as a 2-D shadow.

在所有线性代数和计算方法的背后, 就是降维试图对高维数据进行的处理。 通过旋转,您可以找到将您的3-D手表示为2-D阴影的最佳角度。

There are statistical techniques which can find the best representation of data in a lower-dimensional space than that in which it was originally provided.

有一些统计技术可以在比最初提供数据的维度更低的空间中找到最佳的数据表示形式。

In this article, we will see why this is an often necessary procedure, via a tour of mind-bending geometry and combinatorics. Then, we will examine the code behind a range of useful dimensionality reduction algorithms, step-by-step.

在本文中,我们将通过弯曲思维的几何学和组合学来了解为什么这是经常必要的过程。 然后,我们将逐步检查一系列有用的降维算法背后的代码。

My aim is to make these often difficult concepts more accessible to the general reader — anyone with an interest in how data science and machine learning techniques are fast changing the world as we know it.

我的目的是使普通读者(对数据科学和机器学习技术如何Swift改变我们所知道的世界感兴趣)的读者更容易理解这些通常很困难的概念。

Semi-supervised machine learning is a hot topic in the field of data science, and for good reason. Combining the latest theoretical advances with today’s powerful hardware is a recipe for exciting breakthroughs and science fiction evoking headlines.

半监督机器学习是数据科学领域的热门话题,这是有充分理由的。 将最新的理论进展与当今功能强大的硬件相结合,是令人激动的突破和科幻小说引起人们关注的头条新闻。

We may attribute some of its appeal to how it approximates our own human experience of learning about the world around us.

我们可以将它的某些吸引力归因于它如何近似我们人类对周围世界的学习经验。

The high-level idea is straightforward: given information about a set of labelled “training” data, how can we generalize and make accurate inferences about a set of previously “unseen” data?

高层次的想法很简单:给定有关一组标记的“训练”数据的信息,我们如何才能对一组先前“看不见的”数据进行归纳并做出准确的推断?

Machine learning algorithms are designed to implement this idea. They use a range of different assumptions and input data types. These may be simplistic like K-means clustering. Or complex like Latent Dirichlet Allocation.

机器学习算法旨在实现这一想法。 他们使用一系列不同的假设和输入数据类型。 这些可能像K-means聚类一样简单。 或诸如潜在Dirichlet分配之类的复杂对象。

Behind all semi-supervised algorithms though are two key assumptions: continuity and embedding. These relate to the nature of the feature space in which the data are described. Below is a visual representation of data points in a 3-D feature space.

在所有半监督算法的背后,有两个关键假设: 连续性嵌入 。 这些与描述数据的特征空间的性质有关。 下面是3-D特征空间中数据点的直观表示。

Higher dimensional feature spaces can be thought of as scatter graphs with more axes than we can draw or visualize. The math remains more or less the same!

高维特征空间可视为散点图,其散布图的轴数超出了我们可以绘制或可视化的范围。 数学基本保持不变!

Continuity is the idea that similar data points such as those which are near to each other in ‘feature space’ are more likely to share the same label. Did you notice in the scatter graph above that nearby points are similarly colored? This assumption is the basis for a set of machine learning algorithms called clustering algorithms.

连续性是这样的想法,即类似的数据点(例如在“特征空间”中彼此靠近的数据点)更有可能共享相同的标签。 您是否在上方的散点图中注意到附近的点也有类似的颜色? 该假设是一组称为聚类算法的机器学习算法的基础

Embedding is the assumption that although the data may be described in a high-dimensional feature space such as a ‘scatter-graph-with-too-many-axes-to-draw’, the underlying structure of the data is likely much lower-dimensional.

嵌入的假设是,尽管可以在高维特征空间(例如“绘制轴太多的散点图”)中描述数据,但数据的底层结构可能要低得多,尺寸。

For example, in the scatter graph above we have shown the data in 3-D feature space. But the points fall more or less along a 2-D plane.

例如,在上面的散点图中,我们显示了3-D特征空间中的数据。 但是这些点或多或少地沿着二维平面下降。

Embedding allows us to effectively simplify our data by looking for its underlying structure.

嵌入使我们能够通过查找其底层结构来有效简化数据。

那么,关于这个诅咒……? (So, about this curse…?)

Apart from having both the coolest and scariest sounding name in all data science, the phenomena collectively known as the Curse of Dimensionality also pose real challenges to practitioners in the field.

除了在所有数据科学中都拥有最酷,最恐怖的名字外,被统称为“维诅咒”的现象也给该领域的从业人员带来了真正的挑战。

Although somewhat on the melodramatic side, the title reflects an unavoidable reality of working with high-dimensional data sets. This includes those where each point of data is described by many measurements or ‘features’.

尽管标题有些讲究戏剧性,但标题反映了使用高维数据集不可避免的现实。 这包括通过许多度量或“功能”描述数据的每个点的数据。

The general theme is simple — the more dimensions you work with, the less effective standard computational and statistical techniques become. This has repercussions that need some serious workarounds when machines are dealing with Big Data. Before we dive into some of these solutions, let’s discuss the challenges raised by high-dimensional data in the first place.

总体主题很简单-您使用的维度越多,标准的计算和统计技术的效力就越差。 当机器处理大数据时,这会产生一些需要认真解决的后果。 在深入探讨其中一些解决方案之前,让我们首先讨论高维数据带来的挑战。

计算工作量 (Computational Workload)

Working with data becomes more demanding as the number of dimensions increases. Like many challenges in data science, this boils down to combinatorics.

随着维度数量的增加,处理数据的要求也越来越高。 像数据科学中的许多挑战一样,这可以归结为组合学

With n = 1, there are only 5 boxes to search. With n = 2, there are now 25 boxes; and with n = 3, there are 125 boxes to search. As n gets bigger, it becomes difficult to sample all the boxes. This makes the treasure harder to find — especially as many of the boxes are likely to be empty!

n = 1时,仅搜索5个框。 当n = 2时,现在有25个盒子; 在n = 3的情况下,有125个搜索框。 随着n变大,对所有盒子进行采样变得困难。 这使宝藏更难找到-尤其是许多盒子可能是空的!

In general, with n dimensions each allowing for m states, we will have m^n possible combinations. Try plugging in a few different values and you will be convinced that this presents a workload-versus-sampling challenge to machines tasked with repeatedly sampling different combinations of variables.

通常,在n个维中每个都允许m个状态的情况下,我们将有m ^ n个可能的组合。 尝试插入几个不同的值,您将确信,这对负责重复采样不同变量组合的机器提出了工作量与采样挑战。

With high-dimensional data, we simply cannot comprehensively sample all the possible combinations, leaving vast regions of feature space in the dark.

对于高维数据,我们根本无法对所有可能的组合进行全面采样,而将广阔的特征空间区域留在黑暗中。

尺寸冗余 (Dimensional Redundancy)

We may not even need to subject our machines to such demanding work. Having many dimensions is no guarantee that every dimension is especially useful . A lot of the time, we may be measuring the same underlying pattern in several different ways.

我们甚至不需要使我们的机器经受如此艰巨的工作。 拥有多个维度并不能保证每个维度都特别有用。 很多时候,我们可能会以几种不同的方式来衡量相同的基础模式。

For instance, we could look at data about professional football or soccer players. We may describe each player in six dimensions.

例如,我们可以查看有关职业足球或足球运动员的数据。 我们可以用六个维度来描述每个玩家。

This could be in terms of:

可以是:

  • number of goals scored

    进球数
  • number of of shots attempted

    尝试拍摄的次数
  • number of chances created

    创造的机会数
  • number of tackles won

    铲球数量
  • number of blocks made

    块数
  • number of clearances made

    清关次数

There are six dimensions. Yet you might see that we are actually only describing two underlying qualities — offensive and defensive ability — from a number of angles.

有六个维度。 但是您可能会看到,我们实际上只是从多个角度描述了两种基本素质,即进攻能力和防守能力。

This is an example of the embedding assumption we discussed earlier. High dimensional data often has a much lower-dimensional underlying structure.

这是我们前面讨论的嵌入假设的示例。 高维数据通常具有低维的基础结构。

In this case, we’d expect to see strong correlations between some of our dimensions. Goals scored and shots attempted will unlikely be independent of one another. Much of the information in each dimension is already contained in some of the others.

在这种情况下,我们希望看到我们的某些维度之间有很强的相关性。 进球射门都不可能彼此独立。 每个维度中的许多信息已经包含在其他一些维度中。

Often high-dimensional data will show such behavior. Many of the dimensions are, in some sense, redundant.

高维数据通常会显示这种行为。 从某种意义上说,许多维度都是多余的。

Highly correlated dimensions can harmfully impact other statistical techniques which rely upon assumptions of independence. This could lead to much-dreaded problems such as over-fitting.

高度相关的维度可能会对依赖独立性假设的其他统计技术产生有害影响。 这可能会导致严重的问题,例如过度拟合

Many high-dimensional data sets are actually the results of lower-dimensional generative processes. The classic example is the human voice. It can produce very high-dimensional data from the movement of only a small number of vocal chords.

许多高维数据集实际上是低维生成过程的结果。 典型的例子是人的声音 。 它仅通过少量声带的移动就可以产生非常高维的数据。

High-dimensionality can mask the generative processes. These are often what we’re interested in learning more about.

高维可以掩盖生成过程。 这些通常是我们有兴趣了解的更多信息。

Not only does high-dimensionality pose computational challenges, it often does so without bringing much new information to the show.

高维不仅会带来计算上的挑战,而且经常会带来很多挑战,而不会带来很多新信息。

And there’s more! Here’s where things start getting bizarre.

还有更多! 这是事情开始变得怪异的地方。

几何疯狂 (Geometric Insanity)

Another problem arising from high-dimensional data concerns the effectiveness of different distance metrics, and the statistical techniques which depend upon them.

高维数据引起的另一个问题涉及不同距离度量的有效性以及依赖于它们的统计技术。

This is a tricky concept to grasp, because we’re so used to thinking in everyday terms of three spatial dimensions. This can be a bit of a hindrance for us humans.

这是一个很难理解的概念,因为我们已经习惯于每天从三个空间维度来思考。 对我们人类来说这可能是一个障碍。

Geometry starts getting weird in high-dimensional space. Not only hard-to-visualize weird, but more “WTF-is-that?!” weird.

几何在高维空间开始变得怪异。 不仅难以想象的怪异,而且还有更多“那是WTF ?!” 奇怪的。

Let’s begin with an example in a more familiar number of dimensions. Say you’re mailing a disc with a diameter of 10cm to a friend who likes discs. You could fit it snugly into a square envelope with sides of 10cm, leaving only the corners unused. What percentage of space in the envelope remains unused?

让我们从一个更熟悉的维度示例开始。 假设您是将直径10厘米的光盘邮寄给喜欢该光盘的朋友。 您可以将其紧紧地塞入一个10厘米长的正方形信封中,而只剩下未使用的角落。 信封中有多少百分比的空间未使用?

Well, the envelope has an area of 100cm² inside it, and the disc takes up 78.5398… cm² (recall the area of a circle equals πr²). In other words, the disc takes up ~78.5% of the space available. Less than a quarter remains empty in the four corners.

好了,封套内部有一个100cm²的区域,光盘占用了78.5398…cm²(回想起来,一个圆面积等于πr² )。 换句话说,光盘占用了约78.5%的可用空间。 在四个角落中只有不到四分之一的地方空着。

Now say you’re packaging up a ball which also has a diameter of 10cm, this time into a cube shaped box with sides of 10cm. The box has a total volume of 10³ = 1000cm³, while the ball has a volume of 523.5988… cm³ (the volume of a 3-D sphere can be calculated using 4/3 * πr³). This represents almost 52.4% of the total volume available. In other words, almost half of the box’s volume is empty space in the eight corners.

现在,您要包装一个直径为10cm的球,这次是将其包装成一个边长为10cm的立方体形状的盒子。 盒子的总体积为10³=1000cm³,而球的体积为523.5988…cm³( 3-D球体积可以使用4/3 *πr³计算)。 这几乎占可用总量的52.4%。 换句话说,盒子的体积几乎有一半是八个角的空白区域。

See these examples below:

请参阅以下示例:

The volume of a sphere in 3-D is smaller in example B than that of a circle in the 2-D example B. The center of a cube is smaller than the center of a square with the same length side. Does this pattern continue in more than three dimensions? Or when we’re dealing with hyper-spheres and hyper-cubes? Where do we even begin?

在示例3中,3-D中的球体的体积小于示例2B中的圆的体积。立方体的中心小于在相同长度边上的正方形的中心。 这种模式是否会在三个以上的维度上持续下去? 或者,当我们处理超球体超立方体时 ? 我们什至从哪里开始?

Let us think about what a sphere actually is, mathematically speaking. We can define an n-dimensional sphere as the surface formed by rotating a radius of fixed length r about a central point in (n+1)-dimensional space.

从数学上来讲,让我们考虑一个球体实际上是什么。 我们可以将n维球体定义为围绕( n + 1)维空间中的中心点旋转固定长度r的半径而形成的表面。

In 2-D, this traces out the edge of circle which is a 1-D line. In 3-D this traces out the 2-D surface of an everyday sphere. In 4-D+, which we cannot easily visualize, this process draws out a hyper-sphere.

在2-D中,这将描绘出一维线的圆的边缘。 在3-D中,它可以描绘出日常球体的2-D表面。 在我们无法轻易可视化的4-D +中,此过程绘制出一个超球体。

It’s harder to picture this concept in higher dimensions, but the pattern which we saw earlier continues . The relative volume of the sphere diminishes.

很难从更高的角度来描述这个概念,但是我们之前看到的模式仍在继续。 球体的相对体积减小。

The generalized formula for the volume of a hyper-sphere with radius r in n dimensions is shown below:

广义公式用于超球与n维半径r的体积如下所示:

Γ is the Gamma function, described here. Technically, we should be calling volume in > 3 dimensions hyper-content.

Γ是伽马函数, 在此描述。 从技术上讲,我们应该调用> 3 dimensio NS超续 ENT量。

The volume of a hyper-cube with sides of length 2r in n dimensions is simply (2r)^n. If we extend our sphere-packaging example into higher dimensions, we find the percentage of overall space filled can be found by the general formula:

n个维度上边长为2r的超立方体的体积为(2 r )^ n。 如果将球包装示例扩展到更高的维度,我们发现可以通过以下通用公式找到填充的总空间的百分比:

We’ve taken the first formula, multiplied by 1 / (2r)^n and then cancelled where r^n appears on both sides of the fraction.

我们采用第一个公式,乘以1 /(2 r )^ n ,然后在分数的两边都出现r ^ n的地方取消。

Look at how we have n/2 and n as exponents on the numerator (“top”) and denominator (“bottom”) of that fraction respectively. We can see that as n increases, the denominator will grow quicker than the numerator. This means the fraction gets smaller and smaller. That’s not to mention the fact the denominator also contains a Gamma function featuring n.

看看我们如何分别在该分数的分子(“顶部”)和分母(“底部”)上分别具有n / 2和n作为指数。 我们可以看到,随着n的增加,分母将比分子增长更快。 这意味着分数越来越小。 更不用说分母也包含具有n的Gamma函数的事实

The Gamma function is like the factorial function… you know, the one where (n! = 2 x 3 x … x n). The Gamma function also tends to grow really quickly. In fact, Γ(n) = (n-1)!.

Gamma函数就像阶乘函数, …,您知道,其中( n != 2 x 3 x…x n )。 伽玛功能也往往会Swift增长。 实际上, Γ(n)=(n-1)!

This means that as the number of dimensions increases, the denominator grows much faster than the numerator. So the volume of the hyper-sphere decreases towards zero.

这意味着随着维数的增加,分母的增长比分子的增长快得多。 因此,超球体的体积朝零减小。

In case you don’t much feel like calculating Gamma functions and hyper-volumes in high dimensional space, I’ve made a quick graph:

如果您不太想在高维空间中计算Gamma函数和超体积,我制作了一张快速图表:

The volume of the hyper-sphere (relative to the space in which it lives) rapidly plummets towards zero. This has serious repercussions in the world of Big Data.

超球体的体积(相对于它所居住的空间)Swift下降为零。 这在大数据世界中具有严重的影响。

…Why?

…为什么?

Recall our 2-D and 3-D examples. The empty space corresponded to the “corners” or “outlying regions” of the overall space.

回顾我们的2D和3D示例。 空的空间对应于整个空间的“角落”或“外围区域”。

For the 2-D case, our square had 4 corners which were 21.5% of the total space.

对于二维情况,我们的正方形有4个角,占总空间的21.5%。

In the 3-D case, our cube now had 8 corners which accounted for 47.6% of the total space.

在3-D情况下,我们的立方体现在有8个角,占总空间的47.6%。

As we move into higher dimensions, we will find even more corners. This will make an ever increasing percentage of the total space available.

随着我们迈向更高的维度,我们将发现更多的角落。 这将使可用总空间的百分比不断增加。

Now imagine we have data spread across some multidimensional space. The higher the dimensionality, the higher the total proportion of our data will be “flung out” in the corners, and the more similar the distances will be between the minimum and maximum distances between points.

现在想象一下,我们的数据分布在某些多维空间中。 维数越高,在角落中“抛弃”我们的数据的总比例越高,并且点之间的最小距离和最大距离之间的距离越相似。

In higher dimensions our data are more sparse and more similarly spaced apart. This makes most distance functions less effective.

在更高维度上,我们的数据更稀疏,并且间隔更相似。 这会使大多数距离功能的效果降低。

逃避诅咒! (Escaping the Curse!)

There are a number of techniques which can project our high-dimensional data into a lower dimensional space. Recall the analogy of a 3-D object placed in front of a light source projects a 2-D shadow against a wall.

有许多技术可以将我们的高维数据投影到低维空间中。 回想一下放置在光源前面的3D对象的类比,将2D阴影投射在墙上。

By reducing the dimensionality of our data, we make three gains:

通过减少数据的维数,我们获得了三点收获:

  • lighter computational workload

    减轻计算量
  • less dimensional redundancy

    较少的尺寸冗余
  • more effective distance metrics

    更有效的距离指标

No wonder dimensionality reduction is so crucial in advanced machine learning applications such as computer vision, NLP and predictive modelling.

难怪降维在高级机器学习应用(如计算机视觉NLP预测建模 )中如此重要。

We’ll walk through five methods which are commonly applied to high-dimensional data sets. We’ll be restricting ourselves to feature extraction methods. They try to identify new features underlying the original data.

我们将逐步介绍五种通常应用于高维数据集的方法。 我们将限制自己的特色 提取方法。 他们尝试识别原始数据的新功能。

Feature selection methods choose which of the original features are worth keeping. We’ll leave those for a different article!

特征 选择方法选择哪些原始功能值得保留。 我们将其留给其他文章!

This is a long read with plenty of worked examples. So open your favorite code editor, put the kettle on, and let’s get started!

这是一本长篇小说,上面有许多工作示例。 因此,打开您喜欢的代码编辑器,打开水壶,让我们开始吧!

多维缩放(MDS) (Multidimensional Scaling (MDS))

视觉总结 (Visual Summary)

MDS refers to family of techniques used to reduce dimensionality. They project the original data in a lower-dimensional space, while preserving the distances between the points as much as possible. This is usually achieved by minimizing a loss-function (often called stress or strain) via an iterative algorithm.

MDS是指用于减少尺寸的一系列技术。 他们将原始数据投影在较低维的空间中,同时尽可能保留点之间的距离。 这通常是通过迭代算法将损失函数(通常称为应力应变 )最小化来实现的。

Stress is a function which measures how much of the original distance between points has been lost. If our projection does a good job at retaining the original distances, the returned value will be low.

应力是一种测量点之间原始距离损失了多少的函数。 如果我们的投影在保留原始距离方面做得很好,则返回值将很低。

工作实例 (Worked Example)

If you have R installed, whack it open in your IDE of choice. Otherwise, if you want to follow along anyway, check this R-fiddle.

如果您已安装R,请在您选择的IDE中打开它。 否则,如果您仍然想继续学习, 请检查此R小提琴

We’ll be looking at CMDS (Classical MDS) in this example. It will give an identical output to PCA (Principal Components Analysis), which we’ll discuss later.

在此示例中,我们将研究CMDS(经典MDS)。 它将为PCA(主成分分析)提供相同的输出,我们将在后面讨论。

We’ll be making use of two of R’s strengths in this example:

在此示例中,我们将利用R的两个优势:

  • working with matrix multiplication

    使用矩阵乘法
  • the existence of inbuilt data sets

    内置数据集的存在

Start with defining our input data:

首先定义我们的输入数据:

M <- as.matrix(UScitiesD)

We want to begin with a distance matrix where each element represents the Euclidean distance (think Pythagoras’ Theorem) between our observations. The UScitiesD and eurodist data sets in R are straight-line and road distance matrices between a selection of U.S. and European cities.

我们想从一个距离矩阵开始,其中每个元素代表我们的观测值之间的欧几里得距离 (认为​​毕达哥拉斯定理)。 R中的UScitiesDeurodist数据集是选择的美国和欧洲城市之间的直线和道路距离矩阵。

With non-distance input data, we would need a preliminary step to calculate the distance matrix first.

对于非距离输入数据,我们需要一个初步步骤来首先计算距离矩阵。

M <- as.matrix(dist(raw_data))

With MDS, we seek to find a low-dimensional projection of the data that best preserves the distances between the points. In Classical MDS, we aim to minimize a loss-function called Strain.

借助MDS,我们寻求找到最能保留点之间距离的数据的低维投影 。 在经典MDS中,我们旨在最小化称为Strain的损失函数

Strain is a function that works out how much a given low-dimensional projection distorts the original distances between the points.

应变是一种函数,可以计算出给定的低维投影有多大程度地扭曲了点之间的原始距离。

With MDS, iterative approaches (for example, via gradient descent) are usually used to edge our way towards an optimal solution. But with CMDS, there’s an algebraic way of getting there.

使用MDS时,通常使用迭代方法(例如,通过梯度下降法 )来逐步实现最佳解决方案。 但是有了CMDS,就有一种代数的方式到达那里。

Time to bring in some linear algebra. If this stuff is new to you, don’t worry — you’ll pick things up with a little practice. A good starting point is to see matrices as blocks of numbers that we can manipulate all at once, and work from there.

是时候引入一些线性代数了。 如果这些东西对您来说是新手,请不要担心-您将通过一些练习来掌握。 一个很好的起点是将矩阵视为数字块,我们可以一次操纵所有数字,然后从那里开始工作。

Matrices follow certain rules for operations. Addition and multiplication can be broken down or decomposed into eigenvalues and corresponding eigenvectors.

矩阵遵循某些操作规则。 加法乘法可以分解或分解特征值和相应的特征向量

Eigen-what now?

本征-现在如何?

A simple way of thinking about all this eigen-stuff is in terms of transformations. Transformations can change both the direction and length of vectors upon which they act.

对所有这些本征材料的简单思考方法是变换 。 变换可以改变其作用的向量的方向和长度。

Shown below, matrix A describes a transformation, which is applied to two vectors by multiplying A x v. The blue vector’s direction of 1 unit across and 3 units up remains unchanged. Only it’s length changes, here it doubles. This makes the blue vector an eigenvector of A with an eigenvalue of 2.

如下所示,矩阵A描述了一个变换,通过将A x v乘以将其应用于两个向量。 蓝色矢量的方向为1个单位向上3个单位不变。 只是长度改变了,这里变倍了。 这使蓝色向量成为特征值为2的A的特征向量。

The orange vector does change direction when multiplied by A, so it cannot be an eigenvector of A.

橙色向量乘以A确实会改变方向,因此它不能是A的特征向量

Back to CMDS — our first move is to define a centering matrix that lets us double center our input data. In R, we can implement this as below:

回到CMDS-我们的第一步是定义一个居中矩阵 ,该矩阵使我们可以对输入数据进行两次居中 。 在R中,我们可以如下实现:

n <- nrow(M)
C <- diag(n) - (1/n) * matrix(rep(1, n^2), nrow = n)

We then use R’s support for matrix multiplication %*% to apply the centering matrix to our original data to form a new matrix, B.

然后,我们使用R对矩阵乘法%*%的支持将定心矩阵应用于原始数据,以形成新矩阵B。

B <- -(1/2) * C %*% M %*% C

Nice! Now we can begin building our 2-D projection matrix. To do this, we define two more matrices using the eigenvectors associated with the two largest eigenvalues of matrix B.

真好! 现在我们可以开始构建二维投影矩阵。 为此,我们使用特征向量定义另外两个矩阵 与两个最大特征值相关 矩阵B。

Like so:

像这样:

E <- eigen(B)$vectors[,1:2]
L <- diag(2) * eigen(B)$values[1:2]

Let’s calculate our 2-D output matrix X, and plot the data according to the new co-ordinates.

让我们计算二维输出矩阵X ,然后根据新坐标绘制数据。

X <- E %*% L^(1/2)
plot(-X, pch=4)
text(-X, labels = rownames(M), cex = 0.5)

How does that look? Pretty good, right? We have recovered the underlying 2-D layout of the cities from our original input distance matrix. Of course, this technique lets us use distance matrices calculated from even higher-dimensional data sets.

看起来怎么样? 还不错吧? 我们已经从原始输入距离矩阵中恢复了城市的基本二维布局。 当然,这种技术使我们可以使用从更高维数据集计算出的距离矩阵。

Learn more about the variety of techniques which come under the label of MDS.

了解有关MDS标签下的各种技术的更多信息。

主成分分析(PCA) (Principal Components Analysis (PCA))

视觉总结 (Visual Summary)

In a large data set with many dimensions, some of the dimensions may well be correlated and essentially describe the same underlying information. We can use linear algebra to project our data into a lower-dimensional space, while retaining as much of the underlying information as possible.

在具有多个维度的大型数据集中,某些维度可能很相关,并且本质上描述了相同的基础信息。 我们可以使用线性代数将数据投影到较低维的空间,同时保留尽可能多的基础信息。

The visual summary above provides a low-dimensional explanation. In the plot on the left, our data are described by two axes, x and y.

上面的视觉摘要提供了一个低维度的解释。 在左侧的图中,我们的数据由xy两个轴描述。

In the middle plot, we rotate the axes through the data in the direction that captures as much variation as possible. The new PC1 axis describes much more of the variation than axis PC2. In fact, we could ignore PC2 and still keep a large percentage of the variation in the data.

在中间的图中,我们沿数据方向旋转轴,以捕获尽可能多的变化。 新的PC1轴比PC2轴描述了更多的变化。 实际上,我们可以忽略PC2,而仍然保留很大一部分数据变化。

工作实例 (Worked Example)

Let’s use a small scale example to illustrate the core idea. In an R session or in this snippet at R-fiddle), let’s load one of the in-built data sets.

让我们用一个小规模的例子来说明核心思想。 在R会话中或R-fiddle的此代码段中 ,让我们加载其中一个内置数据集。

data <- as.matrix(mtcars)
head(data)
dim(data)

Here we have 32 observations of different cars across 11 dimensions. They include features and measurements such as mpg, cylinders, horsepower….

在这里,我们对11个维度的不同汽车进行了32次观测。 它们包括功能和测量值,例如mpg,汽缸,马力……。

But how many of those 11 dimensions do we actually need? Are some of them correlated?

但是,我们实际上需要这11个维度中的多少个? 它们中的一些相关吗?

Let’s calculate the correlation between the number of cylinders and horsepower. Without any prior knowledge, what might we expect to find?

让我们计算缸数与马力之间的相关性。 如果没有任何先验知识,我们可能会发现什么?

cor(mtcars$cyl, mtcars$hp)

That’s an interesting result . At +0.83, we find the correlation coefficient is pretty high. This suggests that number of cylinders and horsepower are both describing the same underlying feature. Are more of our dimensions doing something similar?

那是一个有趣的结果。 在+0.83处,我们发现相关系数非常高。 这表明缸数和马力都描述了相同的基本特征。 我们有更多的维度在做类似的事情吗?

Let’s correlate all pairs of our dimensions and build a correlation matrix. Because life’s too short.

让我们关联所有尺寸对,并建立一个关联矩阵 。 因为生命太短暂了。

cor(data)

Each cell contains the correlation coefficient between the dimensions at each row and column. The diagonal always equals 1.

每个单元格包含每一行和每一列的尺寸之间的相关系数。 对角线始终等于1。

Correlation coefficients near +1 show strong positive correlation. Coefficients near -1 show strong negative correlation. We can see some values close to -1 and +1 in our correlation matrix. This shows we have some correlated dimensions in our data set.

接近+1的相关系数显示强正相关。 -1附近的系数显示出很强的负相关性。 我们可以在相关矩阵中看到一些接近-1和+1的值。 这表明我们的数据集中有一些相关的维度。

This is cool, but we still have the same number of dimensions we started with. Let’s throw out a few!

这很酷,但是我们仍然拥有与开始时相同的尺寸数。 让我们扔掉一些!

To do this, we can get out the linear algebra again. One of the strong points of the R language is that it is good at linear algebra, and we’re gonna make use of that in our code. Our first step is to take our correlation matrix and find its eigenvalues.

为此,我们可以再次求出线性代数。 R语言的强项之一是它擅长线性代数,我们将在我们的代码中使用它。 我们的第一步是获取相关矩阵并找到其特征值。

e <- eigen(cor(data))

Let’s inspect the eigenvalues:

让我们检查特征值:

e$valuesbarplot(e$values/sum(e$values),
    main="Proportion Variance explained")

We see 11 values which decrease pretty dramatically on the bar plot! We see that the eigenvector associated with the largest eigenvalue explains about 60% of the variation in our data. The eigenvector associated with the second largest eigenvalue explains about 24% of the variation in our original data. That’s already 84% of the variation in the data, explained by two dimensions!

我们看到11个值在条形图上显着下降! 我们看到与最大特征值相关的特征向量解释了我们数据中约60%的变化。 与第二大特征值相关的特征向量解释了原始数据中约24%的变化。 这已经是数据变化的84%,由两个维度来解释!

OK, let’s say we want to keep 90% of the variation in our original data set. How many dimensions do we need to keep to achieve this?

好的,假设我们要在原始数据集中保留90%的变化。 为了达到这个目的,我们需要保持多少个维度?

cumulative <- cumsum(e$values/sum(e$values))
print(cumulative)

i <- which(cumulative >= 0.9)[1]
print(i)

We calculate the cumulative sum of our eigenvalues’ relative proportion of the total variance. We see that the eigenvectors associated with the 4 largest eigenvalues can describe 92.3% of the original variation in our data.

我们计算特征值相对于总方差的相对比例的累积总和。 我们看到与4个最大特征值相关的特征向量可以描述数据中原始变化的92.3%。

This is useful! We can retain >90% of the original structure using only 4 dimensions. Let’s project the original data set onto a 4-D space. To do this, we need to create a matrix of weights, which we’ll call W.

这很有用! 我们仅使用4个维度就可以保留原始结构的90%以上。 让我们将原始数据集投影到4-D空间上。 要做到这一点,我们需要创建权重的矩阵,我们将CA L L W.

W <- e$vectors[1:ncol(data),1:i]

W is an 11 x 4 matrix. Remember, 11 is the number of dimensions in our original data, and 4 is the number we want to have for our transformed data. Each column in W is given by the eigenvectors corresponding to the four largest eigenvalues we saw earlier.

W是11 x 4矩阵。 请记住,11是我们原始数据中的维数,4是我们想要转换后的数据中的维数。 W中的每一列由对应于我们之前看到的四个最大特征值的特征向量给出。

To get our transformed data, we multiply the original data set by the weights matrix W. In R, we perform matrix multiplication with the %*% operator.

为了获得转换后的数据,我们将原始数据集乘以权重矩阵W。在R中,我们使用%*%运算符执行矩阵乘法。

tD <- data %*% W
head(tD)

We can view our transformed data set . Now each car is described in terms of 4 principal components instead of the original 11 dimensions. To get a better understanding of what these principal components are actually describing, we can correlate them against the original 11 dimensions.

我们可以查看转换后的数据集。 现在,每辆汽车都是用4个主要部件而不是原始的11个尺寸来描述的。 为了更好地理解这些主要成分的实际含义,我们可以将它们与原始的11个维度相关联。

cor(data, tD[,1:i])

We see that component 1 is negatively correlated with cylinders, horsepower and displacement. It is also positively correlated with mpg and possessing a straight (as opposed to V-shaped) engine. This suggests that component 1 is a measure of engine type.

我们看到组件1与汽缸,马力和排量负相关。 它也与mpg呈正相关,并具有直式(与V形相反)的引擎。 这表明组件1是发动机类型的量度。

Cars with large, powerful engines will have a negative score for component 1. Smaller engines and more-fuel efficient cars will have a positive score. Recall that this component describes approximately 60% of the variation in the original data.

具有大型,强劲发动机的汽车在组件1中的得分为负。较小的发动机和燃油效率更高的汽车的得分为正。 回想一下,此组件描述了原始数据中大约60%的变化。

Likewise, we can interpret the remaining components in this manner. It can become trickier (if not impossible) to do so as we proceed. Each subsequent component describes a smaller and smaller proportion of the overall variation in the data. Nothing beats a little domain-specific expertise!

同样,我们可以用这种方式解释其余的组件。 随着我们的进行,这样做可能会变得更加棘手(即使不是不可能)。 每个后续组件都描述了数据整体变化中越来越小的比例。 胜任一点领域专业知识!

There are several aspects in which PCA can vary to the method described here. You can read an entire book on the subject.

PCA可以在多个方面改变此处描述的方法。 您可以阅读有关该主题的整本书

线性判别分析(LDA) (Linear Discriminant Analysis (LDA))

视觉总结 (Visual Summary)

On the original axis, the red and blue classes overlap. Through rotation, we can find a new axis which better separates the classes. We may choose to use this axis to project our data into a lower-dimensional space.

在原始轴上,红色和蓝色类别重叠。 通过旋转,我们可以找到一个更好地分隔类的新轴。 我们可以选择使用该轴将数据投影到低维空间中。

PCA seeks axes that best describe the variation within the data. Linear Discriminant Analysis (LDA) seeks axes that best discriminate between two or more classes within the data.

PCA寻求最能描述数据变化的轴。 线性判别分析(LDA)会寻找可最佳区分数据中两个或多个类别的轴。

This is achieved by calculating two measures

这是通过计算两个度量来实现的

  • within-class variance

    组内方差

  • between-class variance.

    类间差异

The objective is to optimize the ratio between them. There is minimal variance within each class and maximal variance between the classes. We can do this with algebraic methods.

目的是优化它们之间的比率。 每个类别中的方差最小,而各个类别之间的方差最大。 我们可以用代数方法做到这一点。

As shown above, A is the within-class scatter. B is the between-class scatter.

如上所示, A是类内散布。 B是类间散布。

它是如何工作的? (How Does It Work?)

Let’s generate a simple data set for this example (for the R-fiddle, click here).

让我们为该示例生成一个简单的数据集(对于R小提琴, 请单击此处 )。

require(dplyr)
languages <- data.frame(
  HTML = c(22,20,15, 5, 5, 5, 0, 2, 0),
  JavaScript = c(20,25,25,20,20,15, 5, 5, 0),
  Java = c(15, 5, 0,15,30,30,10,10,15),
  Python = c( 5, 0, 2, 5,10, 5,40,35,30),
  job = c("Web","Web","Web","App","App","App","Data","Data","Data")
  )

View(languages)

We have a fictional data set describing nine developers in terms of the number of hours they spend working in each of four languages:

我们有一个虚构的数据集,以他们用四种语言中的每种语言工作的时间来描述九个开发人员:

  • HTML

    HTML
  • JavaScript

    JavaScript
  • Java

    Java
  • Python

    Python

Each developer is classed in one of three job roles:

每个开发人员都被划分为以下三个职位之一:

  • web developer

    Web开发人员
  • app developer

    应用程式开发人员
  • and data scientist

    和数据科学家
cor(select(languages, -job))

We use the select() function from the dplyr package to drop the class labels from the data set. This allows us to inspect the correlations between the different languages.

我们使用dplyr包中的select()函数从数据集中删除类标签。 这使我们可以检查不同语言之间的相关性。

Unsurprisingly, we see some patterns. There is a strong, positive correlation between HTML and JavaScript. This indicates developers who use one of these languages have a tendency to also use the other.

毫不奇怪,我们看到了一些模式。 HTML和JavaScript之间有很强的正相关关系。 这表明使用其中一种语言的开发人员也倾向于使用另一种语言。

We suspect that there is some lower-dimensional structure beneath this 4-D data set. Remember, four languages = four dimensions.

我们怀疑在此4-D数据集下存在一些低维结构。 请记住,四种语言=四个维度。

Let’s use LDA to project our data into a lower-dimensional space that best separates the three classes of job roles.

让我们使用LDA将我们的数据投影到一个较低维度的空间中,该空间可以最好地将三类工作角色分开。

First, we need to build within-class scatter matrices for each class. Let’s use dplyr’s filter() and select() methods to break down our data by job role.

首先,我们需要为每个类建立类内散布矩阵。 让我们使用dplyrfilter()select()方法按工作角色细分数据。

Web <- as.data.frame(
  scale(filter(languages, job == "Web") %>% 
    select(., -job),T))

App <- as.data.frame(
  scale(filter(languages, job == "App") %>%
    select(., -job),T))

Data <- as.data.frame(
  scale(filter(languages, job == "Data") %>%
    select(., -job),T))

So now we have three new data sets, one for each job role. For each of these, we can find a covariance matrix. This is closely related to the correlation matrix. It also describes the trends between how languages are used together.

因此,现在我们有了三个新数据集,每个工作角色一个。 对于这些中的每一个,我们都可以找到一个协方差矩阵 。 这与相关矩阵密切相关。 它还描述了如何一起使用语言之间的趋势。

We find the within-class scatter matrix by summing the each of the three covariance matrices. This gives us a matrix describing the scatter within each class.

我们发现内部 通过求和三个协方差矩阵中的每一个来分散矩阵。 这为我们提供了一个矩阵,用于描述每个类中的分散。

within <- cov(Web) + cov(App) + cov(Data)

Now we want to find the between-class scatter matrix which describes the scatter between the classes. To do this, we must first find the center of each class, by calculating the average features of each. This lets us form a data.frame where each column describes the average developer for each class.

现在我们要找到类间散布矩阵,它描述了类之间的散布。 为此,我们必须首先通过计算每个类别的平均特征来找到每个类别的中心。 这使我们形成一个data.frame ,其中每一列描述每个类的平均开发人员。

means <- t(data.frame(
  mean_Web <- sapply(Web, mean),
  mean_App <- sapply(App, mean),
  mean_Data <- sapply(Data, mean)))

To get our between-class scatter matrix, we find the covariance of this matrix.:

为了获得类间散布矩阵,我们找到该矩阵的协方差:

between <- cov(means)

Now we have two matrices:

现在我们有两个矩阵:

  • our within-class scatter matrix

    我们的类内散布矩阵
  • the between-class scatter matrix

    类间散布矩阵

We want to find new axes for our data which minimizes the ratio between within-class scatter and between-class scatter.

我们希望为我们的数据找到新的轴,以最小化类内散布和类间散布之间的比率。

We do this by finding the eigenvectors of the matrix formed by:

我们通过找到以下矩阵形成的特征向量来做到这一点:

e <- eigen(solve(within) %*% between)

barplot(e$values/sum(e$values),
  main='Variance explained')
  
W <- e$vectors[,1:2]

By plotting the eigenvalues, we can see that the first two eigenvectors will explain more than 95% of the variation in the data.

通过绘制特征值,我们可以看到前两个特征向量将解释数据中超过95%的变化。

Let’s transform the original data set and plot the data in its new, lower-dimensional space.

让我们转换原始数据集并将数据绘制在其新的较低维空间中。

LDA <- scale(select(languages, -job), T) %*% W
  
plot(LDA, pch="", 
  main='Linear Discriminant Analysis')

text(LDA[,1],LDA[,2],cex=0.75,languages$job,
  col=unlist(lapply(c(2,3,4),rep, 3)))

There you go! See how the new axes do an amazing job separating the different classes? This reduces the dimensionality of the data and could also prove useful for classification purposes.

你去! 看到新的轴如何在区分不同的类方面做得很棒吗? 这降低了数据的维数,也可能证明对分类目的有用。

To being interpreting the new axes, we can correlate them against the original data:

为了解释新轴,我们可以将它们与原始数据相关联:

cor(select(languages,-job),LDA)

This reveals how Axis 1 is negatively correlated with JavaScript and HTML, and positively correlated with Python. This axis separates the Data Scientists from the Web and App developers.

这揭示了Axis 1如何与JavaScript和HTML负相关,并与Python正相关。 此轴将数据科学家与Web和App开发人员分开。

Axis 2 is correlated with HTML and Java in opposite directions. This separates the Web developers from the App developers. It would be an interesting insight, if the data weren’t fictional…

轴2在相反的方向与HTML和Java相关。 这使Web开发人员与App开发人员分离。 如果数据不是虚构的,那将是一个有趣的见解。

We have assumed the three classes are all equal in size, which simplifies things a bit. LDA can be applied to 2 or more classes, and can be used as a classification method as well.

我们假设这三个类的大小都相等,这使事情简化了一点。 LDA可以应用于2个或更多类,也可以用作分类方法。

Get the full picture and coverage of LDA’s use in classification.

获得LDA在分类中的使用的全貌和范围。

非线性降维 (Non-linear Dimensionality Reduction)

The techniques covered so far are pretty good in many use-cases, but they make a key assumption: that we are working in the context of linear geometry.

到目前为止,所涉及的技术在许多用例中都很好,但是它们做出了一个关键的假设:我们正在线性几何环境中工作。

Sometimes, this is an assumption we need to drop.

有时,这是我们需要放弃的假设。

Non-linear dimensionality reduction (NLDR) opens up a fascinating world of advanced mathematics and mind-bending possibilities in applications such as computer vision and autonomy.

非线性降维(NLDR)开启了一个引人入胜的高级数学世界,并在诸如计算机视觉和自治等应用程序中产生了令人折服的可能性。

There are many NLDR methods available. We’ll take a look at a couple of techniques relating to manifold learning. These will approximate the underlying structure of high-dimensional data. Manifolds are one of the many mathematical concepts that might sound impenetrable but which are actually seen everyday.

有许多可用的NLDR方法。 我们将研究与多种学习相关的几种技术 这些将近似于高维数据的基础结构。 流形是听起来似乎难以理解但实际上每天都会见到的许多数学概念之一。

Take this map of the world:

拿这张世界地图:

We’re all fine with the idea of representing the surface of a sphere on a flat sheet of paper. Recall from before that a sphere is defined as a 2-D surface traced a fixed distance around a point in 3-D space. The earth’s surface is a 2-D manifold embedded, or wrapped around, in a 3-D space.

在平面纸上表示球体表面的想法我们都很好。 回想一下定义球体之前的情况 因为二维表面在3-D空间中的某个点周围跟踪了固定的距离。 地球表面是嵌入或包裹在3-D空间中的2-D流形。

With high-dimensional data, we can use the concept of manifolds to reduce the number of dimensions we need to describe the data.

对于高维数据,我们可以使用流形的概念来减少描述数据所需的维数。

Think back to the surface of the earth. Earth exists in a 3-D space, so we should describe the location, such as a city, in three dimensions. However, we have no trouble using only two dimensions of latitude and longitude instead.

回想一下地球表面。 地球存在于3-D空间中,因此我们应该从三个维度描述位置,例如城市。 但是,我们可以轻松地只使用纬度和经度这两个维度。

Manifolds can be more complex and higher-dimensional than the earth example here. Isomap and Laplacian Eigenmapping are two closely related methods used to apply this thinking to high-dimensional data.

与此处的地球示例相比,流形可能更复杂且维数更高。 等值图 拉普拉斯特征映射 有两种紧密相关的方法,用于将这种想法应用于高维数据。

等值图 (Isomap)

视觉总结 (Visual Summary)

We can see our original data as a U-shaped underlying structure. The straight-line distance, as shown by the black arrow, between A and B won’t reflect the fact they lie at opposite ends, as shown by the red line.

我们可以将原始数据视为U形底层结构。 如黑色箭头所示, AB之间的直线距离不会反映出它们位于相对两端的事实,如红色线所示。

We can build a nearest-neighbors graph to find the shortest path between the points. This lets us build a distance matrix that can be used as an input for MDS to find a lower-dimensional representation of the original data that preserves the non-linear structure.

我们可以建立一个最近邻居图来找到两点之间的最短路径。 这使我们能够构建一个距离矩阵,该距离矩阵可用作MDS的输入,以查找保留非线性结构的原始数据的低维表示。

We can approximate distances on the manifold using techniques in graph theory. We can do this by building a graph or network by connecting each of our original data points to a set of neighboring points.

我们可以使用图论中的技术来近似流形上的距离。 我们可以通过将每个原始数据点连接到一组相邻点来构建图形或网络来做到这一点。

By using a shortest-paths algorithm, we can find the geodesic distance between each point. We can use this to form a distance matrix which can be an input for a linear dimensionality reduction method.

通过使用最短路径算法,我们可以找到每个点之间的测地距离。 我们可以使用它来形成距离矩阵,该距离矩阵可以作为线性降维方法的输入。

工作实例 (Worked Example)

We’re going to implement a simple Isomap algorithm using an artificially generated data set. We’ll keep things in low dimensions, to help visualize what is going on. Here’s the code.

我们将使用人工生成的数据集实现一个简单的Isomap算法。 我们将使尺寸保持较小,以帮助可视化正在发生的事情。 这是代码

Let’s start by generating some data:

让我们从生成一些数据开始:

x <- y <- c(); a <- b <- 1

for(i in 1:1000){
  theta <- 0.01 * i
  x <- append(x,(a+b*theta)*(cos(theta)+runif(1,-1,1))
  y <- append(y,(a+b*theta)*(sin(theta)+runif(1,-1,1))
}

color <- rainbow(1200)[1:1000]
spiral <- data.frame(x,y,color)
plot(y~x, pch=20, col=color)

Nice! That’s an interesting shape, with a clear, non-linear structure. Our data could be seen as scattered along a 1-D line, running between red and violet, coiled up (or embedded) in a 2-D space. Under the assumption of linearity, distance metrics and other statistical techniques won’t take this into account.

真好! 这是一个有趣的形状,具有清晰的非线性结构。 我们的数据可以看作是沿着一维线散布,在红色和紫色之间延伸,在二维空间中盘绕(或嵌入 )。 在线性的假设下,距离度量标准和其他统计技术不会考虑到这一点。

How can we unravel the data to find its underlying 1-D structure?

我们如何解开数据以找到其底层一维结构?

pc <- prcomp(spiral[,1:2])
plot(data.frame(
  pc$x[,1],1),col=as.character(spiral$color))

PCA won’t help us, as it is a linear dimensionality reduction technique. See how it has collapsed all the points onto an axis running through the spiral? Instead of revealing the underlying red-violet spectrum of points, we only see the blue points scattered along the whole axis.

PCA不会帮助我们,因为这是一个 线性降维技术。 看到它如何将所有点折叠到贯穿螺旋的轴上? 我们没有揭示点的潜在红紫色光谱,而是仅看到沿整个轴分散的蓝点。

Let’s try implementing an Isomap algorithm. We begin by building a graph from our data points, by connecting each to its n-nearest neighboring points. n is a hyper-parameter that we need to set in advance of running the algorithm. For now, let’s use n = 5.

让我们尝试实现Isomap算法。 我们首先从数据点构建图形,方法是将每个数据点连接到n个最近的相邻点。 n是我们在运行算法之前需要设置的超参数 。 现在,让我们使用n = 5。

We can represent the n-nearest neighbors graph as an adjacency matrix A.

我们可以将n最近邻居图表示为邻接矩阵 A。

The element at the intersection of each row and column can be either 1 or 0 depending on whether the corresponding points are connected.

每个行和列的交点处的元素可以为1或0,具体取决于是否连接了相应的点。

Let’s build this with the code below:

让我们用下面的代码构建它:

n <- 5
distance <- as.matrix(dist(spiral[,1:2]))
A <- matrix(0,ncol=ncol(distance),nrow=nrow(distance))

for(i in 1:nrow(A)){
  neighbours <- as.integer(
    names(sort(distance[i,])[2:n+1]))
  A[i,neighbours] <- 1
}

Now we have our n-nearest neighbors graph, we can start working with the data in a non-linear way. For example, we can begin to approximate the distances between points on the spiral by finding their geodesic distance — calculating the length of the shortest path between them.

现在我们有了n最近邻居图,我们可以开始以非线性方式处理数据。 例如,我们可以通过找到测地线的距离来近似估计螺旋线上的点之间的距离,即计算它们之间的最短路径的长度。

Dijkstra’s algorithm is a famous algorithm which can be used to find the shortest path between any two points in a connected graph. We could implement our own version here but to remain on-topic, I will use the distances() function from R’s igraph library.

Dijkstra的算法是一种著名的算法,可用于查找连接图中任意两点之间的最短路径。 我们可以在这里实现我们自己的版本,但是为了保持话题性,我将使用R的igraph库中distances()函数。

install.packages('igraph'); require(igraph)

graph <- graph_from_adjacency_matrix(A)
geo <- distances(graph, algorithm = 'dijkstra')

This gives us a distance matrix. Each element represents the shortest number of edges or links required to get from one point to another.

这给我们一个距离矩阵。 每个元素代表从一个点到另一点所需的最短边或链接数。

Here’s an idea… why not use MDS to find some co-ordinates for the points represented in this distance matrix? It worked earlier for the cities data.

这是一个主意……为什么不使用MDS为该距离矩阵中表示的点找到一些坐标? 它较早用于城市数据。

We could wrap our earlier MDS example in a function and apply our own, homemade version. However, you’ll be pleased to know that R provides an in-built MDS function we can use as well. Let’s scale to one dimension.

我们可以将早期的MDS示例包装在一个函数中,然后应用我们自己的自制版本。 但是,您会很高兴地知道R提供了我们也可以使用的内置MDS函数。 让我们缩放到一个维度。

md <- data.frame(
  'scaled'=cmdscale(geo,1),
  'color'=spiral$color)

plot(data.frame(
  md$scaled,1), col=as.character(md$color), pch=20)

We’ve reduced from 2-D to 1-D, without ignoring the underlying manifold structure.

我们已经从2-D减少到1-D,而没有忽略底层的流形结构。

For advanced, non-linear machine learning purposes, this is a big deal. Often enough, high-dimensional data arises as a result of a lower-dimensional generative process. Our spiral example illustrates this.

对于高级非线性机器学习而言,这很重要。 通常,由于低维生成过程而产生了高维数据。 我们的螺旋示例说明了这一点。

The original spiral was plotted as a data.frame of x and y co-ordinates. But we generated those with a for-loop, in which our index variable i incremented by +1 each iteration.

The original spiral was plotted as a data.frame of x and y co-ordinates. But we generated those with a for-loop, in which our index variable i incremented by +1 each iteration.

By applying our Isomap algorithm, we have recapitulated the steady increase in i with each iteration of the loop. Pretty good going.

By applying our Isomap algorithm, we have recapitulated the steady increase in i with each iteration of the loop. Pretty good going.

The version of Isomap we implemented here has been a little simplified in parts. For example, we could have weighted our adjacency matrix to account for Euclidean distances between the points. This would give us a more nuanced measure of geodesic distance.

The version of Isomap we implemented here has been a little simplified in parts. For example, we could have weighted our adjacency matrix to account for Euclidean distances between the points. This would give us a more nuanced measure of geodesic distance.

One drawback of methods like this include the need to establish suitable hyper-parameter values. If the nearest-neighbors threshold n is too low, you will end up with a fragmented graph. If it is too high, the algorithm will be insensitive to detail. That spiral could become an ellipse if we start connecting points on different layers.

One drawback of methods like this include the need to establish suitable hyper-parameter values. If the nearest-neighbors threshold n is too low, you will end up with a fragmented graph. If it is too high, the algorithm will be insensitive to detail. That spiral could become an ellipse if we start connecting points on different layers.

This means these methods work best with dense data. That requires the manifold structure to be pretty well defined in the first place.

This means these methods work best with dense data. That requires the manifold structure to be pretty well defined in the first place.

Laplacian Eigenmapping (Laplacian Eigenmapping)

Visual Summary (Visual Summary)

Using ideas from Spectral Graph Theory, we can find a lower dimensional projection of the data while retaining the non-linear structure.

Using ideas from Spectral Graph Theory, we can find a lower dimensional projection of the data while retaining the non-linear structure.

Again, we can approximate distances on the manifold using techniques in graph theory. We can do this by building a graph connecting each of our original data points to a set of neighboring points.

Again, we can approximate distances on the manifold using techniques in graph theory. We can do this by building a graph connecting each of our original data points to a set of neighboring points.

Laplacian Eigenmapping takes this graph and applies ideas from spectral graph theory to find a lower-dimensional embedding of the original data.

Laplacian Eigenmapping takes this graph and applies ideas from spectral graph theory to find a lower-dimensional embedding of the original data.

Worked Example (Worked Example)

OK, you’ve made it this far. Your reward is the chance to nerd out with our fourth and final dimensionality reduction algorithm. We’ll be exploring another non-linear technique. Like Isomap, it uses graph theory to approximate the underlying structure of the manifold. Check out the code.

OK, you've made it this far. Your reward is the chance to nerd out with our fourth and final dimensionality reduction algorithm. We'll be exploring another non-linear technique. Like Isomap, it uses graph theory to approximate the underlying structure of the manifold. Check out the code .

Let’s start with similar spiral-shaped data to that we used before. But let’s make it even more tightly wound.

Let's start with similar spiral-shaped data to that we used before. But let's make it even more tightly wound.

set.seed(100)

x <- y <- c();
a <- b <- 1

for(i in 1:1000){
  theta <- 0.02 * i
  x <- append(x,(a+b*theta)*(cos(theta)+runif(1,-1,1))
  y <- append(y,(a+b*theta)*(sin(theta)+runif(1,-1,1))
}

color <- rainbow(1200)[1:1000]
spiral <- data.frame(x,y,color)
plot(y~x, pch=20, col=color)

The naive straight-line distance between A and B is much shorter than the distance from one end of the spiral to the other. Linear techniques won’t stand a chance!

The naive straight-line distance between A and B is much shorter than the distance from one end of the spiral to the other. Linear techniques won't stand a chance!

Again, we begin by constructing the adjacency matrix A of an n-nearest neighbors graph. n is a hyper-parameter we need to choose in advance.

Again, we begin by constructing the adjacency matrix A of an n -nearest neighbors graph. n is a hyper-parameter we need to choose in advance.

Let’s try n = 10:

Let's try n = 10:

n <- 10
distance <- as.matrix(dist(spiral[,1:2]))
A <- matrix(0,ncol=ncol(distance),
  nrow=nrow(distance))

for(i in 1:nrow(A)){
  neighbours <- as.integer(
    names(sort(distance[i,])[2:n+1]))
  A[i,neighbours] <- 1
}

for(j in 1:nrow(A)){
  for(k in 1:ncol(A)){
    if(A[j,k] == 1){
      out[k,j] <- 1
    }
  }
}

So far, so much like Isomap. We’ve added an extra few lines of logic to force the matrix to be symmetric. This will allow us to use ideas from spectral graph theory in the next step. We will define the Laplacian matrix of our graph.

So far, so much like Isomap. We've added an extra few lines of logic to force the matrix to be symmetric. This will allow us to use ideas from spectral graph theory in the next step. We will define the Laplacian matrix of our graph.

We do this by building the degree matrix D.

We do this by building the degree matrix D .

D <- diag(nrow(A))

for(i in 1:nrow(D)){   
  D[i,i] = sum(A[,i])
}

This is a matrix the same size as A, where every element is equal to zero — except those on the diagonal, which equal the sum of the corresponding column of matrix A.

This is a matrix the same size as A , where every element is equal to zero — except those on the diagonal, which equal the sum of the corresponding column of matrix A .

Next, we form the Laplacian matrix L with the simple subtraction:

Next, we form the Laplacian matrix L with the simple subtraction:

L = D - A

The Laplacian matrix is another matrix representation of our graph particularly suited to linear algebra. It allows us to calculate a whole range of interesting properties.

The Laplacian matrix is another matrix representation of our graph particularly suited to linear algebra. It allows us to calculate a whole range of interesting properties.

To find our 1-D embedding of the original data, we need to find a vector x and eigenvalue λ.

To find our 1-D embedding of the original data, we need to find a vector x and eigenvalue λ.

This will solve the generalized eigenvalue problem:

This will solve the generalized eigenvalue problem :

Lx = λDx

L x = λ D x

Thankfully, you can put away the pencil and paper, because R provides a package to help us do this.

Thankfully, you can put away the pencil and paper, because R provides a package to help us do this.

install.packages('geigen'); require(geigen)
eig <- geigen(L,D)
eig$values[1:10]

We see that the geigen() function has returned the eigenvalue solutions from smallest to largest. Note how the first value is practically zero.

We see that the geigen() function has returned the eigenvalue solutions from smallest to largest. Note how the first value is practically zero.

This is one of the properties of the Laplacian matrix — its number of zero eigenvalues tell us how many connected components we have in the graph. Had we used a lower value for n, we might have built a fragmented graph in say, three separate, disconnected parts — in which case, we’d have found three zero eigenvalues.

This is one of the properties of the Laplacian matrix — its number of zero eigenvalues tell us how many connected components we have in the graph. Had we used a lower value for n , we might have built a fragmented graph in say, three separate, disconnected parts — in which case, we'd have found three zero eigenvalues.

To find our low-dimensional embedding, we can take the eigenvectors associated with the lowest non-zero eigenvalues. Since we are projecting from 2-D into 1-D, we will only need one such eigenvector.

To find our low-dimensional embedding, we can take the eigenvectors associated with the lowest non-zero eigenvalues. Since we are projecting from 2-D into 1-D, we will only need one such eigenvector.

embedding <- eig$vectors[,2]
plot(data.frame(embedding,1), col=spiral$colors, pch=20)

And there we have it — another non-linear data set successfully embedded in lower dimensions. Perfect!

And there we have it — another non-linear data set successfully embedded in lower dimensions. 完善!

We have implemented a simplified version of Laplacian Eigenmapping. We ignored choosing another hyper-parameter t, which would have had the effect of weighting our nearest-neighbors graph.

We have implemented a simplified version of Laplacian Eigenmapping. We ignored choosing another hyper-parameter t , which would have had the effect of weighting our nearest-neighbors graph.

Take a look at the original paper for the full details and mathematical justification.

Take a look at the original paper for the full details and mathematical justification.

结论 (Conclusion)

There we are — a run through of four dimensionality reduction techniques that we can apply to linear and non-linear data. Don’t worry if you didn’t quite follow all the math (although congrats if you did!). Remember, we always need to strike a balance between theory and practice when it comes to data science.

There we are — a run through of four dimensionality reduction techniques that we can apply to linear and non-linear data. Don't worry if you didn't quite follow all the math (although congrats if you did!). Remember, we always need to strike a balance between theory and practice when it comes to data science.

These algorithms and several others are available in various packages of R, and in scikit-learn for Python.

These algorithms and several others are available in various packages of R , and in scikit-learn for Python.

Why, then, did we run through each one step-by-step? In my experience, rebuilding something from scratch is a great way to understand how it works.

Why, then, did we run through each one step-by-step? In my experience, rebuilding something from scratch is a great way to understand how it works.

Dimensionality reduction touches upon several branches of mathematics which are useful within data science and other disciplines. Putting these into practice is a great exercise for turning theory into application.

Dimensionality reduction touches upon several branches of mathematics which are useful within data science and other disciplines. Putting these into practice is a great exercise for turning theory into application.

There are, of course, other techniques that we haven’t covered. But if you still have an appetite for more machine learning, then try out the links below:

There are, of course, other techniques that we haven't covered. But if you still have an appetite for more machine learning, then try out the links below:

Linear techniques:

Linear techniques:

Non-linear:

Non-linear:

Thanks for reading! If you have any feedback or questions, please leave a response below!

谢谢阅读! If you have any feedback or questions, please leave a response below!

翻译自: https://www.freecodecamp.org/news/the-curse-of-dimensionality-how-we-can-save-big-data-from-itself-d9fa0f872335/

维度诅咒

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值