维度诅咒_维度的诅咒减去行话的诅咒-CSDN博客

维度诅咒

重点 (Top highlight)

The curse of dimensionality! What on earth is that? Besides being a prime example of shock-and-awe names in machine learning jargon (which often sound far fancier than they are), it’s a reference to the effect that adding more features has on your dataset. In a nutshell, the curse of dimensionality is all about loneliness.

维度的诅咒！那到底是什么？除了是机器学习术语中震撼人心的名字的主要示例(听起来通常比他们想象的要怪异得多)之外，它还引用了在数据集上添加更多特征的效果。简而言之，维度的诅咒全都与孤独有关。

In a nutshell, the curse of dimensionality is all about loneliness.

简而言之，维度的诅咒全都与孤独有关。

Before I explain myself, let’s get some basic jargon out of the way. What’s a feature? It’s the machine learning word for what other disciplines might call a predictor / (independent) variable / attribute / signal. Information about each datapoint, in other words. Here’s a jargon intro if none of those words felt familiar.

在我自我解释之前，让我们先了解一些基本术语。有什么功能？这是机器学习的词汇，表示其他学科可能将其称为预测变量/(独立)变量/属性/信号。换句话说，有关每个数据点的信息。如果这些词都不熟悉，这是一个术语介绍。

Data social distancing is easy: just add a dimension. But for some algorithms, you may find that this is a curse…

数据社交区分开很容易：只需添加一个维度。但是对于某些算法，您可能会发现这是一个诅咒……

When a machine learning algorithm is sensitive to the curse of dimensionality, it means the algorithm works best when your datapoints are surrounded in space by their friends. The fewer friends they have around them in space, the worse things get. Let’s take a look.

当机器学习算法对维数的诅咒敏感时，这意味着当您的数据点被朋友包围时，该算法最有效。他们在太空周围拥有的朋友越少，情况就越糟。让我们来看看。

一维 (One dimension)

Imagine you’re sitting in a large classroom, surrounded by your buddies.

想象一下，您坐在一个大教室里，周围被好友们包围着。

You’re a datapoint, naturally. Let’s put you in one dimension by making the room dark and shining a bright light from the back of the room at you. Your shadow is projected onto a line on the front wall. On that line, it’s not lonely at all. You and your crew are sardines in a can, all lumped together. It’s cozy in one dimension! Perhaps a little too cozy.

您自然是一个数据点。通过使房间变暗并从房间背面向您发出明亮的光线，让您处于一个维度。您的阴影投影到前墙上的一条线上。在那条线上，这一点都不孤单。您和您的船员都是沙丁鱼罐头，全都混在一起。一维舒适！也许有点太舒服了。

二维 (Two dimensions)

To give you room to breathe, let’s add a dimension. We’re in 2D and the plane is the floor of the room. In this space, you and your friends are more spread out. Personal space is a thing again.

为了给您呼吸的空间，让我们添加一个尺寸。我们处于2D模式，飞机是房间的地板。在这个空间中，您和您的朋友更加分散。个人空间又是一回事。

Note: If you prefer to follow along in an imaginary spreadsheet, think of adding/removing a dimension as inserting/deleting a column of numbers.

注意： 如果您喜欢在虚构的电子表格中进行操作，请考虑将尺寸添加/删除视为插入/删除数字列。

三维 (Three dimensions)

Let’s add a third dimension by randomly sending each of you to one of the floors of the 5-floor building you were in.

让我们通过将每个人随机发送到您所在的5层建筑物的一层来增加第三个维度。

All of a sudden, you’re not so densely surrounded by friends anymore. It’s lonely around you. If you enjoyed the feeling of a student in nearly every seat, chances are you’re now mournfully staring at quite a few empty chairs. You’re beginning to get misty eyed, but at least one of your buddies is probably still near you…

突然之间，您不再被朋友所包围。你身边很寂寞。如果您喜欢几乎每个座位上的学生感觉，那么您现在很悲哀地凝视着很多空椅子。您开始眼花mist乱，但是至少您的一个伙伴可能仍在您附近……

四个维度 (Four dimensions)

Not for long! Let’s add another dimension. Time.

不是很长！让我们添加另一个维度。时间。

The students are spread among 60min sections of this class (on various floors) at various times — let’s limit ourselves to 9 sessions because lecturers need sleep and, um, lives. So, if you were lucky enough to still have a companion for emotional support before, I’m fairly confident you’re socially distanced now. If you can’t be effective when you’re lonely, boom! We have our problem. The curse of dimensionality has struck!

在不同的时间，这些学生分布在该课程的60分钟部分(位于不同楼层)中-我们将自己限制在9节课中，因为讲师需要睡眠和一些生命。因此，如果您有幸在此之前仍然有同伴提供情感支持，那么我很自信您现在在社交上与外界保持距离。如果您在孤独时无法发挥作用，那就加油！我们有问题。维度的诅咒来了！

MOAR尺寸 (MOAR dimensions)

As we add dimensions, you get lonely very, very quickly. If we want to make sure that every student is just as surrounded by friends as they were in 2D, we’re going to need students. Lots of them.

随着我们添加维度，您会非常非常快速地孤独。如果我们要确保每个学生和2D一样都被朋友包围着，那么我们将需要学生。其中很多。

The most important idea here is that we have to recruit more friends exponentially, not linearly, to keep your blues at bay.

这里最重要的想法是，我们必须成倍地而不是线性地招募更多的朋友，以使您的蓝调保持稳定。

If we add two dimensions, we can’t simply compensate with two more students… or even two more classrooms’ worth of students. If we started with 50 students in the room originally and we added 5 floors and 9 classes, we need 5x9=45 times more students to keep one another as much company as 50 could have done. So, we need 45x50=2,250 students to avoid loneliness. That’s a whole lot more than one extra student per dimension! Data requirements go up quickly.

如果我们增加两个维度，就不能简单地补偿另外两个学生…甚至两个教室的学生价值。如果我们最初从教室里的50个学生开始，并且增加了5层楼和9个班级，那么我们需要的学生人数是5x9 = 45倍，以保持50个学生可以做的尽可能多的陪伴。因此，我们需要45x50 = 2,250名学生来避免孤独感。每个维度多了一个额外的学生！数据需求Swift上升。

When you add dimensions, minimum data requirements can grow rapidly.

添加维度时，最低数据要求可能会Swift增长。

We need to recruit many, many more students (datapoints) every time we go up a dimension. If data are expensive for you, this curse is really no joke!

每次上维时，我们都需要招募更多很多学生(数据点)。如果数据对您来说太昂贵了，那么这个诅咒真的不是笑话！

维数 (Dimensional divas)

Not all machine learning algorithms get so emotional when confronted with a bit of me-time. Methods like k-NN are complete divas, of course. It’s hardly a surprise for a method whose name abbreviation stands for k-Nearest Neighbors — it’s about computing things about neighboring datapoints, so it’s rather important that the datapoints are neighborly.

并非所有的机器学习算法在面对我的时候都会变得如此激动。当然，像k-NN这样的方法是完整的。对于名称缩写代表k-Nearest Neighbors的方法来说，这并不令人惊讶-它是关于计算相邻数据点的信息，因此，数据点是相邻的非常重要。

Other methods are a lot more robust when it comes to dimensions. If you’ve taken a class on linear regression, for example, you’ll know that once you have a respectable number of datapoints, gaining or dropping a dimension isn’t going to making anything implode catastrophically. There’s still a price — it’s just more affordable.*

在尺寸方面，其他方法要健壮得多。例如，如果您上过线性回归课程，您就会知道，一旦拥有足够数量的数据点，增加或减少维数就不会造成灾难性的内爆。仍有价格-更实惠。*

*Which doesn’t mean it is resilient to all abuse! If you’ve never known the chaos that including a single outlier or adding one near-duplicate feature can unleash on the least squares approach (the Napoleon of crime, Multicollinearity, strikes again!) then consider yourself warned. No method is perfect for every situation. And, yes, that includes neural networks.

*这并不意味着它可以抵抗所有虐待！ 如果您从未意识到只有一个异常值或添加一个近乎重复的特征会导致最小二乘方法的释放(犯罪的拿破仑，多重共线性，再次打击！)，那么请考虑一下自己。 没有一种方法适合每种情况。 而且，是的，其中包括神经网络。

你应该怎么做？ (What should you do about it?)

What are you going to do about the curse of dimensionality in practice? If you’re a machine learning researcher, you’d better know if your algorithm has this problem… but I’m sure you already do. You’re probably not reading this article, so we’ll just talk about you behind your back, shall we? But yeah, you might like to think about whether it’s possible to design the algorithm you’re inventing to be less sensitive to dimension. Many of your customers like their matrices on the full-figured side**, especially if things are getting textual.

在实践中，您将如何处理维数的诅咒？如果您是机器学习研究人员，则最好知道您的算法是否存在此问题……但我确定您已经做到了。您可能没有读这篇文章，所以我们只是在背后谈论您，对吧？但是，是的，您可能想考虑是否有可能设计自己发明的对尺寸不太敏感的算法。您的许多客户都喜欢他们在功能齐全的一面的矩阵**，尤其是当事情变得文本化时。

**Conventionally, we arrange data in a matrix so that the rows are examples and the columns are features. In that case, a tall and skinny matrix has lots of examples spread over few dimensions.

**按惯例，我们将数据排列在矩阵中，以使行为示例，而列为要素。 在那种情况下，一个又高又瘦的矩阵有很多例子，分布在几个维度上。

If you’re an applied data science enthusiast, you’ll do what you always do — get a benchmark of the algorithm’s performance using just one or a few promising features before attempting to throw the kitchen sink at it. (I’ll explain why you need that habit in another post, if you want a clue in the meantime, look up the term “overfitting”.)

如果您是应用数据科学的狂热者，您将做自己经常做的事情-在尝试将厨房水槽扔给它之前，仅使用一个或几个有前途的功能就可以获得算法性能的基准。 (我将在另一篇文章中解释为什么您需要这种习惯，如果同时需要线索，请查找“ 过度拟合 ”一词 。)

Some methods only work well on tall, skinny datasets, so you might need to put your dataset on a diet if you’re feeling cursed.

某些方法仅适用于又高又瘦的数据集，因此，如果您感到被诅咒，可能需要节食饮食。

If your method works decently on a limited number of features and then blows a raspberry at you when you increase the dimensions, that’s your cue to either stick to a few features you handpick (or even stepwise-select if you’re getting crafty) or first make a few superfeatures out of your original kitchen sink by running some cute feature engineering techniques (you could try anything from old school things like principal component analysis (PCA) — still relevant today, eigenvectors never go out of fashion — to more modern things like autoencoders and other neural network funtimes). You don’t really need to know the term curse of dimensionality to get your work done because your process — start small and build up the complexity — should take care of it for you, but if it was bothering you… now you can shrug off the worry.

如果您的方法在有限数量的特征上工作得很好，然后在增加尺寸时向您吹了覆盆子，那么这可能是您坚持手工挑选了一些特征(或者如果您正在精打细算，则是逐步选择 )或首先通过运行一些可爱的功能工程技术在原始的厨房水槽中做一些超级功能 (您可以尝试一些从老派的事情，例如主成分分析(PCA)，到今天仍然有用，特征向量永远不会过时，再到更现代的事情)例如自动编码器和其他神经网络的娱乐时间)。您真的不需要知道维度诅咒一词就可以完成工作，因为您的过程(从小开始并增加复杂性)应该为您解决，但是，如果它困扰您……现在您可以不用担心了担心。

To summarize: As you add more and more features (columns), you need an exponentially-growing amount of examples (rows) to overcome how spread out your datapoints are in space. Some methods only work well on long skinny datasets, so you might need to put your dataset on a diet if you’re feeling cursed.

总结：随着添加越来越多的功能 (列)，您需要数量呈指数增长的示例 (行)来克服数据点在空间中的分布。有些方法仅适用于瘦长的数据集，因此，如果您感到被诅咒，可能需要节食饮食。