数据科学统计

最新推荐文章于 2023-06-21 17:22:26 发布

weixin_26745985

最新推荐文章于 2023-06-21 17:22:26 发布

阅读量391

点赞数

文章标签： python 人工智能大数据 java 算法

原文链接：https://medium.com/analytics-vidhya/statistics-for-data-science-1ee97be0c79f

版权

统计 (Statistics)

Statistics refers to the mathematics and techniques with which we understand data. It is a rich, enormous field, more suited to a shelf (or room) in a library rather than a chapter in a book, and so our discussion will necessarily not be a deep one. Instead, I’ll try to teach you just enough to be dangerous, and pique your interest just enough that you’ll go off and learn more.

统计是指我们用来理解数据的数学和技术。它是一个丰富而巨大的领域，更适合于图书馆的书架(或房间)，而不是书中的一章，因此我们的讨论不一定是一个深层的领域。取而代之的是，我将尽一切可能教给您危险的知识，并激发您的兴趣，使您可以继续学习并了解更多。

描述一组数据 (Describing a Single Set of Data)

Through a combination of word-of-mouth and luck, your team has grown to dozens of members, and the VP of Fundraising asks you for some sort of description of how many friends your members have that he can include in his elevator pitches.

通过口口相传和好运的结合，您的团队已发展成数十个成员，并且筹款副总裁要求您对您的成员可以在电梯轿厢中加入多少个朋友进行某种描述。

You are easily able to produce this data. But now you are faced with the problem of how to describe it.

您可以轻松生成此数据。但是现在您面临如何描述它的问题。

One obvious description of any data set is simply the data itself:

任何数据集的一个明显描述就是数据本身：

For a small enough data set this might even be the best description. But for a larger data set, this is unwieldy and probably opaque. (Imagine staring at a list of 1 million numbers.) For that reason we use statistics to distill and communicate relevant features of our data.

对于足够小的数据集，这甚至可能是最好的描述。但是对于更大的数据集，这是笨拙的，而且可能是不透明的。 (想象一下盯着一百万个数字的列表。)因此，我们使用统计数据来提取和传达数据的相关特征。

As a first approach you put the friend counts into a histogram using Counter and plt.bar():

作为第一种方法，您可以使用Counter和plt.bar()将朋友数放入直方图中：

Unfortunately, this chart is still too difficult to slip into conversations. So you start generating some statistics. Probably the simplest statistic is simply the number of data points:

不幸的是，这张图表仍然太困难而无法进行对话。因此，您开始生成一些统计信息。可能最简单的统计数据就是数据点的数量：

You’re probably also interested in the largest and smallest values:

您可能也对最大和最小值感兴趣：

which are just special cases of wanting to know the values in specific positions:

这只是想知道特定位置的值的特殊情况：

But we’re only getting started.

但是我们才刚刚开始。

集中趋势 (Central Tendencies)

Usually, we’ll want some notion of where our data is centered. Most commonly we’ll use the mean (or average), which is just the sum of the data divided by its count:

通常，我们需要一些关于数据居中位置的概念。最常见的是，我们将使用平均值(或平均值)，即数据之和除以其计数：

If you have two data points, the mean is simply the point halfway between them. As you add more points, the mean shifts around, but it always depends on the value of every point.

如果您有两个数据点，则平均值就是它们之间的中间点。当您添加更多点时，均值会四处移动，但这始终取决于每个点的值。

We’ll also sometimes be interested in the median, which is the middle-most value (if the number of data points is odd) or the average of the two middle-most values (if the number of data points is even).

有时我们还会对中位数感兴趣，中位数是中间值(如果数据点的数量为奇数)或两个中间值的平均值(如果数据点的数量为偶数)。

For instance, if we have five data points in a sorted vector x, the median is x[5 // 2] or x[2]. If we have six data points, we want the average of x[2] (the third point) and x[3] (the fourth point).

例如，如果我们在排序的向量x中有五个数据点，则中位数为x [5 // 2]或x [2]。如果我们有六个数据点，则需要x [2](第三点)和x [3](第四点)的平均值。

Notice that — unlike the mean — the median doesn’t depend on every value in your data. For example, if you make the largest point larger (or the smallest point smaller), the middle points remain unchanged, which means so does the median.

请注意，与平均值不同，中位数并不取决于数据中的每个值。例如，如果将最大点变大(或将最小点变小)，则中间点保持不变，这意味着中位数也保持不变。

The median function is slightly more complicated than you might expect, mostly because of the “even” case:

中位数函数比您预期的要复杂一些，主要是由于“偶数”情况：

Clearly, the mean is simpler to compute, and it varies smoothly as our data changes. If we have n data points and one of them increases by some small amount e, then necessarily the mean will increase by e / n. (This makes the mean amenable to all sorts of calculus tricks.) Whereas in order to find the median, we have to sort our data. And changing one of our data points by a small amount e might increase the median by e, by some number less than e, or not at all (depending on the rest of the data).

显然，均值更易于计算，并且随着我们数据的变化而平稳变化。如果我们有n个数据点，并且其中一个数据点增加了少量e，则均值必然会增加e / n。 (这使得均值适用于各种演算技巧。)而为了找到中位数，我们必须对数据进行排序。将我们的一个数据点少量更改e可能会使中位数增加e，比e少一些，或者根本不增加(取决于数据的其余部分)。

Note: There are, in fact, non obvious tricks to efficiently compute medians without sorting the data. However, they are beyond the scope of this article, so we have to sort the data.

注意：实际上，有一些不明显的技巧可以有效地计算中位数而不对数据进行排序。但是，它们超出了本文的范围，因此我们必须对数据进行排序。

At the same time, the mean is very sensitive to outliers in our data. If our friendliest user had 200 friends (instead of 100), then the mean would rise to 7.82, while the median would stay the same. If outliers are likely to be bad data (or otherwise unrepresentative of whatever phenomenon we’re trying to understand), then the mean can sometimes give us a misleading picture. For example, the story is often told that in the mid-1980s, the major at the University of North Carolina with the highest average starting salary was geography, mostly on account of NBA star (and outlier) Michael Jordan.

同时，均值对我们数据中的异常值非常敏感。如果我们最友好的用户有200个朋友(而不是100个)，则平均值将上升到7.82，而中位数将保持不变。如果离群值可能是不良数据(或者不能代表我们试图理解的任何现象)，那么均值有时可能给我们带来误解。例如，这个故事经常被讲到，在1980年代中期，北卡罗来纳大学的平均起薪最高的专业是地理，主要是因为NBA球星(及离队)迈克尔·乔丹。

A generalization of the median is the quantile, which represents the value less than which a certain percentile of the data lies. (The median represents the value less than which 50% of the data lies.)

中位数的一般化是分位数，它表示的值小于数据的某个百分位数所在的值。 (中位数表示小于该值的50％的值。)

Less commonly you might want to look at the mode, or most-common value[s]:

较不常见的是，您可能希望查看模式或最常见的值：

But most frequently we’ll just use the mean.

但最常见的是，我们只会使用均值。

分散 (Dispersion)

Dispersion refers to measures of how spread out our data is. Typically they’re statistics for which values near zero signify not spread out at all and for which large values (whatever that means) signify very spread out. For instance, a very simple measure is the range, which is just the difference between the largest and smallest elements:

分散是指衡量我们数据分布程度的指标。通常，它们是统计值，即接近零的值表示根本没有展开，而较大的值(无论什么意思)表示非常分散。例如，范围是一个非常简单的度量，它只是最大和最小元素之间的差：

The range is zero precisely when the max and min are equal, which can only happen if the elements of x are all the same, which means the data is as undispersed as possible. Conversely, if the range is large, then the max is much larger than the min and the data is more spread out.

当max和min相等时，范围恰好为零，这只有在x的元素都相同的情况下才可能发生，这意味着数据尽可能分散。相反，如果范围较大，则最大值比最小值大得多，并且数据分布更广。

Like the median, the range doesn’t really depend on the whole data set. A data set whose points are all either 0 or 100 has the same range as a data set whose values are 0, 100, and lots of 50s. But it seems like the first data set “should” be more spread out.

像中位数一样，范围实际上并不取决于整个数据集。点全部为0或100的数据集与值分别为0、100和50s的数据集具有相同的范围。但是，似乎第一个数据集“应该”更加分散。

A more complex measure of dispersion is the variance, which is computed as:

方差的更复杂度量是方差，其计算公式如下：

Note: This looks like it is almost the average squared deviation from the mean, except that we’re dividing by n-1 instead of n. In fact, when we’re dealing with a sample from a larger population, x_bar is only an estimate of the actual mean, which means that on average (x_i - x_bar) ** 2 is an underestimate of x_i’s squared deviation from the mean, which is why we divide by n-1 instead of n. See Wikipedia.

注意：这看起来几乎是与均值的平均平方偏差，只是我们用n-1除而不是n除。实际上，当我们处理来自更大总体的样本时，x_bar只是实际均值的估计，这意味着平均(x_i-x_bar)** 2是x_i与均值平方差的低估，这就是为什么我们将n除以n-1而不是n。参见维基百科。

Now, whatever units our data is in (e.g., “friends”), all of our measures of central tendency are in that same unit. The range will similarly be in that same unit. The variance, on the other hand, has units that are the square of the original units (e.g., “friends squared”). As it can be hard to make sense of these, we often look instead at the standard deviation:

现在，无论我们的数据位于什么单位(例如“朋友”)中，我们所有集中趋势的度量都在同一单位中。范围将类似地以相同单位显示。另一方面，方差的单位是原始单位的平方(例如，“朋友平方”)。由于很难理解这些含义，因此我们经常查看标准偏差：

Both the range and the standard deviation have the same outlier problem that we saw earlier for the mean. Using the same example, if our friendliest user had instead 200 friends, the standard deviation would be 14.89, more than 60% higher!

范围和标准偏差都具有我们之前在均值中看到的相同的离群值问题。使用相同的示例，如果我们最友好的用户改为有200个朋友，则标准差将为14.89，超过60％！

A more robust alternative computes the difference between the 75th percentile value and the 25th percentile value:

一个更健壮的替代方法可以计算第75个百分位数和第25个百分位数之间的差：

which is quite plainly unaffected by a small number of outliers.

完全不受少数异常值的影响。

辛普森悖论 (Simpson’s Paradox)

One not uncommon surprise when analyzing data is Simpson’s Paradox, in which correlations can be misleading when confounding variables are ignored. For example, imagine that you can identify all of your members as either East Coast or West Coast. You decide to examine which coast’s members are friendlier:

分析数据时，一个常见的惊喜是Simpson的悖论，当忽略混杂变量时，相关性可能会产生误导。例如，假设您可以将所有成员标识为东海岸或西海岸。您决定检查哪个海岸的会员更友好：

It certainly looks like the West Coast are friendlier than the East Coast. Your coworkers advance all sorts of theories as to why this might be: maybe it’s the sun, or the coffee, or the organic produce, or the laid-back Pacific vibe?

看起来西海岸比东海岸更友好。您的同事提出了各种各样的理论来解释为什么会这样：也许是太阳，咖啡，有机产品或悠闲的太平洋氛围？

When playing with the data you discover something very strange. If you only look at people with PhDs, the East Coast members have more friends on average. And if you only look at people without PhDs, the East Coast members also have more friends on average!

在处理数据时，您会发现一些非常奇怪的东西。如果只看博士学位的人，那么东海岸的成员平均会有更多的朋友。而且，如果只看那些没有博士学位的人，那么东海岸的成员平均也会有更多的朋友！

Once you account for the users’ degrees, the correlation goes in the opposite direction! Bucketing the data as East Coast/West Coast disguised the fact that the East Coast members skew much more heavily toward PhD types.

一旦考虑了用户的学位，相关性就会朝相反的方向发展！在对东海岸/西海岸的数据进行分类时，掩盖了东海岸成员偏向于博士学位类型的事实。

This phenomenon crops up in the real world with some regularity. The key issue is that correlation is measuring the relationship between your two variables all else being equal. If your data classes are assigned at random, as they might be in a well-designed experiment, “all else being equal” might not be a terrible assumption. But when there is a deeper pattern to class assignments, “all else being equal” can be an awful assumption.

这种现象在现实世界中以一定规律出现。关键问题是相关性正在衡量所有其他变量相等时的两个变量之间的关系。如果您的数据类别是随机分配的(就像在经过精心设计的实验中那样)，那么“其他所有条件都相同”可能不是一个可怕的假设。但是，当有更深层次的课堂分配模式时，“其他条件都一样”可能是一个可怕的假设。

The only real way to avoid this is by knowing your data and by doing what you can to make sure you’ve checked for possible confounding factors. Obviously, this is not always possible. If you didn’t have the educational attainment of these 200 members, you might simply conclude that there was something inherently more sociable about the West Coast.

避免这种情况的唯一真实方法是了解您的数据，并尽力确保已检查可能的混淆因素。显然，这并不总是可能的。如果您没有这200名成员的学历，您可能会简单地得出结论，西海岸在本质上会更加友善。

其他一些相关警告 (Some Other Correlational Caveats)

A correlation of zero indicates that there is no linear relationship between the two variables. However, there may be other sorts of relationships. For example, if:

相关系数为零表示两个变量之间没有线性关系。但是，可能存在其他类型的关系。例如，如果：

then x and y have zero correlation. But they certainly have a relationship — each element of y equals the absolute value of the corresponding element of x. What they don’t have is a relationship in which knowing how x_i compares to mean(x) gives us information about how y_i compares to mean(y). That is the sort of relationship that correlation looks for.

那么x和y的相关性为零。但是它们当然有关系-y的每个元素等于x的相应元素的绝对值。他们所没有的是一种关系，其中知道x_i如何与均值(x)比较会为我们提供有关y_i如何与均值(y)比较的信息。这就是关联寻找的那种关系。

In addition, correlation tells you nothing about how large the relationship is. The variables:

此外，相关性不会告诉您关系有多大。变量：