数据科学统计

统计 (Statistics)

Statistics refers to the mathematics and techniques with which we understand data. It is a rich, enormous field, more suited to a shelf (or room) in a library rather than a chapter in a book, and so our discussion will necessarily not be a deep one. Instead, I’ll try to teach you just enough to be dangerous, and pique your interest just enough that you’ll go off and learn more.

统计是指我们用来理解数据的数学和技术。 它是一个丰富而巨大的领域,更适合于图书馆的书架(或房间),而不是书中的一章,因此我们的讨论不一定是一个深层的领域。 取而代之的是,我将尽一切可能教给您危险的知识,并激发您的兴趣,使您可以继续学习并了解更多。

描述一组数据 (Describing a Single Set of Data)

Through a combination of word-of-mouth and luck, your team has grown to dozens of members, and the VP of Fundraising asks you for some sort of description of how many friends your members have that he can include in his elevator pitches.

通过口口相传和好运的结合,您的团队已发展成数十个成员,并且筹款副总裁要求您对您的成员可以在电梯轿厢中加入多少个朋友进行某种描述。

You are easily able to produce this data. But now you are faced with the problem of how to describe it.

您可以轻松生成此数据。 但是现在您面临如何描述它的问题。

One obvious description of any data set is simply the data itself:

任何数据集的一个明显描述就是数据本身:

Image for post

For a small enough data set this might even be the best description. But for a larger data set, this is unwieldy and probably opaque. (Imagine staring at a list of 1 million numbers.) For that reason we use statistics to distill and communicate relevant features of our data.

对于足够小的数据集,这甚至可能是最好的描述。 但是对于更大的数据集,这是笨拙的,而且可能是不透明的。 (想象一下盯着一百万个数字的列表。)因此,我们使用统计数据来提取和传达数据的相关特征。

As a first approach you put the friend counts into a histogram using Counter and plt.bar():

作为第一种方法,您可以使用Counter和plt.bar()将朋友数放入直方图中:

Image for post
Image for post

Unfortunately, this chart is still too difficult to slip into conversations. So you start generating some statistics. Probably the simplest statistic is simply the number of data points:

不幸的是,这张图表仍然太困难而无法进行对话。 因此,您开始生成一些统计信息。 可能最简单的统计数据就是数据点的数量:

Image for post

You’re probably also interested in the largest and smallest values:

您可能也对最大和最小值感兴趣:

Image for post

which are just special cases of wanting to know the values in specific positions:

这只是想知道特定位置的值的特殊情况:

Image for post

But we’re only getting started.

但是我们才刚刚开始。

集中趋势 (Central Tendencies)

Usually, we’ll want some notion of where our data is centered. Most commonly we’ll use the mean (or average), which is just the sum of the data divided by its count:

通常,我们需要一些关于数据居中位置的概念。 最常见的是,我们将使用平均值(或平均值),即数据之和除以其计数:

Image for post

If you have two data points, the mean is simply the point halfway between them. As you add more points, the mean shifts around, but it always depends on the value of every point.

如果您有两个数据点,则平均值就是它们之间的中间点。 当您添加更多点时,均值会四处移动,但这始终取决于每个点的值。

We’ll also sometimes be interested in the median, which is the middle-most value (if the number of data points is odd) or the average of the two middle-most values (if the number of data points is even).

有时我们还会对中位数感兴趣,中位数是中间值(如果数据点的数量为奇数)或两个中间值的平均值(如果数据点的数量为偶数)。

For instance, if we have five data points in a sorted vector x, the median is x[5 // 2] or x[2]. If we have six data points, we want the average of x[2] (the third point) and x[3] (the fourth point).

例如,如果我们在排序的向量x中有五个数据点,则中位数为x [5 // 2]或x [2]。 如果我们有六个数据点,则需要x [2](第三点)和x [3](第四点)的平均值。

Notice that — unlike the mean — the median doesn’t depend on every value in your data. For example, if you make the largest point larger (or the smallest point smaller), the middle points remain unchanged, which means so does the median.

请注意,与平均值不同,中位数并不取决于数据中的每个值。 例如,如果将最大点变大(或将最小点变小),则中间点保持不变,这意味着中位数也保持不变。

The median function is slightly more complicated than you might expect, mostly because of the “even” case:

中位数函数比您预期的要复杂一些,主要是由于“偶数”情况:

Image for post

Clearly, the mean is simpler to compute, and it varies smoothly as our data changes. If we have n data points and one of them increases by some small amount e, then necessarily the mean will increase by e / n. (This makes the mean amenable to all sorts of calculus tricks.) Whereas in order to find the median, we have to sort our data. And changing one of our data points by a small amount e might increase the median by e, by some number less than e, or not at all (depending on the rest of the data).

显然,均值更易于计算,并且随着我们数据的变化而平稳变化。 如果我们有n个数据点,并且其中一个数据点增加了少量e,则均值必然会增加e / n。 (这使得均值适用于各种演算技巧。)而为了找到中位数,我们必须对数据进行排序。 将我们的一个数据点少量更改e可能会使中位数增加e,比e少一些,或者根本不增加(取决于数据的其余部分)。

Note: There are, in fact, non obvious tricks to efficiently compute medians without sorting the data. However, they are beyond the scope of this article, so we have to sort the data.

注意:实际上,有一些不明显的技巧可以有效地计算中位数而不对数据进行排序。 但是,它们超出了本文的范围,因此我们必须对数据进行排序。

At the same time, the mean is very sensitive to outliers in our data. If our friendliest user had 200 friends (instead of 100), then the mean would rise to 7.82, while the median would stay the same. If outliers are likely to be bad data (or otherwise unrepresentative of whatever phenomenon we’re trying to understand), then the mean can sometimes give us a misleading picture. For example, the story is often told that in the mid-1980s, the major at the University of North Carolina with the highest average starting salary was geography, mostly on account of NBA star (and outlier) Michael Jordan.

同时,均值对我们数据中的异常值非常敏感。 如果我们最友好的用户有200个朋友(而不是100个),则平均值将上升到7.82,而中位数将保持不变。 如果离群值可能是不良数据(或者不能代表我们试图理解的任何现象),那么均值有时可能给我们带来误解。 例如,这个故事经常被讲到,在1980年代中期,北卡罗来纳大学的平均起薪最高的专业是地理,主要是因为NBA球星(及离队)迈克尔·乔丹。

A generalization of the median is the quantile, which represents the value less than which a certain percentile of the data lies. (The median represents the value less than which 50% of the data lies.)

中位数的一般化是分位数,它表示的值小于数据的某个百分位数所在的值。 (中位数表示小于该值的50%的值。)

Image for post

Less commonly you might want to look at the mode, or most-common value[s]:

较不常见的是,您可能希望查看模式或最常见的值:

Image for post

But most frequently we’ll just use the mean.

但最常见的是,我们只会使用均值。

分散 (Dispersion)

Dispersion refers to measures of how spread out our data is. Typically they’re statistics for which values near zero signify not spread out at all and for which large values (whatever that means) signify very spread out. For instance, a very simple measure is the range, which is just the difference between the largest and smallest elements:

分散是指衡量我们数据分布程度的指标。 通常,它们是统计值,即接近零的值表示根本没有展开,而较大的值(无论什么意思)表示非常分散。 例如,范围是一个非常简单的度量,它只是最大和最小元素之间的差:

Image for post

The range is zero precisely when the max and min are equal, which can only happen if the elements of x are all the same, which means the data is as undispersed as possible. Conversely, if the range is large, then the max is much larger than the min and the data is more spread out.

当max和min相等时,范围恰好为零,这只有在x的元素都相同的情况下才可能发生,这意味着数据尽可能分散。 相反,如果范围较大,则最大值比最小值大得多,并且数据分布更广。

Like the median, the range doesn’t really depend on the whole data set. A data set whose points are all either 0 or 100 has the same range as a data set whose values are 0, 100, and lots of 50s. But it seems like the first data set “should” be more spread out.

像中位数一样,范围实际上并不取决于整个数据集。 点全部为0或100的数据集与值分别为0、100和50s的数据集具有相同的范围。 但是,似乎第一个数据集“应该”更加分散。

A more complex measure of dispersion is the variance, which is computed as:

方差的更复杂度量是方差,其计算公式如下:

Image for post

Note: This looks like it is almost the average squared deviation from the mean, except that we’re dividing by n-1 instead of n. In fact, when we’re dealing with a sample from a larger population, x_bar is only an estimate of the actual mean, which means that on average (x_i - x_bar) ** 2 is an underestimate of x_i’s squared deviation from the mean, which is why we divide by n-1 instead of n. See Wikipedia.

注意:这看起来几乎是与均值的平均平方偏差,只是我们用n-1除而不是n除。 实际上,当我们处理来自更大总体的样本时,x_bar只是实际均值的估计,这意味着平均(x_i-x_bar)** 2是x_i与均值平方差的低估,这就是为什么我们将n除以n-1而不是n。 参见维基百科

Now, whatever units our data is in (e.g., “friends”), all of our measures of central tendency are in that same unit. The range will similarly be in that same unit. The variance, on the other hand, has units that are the square of the original units (e.g., “friends squared”). As it can be hard to make sense of these, we often look instead at the standard deviation:

现在,无论我们的数据位于什么单位(例如“朋友”)中,我们所有集中趋势的度量都在同一单位中。 范围将类似地以相同单位显示。 另一方面,方差的单位是原始单位的平方(例如,“朋友平方”)。 由于很难理解这些含义,因此我们经常查看标准偏差:

Image for post

Both the range and the standard deviation have the same outlier problem that we saw earlier for the mean. Using the same example, if our friendliest user had instead 200 friends, the standard deviation would be 14.89, more than 60% higher!

范围和标准偏差都具有我们之前在均值中看到的相同的离群值问题。 使用相同的示例,如果我们最友好的用户改为有200个朋友,则标准差将为14.89,超过60%!

A more robust alternative computes the difference between the 75th percentile value and the 25th percentile value:

一个更健壮的替代方法可以计算第75个百分位数和第25个百分位数之间的差:

Image for post

which is quite plainly unaffected by a small number of outliers.

完全不受少数异常值的影响。

相关性 (Correlation)

Your VP of Growth has a theory that the amount of time people spend on the site is related to the number of friends they have on the site (she’s not a VP for nothing), and she’s asked you to verify this.

您的增长副总裁有一个理论,认为人们在网站上花费的时间与他们在网站上拥有的朋友数量有关(她不是无所事事的副总裁),因此她被要求您进行验证。

After digging through traffic logs, you’ve come up with a list daily_minutes that shows how many minutes per day each user spends on your company site, and you’ve ordered it so that its elements correspond to the elements of our previous num_friends list. We’d like to investigate the relationship between these two metrics.

在浏览流量日志之后,您会得到一个daily_minutes列表,该列表显示每个用户每天在公司网站上花费的时间,并且您已对其进行了订购,以使其元素与我们之前的num_friends列表的元素相对应。 我们想调查这两个指标之间的关系。

We’ll first look at covariance, the paired analogue of variance. Whereas variance measures how a single variable deviates from its mean, covariance measures how two variables vary in tandem from their means:

我们首先来看协方差,方差的配对类似物。 方差度量单个变量偏离其均值的方式,而协方差度量两个变量与其均值的串联方式:

Image for post

Recall that dot sums up the products of corresponding pairs of elements. When corresponding elements of x and y are either both above their means or both below their means, a positive number enters the sum. When one is above its mean and the other below, a negative number enters the sum. Accordingly, a “large” positive covariance means that x tends to be large when y is large and small when y is small. A “large” negative covariance means the opposite — that x tends to be small when y is large and vice versa. A covariance close to zero means that no such relationship exists. Nonetheless, this number can be hard to interpret, for a couple of reasons:

回想一下,点总结了相应的元素对的乘积。 当x和y的对应元素都在均值以上或均在均值以下时,将输入一个正数。 当一个值高于平均值而另一个值低于平均值时,将输入一个负数。 因此,“大”正协方差意味着当y大时x倾向于大,而当y小时x倾向于小。 “大”负协方差表示相反的含义-当y大时x趋于小,反之亦然。 接近零的协方差意味着不存在这种关系。 尽管如此,由于以下几个原因,这个数字可能难以解释:

• Its units are the product of the inputs’ units (e.g., friend-minutes-per-day), which can be hard to make sense of. (What’s a “friend-minute-per-day”?)

•它的单位是输入单位的乘积(例如每天的朋友分钟),这很难理解。 (什么是“每天的朋友分钟”?)

• If each user had twice as many friends (but the same number of minutes), the covariance would be twice as large. But in a sense the variables would be just as interrelated. Said differently, it’s hard to say what counts as a “large” covariance.

•如果每个用户有两倍的朋友(但分钟数相同),则协方差将是两倍。 但是从某种意义上说,变量是相互关联的。 换句话说,很难说什么是“大”协方差。

For this reason, it’s more common to look at the correlation, which divides out the standard deviations of both variables:

因此,查看相关性更为常见,该相关性将两个变量的标准差分开:

Image for post

The correlation is unit less and always lies between -1 (perfect anti-correlation) and 1 (perfect correlation). A number like 0.25 represents a relatively weak positive correlation.

相关性的单位较小,始终在-1(完美反相关)和1(完美相关)之间。 像0.25这样的数字表示相对较弱的正相关。

However, one thing we neglected to do was examine our data. Check out Figure 5–2.

但是,我们忽略的一件事是检查我们的数据。 查看图5–2。

Image for post

The person with 100 friends (who spends only one minute per day on the site) is a huge outlier, and correlation can be very sensitive to outliers. What happens if we ignore him?

拥有100个朋友(每天只在网站上花费一分钟)的人是一个巨大的异常值,关联对异常值非常敏感。 如果我们不理他会怎样?

Image for post

Without the outlier, there is a much stronger correlation (Figure 5–3).

如果没有异常值,则存在更强的相关性(图5–3)。

Image for post

You investigate further and discover that the outlier was actually an internal test account that no one ever bothered to remove. So you feel pretty justified in excluding it.

您进一步调查发现异常值实际上是一个内部测试帐户,没有人愿意删除。 因此,您将其排除在外是很合理的。

辛普森悖论 (Simpson’s Paradox)

One not uncommon surprise when analyzing data is Simpson’s Paradox, in which correlations can be misleading when confounding variables are ignored. For example, imagine that you can identify all of your members as either East Coast or West Coast. You decide to examine which coast’s members are friendlier:

分析数据时,一个常见的惊喜是Simpson的悖论,当忽略混杂变量时,相关性可能会产生误导。 例如,假设您可以将所有成员标识为东海岸或西海岸。 您决定检查哪个海岸的会员更友好:

Image for post

It certainly looks like the West Coast are friendlier than the East Coast. Your coworkers advance all sorts of theories as to why this might be: maybe it’s the sun, or the coffee, or the organic produce, or the laid-back Pacific vibe?

看起来西海岸比东海岸更友好。 您的同事提出了各种各样的理论来解释为什么会这样:也许是太阳,咖啡,有机产品或悠闲的太平洋氛围?

When playing with the data you discover something very strange. If you only look at people with PhDs, the East Coast members have more friends on average. And if you only look at people without PhDs, the East Coast members also have more friends on average!

在处理数据时,您会发现一些非常奇怪的东西。 如果只看博士学位的人,那么东海岸的成员平均会有更多的朋友。 而且,如果只看那些没有博士学位的人,那么东海岸的成员平均也会有更多的朋友!

Image for post

Once you account for the users’ degrees, the correlation goes in the opposite direction! Bucketing the data as East Coast/West Coast disguised the fact that the East Coast members skew much more heavily toward PhD types.

一旦考虑了用户的学位,相关性就会朝相反的方向发展! 在对东海岸/西海岸的数据进行分类时,掩盖了东海岸成员偏向于博士学位类型的事实。

This phenomenon crops up in the real world with some regularity. The key issue is that correlation is measuring the relationship between your two variables all else being equal. If your data classes are assigned at random, as they might be in a well-designed experiment, “all else being equal” might not be a terrible assumption. But when there is a deeper pattern to class assignments, “all else being equal” can be an awful assumption.

这种现象在现实世界中以一定规律出现。 关键问题是相关性正在衡量所有其他变量相等时的两个变量之间的关系。 如果您的数据类别是随机分配的(就像在经过精心设计的实验中那样),那么“其他所有条件都相同”可能不是一个可怕的假设。 但是,当有更深层次的课堂分配模式时,“其他条件都一样”可能是一个可怕的假设。

The only real way to avoid this is by knowing your data and by doing what you can to make sure you’ve checked for possible confounding factors. Obviously, this is not always possible. If you didn’t have the educational attainment of these 200 members, you might simply conclude that there was something inherently more sociable about the West Coast.

避免这种情况的唯一真实方法是了解您的数据,并尽力确保已检查可能的混淆因素。 显然,这并不总是可能的。 如果您没有这200名成员的学历,您可能会简单地得出结论,西海岸在本质上会更加友善。

其他一些相关警告 (Some Other Correlational Caveats)

A correlation of zero indicates that there is no linear relationship between the two variables. However, there may be other sorts of relationships. For example, if:

相关系数为零表示两个变量之间没有线性关系。 但是,可能存在其他类型的关系。 例如,如果:

Image for post

then x and y have zero correlation. But they certainly have a relationship — each element of y equals the absolute value of the corresponding element of x. What they don’t have is a relationship in which knowing how x_i compares to mean(x) gives us information about how y_i compares to mean(y). That is the sort of relationship that correlation looks for.

那么x和y的相关性为零。 但是它们当然有关系-y的每个元素等于x的相应元素的绝对值。 他们所没有的是一种关系,其中知道x_i如何与均值(x)比较会为我们提供有关y_i如何与均值(y)比较的信息。 这就是关联寻找的那种关系。

In addition, correlation tells you nothing about how large the relationship is. The variables:

此外,相关性不会告诉您关系有多大。 变量:

Image for post

are perfectly correlated, but (depending on what you’re measuring) it’s quite possible that this relationship isn’t all that interesting.

是完全相关的,但是(取决于您所测量的),这种关系很有可能不是那么有趣。

相关和因果关系 (Correlation and Causation)

You have probably heard at some point that “correlation is not causation,” most likely by someone looking at data that posed a challenge to parts of his worldview that he was reluctant to question. Nonetheless, this is an important point — if x and y are strongly correlated, that might mean that x causes y, that y causes x, that each causes the other, that some third factor causes both, or it might mean nothing.

您可能在某个时候听说过“关联不是因果关系”,很可能是某人查看的数据对他不愿质疑的部分世界观构成挑战。 但是,这是很重要的一点-如果x和y高度相关,则可能意味着x导致y,y导致x,每个导致另一个,第三个因素同时导致这两个,或者可能什么也没有。

Consider the relationship between num_friends and daily_minutes. It’s possible that having more friends on the site causes your company users to spend more time on the site. This might be the case if each friend posts a certain amount of content each day, which means that the more friends you have, the more time it takes to stay current with their updates.

考虑num_friends和daily_minutes之间的关系。 网站上有更多朋友可能导致公司用户在网站上花费更多时间。 如果每个朋友每天都发布一定量的内容,则可能是这种情况,这意味着您拥有的朋友越多,花在与他们的更新保持最新上的时间就越多。

However, it’s also possible that the more time you spend arguing in the company forums, the more you encounter and befriend like-minded people. That is, spending more time on the site causes users to have more friends.

但是,也有可能您在公司论坛上争论的时间越长,遇到的人越多,并与志趣相投的人成为朋友。 也就是说,在网站上花费更多时间会使用户拥有更多朋友。

A third possibility is that the users who are most passionate about members spend more time on the site (because they find it more interesting) and more actively collect members friends (because they don’t want to associate with anyone else).

第三种可能性是,最热衷于会员的用户在网站上花费更多的时间(因为他们发现它更有趣),并且更积极地收集会员朋友(因为他们不想与任何其他人联系)。

One way to feel more confident about causality is by conducting randomized trials. If you can randomly split your users into two groups with similar demographics and give one of the groups a slightly different experience, then you can often feel pretty good that the different experiences are causing the different outcomes.

对因果关系更有信心的一种方法是进行随机试验。 如果您可以将用户随机分为具有相似的受众特征的两组,并为一组用户提供略有不同的体验,那么您常常会感觉很好,因为不同的体验会导致不同的结果。

For instance, if you don’t mind being angrily accused of experimenting on your users, you could randomly choose a subset of your users and show them content from only a fraction of their friends. If this subset subsequently spent less time on the site, this would give you some confidence that having more friends causes more time on the site.

例如,如果您不介意生气地指责用户,则可以随机选择一部分用户,并仅向一部分朋友显示内容。 如果此子集随后在网站上花费的时间更少,则可以使您有更多的朋友会在网站上花费更多时间的信心。

I hope you found this article useful, Thank you for reading till here. If you have any question and/or suggestions, let me know in the comments.You can also get in touch with me directly through email & LinkedIn

希望本文对您有所帮助,谢谢您的阅读。 如果您有任何疑问和/或建议,请在评论中让我知道。您也可以通过电子邮件LinkedIn直接与我联系

References and Further Reading

参考资料和进一步阅读

Linear Algebra for Data Science

数据科学的线性代数

翻译自: https://medium.com/analytics-vidhya/statistics-for-data-science-1ee97be0c79f

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值