四分位数和均值标准差_当中位数有利于均值时

本文探讨了在何种情况下中位数比均值更能反映数据特性,介绍了四分位数作为统计衡量标准的重要性,并对比了四分位数与均值标准差在数据描述中的应用。
摘要由CSDN通过智能技术生成

四分位数和均值标准差

The mean and the median are two of the most common features used when describing numerical data. The two are known as measures of central tendency, meaning they describe a set of data by shedding light on the central position of the data. The mean is the average value — it’s the value that you get when you add up all of the data and divide that number by the number of points in the dataset. On the other hand, the median is the middle number in a set of data once it has been ordered from smallest to largest.

平均值和中位数是描述数值数据时最常用的两个特征。 两者被称为集中趋势的量度,这意味着它们通过使光线集中在数据的中心位置来描述一组数据。 平均值是平均值,它是将所有数据相加并将该数字除以数据集中的点数所得的值。 另一方面,中位数是从最小到最大排序的一组数据的中间数字。

Data: 1, 8, 3
• Mean --> (1 + 8 + 3) / 3 = 4• Median --> 1, 3, 8 --> 3

While the mean may seem like the logical measure to use when describing your data, this is not always the case. When it comes to the mean, it has one key disadvantage — the mean is very susceptible to outliers in the data. Take the data graphed in the chart above, for example. The data above represents the cost of sneaker orders. As we can see, the vast majority of the data is all the way on the left side of the chart.

尽管平均值似乎是描述数据时使用的逻辑度量,但并非总是如此。 说到平均值,它有一个关键的缺点-平均值很容易受到数据中异常值的影响。 以上表中绘制的数据为例。 上面的数据代表运动鞋订单的成本。 如我们所见,绝大多数数据一直在图表的左侧。

Image for post
Description of the data
数据说明

When we take a look at the description of the data above, we see that 75% of sneaker purchases cost $390 or less. However, if we were to take the mean of this data, our mean would be equal to $3,145.13. Clearly, this number would not be a very accurate representation of our data. In this case, a few drastic outliers (we can see the discrepancy if we take a look at the max value in the description) are greatly influencing the mean, and thus, it would be better for us to use the median as a metric to report for this dataset. Our median is $284, which is a much better representation of our data and of sneaker sales in general, using domain knowledge of the sneaker market.

当我们看一下以上数据的描述时,我们发现75%的运动鞋购买价格为390美元或更少。 但是,如果我们取这些数据的平均值,我们的平均值将等于3,145.13美元。 显然,该数字不能很好地表示我们的数据。 在这种情况下,一些严重的离群值(如果查看描述中的最大值,我们可以看到差异)极大地影响了均值,因此,最好使用中位数作为度量标准此数据集的报告。 我们的中位数是284美元,使用运动鞋市场的领域知识,这可以更好地表示我们的数据和总体运动鞋销量。

Evidently, it is very important to look at the distribution of your data before deciding on which metric to use to represent it. If the data is normally distributed (even shape), the mean will likely be an appropriate descriptor. However, if the data is skewed like the data we looked at here, the median may be the better option.

显然,在决定使用哪种度量来表示数据之前,先查看数据的分布非常重要。 如果数据呈正态分布(均匀),则均值可能是适当的描述符。 但是,如果数据像我们在此处看到的那样歪斜,则中位数可能是更好的选择。

翻译自: https://towardsdatascience.com/when-the-median-is-favorable-to-the-mean-c5b01b149ec0

四分位数和均值标准差

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值