

数据科学机器学习(Data Science, Machine Learning)

In this article, we will go through two of the important concepts in descriptive statistics — Skewness and Kurtosis. At the end of the article, you will have answers to the questions such as what is skewness & kurtosis, right/left skewness, how skewness & kurtosis are measured, how it is useful, etc.

在本文中,我们将介绍描述性统计中的两个重要概念-偏度和峰度。 在本文的结尾,您将获得以下问题的答案,例如什么是偏度和峰度,右/左偏度,如何测量偏度和峰度,如何使用等等。

偏度 (Skewness)

‘Skewness’ is a measure of the asymmetry of the probability distribution of a real-valued random variable.


负偏度 (Negative Skewness)

The data concentrated more on the right of the figure as you can see below. So there is a long tail on the left side. It is also called as left-skewed or left-tailed.

数据更加集中在该图的右侧,如下所示。 因此,左侧有一条长长的尾巴。 也称为左偏或左尾。

正偏度 (Positive Skewness)

The data concentrated more on the left of the figure as you can see below. So there is a long tail on the right side. It is also called as right-skewed or right-tailed.

数据更加集中在图的左侧,如下所示。 因此,右侧有一条长长的尾巴。 也称为右偏或右尾。

Image for post
Source: Wikipedia

如何解释偏度(How to interpret skewness)

A rule of thumb says:


  • If the skewness is between -0.5 and 0.5, the data are fairly symmetrical (normal distribution).

  • If the skewness is between -1 and -0.5(negatively skewed) or between 0.5 and 1(positively skewed), the data are moderately skewed.

  • If the skewness is less than -1(negatively skewed) or greater than 1(positively skewed), the data are highly skewed.


If the data follow normal distribution, its skewness will be zero. But in real world, we don’t find any data which perfectly follows normal distribution. So, for any real world data we don’t find exact zero skewness but it can be close to zero.

如果数据遵循正态分布,则其偏斜度将为零。 但是在现实世界中,我们找不到完全符合正态分布的数据。 因此,对于任何现实世界的数据,我们都找不到精确的零偏度,但它可能接近零。

为什么要研究偏度 (Why study skewness)

Consider the below example. Here total_bill is positively skewed and data points are concentrated on the left side. If we were to build the model on this, the model will make better predictions where total_bill is lower compared to higher total_bill.

考虑下面的例子。 这里total_bill正偏,数据点集中在左侧。 如果我们要以此为基础构建模型,则与更高的total_bill相比, total_bill较低的模型将做出更好的预测。

Image for post
Image by Author

Skewness tells us about the direction of the outlier. From the above distribution, we can clearly say that outliers are present on the right side of the distribution.

偏度告诉我们异常值的方向。 从上面的分布中,我们可以清楚地说出异常值出现在分布的右侧。

如何处理偏斜的数据 (How to deal with skewed data)

Many statistical tests and machine learning models depend on normality assumptions. So, significant skewness means that data is not normal and that may affect your statistical tests or machine learning prediction power. In such cases, we need to transform the data to make it normal. Some of the common techniques used for treating skewed data:

许多统计测试和机器学习模型都依赖于正态性假设。 因此,严重偏斜意味着数据不正常,并且可能会影响您的统计测试或机器学习预测能力。 在这种情况下,我们需要转换数据以使其正常。 用于处理偏斜数据的一些常用技术:

  • Log transformation

  • Square root transformation

  • Power transformation

  • Exponential transformation

  • Box-Cox transformation, etc


In the below example, we will look at the tips dataset from the Seaborn library. As we can see, total_bill has a skewness of 1.12 which means it is highly skewed. It is also visible from the distribution plot that data is positively skewed. After the log transformation of total_bill, skewness is reduced to -0.11 which means is fairly symmetrical.

在下面的示例中,我们将查看Seaborn库中的tips数据集。 如我们所见, total_bill的偏斜度为1.12,这意味着高度偏斜。 从分布图中还可以看出,数据正偏。 经过total_bill对数转换后,偏斜度减小到-0.11,这意味着相当对称。

峰度 (Kurtosis)

‘Kurtosis’ is a measure of ‘tailedness’ of the probability distribution of a real-valued random variable. It is generally used to identify outliers (extreme values) in the given dataset. Since it is used for identifying outliers, extreme values at both ends of tails are used for analysis.

“峰度”是对实值随机变量的概率分布的“尾部”度量。 通常用于标识给定数据集中的异常值(极值)。 由于用于识别离群值,因此使用尾部两端的极值进行分析。

峰态的类型以及如何解释 (Types of Kurtosis and how to interpret)

  1. Mesokurtic (Kurtosis = 3) — This distribution shows kurtosis of 3 near zero. The distribution of extreme values (outliers) is similar to that of normal distribution.

    Mesokurtic(峰度= 3)-此分布显示峰度3接近零。 极值(离群值)的分布与正态分布相似。

  2. Leptokurtic (Kurtosis > 3) — This distribution shows greater kurtosis than mesokurtic. The peak is higher and sharper than Mesokurtic. It shows heavy tails on either side that indicates large outliers. In the investment world, a leptokurtic distribution means that it is a high-risk investment.

    Leptokurtic(Kurtosis> 3)-这种分布显示出比Mekokurtic更大的峰度。 该峰比中胚层更高且更尖锐。 它的两边都有粗尾,表明离群值较大。 在投资世界中,Leptokurtic发行意味着它是高风险的投资。

  3. Platykurtic: (Kurtosis < 3) — This distribution shows lower kurtosis than mesokurtic. The peak is lower and broader than Mesokurtic. It shows flat tails on either side indicating small outliers. In the investment world, a platykurtic distribution means that it is a low-risk investment.

    侧柏:(Kurtosis <3)-此分布显示峰度比中侧偏低。 该峰比中胚层低且宽。 它的两侧均显示平坦的尾巴,表示离群值较小。 在投资世界中,platykurtic发行意味着它是一种低风险的投资。

Source: tutorialspoint.com

Below example shows how to calculate kurtosis:


Thank you for reading this article. You can reach me at https://www.linkedin.com/in/chetanambi/

感谢您阅读本文。 您可以通过https://www.linkedin.com/in/chetanambi/与我联系

翻译自: https://medium.com/towards-artificial-intelligence/what-are-skewness-and-kurtosis-3e854a01808c






当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


