箱线图异常检测_用于异常检测的箱线图

箱线图异常检测

In the previous article, I wrote about outlier detection using a simple statistical technique called Z-score. While that’s an easy way to create a filter for screening outliers, there’s even a better way to do it — using boxplots.

在上一篇文章中,我写了一篇使用称为Z-score的简单统计技术的离群值检测 。 尽管这是创建用于筛选异常值的筛选器的简便方法,但还有一种更好的方法-使用箱线图。

Boxplots are an excellent statistical technique to understand the distribution, dispersion and variation of univariate and categorical data— all in a single plot.

箱线图是一种了解单变量和分类数据的分布,离散和变化的出色统计技术,所有这些都在一个图中。

The purpose of this article is to introduce boxplot as a tool for outlier detection, and I’m doing so focusing on the following areas:

本文的目的是介绍boxplot作为离群值检测的工具,我将重点关注以下几个方面:

  • the statistical intuition behind boxplots

    箱线图背后的统计直觉
  • how they are used in outlier detection

    如何将它们用于离群值检测
  • a tiny bit of programming

    一点点编程

箱线图:一种直觉 (Boxplot: an intuition)

The boxplot is an effective tool to visualize the spread of data with respect to central values. I really don’t think you need to learn a lot of details, but below is a brief description to give a bit of intuition of how it works under the hood. Don’t feel bad if you don’t get it 100%.

箱线图是一种有效的工具,可以根据中心值可视化数据的传播。 我真的不认为您需要学习很多细节,但是下面是简要说明,以直观地了解它的工作原理。 如果您没有100%得到它,那就不要难过。

A picture is worth a thousand words, so instead of describing the concept in words just take a look at the following figure top-to-bottom to build your own intuition. It all starts with a small dataset of seven observations: 1, 6, 5, 4, 4, 7, 8.

一幅图片价值一千个单词,因此与其用文字描述概念,不如看下图自上而下地建立自己的直觉。 所有这些都从一个包含七个观测值的小型数据集开始:1、6、5、4、4、7、8。

Image for post

If you re-arrange the data small to large, the mid-point is the median. The median splits data into two halves. The mid-points of each halve is called a “quartile”. So we get two quartiles — the 1st quartile is the mid-point of the first half and the 3rd quartile is the mid-point of the second half. As you walk through the steps from the top, in the final part of the figure you have a boxplot and the data it contains.

如果将数据从小到大重新排列,则中点是中位数。 中位数将数据分为两半。 每个一半的中点称为“四分位数”。 因此,我们得到两个四分位数-第一个四分位数是上半部分的中点,而第三个四分位数是下半部分的中点。 当您从顶部开始逐步执​​行操作时,在该图的最后部分中,您将获得一个箱形图及其包含的数据。

Statistically speaking, a boxplot provides several pieces of information, two important ones are the quartiles, represented by both ends of the box. The distance between these two quartiles is called the Interquartile Range (IQR).

从统计学上讲,箱线图提供了几条信息,其中两个重要的信息是四分位数,由箱的两端表示。 这两个四分位数之间的距离称为四分位数间距(IQR)。

In the boxplot below, the length of the box is IQR, and the minimum and maximum values are represented by the whiskers. The whiskers are generally extended into 1.5*IQR distance on either side of the box. Therefore, all data points outside these 1.5*IQR values are flagged as outliers.

在下面的方框图中,方框的长度为IQR,最小值和最大值由晶须表示。 晶须通常在盒子的任一侧延伸到1.5 * IQR距离。 因此,这些1.5 * IQR值之外的所有数据点都标记为离群值。

Image for post
Statistical concepts associated with boxplots and positions of outliers
与箱线图和异常值位置相关的统计概念

If you’ve got the intuition about right, understanding how an “outlier” comes into play isn’t that difficult. Check out the following figure.

如果您对正确有直觉,那么了解“异常值”如何发挥作用就没有那么困难了。 查看下图。

Image for post

Generally, any data point outside the min and max values (represented by whiskers at both ends of the box) are treated as outliers.

通常,超出最大值和最小值(由框两端的晶须表示)的任何数据点都被视为异常值。

Python范例 (Example in Python)

Again, if you didn’t understand the statistical concept 100%, no hard feelings. We can drive a car without understanding a lot of its mechanics. But we do have to know how to drive!

同样,如果您不完全理解统计概念,那么就不会感到难过。 我们可以在不了解汽车许多原理的情况下驾驶汽车。 但是, 我们要知道怎么开车!

Just like knowing how to drive, understanding how to implement an algorithm is the most important part of the business. Below is a small snippet to build that programming intuition in Python.

就像知道如何驾驶一样,了解如何实现算法是企业最重要的部分。 以下是在Python中建立该编程直觉的一小段代码。

# import libraries
import numpy as np
import seaborn as sns
sns.set_style("whitegrid")# data
data = [1, 4, 4, 5, 6, 7, 8, 13]# create boxplot
sns.boxplot(y = data)
Image for post

As you can see, one outlier is pretty clearly visible in this boxplot and we can easily filter that. We don’t know the exact value of the outlier but we know that it’s greater than 12. So let’s filter that outlier value.

如您所见,在这个箱线图中可以清楚地看到一个离群值,我们可以轻松地对其进行过滤。 我们不知道离群值的确切值,但我们知道它大于12。因此,让我们过滤该离群值。

# filter outliers 
outliers = [i for i in data if i > 12]print("Outliers are: ", outliers)
Image for post

There you have it, the boxplot detects 13 as an outlier in the dataset. Whether this outlier is an anomaly or not, that, of course, is a different question that can only be answered separately using domain knowledge and additional techniques.

有了它,箱线图将13检测为数据集中的异常值。 当然,这个异常值是否是异常值,这是一个不同的问题,只能使用领域知识和其他技术来单独回答。

结论 (Conclusion)

The purpose of this article was to give the statistical intuition behind boxplot and demonstrate how it works with a tiny bit of programming example. The power of boxplots lies in the fact that you can “see” the extreme values and make a decision on the threshold for outliers by visual interpretation. The demo here was based on univariate data but it would work in a similar fashion for a multivariate dataset and categorical values.

本文的目的是提供箱线图背后的统计直觉,并通过少量编程示例演示其工作原理。 箱线图的强大之处在于您可以“看到”极限值并通过视觉解释来确定异常值的阈值。 此处的演示基于单变量数据,但对于多变量数据集和分类值,它将以类似的方式工作。

If you liked the article feel free to follow me on Medium or Twitter.

如果您喜欢这篇文章,请随时在MediumTwitter上关注我。

翻译自: https://towardsdatascience.com/boxplot-for-anomaly-detection-9eac783382fd

箱线图异常检测

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值