

In the previous article, I wrote about outlier detection using a simple statistical technique called Z-score. While that’s an easy way to create a filter for screening outliers, there’s even a better way to do it — using boxplots.

Boxplots are an excellent statistical technique to understand the distribution, dispersion and variation of univariate and categorical data— all in a single plot.


The purpose of this article is to introduce boxplot as a tool for outlier detection, and I’m doing so focusing on the following areas:


  • the statistical intuition behind boxplots

  • how they are used in outlier detection

  • a tiny bit of programming


The boxplot is an effective tool to visualize the spread of data with respect to central values. I really don’t think you need to learn a lot of details, but below is a brief description to give a bit of intuition of how it works under the hood. Don’t feel bad if you don’t get it 100%.

A picture is worth a thousand words, so instead of describing the concept in words just take a look at the following figure top-to-bottom to build your own intuition. It all starts with a small dataset of seven observations: 1, 6, 5, 4, 4, 7, 8.

If you re-arrange the data small to large, the mid-point is the median. The median splits data into two halves. The mid-points of each halve is called a “quartile”. So we get two quartiles — the 1st quartile is the mid-point of the first half and the 3rd quartile is the mid-point of the second half. As you walk through the steps from the top, in the final part of the figure you have a boxplot and the data it contains.

Statistically speaking, a boxplot provides several pieces of information, two important ones are the quartiles, represented by both ends of the box. The distance between these two quartiles is called the Interquartile Range (IQR).

In the boxplot below, the length of the box is IQR, and the minimum and maximum values are represented by the whiskers. The whiskers are generally extended into 1.5*IQR distance on either side of the box. Therefore, all data points outside these 1.5*IQR values are flagged as outliers.

Statistical concepts associated with boxplots and positions of outliers

If you’ve got the intuition about right, understanding how an “outlier” comes into play isn’t that difficult. Check out the following figure.

Generally, any data point outside the min and max values (represented by whiskers at both ends of the box) are treated as outliers.


Again, if you didn’t understand the statistical concept 100%, no hard feelings. We can drive a car without understanding a lot of its mechanics. But we do have to know how to drive!

Just like knowing how to drive, understanding how to implement an algorithm is the most important part of the business. Below is a small snippet to build that programming intuition in Python.

# import libraries
import numpy as np
import seaborn as sns
sns.set_style("whitegrid")# data
data = [1, 4, 4, 5, 6, 7, 8, 13]# create boxplot
sns.boxplot(y = data)
As you can see, one outlier is pretty clearly visible in this boxplot and we can easily filter that. We don’t know the exact value of the outlier but we know that it’s greater than 12. So let’s filter that outlier value.

如您所见,在这个箱线图中可以清楚地看到一个离群值,我们可以轻松地对其进行过滤。 我们不知道离群值的确切值,但我们知道它大于12。因此,让我们过滤该离群值。

# filter outliers 
outliers = [i for i in data if i > 12]print("Outliers are: ", outliers)
There you have it, the boxplot detects 13 as an outlier in the dataset. Whether this outlier is an anomaly or not, that, of course, is a different question that can only be answered separately using domain knowledge and additional techniques.

有了它,箱线图将13检测为数据集中的异常值。 当然,这个异常值是否是异常值,这是一个不同的问题,只能使用领域知识和其他技术来单独回答。

The purpose of this article was to give the statistical intuition behind boxplot and demonstrate how it works with a tiny bit of programming example. The power of boxplots lies in the fact that you can “see” the extreme values and make a decision on the threshold for outliers by visual interpretation. The demo here was based on univariate data but it would work in a similar fashion for a multivariate dataset and categorical values.

