【Python】数据分析 Section 6.3: Box and Whisker Plots | from Coursera “Applied Data Science with Python“

A box plot, sometimes called a box-and-whisker plot, is a method of showing aggregate statistics of various samples in a concise matter. The goal of the boxplot is to summarize the distribution of your data through a visualization of what's called the 5-number-summary: the extremes - often the minimum and maximum values, the center, usually the median of the data, and the first and third quartiles of your the data. The quartiles of your data break it into four roughly similar sized buckets, and so the first and third quartile markers -- sometimes called hinges -- show you the middle 50% of your data. Through the box plot we can get a sense of the weighting of the data in a fairly compact visual representation. Let's take a look.

# First we'll bring in our libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Now let's create three different samples from NumPy - One for the normal distribution,
# one for random, and one from a gamma distribution. 

normal_sample = np.random.normal(loc=0.0, scale=1.0, size=10000)
random_sample = np.random.random(size=10000)
gamma_sample = np.random.gamma(2, size=10000)

# Next, let's put those in a pandas DataFrame.
df = pd.DataFrame({'normal': normal_sample, 
                   'random': random_sample, 
                   'gamma': gamma_sample})
df

# Now we can use the pandas describe function to see some summary statistics about our data frame.
# Each row has 10,000 entries. The mean values and standard deviation vary heavily.
df.describe()

This function shows the minimal and maximum values and three three different percentage values. These percentage values make up what's called the interquartile range. There are four different quarters of the data. The first is between the minimal value and the first 25% of the data. And this value of 25% is called the first quartile. The second quarter of data is between the 25% mark and the 50% of the data. The third between 50 and 75% of the data. And 75% mark is called the third quartile. And the final piece of data is between the 75% and the maximum of the data.

Like standard deviation, the interquartile range is a measure of variability of data. And it's common to plot this using a box plot. In a box plot, the mean, or the median, of the data is plotted as a straight line. Two boxes are formed, one above, which represents the 50% to 75% data group, and one below, which represents the 25% to 50% data group. Thin lines are then drawn out to the minimum and maximum values.

# To see a boxplot we just choose the column of the dataframe we are interested in and pass it
# to pyplot's boxplot function. matplotlib uses numpy arrays for data, but since pandas is built
# on top of numpy things work fluidly
plt.boxplot(df['normal'])

You'll see that matplotlib actually prints out a bunch of information about artists. Often we don't really want to see this, but it can be handy at times. To supress this we simply put a semicolon at the end of our last statement. This is a Jupyter notebook trick that I've actually used a few times and haven't told you about -- it supresses printing the last variable in a cell. Be warned, it's not standard python!

# Now just the image
plt.boxplot(df['normal']);

Great, this gives us a basic box plot. Now let's add the other two samples to it. Unfortunately we can't just pass a whole pandas data frame to matplotlib. Instead we need to pull out each column and send them in as a list of values.

# plot boxplots for all three of df's columns
plt.boxplot([ df['normal'], df['random'], df['gamma'] ], whis=[0,100]);

All right, that gives us our three distributions. Now, we didn't normalize the scale, so that's a little wonky. But if we look at the gamma distribution, for instance, we see the tail of it is very, very long. So the maximum values are very far out. Let's take a look at this by itself in a histogram.

plt.hist(df['gamma'], bins=100);

Interesting, we see it starts at a moderate level, spikes up, then drops off much more gradually and does indeed have a very long tail. Let's add this to our box plot, and I'm going to take this as an opportunity to demonstrate something called inset axes.

Recall that we have one figure with one subplot. Since we didn't do anything fancy with subplots, that means we only have one axes object. We can actually overlay an axes on top of another within a figure. We do this by calling the inset_axes function on the figure and sending in details of the new axes that we want to create. The details we send are a position in x/y space and the width and height of the new plot.

plt.figure(figsize=(9,9))
# Our main figure is our boxplot
plt.boxplot([ df['normal'], df['random'], df['gamma'] ], whis=[0,100])
# Now let's plot on that axes a new axes object! This will be overlayed on
# top, and we provide a bounding box of (0,0.6) as the bottom left, and 
# (0.6,0.4) as width and height. These are ratios of the ax object
ax2 = plt.gca().inset_axes([0,0.6,0.6,0.4])
# Now we can just plot our histogram right on there
ax2.hist(df['gamma'], bins=100, density=True)
# And since the histogram will have tick labels on the left and clash with
# the main figure, we can flip them to the right
ax2.yaxis.tick_right();

Pretty cool, isn't it? So in one figure here we have our boxplots of three distributions, and then we have a nice little inset image showing the histogram of the far right boxplot.

Remember again that each boxplot is our five number summary -- a median line in red in the middle, then two boxes on either side of that which represent 25% of the population respectively, then whiskers which run out to the maximum and minimums of the data. This would be a great time to pause the video and play with the notebook, to clean up this figure by adding titles, legends, and the like. How would you make it clear to the reader that the inset histogram is about the boxplot on the far right, for instance?

Now, we often want to look at a boxplot not by seeing the maximum and minimum values but instead by emphasizing outliers. How outliers are detected really depends, and there are various mechanisms to determine whether an observation is an outlier or not. If we look at the documentation though, the default for matplotlib is that outliers are all data points which are either greater than or less than the distance between the hinge (the bottom or top of the box, or first and third quartile), and 1.5 * the inter-quartile range (IQR). The interquartile range is the distance between the two hinges, which captures 50% of our data. So if we omit the whis parameter to the boxplot call we'll see the outliers using this method plotted.

# Nice big figure
plt.figure(figsize=(10,10))
# Now with outliers
plt.boxplot([ df['normal'], df['random'], df['gamma'] ]);

Each circle in the boxplot is a single outlier observation. The box plot is one of the more common plots that you might use as a data scientist, and matplotlib has significant support for different kinds of box plots. Here the matplotlib documentation is key. You can find links in the course resources to the API, which describes the box plot functionality.

I've got one more plot to show you this week - a two dimensional histogram which is better known as a heat map. Then we'll look at a couple more advance features of matplotlib.

  • 24
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值