python画箱线图采坑总结

本文通过实例介绍了箱线图的构成及其在数据可视化中的作用,特别是异常值的判断方法。箱线图的上下边缘分别由Q2-1.5IQR和Q4+1.5IQR确定,IQR为四分位距。通过调整系数,可以改变异常值的判断标准。文中展示了不同系数下异常值的变化,并讨论了设置不同上下限倍数的可能性。
摘要由CSDN通过智能技术生成

我现在是有5组数据,我想画成下图这种形式,因为我想给每一个图都指定一个上下边缘,然后不在这个范围内的数就画成异常值。在网上找资料找了很久,都没有找到方法,其实是我自己没搞懂箱线图的原理。

在这里插入图片描述如下图所示,每一个箱线图都有上边缘,下边缘,箱体,异常值组成,箱体的上边是上四分位数,下边是下四分位数,中间是中位数
箱形图有5个参数:
下边缘(Q1),
下四分位数(Q2),又称“第一四分位数”,等于该样本中所有数值由小到大排列后第25%的数字;
中位数(Q3),又称“第二四分位数”等于该样本中所有数值由小到大排列后第50%的数字;
上四分位数(Q4),又称“第三四分位数”等于该样本中所有数值由小到大排列后第75%的数字;
上边缘(Q5),
异常值:超过上边缘或者下边缘的值
千万不要跟我一样以为上边缘是最大值,下边缘是最小值
上下边缘的确定是Q2-1.5IQR和Q4+1.5IQR,其中IQR=Q4-Q2;

在这里插入图片描述下面我们来用程序验证一下

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data={'neutral':[55,52,52,52,51,51,50,50,50,48,48,48,47,47,47,47,47,46,46,46,46,45,45,45,45,44,44,44,44,44,44,44,43,43,43,43,43,42,42,42,42,42,42,41,41,41,41,41,41,41,40,40,40,40,40,40,40,40,39,39,39,38,38,38,38,38,38,38,38,38,38,37,37,37,37,37,37,37,37,37,37,37,37,37,36,36,36,36,36,36,36,36,36,36,36,36,35,35,35,35,35,35,35,35,35,35,34,34,34,34,34,34,34,34,34,34,34,33,33,33,33,33,33,33,33,32,32,32,32,32,32,32,32,32,32,31,31,31,31,31,31,31,30,30,30,30,30,30,30,30,30,29,29,29,29,29,29,28,28,28,28,28,28,28,27,27,27,27,27,27,27,27,27,26,26,26,26,26,25,25,25,25,25,25,24,24,24,24,23,22,21,21,20,20,20,20,20,18,16,12]}
df = pd.DataFrame(data)
print(df.describe())
df.plot.box(title="Consumer spending in each country",whis=1.5)
plt.grid(linestyle="--", alpha=0.3)
plt.show()

结果展示
在这里插入图片描述我发现程序有一个异常点(这个异常点看坐标应该是12),而且上边缘是这个数列的最大值,下边缘不是这个数列的最小值,我们来计算一下
IQR=Q4-Q2=40.25-30=10.25
Q1=Q2-1.5IQR=30-1.5×10.25=14.625,那么下边缘应该是14.625,12超出了这个范围,所以被判为异常点

Q5=Q4+1.5IQR=40.25+15.375=55.625,这个数列所有的数都没有超过这个上边缘(最大只有55),所以上边没有异常点

现在我们改一下这个数列,再来看看结果

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data={'neutral':[80,56,52,52,51,51,50,50,50,48,48,48,47,47,47,47,47,46,46,46,46,45,45,45,45,44,44,44,44,44,44,44,43,43,43,43,43,42,42,42,42,42,42,41,41,41,41,41,41,41,40,40,40,40,40,40,40,40,39,39,39,38,38,38,38,38,38,38,38,38,38,37,37,37,37,37,37,37,37,37,37,37,37,37,36,36,36,36,36,36,36,36,36,36,36,36,35,35,35,35,35,35,35,35,35,35,34,34,34,34,34,34,34,34,34,34,34,33,33,33,33,33,33,33,33,32,32,32,32,32,32,32,32,32,32,31,31,31,31,31,31,31,30,30,30,30,30,30,30,30,30,29,29,29,29,29,29,28,28,28,28,28,28,28,27,27,27,27,27,27,27,27,27,26,26,26,26,26,25,25,25,25,25,25,24,24,24,24,23,22,21,21,20,20,20,20,20,18,16,13]}
df = pd.DataFrame(data)
print(df.describe())
df.plot.box(title="Consumer spending in each country",whis=1.5)
plt.grid(linestyle="--", alpha=0.3)
plt.show()

如下图,13不是异常点,80,56是异常点

在这里插入图片描述但是,这个系数是可以改的,可以改成2.0试试

此时Q1=Q2-2IQR

Q5=Q4+2IQR

df.plot.box(title="Consumer spending in each country",whis=1.5)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data={'neutral':[80,56,52,52,51,51,50,50,50,48,48,48,47,47,47,47,47,46,46,46,46,45,45,45,45,44,44,44,44,44,44,44,43,43,43,43,43,42,42,42,42,42,42,41,41,41,41,41,41,41,40,40,40,40,40,40,40,40,39,39,39,38,38,38,38,38,38,38,38,38,38,37,37,37,37,37,37,37,37,37,37,37,37,37,36,36,36,36,36,36,36,36,36,36,36,36,35,35,35,35,35,35,35,35,35,35,34,34,34,34,34,34,34,34,34,34,34,33,33,33,33,33,33,33,33,32,32,32,32,32,32,32,32,32,32,31,31,31,31,31,31,31,30,30,30,30,30,30,30,30,30,29,29,29,29,29,29,28,28,28,28,28,28,28,27,27,27,27,27,27,27,27,27,26,26,26,26,26,25,25,25,25,25,25,24,24,24,24,23,22,21,21,20,20,20,20,20,18,16,13]}
df = pd.DataFrame(data)
print(df.describe())
df.plot.box(title="Consumer spending in each country",whis=2.0)
plt.grid(linestyle="--", alpha=0.3)
plt.show()

在这里插入图片描述不知道可不可以上下限的倍数设的不一样啊
https://www.bilibili.com/video/BV1Jt4y1i76Q?from=search&seid=12376886745584686578
这个视频中说的,不是很懂它的意思,我也没试出来这之间的关系

df.plot.box(title="Consumer spending in each country",whis=(20,100))
  • 3
    点赞
  • 26
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值