基于分位数的异常值识别

最新推荐文章于 2024-05-25 16:17:01 发布

lizz2276

最新推荐文章于 2024-05-25 16:17:01 发布

阅读量1.4k

点赞数 1

本文链接：https://blog.csdn.net/lizz2276/article/details/119534619

版权

>>> df_outlier = pd.DataFrame({'age':[1,2,30,33,34,35,40,41,42,43,100,200],

'height':[10,20,30,40,150,155,160,165,170,175,300,400],

'weight':[10,20,22,25,100,110,120,130,140,150,133,141]})
>>> df_outlier
age height weight
0 1 10 10
1 2 20 20
2 30 30 22
3 33 40 25
4 34 150 100
5 35 155 110
6 40 160 120
7 41 165 130
8 42 170 140
9 43 175 150
10 100 300 133
11 200 400 141

>>> for col in df_outlier.columns:
   percentile = np.percentile(df_outlier[col],[0,25,50,75,100])
   irq = percentile[3]-percentile[1]
   up_limit = percentile[3]+irq*1.5
   down_limit = percentile[1]-irq*1.5
   df_outlier.loc[(df_outlier[col]>up_limit) | (df_outlier[col]<down_limit),col] = np.nan

>>> df_outlier
age height weight
0 NaN 10.0 10.0
1 NaN 20.0 20.0
2 30.0 30.0 22.0
3 33.0 40.0 25.0
4 34.0 150.0 100.0
5 35.0 155.0 110.0
6 40.0 160.0 120.0
7 41.0 165.0 130.0
8 42.0 170.0 140.0
9 43.0 175.0 150.0
10 NaN 300.0 133.0
11 NaN NaN 141.0

=======================

这种方法是利用箱型图的四分位距（IQR）对异常值进行检测，也叫Tukey‘s test。箱型图的定义如下：

（图片来源：https://blog.csdn.net/weixin_39501270/article/details/77369597，侵删。）

四分位距(IQR)就是上四分位与下四分位的差值。而我们通过IQR的1.5倍为标准，规定：超过（上四分位+1.5倍IQR距离，或者下四分位-1.5倍IQR距离）的点为异常值。下面是Python中的代码实现，主要使用了numpy的percentile方法。

Percentile = np.percentile(df['length'],[0,25,50,75,100])
IQR = Percentile[3] - Percentile[1]
UpLimit = Percentile[3] + IQR*1.5
DownLimit = Percentile[1] - IQR*1.5

也可以使用seaborn的可视化方法boxplot来实现：

f,ax=plt.subplots(figsize=(10,8))
sns.boxplot(y='length',data=df,ax=ax)
plt.show()

上图中的菱形点就是异常值。

3 异常值的处理方法

检测到了异常值，我们需要对其进行一定的处理。而一般异常值的处理方法可大致分为以下几种：

删除含有异常值的记录：直接将含有异常值的记录删除；
视为缺失值：将异常值视为缺失值，利用缺失值处理的方法进行处理；
平均值修正：可用前后两个观测值的平均值修正该异常值；
不处理：直接在具有异常值的数据集上进行数据挖掘；

是否要删除异常值可根据实际情况考虑。因为一些模型对异常值不很敏感，即使有异常值也不影响模型效果，但是一些模型比如逻辑回归LR对异常值很敏感，如果不进行处理，可能会出现过拟合等非常差的效果。

lizz2276

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
基于分位数的异常值识别

>>> df_outlier = pd.DataFrame({'age':[1,2,30,33,34,35,40,41,42,43,100,200],'height':[10,20,30,40,150,155,160,165,170,175,300,400],'weight':[10,20,22,25,100,110,120,130,140,150,133,141]})>>> df_outlier age height weight0 1 ...
复制链接

扫一扫