python异常值删除,Python从数据中删除异常值

I have a data frame as following:

ID Value

A 70

A 80

B 75

C 10

B 50

A 1000

C 60

B 2000

.. ..

I would like to group this data by ID, remove the outliers from the grouped data (the ones we see from the boxplot) and then calculate mean.

So far

grouped = df.groupby('ID')

statBefore = pd.DataFrame({'mean': grouped['Value'].mean(), 'median': grouped['Value'].median(), 'std' : grouped['Value'].std()})

How can I find outliers, remove them and get the statistics.

解决方案

I believe the method you're referring to is to remove values > 1.5 * the interquartile range away from the median. So first, calculate your initial statistics:

statBefore = pd.DataFrame({'q1': grouped['Value'].quantile(.25), \

'median': grouped['Value'].median(), 'q3' : grouped['Value'].quantile(.75)})

And then determine whether values in the original DF are outliers:

def is_outlier(row):

iq_range = statBefore.loc[row.ID]['q3'] - statBefore.loc[row.ID]['q1']

median = statBefore.loc[row.ID]['median']

if row.Value > (median + (1.5* iq_range)) or row.Value < (median - (1.5* iq_range)):

return True

else:

return False

#apply the function to the original df:

df.loc[:, 'outlier'] = df.apply(is_outlier, axis = 1)

#filter to only non-outliers:

df_no_outliers = df[~(df.outlier)]

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值