python离群点检测_如何从熊猫DataFrame中检测峰点(离群值)

我有一个带有多个速度值的熊猫数据帧,这些速度值是连续移动的值,但它是一个传感器数据,因此我们经常在中间出现误差的情况下,移动平均值似乎也无济于事,所以我可以采用什么方法用于从数据中删除这些离群值或峰点?

例:

data points={0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9}

在此数据中,如果我看到点4、4、5、6完全是离群值,那么在我使用具有5分钟窗框的滚动平均值来平滑这些值之前,但仍然得到了这些类型的亮点,我想删除它,有人可以建议我采取任何技术摆脱这些问题。

我有一张图片,可以更清晰地查看数据:5Z7Nd.png

如果您在此处看到数据如何显示一些必须删除的离群点?有什么想法摆脱这些问题的可能方法是什么?

解决方案

I really think z-score using scipy.stats.zscore() is the way to go here. Have a look at the related issue in this post. There they are focusing on which method to use before removing potential outliers. As I see it, your challenge is a bit simpler, since judging by the data provided, it would be pretty straight forward to identify potential outliers without having to transform the data. Below is a code snippet that does just that. Just remember though, that what does and does not look like outliers will depend entirely on your dataset. And after removing some outliers, what has not looked like an outlier before, suddenly will do so now. Have a look:

importmatplotlib.pyplotaspltimportpandasaspdimportnumpyasnpfromscipyimportstats# your data (as a list)data=[0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]# initial plotdf1=pd.DataFrame(data=data)df1.columns=['data']df1.plot(style='o')# Function to identify and remove outliersdefoutliers(df,level):# 1. temporary dataframedf=df1.copy(deep=True)# 2. Select a level for a Z-score to identify and remove outliersdf_Z=df[(np.abs(stats.zscore(df))

Originial data:

ZFEaT.png

Test run 1 : Z-score = 4:

WwKEK.png

As you can see, no data has been removed because the level was set too high.

Test run 2 : Z-score = 2:

qyuPz.png

Now we're getting somewhere. Two outliers have been removed, but there is still some dubious data left.

Test run 3 : Z-score = 1.2:

ECXVR.png

This is looking really good. The remaining data now seems to be a bit more evenly distributed than before. But now the data point highlighted by the original datapoint is starting to look a bit like a potential outlier. So where to stop? That's going to be entirely up to you!

EDIT: Here's the whole thing for an easy copy&paste:

importmatplotlib.pyplotaspltimportpandasaspdimportnumpyasnpfromscipyimportstats# your data (as a list)data=[0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]# initial plotdf1=pd.DataFrame(data=data)df1.columns=['data']df1.plot(style='o')# Function to identify and remove outliersdefoutliers(df,level):# 1. temporary dataframedf=df1.copy(deep=True)# 2. Select a level for a Z-score to identify and remove outliersdf_Z=df[(np.abs(stats.zscore(df))

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值