python异常值检测

最新推荐文章于 2024-08-17 18:30:38 发布

a useful man

最新推荐文章于 2024-08-17 18:30:38 发布

阅读量5.5k

点赞数 1

分类专栏：数据科学

本文链接：https://blog.csdn.net/sinat_23971513/article/details/105243561

版权

本文介绍了Python中用于异常值检测的各种方法，包括使用标准偏差、四分位间距、高斯混合模型、椭圆包络、隔离森林、局部离群因子和DBSCAN聚类。异常值检测有助于发现不符合数据总体趋势的行为，可用于数据预处理或识别异常行为。

摘要由CSDN通过智能技术生成

Anomaly Detection异常检测

What are Outliers ?
Statistical Methods for Univariate Data
Using Gaussian Mixture Models
Fitting an elliptic envelope
Isolation Forest
Local Outlier Factor
Using clustering method like DBSCAN
什么是离群值？
单变量数据的统计方法
使用高斯混合模型
安装椭圆形信封
隔离森林
局部离群因子
使用DBSCAN之类的聚类方法

Outliers离群值

New data which doesn’t belong to general trend (or distribution) of entire data are known as outliers.
Data belonging to general trend are known as inliners.
Learning models are impacted by presence of outliers.
Anomaly detection is another use of outlier detection in which we find out unusual behaviour.
Data which were detected outliers can be deleted from complete dataset.
Outliers can also be marked before using them in learning methods
不属于整个数据的总体趋势（或分布）的新数据称为异常值。
属于大趋势的数据称为线性。
学习模型会受到异常值的影响。
异常检测是异常检测的另一种用途，在异常检测中我们可以发现异常行为。
可以从完整数据集中删除检测到的异常值的数据。
在学习方法中使用异常值之前，也可以对其进行标记

Statistical Methods for Univariate Data单变量数据的统计方法

Using Standard Deviation Method - zscore
Using Interquartile Range Method - IRQ
使用标准偏差方法-zscore
使用四分位间距法-IRQ

Using Standard Deviation Method使用标准偏差法

If univariate data follows Gaussian Distribution, we can use standard deviation to figure out where our data lies
如果单变量数据遵循高斯分布，我们可以使用标准差来找出数据所在的位置

import numpy as np
data = np.random.normal(size=1000)

Adding More Outliers

data[-5:] = [3.5,3.6,4,3.56,4.2]

from scipy.stats import zscore

Detecting Outliers

data[np.abs(zscore(data)) > 3]

array([3.05605991, 3.5       , 3.6       , 4.        , 3.56      ,
       4.2       ])

Using Interquartile Range使用四分位间距

For univariate data not following Gaussian Distribution IQR is a way to detect outliers
对于不遵循高斯分布的单变量数据，IQR是检测异常值的一种方法

from scipy.stats import iqr

data = np.random.normal(size=1000)
data[-5:]=[-2,9,11,-3,-21]
iqr_value = iqr(data)
lower_threshold = np.percentile(data,25) - iqr_value*1.5
upper_threshold = np.percentile(data,75