Anomaly Detection异常检测
- What are Outliers ?
- Statistical Methods for Univariate Data
- Using Gaussian Mixture Models
- Fitting an elliptic envelope
- Isolation Forest
- Local Outlier Factor
- Using clustering method like DBSCAN
- 什么是离群值?
- 单变量数据的统计方法
- 使用高斯混合模型
- 安装椭圆形信封
- 隔离森林
- 局部离群因子
- 使用DBSCAN之类的聚类方法
Outliers离群值
- New data which doesn’t belong to general trend (or distribution) of entire data are known as outliers.
- Data belonging to general trend are known as inliners.
- Learning models are impacted by presence of outliers.
- Anomaly detection is another use of outlier detection in which we find out unusual behaviour.
- Data which were detected outliers can be deleted from complete dataset.
- Outliers can also be marked before using them in learning methods
- 不属于整个数据的总体趋势(或分布)的新数据称为异常值。
- 属于大趋势的数据称为线性。
- 学习模型会受到异常值的影响。
- 异常检测是异常检测的另一种用途,在异常检测中我们可以发现异常行为。
- 可以从完整数据集中删除检测到的异常值的数据。
- 在学习方法中使用异常值之前,也可以对其进行标记
Statistical Methods for Univariate Data单变量数据的统计方法
- Using Standard Deviation Method - zscore
- Using Interquartile Range Method - IRQ
- 使用标准偏差方法-zscore
- 使用四分位间距法-IRQ
Using Standard Deviation Method使用标准偏差法
- If univariate data follows Gaussian Distribution, we can use standard deviation to figure out where our data lies
- 如果单变量数据遵循高斯分布,我们可以使用标准差来找出数据所在的位置
import numpy as np
data = np.random.normal(size=1000)
data[-5:] = [3.5,3.6,4,3.56,4.2]
from scipy.stats import zscore
data[np.abs(zscore(data)) > 3]
array([3.05605991, 3.5 , 3.6 , 4. , 3.56 ,
4.2 ])
Using Interquartile Range使用四分位间距
- For univariate data not following Gaussian Distribution IQR is a way to detect outliers
- 对于不遵循高斯分布的单变量数据,IQR是检测异常值的一种方法
from scipy.stats import iqr
data = np.random.normal(size=1000)
data[-5:]=[-2,9,11,-3,-21]
iqr_value = iqr(data)
lower_threshold = np.percentile(data,25) - iqr_value*1.5
upper_threshold = np.percentile(data,75