背景介绍
Preface
先前常用统计学方法,但存在缺点:
数据的简化假设 - simplified assumption about data representation
算法可延展性差 - poor algorithmic scalability
可解释性差 - low focus on interpretability
现在引入数据挖掘(data mining),发展出来计算机科学的相关方法,相比于数学精度,更为关注计算效率和直观的数据分析。
Introduction to Outlier Analysis
定义
关于离群点:“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.”
偏离程度高到以至于怀疑是否是由其他机理产生的。
离群点又叫:abnormalities、discordants、devisnts 、anomalies
离群点包含信息
- 观测点本身的信息
- 观测点的生成信息(即“a different mechanism”)
离群点检测算法的输出
- 离群分数(outlier score),判断离群程度
- 二进制标签(binary labels),判断最终结果
实际上,二者有联系,前者得出阈值,用于结果的得出
噪声与异常
实际上,界定噪声和异常视分析方法而定,但是对于同一离群点检测模型,异常值noise < anomalies.
具体判别时,可引入量化度量:
The sparsity of he underlying region
Nearest neighbor based distance
The fit to the underlying data distribution
异常&噪声处理
整体思路:去除噪声(noise removal),诊断异常(anomaly detection)
常用方法:
- 无监督(unsupervised methods),噪声去除&异常检测,多用于探索阶段的模型构建;
- 有监督(supervised methods),针对特定应用的异常检测
The outlier detection process needs to be sensitive to the nature of the attributes and relationships in the underlying data.