LODA的论文(Tomáš Pevn`y. Loda: lightweight on-line detector of anomalies. Machine Learning, 102(2):275–304, 2016.)指出:LODA的简单性在需要实时处理大量样本的领域或数据流易受概念漂移影响、需要在线更新检测器的领域特别有用。除了快速和准确之外,LODA还能够对丢失变量的数据操作和更新。因此,LODA在传感器中断的领域非常实用。此外,LODA可以识别出被仔细检查的样本与大多数样本不同的特征。当目标是找出导致异常的原因时,此功能非常有用。值得注意的是,这些有利的属性都没有增加LODA较低的时间和空间复杂度。我们将LODA与几种最先进的异常检测器在两种情况下进行比较:批处理训练和数据流的在线训练。来自UCI知识库的36个数据集的结果说明了该系统的优点,同时也提供了更多关于批量和在线异常检测的一般性问题的见解。
其他LODA的改进论文:Saarinen I: Adaptive real-time anomaly detection for multi-dimensional streaming data. 2017.
多维流数据的自适应实时异常检测
LODA:API介绍
参数:
-
contamination (float in (0., 0.5), 可选 (默认值=0.1)) – The amount of contamination of the data set, 即,数据集中离群值的比例. Used when fitting to define the threshold on the decision function.
-
n_bins (int, 可选 (默认值 = 10)) – The number of bins for the histogram.
-
n_random_cuts (int, 可选 (默认值 = 100)) – The number of random cuts.
注意:
从0.6.9版开始不推荐使用:fit_predict和fit_predict_score将在pyod 0.8.0中删除。它会被替换为:先调用fit函数,然后再访问labels_属性以保持一致性。可以通过调用评估方法(例如AUC ROC)进行评分。
相关函数:
1. predict(X):预测特定样本是否为离群值。
参数: X (numpy的数组,形状为:(n_samples, n_features)) – 输入样本
返回值: outlier_labels – 对于每个观察值,根据训练好的模型辨别是否应将其视为异常值。 0代表正常值,1代表离群值。
返回值类型: numpy的数组,形状为:(n_samples,)
2.predict_proba(X, method=‘linear’)
预测样本离群的可能性。两种方法:
(1)使用最小-最大转换将离群值线性转换为[0,1]
(2)使用统一分数
参数:
(1)X (numpy array of shape (n_samples, n_features)) – The input samples.
(2)method (str, 可选(默认值=‘linear’)) – 概率转换法. ‘linear’ 或 ‘unify’.
返回值: outlier_probability – 对于每个观察值,根据训练好的模型辨别是否应将其视为异常值。返回异常值概率,范围为[0,1]。
返回值类型: numpy array of shape (n_samples,)
源代码如下:
# -*- coding: utf-8 -*-
"""Loda: Lightweight on-line detector of anomalies
Adapted from tilitools (https://github.com/nicococo/tilitools) by
"""
# Author: Yue Zhao <zhaoy@cmu.edu>
# License: BSD 2 clause
from __future__ import division
from __future__ import print_function
import numpy as np
from sklearn.utils.validation import check_is_fitted
from sklearn.utils import check_array
from .base import BaseDetector
class LODA(BaseDetector):
"""Loda: Lightweight on-line detector of anomalies. See
:cite:`pevny2016loda` for more information.
Parameters
----------
contamination : float in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set,
i.e. the proportion of outliers in the data set. Used when fitting to
define the threshold on the decision function.
n_bins : int, optional (default = 10)
The number of bins for the histogram.
n_random_cuts : int, optional (default = 100)
The number of random cuts.
Attributes
----------
decision_scores_ : numpy array of shape (n_samples,)
The outlier scores of the training data.
The higher, the more abnormal. Outliers tend to have higher
scores. This value is available once the detector is
fitted.
threshold_ : float
The threshold is based on ``contamination``. It is the
``n_samples * contamination`` most abnormal samples in
``decision_scores_``. The threshold is calculated for generating
binary outlier labels.
labels_ : int, either 0 or 1
The binary labels of the training data. 0 stands for inliers
and 1 for outliers/anomalies. It is generated by applying
``threshold_`` on ``decision_scores_``.
"""
def __init__(self, contamination=0.1, n_bins=10, n_random_cuts=100):
super(LODA, self).__init__(contamination=contamination)
self.n_bins = n_bins
self.n_random_cuts = n_random_cuts
self.weights = np.ones(n_random_cuts, dtype=np.float) / n_random_cuts
def fit(self, X, y=None):
"""Fit detector. y is ignored in unsupervised methods.
Parameters
----------
X : numpy array of shape (n_samples, n_features)
The input samples.
y : Ignored
Not used, present for API consistency by convention.
Returns
-------
self : object
Fitted estimator.
"""
# validate inputs X and y (optional)
X = check_array(X)
self._set_n_classes(y)
pred_scores = np.zeros([X.shape[0], 1])
n_components = X.shape[1]
n_nonzero_components = np.sqrt(n_components)
n_zero_components = n_components - np.int(n_nonzero_components)
self.projections_ = np.random.randn(self.n_random_cuts, n_components)
self.histograms_ = np.zeros((self.n_random_cuts, self.n_bins))
self.limits_ = np.zeros((self.n_random_cuts, self.n_bins + 1))
for i in range(self.n_random_cuts):
rands = np.random.permutation(n_components)[:n_zero_components]
self.projections_[i, rands] = 0.
projected_data = self.projections_[i, :].dot(X.T)
self.histograms_[i, :], self.limits_[i, :] = np.histogram(
projected_data, bins=self.n_bins, density=False)
self.histograms_[i, :] += 1e-12
self.histograms_[i, :] /= np.sum(self.histograms_[i, :])
# calculate the scores for the training samples
inds = np.searchsorted(self.limits_[i, :self.n_bins - 1],
projected_data, side='left')
pred_scores[:, 0] += -self.weights[i] * np.log(
self.histograms_[i, inds])
self.decision_scores_ = (pred_scores / self.n_random_cuts).ravel()
self._process_decision_scores()
return self
def decision_function(self, X):
"""Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different
detector algorithms. For consistency, outliers are assigned with
larger anomaly scores.
Parameters
----------
X : numpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only
if they are supported by the base estimator.
Returns
-------
anomaly_scores : numpy array of shape (n_samples,)
The anomaly score of the input samples.
"""
check_is_fitted(self, ['projections_', 'decision_scores_',
'threshold_', 'labels_'])
X = check_array(X)
pred_scores = np.zeros([X.shape[0], 1])
for i in range(self.n_random_cuts):
projected_data = self.projections_[i, :].dot(X.T)
inds = np.searchsorted(self.limits_[i, :self.n_bins - 1],
projected_data, side='left')
pred_scores[:, 0] += -self.weights[i] * np.log(
self.histograms_[i, inds])
pred_scores /= self.n_random_cuts
return pred_scores.ravel()