轻型异常在线检测器LODA:Lightweight on-line detector of anomalies

LODA的论文Tomáš Pevn`y. Loda: lightweight on-line detector of anomalies. Machine Learning, 102(2):275–304, 2016.)指出:LODA的简单性在需要实时处理大量样本的领域或数据流易受概念漂移影响、需要在线更新检测器的领域特别有用。除了快速和准确之外,LODA还能够对丢失变量的数据操作和更新。因此,LODA在传感器中断的领域非常实用。此外,LODA可以识别出被仔细检查的样本与大多数样本不同的特征。当目标是找出导致异常的原因时,此功能非常有用。值得注意的是,这些有利的属性都没有增加LODA较低的时间和空间复杂度。我们将LODA与几种最先进的异常检测器在两种情况下进行比较:批处理训练和数据流的在线训练。来自UCI知识库的36个数据集的结果说明了该系统的优点,同时也提供了更多关于批量和在线异常检测的一般性问题的见解。

其他LODA的改进论文:Saarinen I: Adaptive real-time anomaly detection for multi-dimensional streaming data. 2017.
多维流数据的自适应实时异常检测
LODA:API介绍

参数:
  1. contamination (float in (0., 0.5), 可选 (默认值=0.1)) – The amount of contamination of the data set, 即,数据集中离群值的比例. Used when fitting to define the threshold on the decision function.

  2. n_bins (int, 可选 (默认值 = 10)) – The number of bins for the histogram.

  3. n_random_cuts (int, 可选 (默认值 = 100)) – The number of random cuts.

注意:

从0.6.9版开始不推荐使用:fit_predict和fit_predict_score将在pyod 0.8.0中删除。它会被替换为:先调用fit函数,然后再访问labels_属性以保持一致性。可以通过调用评估方法(例如AUC ROC)进行评分。

相关函数:
1. predict(X):预测特定样本是否为离群值。

参数: X (numpy的数组,形状为:(n_samples, n_features)) – 输入样本

返回值: outlier_labels – 对于每个观察值,根据训练好的模型辨别是否应将其视为异常值。 0代表正常值,1代表离群值。

返回值类型: numpy的数组,形状为:(n_samples,)

2.predict_proba(X, method=‘linear’)

预测样本离群的可能性。两种方法:
(1)使用最小-最大转换将离群值线性转换为[0,1]
(2)使用统一分数

参数:
(1)X (numpy array of shape (n_samples, n_features)) – The input samples.
(2)method (str, 可选(默认值=‘linear’)) – 概率转换法. ‘linear’ 或 ‘unify’.

返回值: outlier_probability – 对于每个观察值,根据训练好的模型辨别是否应将其视为异常值。返回异常值概率,范围为[0,1]。

返回值类型: numpy array of shape (n_samples,)

源代码如下:

# -*- coding: utf-8 -*-
"""Loda: Lightweight on-line detector of anomalies
Adapted from tilitools (https://github.com/nicococo/tilitools) by
"""
# Author: Yue Zhao <zhaoy@cmu.edu>
# License: BSD 2 clause

from __future__ import division
from __future__ import print_function

import numpy as np
from sklearn.utils.validation import check_is_fitted
from sklearn.utils import check_array

from .base import BaseDetector


class LODA(BaseDetector):
    """Loda: Lightweight on-line detector of anomalies. See
    :cite:`pevny2016loda` for more information.

    Parameters
    ----------
    contamination : float in (0., 0.5), optional (default=0.1)
        The amount of contamination of the data set,
        i.e. the proportion of outliers in the data set. Used when fitting to
        define the threshold on the decision function.

    n_bins : int, optional (default = 10)
        The number of bins for the histogram.

    n_random_cuts : int, optional (default = 100)
        The number of random cuts.

    Attributes
    ----------
    decision_scores_ : numpy array of shape (n_samples,)
        The outlier scores of the training data.
        The higher, the more abnormal. Outliers tend to have higher
        scores. This value is available once the detector is
        fitted.

    threshold_ : float
        The threshold is based on ``contamination``. It is the
        ``n_samples * contamination`` most abnormal samples in
        ``decision_scores_``. The threshold is calculated for generating
        binary outlier labels.

    labels_ : int, either 0 or 1
        The binary labels of the training data. 0 stands for inliers
        and 1 for outliers/anomalies. It is generated by applying
        ``threshold_`` on ``decision_scores_``.
    """

    def __init__(self, contamination=0.1, n_bins=10, n_random_cuts=100):
        super(LODA, self).__init__(contamination=contamination)
        self.n_bins = n_bins
        self.n_random_cuts = n_random_cuts
        self.weights = np.ones(n_random_cuts, dtype=np.float) / n_random_cuts

    def fit(self, X, y=None):
        """Fit detector. y is ignored in unsupervised methods.

        Parameters
        ----------
        X : numpy array of shape (n_samples, n_features)
            The input samples.

        y : Ignored
            Not used, present for API consistency by convention.

        Returns
        -------
        self : object
            Fitted estimator.
        """
        # validate inputs X and y (optional)
        X = check_array(X)
        self._set_n_classes(y)
        pred_scores = np.zeros([X.shape[0], 1])

        n_components = X.shape[1]
        n_nonzero_components = np.sqrt(n_components)
        n_zero_components = n_components - np.int(n_nonzero_components)

        self.projections_ = np.random.randn(self.n_random_cuts, n_components)
        self.histograms_ = np.zeros((self.n_random_cuts, self.n_bins))
        self.limits_ = np.zeros((self.n_random_cuts, self.n_bins + 1))
        for i in range(self.n_random_cuts):
            rands = np.random.permutation(n_components)[:n_zero_components]
            self.projections_[i, rands] = 0.
            projected_data = self.projections_[i, :].dot(X.T)
            self.histograms_[i, :], self.limits_[i, :] = np.histogram(
                projected_data, bins=self.n_bins, density=False)
            self.histograms_[i, :] += 1e-12
            self.histograms_[i, :] /= np.sum(self.histograms_[i, :])

            # calculate the scores for the training samples
            inds = np.searchsorted(self.limits_[i, :self.n_bins - 1],
                                   projected_data, side='left')
            pred_scores[:, 0] += -self.weights[i] * np.log(
                self.histograms_[i, inds])

        self.decision_scores_ = (pred_scores / self.n_random_cuts).ravel()
        self._process_decision_scores()

        return self


    def decision_function(self, X):
        """Predict raw anomaly score of X using the fitted detector.

        The anomaly score of an input sample is computed based on different
        detector algorithms. For consistency, outliers are assigned with
        larger anomaly scores.

        Parameters
        ----------
        X : numpy array of shape (n_samples, n_features)
            The training input samples. Sparse matrices are accepted only
            if they are supported by the base estimator.

        Returns
        -------
        anomaly_scores : numpy array of shape (n_samples,)
            The anomaly score of the input samples.
        """
        check_is_fitted(self, ['projections_', 'decision_scores_',
                               'threshold_', 'labels_'])

        X = check_array(X)
        pred_scores = np.zeros([X.shape[0], 1])
        for i in range(self.n_random_cuts):
            projected_data = self.projections_[i, :].dot(X.T)
            inds = np.searchsorted(self.limits_[i, :self.n_bins - 1],
                                   projected_data, side='left')
            pred_scores[:, 0] += -self.weights[i] * np.log(
                self.histograms_[i, inds])
        pred_scores /= self.n_random_cuts
        return pred_scores.ravel()

  • 2
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值