Scikit Learn-异常检测

Scikit Learn-异常检测 (Scikit Learn - Anomaly Detection)

Here, we will learn about what is anomaly detection in Sklearn and how it is used in identification of the data points.

在这里,我们将了解什么是Sklearn中的异常检测以及如何将其用于识别数据点。

Anomaly detection is a technique used to identify data points in dataset that does not fit well with the rest of the data. It has many applications in business such as fraud detection, intrusion detection, system health monitoring, surveillance, and predictive maintenance. Anomalies, which are also called outlier, can be divided into following three categories −

异常检测是一种用于识别数据集中与其他数据不太吻合的数据点的技术。 它在商业中具有许多应用程序,例如欺诈检测,入侵检测,系统运行状况监视,监视和预测性维护。 异常也称为离群值,可以分为以下三类:

  • Point anomalies − It occurs when an individual data instance is considered as anomalous w.r.t the rest of the data.

    点异常 -当单个数据实例被认为与其余数据异常时,会发生异常。

  • Contextual anomalies − Such kind of anomaly is context specific. It occurs if a data instance is anomalous in a specific context.

    上下文异常 -这种异常是上下文特定的。 如果数据实例在特定上下文中异常,则会发生这种情况。

  • Collective anomalies − It occurs when a collection of related data instances is anomalous w.r.t entire dataset rather than individual values.

    集体异常 -当相关数据实例的集合相对于整个数据集而不是单个值异常时,就会发生这种情况。

方法 (Methods)

Two methods namely outlier detection and novelty detection can be used for anomaly detection. It’s necessary to see the distinction between them.

异常检测可以使用异常检测新颖性检测这两种方法。 有必要看到它们之间的区别。

离群值检测 (Outlier detection)

The training data contains outliers that are far from the rest of the data. Such outliers are defined as observations. That’s the reason, outlier detection estimators always try to fit the region having most concentrated training data while ignoring the deviant observations. It is also known as unsupervised anomaly detection.

训练数据包含离其他数据远的异常值。 这些异常值被定义为观察值。 这就是原因,离群检测估计器总是尝试拟合训练数据最集中的区域,而忽略了异常观测值。 这也称为无监督异常检测。

新颖性检测 (Novelty detection)

It is concerned with detecting an unobserved pattern in new observations which is not included in training data. Here, the training data is not polluted by the outliers. It is also known as semi-supervised anomaly detection.

它与在训练数据中不包括的新观察中检测到未观察到的模式有关。 在这里,训练数据不受异常值的污染。 这也称为半监督异常检测。

There are set of ML tools, provided by scikit-learn, which can be used for both outlier detection as well novelty detection. These tools first implementing object learning from the data in an unsupervised by using fit () method as follows −

scikit-learn提供了一套ML工具,可用于异常检测和新颖性检测。 这些工具首先通过使用fit()方法在无监督的情况下从数据中实现对象学习-


estimator.fit(X_train)

Now, the new observations would be sorted as inliers (labeled 1) or outliers (labeled -1) by using predict() method as follows −

现在,可以通过使用predict()方法将新观察值分类为离群值(标记 为1)离群值(标记为-1) ,如下所示:


estimator.fit(X_test)

The estimator will first compute the raw scoring function and then predict method will make use of threshold on that raw scoring function. We can access this raw scoring function with the help of score_sample method and can control the threshold by contamination parameter.

估计器将首先计算原始评分函数,然后预测方法将使用该原始评分函数的阈值。 我们可以借助score_sample方法访问此原始评分功能,并可以通过污染参数控制阈值。

We can also define decision_function method that defines outliers as negative value and inliers as non-negative value.

我们还可以定义Decision_function方法,将离群值定义为负值,将离群值定义为非负值。


estimator.decision_function(X_test)

用于异常值检测的Sklearn算法 (Sklearn algorithms for Outlier Detection)

Let us begin by understanding what an elliptic envelop is.

让我们首先了解什么是椭圆形信封。

拟合椭圆形信封 (Fitting an elliptic envelop)

This algorithm assume that regular data comes from a known distribution such as Gaussian distribution. For outlier detection, Scikit-learn provides an object named covariance.EllipticEnvelop.

该算法假定常规数据来自已知分布,例如高斯分布。 为了检测异常值,Scikit-learn提供了一个名为covariance.EllipticEnvelop的对象。

This object fits a robust covariance estimate to the data, and thus, fits an ellipse to the central data points. It ignores the points outside the central mode.

该对象将稳健的协方差估计值拟合到数据,因此将椭圆拟合到中心数据点。 它忽略中心模式之外的点。

参量 (Parameters)

Following table consist the parameters used by sklearn. covariance.EllipticEnvelop method −

下表包含sklearn使用的参数 covariance.EllipticEnvelop方法-

Sr.No Parameter & Description
1

store_precision − Boolean, optional, default = True

We can specify it if the estimated precision is stored.

2

assume_centered − Boolean, optional, default = False

If we set it False, it will compute the robust location and covariance directly with the help of FastMCD algorithm. On the other hand, if set True, it will compute the support of robust location and covarian.

3

s

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值