分类数据失衡_分类失衡的局部离群因素

最新推荐文章于 2024-05-03 22:38:29 发布

weixin_26735933

最新推荐文章于 2024-05-03 22:38:29 发布

阅读量339

点赞数

文章标签： python 人工智能 java 机器学习算法

原文链接：https://towardsdatascience.com/local-outlier-factor-for-imbalanced-classification-cbced8f84baf

版权

分类数据失衡

介绍 (Introduction)

The presence of imbalanced class sizes when discriminating class membership in a body of data can be a large problem if one’s results are not interpreted appropriately. Achieving high accuracy, the so-called “white whale” of most classification problems, becomes a trivial task if an imbalance is not properly addressed. Although it is often better to optimize metrics such as sensitivity and specificity, this can be difficult with many of the popular supervised learning models. For this reason, one might consider turning to unsupervised/semi-supervised methods instead.

如果无法正确解释一个人的结果，则在区分数据主体中的类成员时，类大小不平衡的存在可能是一个大问题。如果不能正确解决不平衡问题，那么大多数分类问题中所谓的“白鲸”就必须达到很高的精度。尽管优化诸如敏感性和特异性之类的指标通常更好，但是对于许多流行的有监督学习模型而言，这可能是困难的。因此，可以考虑改用无监督/半监督方法。

One common application of unsupervised/semi-supervised learning is anomaly detection. In this specific context, unsupervised learning focuses on outlier detection, or identifying anomalies within the known data, while semi-supervised learning focuses on novelty detection, or looking for anomalies that come from new data.

无监督/半监督学习的一种常见应用是异常检测。在这种特定情况下，无监督学习侧重于异常值检测或识别已知数据中的异常，而半监督学习侧重于新颖性检测或寻找来自新数据的异常。

Image for post — **Figure 1.** Outlier detection (left) identifies anomalies within data. Novelty detection (right) identifies anomalies in new data. Anomalies are circled here. **图1.**离群值检测(左)标识数据中的异常。新颖性检测(右)标识新数据中的异常。在这里圈出异常。

While there are many techniques for anomaly detection, this post will discuss the Local Outlier Factor (LOF) algorithm. The algorithm will be broken down step by step, generalized to imbalanced classification, and applied to the task of exoplanet discovery.

尽管有许多用于异常检测的技术，但本文将讨论局部离群因子(LOF)算法。该算法将逐步分解，推广到不平衡分类，并应用于系外行星发现任务。

局部离群因子(LOF) (Local Outlier Factor (LOF))

The Local Outlier Factor algorithm works similarly to density-based clustering methods. LOF compares the local density of a point to that of its k-nearest neighbors. Clusters are identified as points of similar density and outliers are identified as points with considerably lower density than their neighbors.

局部离群因子算法的工作原理类似于基于密度的聚类方法。 LOF将点的局部密度与其k最近邻居的局部密度进行比较。聚类被标识为密度相似的点，离群值被标识为密度远低于其邻居的点。

LOF introduces outlier-ness as a measurable quantity instead of a binary property. The formal notation of the problem, as presented by Breunig et al. in 2000, is as follows:

LOF引入离群值作为可测量的量而不是二进制属性。问题的正式表示法，如Breunig等人所述。在2000年，如下：

The problem can be worked out step by step using the data from Figure 1 as an example.

以图1中的数据为例，可以逐步解决问题。

1 — Identify the k-nearest neighbors and the k-distance:For this problem k will be equal to 5. Figure 2 displays the 5 nearest neighbors of a given point a (marked with a red dot). The distance between a and the farthest point will be the k-distance. Here the k-distance of point a is 0.631.

1-确定k个最近的邻居和k距离：对于此问题， k等于5。图2显示了给定点a的5个最近邻居(标有红点)。 a和最远点之间的距离将是k-距离。这里，点a的k-距离为0.631。

2 — Calculate reachability-distance:The reachability-distance is the maximum of the distance between two points and one of the point’s k-distance (see Equation 2). Consider, for example, point a (from step 1) and point b (which can be any point in the dataset). The reachability-distance between a and b is either the true distance between a and b or the k-distance of point b (whichever is greater).

2 —计算可达距离：可达距离是两点之间的距离和该点的k距离之一的最大值(请参见公式2)。例如，考虑点a (来自步骤1)和点b (可以是数据集中的任何点)。 a和b之间的可达距离或者是a和b或b点的K-距离(取大值)之间的真实距离。

3 — Calculate local reachability density:Reachability-distance is used to calculate the local reachability density. To find the local reachability density of point a, calculate the mean reachability-distance between a and all of its nearest neighbors. Then, take the inverse of that value (see Equation 3). If the 5 nearest neighbors of point a are {b, c, d, e, f}, then one must calculate 1 divided by the average of:

3 —计算局部可达性密度：可达距离用于计算局部可达性密度。为了找到点a的局部可达性密度，计算a及其所有最近邻之间的平均可达性距离。然后，取该值的倒数(见公式3)。如果点a的5个最近邻居是{ b，c，d，e ， f }，则必须计算1除以以下平均值：

max(kdist(b), dist(a,b)) max(kdist(c), dist(a,c)) max(kdist(d), dist(a,d)) max(kdist(e), dist(a,e)) max(kdist(f), dist(a,f))

max(kdist( b )，dist( a，b ))max(kdist( c )，dist( a，c ))max(kdist( d )，dist( a，d ))max(kdist( e )，dist ( a，e ))max(kdist( f )，dist( a，f ))

This can be thought of as (the inverse of) the average distance at which a can be “reached” by each of its nearest neighbors. The higher the local reachability density of a, the more “reachable” it is.

可以认为这是与其最近的邻居可以“到达” a的平均距离(的倒数)。 a的局部可到达性密度越高，它越“可到达”。

4 — Find the local reachability density of each observation in the dataset:In order to find the local outlier factor of points in the dataset, each point’s local reachability density must be known.

4 —在数据集中查找每个观测值的局部可及性密度：为了在数据集中查找点的局部离群因子，必须知道每个点的局部可及性密度。

5 — Calculate local outlier factor of every observation in the dataset:The local outlier factor is calculated as the ratio of one point’s local reachability density to the mean local reachability density of its k-nearest neighbors (see Equation 4). The LOF of point a would be its own local reachability density, lrd(a), divided by the mean of {lrd(b), lrd(c), lrd(d), lrd(e), lrd(f)}.

5 —计算数据集中每个观测值的局部离群因子：局部离群因子的计算方法是：一个点的局部可及性密度与其k个近邻的平均局部可及性密度之比(请参见公式4)。点a的LOF将是其自身的局部可达性密度lrd( a )除以{lrd( b )，lrd( c )，lrd( d )，lrd( e )，lrd( f )}的平均值。

The LOF of a given point can be thought of as how close a point is to its neighbors compared to how close each of its neighbors are to their neighbors. An LOF of 1 implies that the point is just as dense or close to its neighbors as its neighbors are to theirs. A LOF less than 1 implies that the point is more dense than its neighbors. A LOF greater than 1 implies that the point is less dense than its neighbors and is likely an outlier.

给定点的LOF可以认为是一个点与其邻居之间的距离与每个邻居与其邻居之间的距离相比。 LOF为1表示该点与它的邻居一样密集或接近其邻居。 LOF小于1表示该点比其相邻点更密集。 LOF大于1表示该点的密度小于其相邻点的密度，并且可能是异常值。

By comparing the local densities of points in a dataset, LOF identifies observations that are anomalous with respect to the other observations nearest it. For example, a given point might not be an outlier if it is some distance away from any point in a sparse cluster. However, if that point is the same distance away from the any point in an especially dense cluster, it could very well be considered an outlier.

通过比较数据集中点的局部密度，LOF可以识别相对于最接近它的其他观测值是异常的观测值。例如，如果给定点与稀疏聚类中的任何点相距一定距离，则可能不是离群值。但是，如果该点与一个特别密集的群集中的任何点相距相同的距离，则很可能将其视为异常值。

分类的LOF (LOF for Classification)

LOF generalizes to classification problems when there is significant imbalance in class sizes. Instead of considering it a problem of two classes, consider it a one class problem where the majority class is the main cluster and the minority class is a set of outliers. To do this, the focus of LOF must be shifted from outlier detection to novelty detection.

当班级规模明显不平衡时，LOF会泛化分类问题。与其将其视为两个类别的问题，不如将其视为一个类别的问题，其中多数类别为主要类别，而少数类别为一组离群值。为此，LOF的重点必须从异常值检测转移到新颖性检测。

Novelty detection is determines whether or not a new observation belongs with the original data or if it is an outlier. When performing this semi-supervised task, it is important to distinguish training data from testing data. The training data should contain only one class (the majority class) and the testing data can contain observations from both the majority and minority classes. There should be no overlap in the training and testing sets.

新颖性检测是确定新观察值是否属于原始数据，或者它是否是异常值。在执行此半监督任务时，区分训练数据与测试数据非常重要。培训数据应仅包含一个类(多数类)，而测试数据可以包含来自多数类和少数类的观察。培训和测试集不应重叠。

The first step is to find the local outlier factors for each point in the training dataset. Once calculated, points from the testing set have their location outlier factors calculated independently of one another. Note that none of the local outlier factors in the training set should be recalculated with the inclusion of new testing observations (i.e., none of the testing observations should be considered in the k-neighbors formulation of the training set).

第一步是找到训练数据集中每个点的局部离群因子。一旦计算完，测试集中的点就可以独立计算彼此的位置离群因子。请注意，训练集中的任何局部异常值都不应在计算时加入新的测试观察值(即，训练集中的k邻居公式中不应考虑任何测试观察值)。

Identifying a new or testing observation as an outlier determines its membership in the minority class.

将一个新的或正在测试的观测值识别为异常值，即可确定其在少数群体中的成员资格。

系外行星发现 (Exoplanet Discovery)

To demonstrate the effectiveness of this approach, let us consider data from NASA’s Kepler Mission (as found on Kaggle.com). The mission of the Kepler Telescope is to find stars that are likely to have planets orbiting them (and are likely to be within their star’s habitable zone). These exoplanets are discovered by pointing the telescope at a star and measuring the luminous flux coming from it.

为了证明这种方法的有效性，让我们考虑来自NASA开普勒飞行任务的数据(可在Kaggle.com上找到 )。开普勒望远镜的任务是寻找可能有行星绕其运行的恒星(并且很可能在恒星的宜居区域内)。这些系外行星是通过将望远镜对准恒星并测量来自其的光通量发现的。

Stars with planets orbiting them will periodically appear fractionally less bright, as measured by lower flux, as the planet passes between the star and Kepler. This method of discovering exoplanets is referred to as the transit method because it depends on the transit of planets across their star.

当行星经过恒星和开普勒之间时，通过行星运行的恒星会周期性地出现亮度降低的现象，这是通过较低的通量来衡量的。这种发现系外行星的方法称为过渡方法，因为它取决于行星在其恒星上的过渡。

Over 2,300 confirmed exoplanets have been discovered by Kepler, and of that, roughly 360 could potentially be within the habitable zone of their star. More information on the Kepler Mission can be found at the mission’s website.

开普勒发现了超过2,300个已确认的系外行星，其中大约360个可能位于其恒星的宜居区域内。有关开普勒任务的更多信息，请访问任务网站。

The dataset considered in this post is not the complete body of flux measurements from Kepler, but rather a sample containing 42 confirmed exoplanets out of 5,657 observations. Each observation consists of 3,197 consecutive flux measurements from a given star. Figure 3 offers examples of flux measurements from one star without an exoplanet and one with an exoplanet orbiting it.

本文中考虑的数据集不是开普勒提供的通量测量的完整内容，而是一个包含5657个观测值中的42个已确认系外行星的样本。每个观测值都包含来自给定恒星的3197次连续通量测量。图3提供了从一颗没有系外行星的恒星和一颗有系外行星绕行的恒星进行的通量测量的示例。

Although there appears to be some pattern by which to discriminate here, it is not so clear in other observations. It should be noted that the classification and anomaly detection methods presented in this post are not the ideal techniques for analyzing data of this form; more careful trend analysis and decomposition is needed to extract signals. However, the imbalance in this dataset does provide a good platform on which to demonstrate the merits of anomaly detection for classification problems. We are, after all, attempting to find anomalies in the luminous flux patterns of observed stars.

尽管这里似乎存在某种区分的模式，但在其他观察中并不清楚。应该注意的是，这篇文章中介绍的分类和异常检测方法并不是分析这种形式的数据的理想技术。需要更仔细的趋势分析和分解以提取信号。但是，该数据集中的不平衡确实提供了一个很好的平台，可以在该平台上展示异常检测对分类问题的优点。毕竟，我们正在尝试寻找被观测恒星的光通量模式中的异常。

The testing set used in this problem contains all 42 exoplanet observations and an additional 48 non-exoplanet observations. The training set for the LOF algorithm contains the remaining 5,567 non-exoplanet observations. The k-nearest neighbors algorithm, LOF’s closest supervised learning analogue, is also fit here for comparison. For that, the training set consists of the 5,657 total observations.

此问题中使用的测试集包含所有42个系外行星观测值和另外48个非系外行星观测值。 LOF算法的训练集包含剩余的5567个非系外行星观测值。第k -nearest邻居算法，LOF最亲密的监督学习的模拟，也适合在这里进行比较。为此，训练集包括5657个总观察值。

The metrics used to evaluate the performance of the two models are accuracy, sensitivity, and specificity (see Equations 5.a, 5.b, and 5.c). Sensitivity is the proportion of exoplanets that are correctly identified and specificity is the proportion of non-exoplanets correctly identified.

用于评估两个模型的性能的度量标准是准确性，敏感性和特异性(请参见公式5.a，5.b和5.c)。敏感性是正确识别的系外行星的比例，而特异性是正确识别的系外行星的比例。

Supervised Model:

监督模型：

k-Nearest neighbors was fit to its respective training set using k=20. Predictions on the testing set yielded accuracy of 0.533, sensitivity of 0.0, and specificity of 1.0. This means that the model correctly identified all non-exoplanets, but did not identify any exoplanets. Closer inspection reveals that the model returned “non-exoplanet” for every prediction. This outcome is actually expected, considering that only 0.7% of the training observations are exoplanets.

k-最近邻使用k = 20适合其各自的训练集。对测试仪的预测得出的准确度为0.533，灵敏度为0.0，特异性为1.0。这意味着该模型正确地识别了所有非系外行星，但没有识别出任何系外行星。仔细检查发现，该模型对每个预测都返回“非系外行星”。考虑到只有0.7％的训练观测结果是系外行星，因此实际上是可以预期的结果。

Semi-Supervised Model — LOF:

半监督模型-LOF：

Two rounds of LOF are used in the solution presented here. The first round performs LOF on its respective training set and determines outliers that are present within it. These outliers are then removed in order to reduce noise in the training data. The second round performs LOF on the reduced training set and predicts novelties that may be present in the testing set. Testing observations predicted to be novelties are labeled as exoplanet observations and those not are labeled as non-exoplanet observations.

本文介绍的解决方案中使用了两轮LOF。第一轮对其各自的训练集执行LOF，并确定其中存在的异常值。然后将这些离群值删除，以减少训练数据中的噪声。第二轮对简化的训练集执行LOF，并预测测试集中可能出现的新颖性。预计是新颖的测试观测被标记为系外行星观测，未标记为非系外行星观测。

This yielded accuracy of 0.6889, specificity of 0.6875 and sensitivity 0.6905. Although the accuracy is not particularly high, the model correctly identified 69% of all the exoplanets and the non-exoplanets. This is significantly better than the supervised model, as it catches 2 of every 3 exoplanets instead of 0.

得到的准确度为0.6889，特异性为0.6875，灵敏度为0.6905。尽管精度不是很高，但是该模型正确地识别了所有系外行星和非系外行星的69％。这比监督模型要好得多，因为它捕获了每3个系外行星中的2个，而不是0个。

结论 (Conclusion)

Anomaly detection is an important paradigm in machine learning. In addition to outlier and novelty detection, it has also proven its worth in imbalanced classification. Local Outlier Factor is one of the most popular among outlier detection methods for its relatively simple intuition and effectiveness.

异常检测是机器学习中的重要范例。除异常值和新颖性检测外，它还证明了其在不平衡分类中的价值。局部离群因子因其相对简单的直觉和有效性而成为离群检测方法中最受欢迎的方法之一。

The ability of LOF to identify exoplanets, as demonstrated in this post, is good but not great. This is partly because the model does not perform as well in high dimensionality. Other factors include the noisiness of the data as well as the difficult structure of the data (each observation is an entire time-series dataset!).

正如本文所展示的，LOF识别系外行星的能力虽然很好，但并不出色。部分原因是该模型在高维方面表现不佳。其他因素包括数据的嘈杂性以及数据的困难结构(每次观察都是整个时间序列数据集！)。

A Jupyter notebook for the modeling process can be found at the following link: https://github.com/willarliss/Exoplanets

可以在以下链接中找到用于建模过程的Jupyter笔记本： https : //github.com/willarliss/Exoplanets