局部离群因子算法

皮蛋瘦肉熬不成粥

已于 2023-12-07 16:37:54 修改

阅读量998

点赞数 9

文章标签：算法 python

于 2023-12-07 16:17:04 首次发布

本文链接：https://blog.csdn.net/mtalab/article/details/134858199

版权

局部离群因子算法(LOF)

1. 算法概述

Lof算法是基于密度的异常点检测。在使用该算法时要求异常点和正常点的点数最大为1:1,也即异常点的点数不得多于正常点的点数。在Sklearn中有完整的代码包可以调用。参数的设定根据需要设置即可。

2. 算法原理

局部离群因子 (LOF) 算法是一种无监督异常检测方法，可计算给定数据点相对于其邻居的局部密度偏差。它将密度远低于其邻居的样本视为异常值。首先，算法对每一个数据点计算局部可达密度，自身和临近点的局部可达密度的比值求均值则为该点的score分数。如果参数中设置了离群点和异常值的比值，会根据该参数计算出判断异常值的临界值，否则以1.5作为分界点。

3. 算法的Python源码解释

3.1 `lof`主函数

def fit(self, X, y=None):
    """Fit the local outlier factor detector from the training dataset.

        Parameters
        ----------
        X : {array-like, sparse matrix} of shape (n_samples, n_features) or \
                (n_samples, n_samples) if metric='precomputed'
            Training data.

        y : Ignored
            Not used, present for API consistency by convention.

        Returns
        -------
        self : LocalOutlierFactor
            The fitted local outlier factor detector.
        """
    
    # 参数校验
    self._validate_params()

        self._fit(X)

        n_samples = self.n_samples_fit_
        if self.n_neighbors > n_samples:
            warnings.warn(
                "n_neighbors (%s) is greater than the "
                "total number of samples (%s). n_neighbors "
                "will be set to (n_samples - 1) for estimation."
                % (self.n_neighbors, n_samples)
            )
            
            
        '''     
        代码详解：
        
        # 近邻点 
        # 取最大值，保证输入的近邻点树小于输入数据的样本点数
        self.n_neighbors_ = max(1, min(self.n_neighbors, n_samples - 1))

        # 根据K近邻算法查找目标点(p1)的k个近邻点
        # 返回近邻点的索引和距离目标点的距离
        # 距离使用什么方式计算可在算法的参数中设置
        '''
        self._distances_fit_X_, _neighbors_indices_fit_X_ = self.kneighbors(
            n_neighbors=self.n_neighbors_
        )

        if self._fit_X.dtype == np.float32:
            self._distances_fit_X_ = self._distances_fit_X_.astype(
                self._fit_X.dtype,
                copy=False,
            )


        '''    
        代码详解：
        
        # 计算局部可达密度    
        # 先求目标点(p1)和近邻点的实际距离
        # 再求近邻点的k个近邻点，以及K个近邻点的最大距离
        # 比较实际距离和近邻点的K个近邻点的最大距离，返回大的一个距离就是可达距离
        # 依照上述步骤得到目标点和K个近邻点的k个可达距离
        # 将可达距离求均值取倒数得到的就是局部可达密度(lrd)
        '''
        self._lrd = self._local_reachability_density(
            self._distances_fit_X_, _neighbors_indices_fit_X_
        )


        ''' 
        代码详解：
        
        # 计算K个近邻点的局部可达密度和目标点的局部可达密度的比值
        # 得到的就是局部离群因子
        '''
        # Compute lof score over training samples to define offset_:
        lrd_ratios_array = (
            self._lrd[_neighbors_indices_fit_X_] / self._lrd[:, np.newaxis]
        )

        self.negative_outlier_factor_ = -np.mean(lrd_ratios_array, axis=1)

        if self.contamination == "auto":
            # inliers score around -1 (the higher, the less abnormal).
            self.offset_ = -1.5
        else:
            self.offset_ = np.percentile(
                self.negative_outlier_factor_, 100.0 * self.contamination
            )

        return self

3.2 计算局部可达距离

    def _local_reachability_density(self, distances_X, neighbors_indices):
        """The local reachability density (LRD)

        The LRD of a sample is the inverse of the average reachability
        distance of its k-nearest neighbors.

        Parameters
        ----------
        distances_X : ndarray of shape (n_queries, self.n_neighbors)
            Distances to the neighbors (in the training samples `self._fit_X`)
            of each query point to compute the LRD.

        neighbors_indices : ndarray of shape (n_queries, self.n_neighbors)
            Neighbors indices (of each query point) among training samples
            self._fit_X.

        Returns
        -------
        local_reachability_density : ndarray of shape (n_queries,)
            The local reachability density of each sample.
        """

        '''
        代码详解：
        
        # 这一行代码为numpy的高级索引
        # 将查询得到的近邻点的索引和近邻点数-1输入
        # 返回的就是各个近邻点的K个近邻点中的最大距离
        # 在目标点(p1)和近邻点的实际距离以及近邻点的K近邻的最大距离取最大值
        # 返回的就是可达距离
        # 对可达距离求均值取倒数得到局部可达密度(lrd)
        # 计算中的1e-10是避免除0
        '''
        dist_k = self._distances_fit_X_[neighbors_indices, self.n_neighbors_ - 1]
        reach_dist_array = np.maximum(distances_X, dist_k)

        # 1e-10 to avoid `nan' when nb of duplicates > n_neighbors_:
        return 1.0 / (np.mean(reach_dist_array, axis=1) + 1e-10)

3.3 Python code 的优点

① Python 的实现中近邻点的查找可以是kd树或者其他的查询算法

② 查询算法返回近邻点的索引和距离

③ lof算法在求可达距离时需要取目标点(p1)和近邻点的实际距离、近邻点的K个近邻点的最大距离之间的最大值

④ 通过索引的方式只需要求一次近邻点，后续求可达距离时只需要通过索引的查询即可实现

3.4 算法的缺点

对高维数据不太友好，查询和计算代价比较高

4、示例

from sklearn.neighbors import LocalOutlierFactor

X = [[-1.1], [0.2], [101.1], [0.3]]
clf = LocalOutlierFactor(n_neighbors=2)
clf.fit(X)
score = clf.negative_outlier_factor_
print(score)