局部离群因子算法(LOF)
1. 算法概述
Lof
算法是基于密度的异常点检测。在使用该算法时要求异常点和正常点的点数最大为1:1,也即异常点的点数不得多于正常点的点数。在Sklearn
中有完整的代码包可以调用。参数的设定根据需要设置即可。
2. 算法原理
局部离群因子 (LOF)
算法是一种无监督异常检测方法,可计算给定数据点相对于其邻居的局部密度偏差。它将密度远低于其邻居的样本视为异常值。首先,算法对每一个数据点计算局部可达密度,自身和临近点的局部可达密度的比值求均值则为该点的score分数。如果参数中设置了离群点和异常值的比值,会根据该参数计算出判断异常值的临界值,否则以1.5作为分界点。
3. 算法的Python源码解释
3.1 lof
主函数
def fit(self, X, y=None):
"""Fit the local outlier factor detector from the training dataset.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features) or \
(n_samples, n_samples) if metric='precomputed'
Training data.
y : Ignored
Not used, present for API consistency by convention.
Returns
-------
self : LocalOutlierFactor
The fitted local outlier factor detector.
"""
# 参数校验
self._validate_params()
self._fit(X)
n_samples = self.n_samples_fit_
if self.n_neighbors > n_samples:
warnings.warn(
"n_neighbors (%s) is greater than the "
"total number of samples (%s). n_neighbors "
"will be set to (n_samples - 1) for estimation."
% (self.n_neighbors, n_samples)
)
'''
代码详解:
# 近邻点
# 取最大值,保证输入的近邻点树小于输入数据的样本点数
self.n_neighbors_ = max(1, min(self.n_neighbors, n_samples - 1))
# 根据K近邻算法查找目标点(p1)的k个近邻点
# 返回近邻点的索引和距离目标点的距离
# 距离使用什么方式计算可在算法的参数中设置
'''
self._distances_fit_X_, _neighbors_indices_fit_X_ = self.kneighbors(
n_neighbors=self.n_neighbors_
)
if self._fit_X.dtype == np.float32:
self._distances_fit_X_ = self._distances_fit_X_.astype(
self._fit_X.dtype,
copy=False,
)
'''
代码详解:
# 计算局部可达密度
# 先求目标点(p1)和近邻点的实际距离
# 再求近邻点的k个近邻点,以及K个近邻点的最大距离
# 比较实际距离和近邻点的K个近邻点的最大距离,返回大的一个距离就是可达距离
# 依照上述步骤得到目标点和K个近邻点的k个可达距离
# 将可达距离求均值取倒数得到的就是局部可达密度(lrd)
'''
self._lrd = self._local_reachability_density(
self._distances_fit_X_, _neighbors_indices_fit_X_
)
'''
代码详解:
# 计算K个近邻点的局部可达密度和目标点的局部可达密度的比值
# 得到的就是局部离群因子
'''
# Compute lof score over training samples to define offset_:
lrd_ratios_array = (
self._lrd[_neighbors_indices_fit_X_] / self._lrd[:, np.newaxis]
)
self.negative_outlier_factor_ = -np.mean(lrd_ratios_array, axis=1)
if self.contamination == "auto":
# inliers score around -1 (the higher, the less abnormal).
self.offset_ = -1.5
else:
self.offset_ = np.percentile(
self.negative_outlier_factor_, 100.0 * self.contamination
)
return self
3.2 计算局部可达距离
def _local_reachability_density(self, distances_X, neighbors_indices):
"""The local reachability density (LRD)
The LRD of a sample is the inverse of the average reachability
distance of its k-nearest neighbors.
Parameters
----------
distances_X : ndarray of shape (n_queries, self.n_neighbors)
Distances to the neighbors (in the training samples `self._fit_X`)
of each query point to compute the LRD.
neighbors_indices : ndarray of shape (n_queries, self.n_neighbors)
Neighbors indices (of each query point) among training samples
self._fit_X.
Returns
-------
local_reachability_density : ndarray of shape (n_queries,)
The local reachability density of each sample.
"""
'''
代码详解:
# 这一行代码为numpy的高级索引
# 将查询得到的近邻点的索引和近邻点数-1输入
# 返回的就是各个近邻点的K个近邻点中的最大距离
# 在目标点(p1)和近邻点的实际距离以及近邻点的K近邻的最大距离取最大值
# 返回的就是可达距离
# 对可达距离求均值取倒数得到局部可达密度(lrd)
# 计算中的1e-10是避免除0
'''
dist_k = self._distances_fit_X_[neighbors_indices, self.n_neighbors_ - 1]
reach_dist_array = np.maximum(distances_X, dist_k)
# 1e-10 to avoid `nan' when nb of duplicates > n_neighbors_:
return 1.0 / (np.mean(reach_dist_array, axis=1) + 1e-10)
3.3 Python code 的优点
① Python 的实现中近邻点的查找可以是kd树或者其他的查询算法
② 查询算法返回近邻点的索引和距离
③ lof
算法在求可达距离时需要取目标点(p1
)和近邻点的实际距离、近邻点的K个近邻点的最大距离之间的最大值
④ 通过索引的方式只需要求一次近邻点,后续求可达距离时只需要通过索引的查询即可实现
3.4 算法的缺点
对高维数据不太友好,查询和计算代价比较高
4、示例
from sklearn.neighbors import LocalOutlierFactor
X = [[-1.1], [0.2], [101.1], [0.3]]
clf = LocalOutlierFactor(n_neighbors=2)
clf.fit(X)
score = clf.negative_outlier_factor_
print(score)
返回近邻点的索引和距离
近邻点的K个近邻点的最大距离
返回可达距离
返回局部可达密度
返回近邻点和目标点lrd比值的lrd矩阵
返回lof score
4、参考网址:
C code:C code KdTree
KdTree 原理:KdTree 原理
lof 算法原理:lof 算法原理