异常检测系列：基于极限梯度提升的异常检测（XGBOD）-CSDN博客

本文链接：https://blog.csdn.net/wjjc1017/article/details/135875489

在第一章中，我们讨论了监督学习可以更好地处理已知的异常值，而无监督学习可以探索新类型的异常值。我们能否充分利用监督和无监督学习的优势呢？具体来说，由于监督学习通常可以实现更好的精度，而无监督学习的异常值分数可以更好地识别异常值，我们能否将无监督学习的异常值分数作为监督学习的特征？

上述想法属于一个更大的概念，称为表示学习。它是一种机器学习方法，用于发现特征的数据表示。在本章中，我将解释表示学习，然后介绍一种名为XGBOD（极限梯度提升异常检测）的监督学习技术。我选择在本书中介绍XGBOD，以便您可以了解其他表示学习变体，如BORE（袋装异常表示集合）。

(A) 表示学习

表示学习是机器学习中研究无需人工干预即可发现原始数据表示的系统方法。表示学习的目的是使用机器学习算法学习数据中的任何正常和模糊模式。原始数据可以由新特征表示。许多降维技术，如PCA和自编码器，可以提供这种能力。在文献中，表示学习 [1] 也可以称为无监督特征工程 [2]。让我用放大镜作为类比来说明这个想法。想象一下放大镜在数据上扫描。放大镜可以放大数据中的正常模式，而另一个放大镜可以放大不规则的数据模式。这些不同的放大镜在数据科学中被称为特征。它们产生新的数据来表示原始数据。

(B) 标记的目标包含不同类型的异常值

在我们讨论监督学习之前，让我们先了解一下目标。异常值可以有不同的类型。在二元分类模型中，它们都被标记为“1”。让我举个例子来说明这一点。医疗保险和医疗补助是美国的两个政府计划，涵盖医疗和与健康相关的服务。医生向医疗保险和医疗补助提交他们提供的医疗服务的账单。虽然大部分账单都是正确和专业的，但仍然存在不诚实的账单。Sparrow（2019）在他的书《License to steal: How Fraud Bleeds America’s Health Care System》[6]中描述了不同类型的医疗欺诈。其中一种类型可以是医生与患者勾结，向医疗保险/医疗补助提交多个索赔。另一种类型可以是不诚实的医疗服务提供者为许多幽灵患者创建账单。甚至还有一种类型可能涉及患者、医生、律师和医疗供应商的犯罪团伙。在数据科学术语中，以上是来自特定分布的不同类型的异常值。如果将这些索赔作为数据点绘制在二维图上，它们可能是图（A）中除了正确账单之外的异常值O1，O2，a1和a2。这个预测问题仍然可以被定义为一个二元分类问题，其中所有类型的异常都是“1”，其余为0。

（C）XGBOD

监督学习方法可以是任何分类模型，例如BORE使用的逻辑回归[1]。在表示学习方法的基础上，Zhao和Maciej K. Hryniewicki（2019）[3]提出了一种基于XGBoost的模型，称为XGBOD（Extreme Gradient Boosting Outlier Detection）。据报道，与其他集成方法相比，XGBoost能更好地处理不平衡数据[5]。这对于极度不平衡的目标是有吸引力的，如上所述。Chen和Guestrin [4]的XGBoosting（EXtreme Gradient Boosting）算法是梯度提升树算法的一个众所周知的实现。XGBoost通过其内置的正则化公式在损失函数中减轻过拟合。其并行处理和优化计算也对许多数据科学家具有吸引力。

XGBOD有三个步骤。首先，它应用无监督学习来创建新特征，称为转换异常得分（TOS）。其次，它将新特征与原始特征连接起来，然后应用皮尔逊相关系数来保留有用的特征。第三，它训练一个XGBoost分类器。使用XGBoost可以进行特征修剪，并提供特征重要性排序。

对于TOS的生成，除非另有说明，XGBOD将KNN、AvgKNN、LOF、iForest、HBOS和OCSVM作为默认方法。方法列表虽然不是详尽无遗的。模型将使用不同的超参数生成多个TOS。以下是默认模型及其超参数范围。

KNN、AvgKNN、LOF：KNN、AvgKNN和LOF的预定义n_neighbors范围为[1, 3, 5, 10, 20, 30, 40, 50]
iForest：估计器数量的预定义范围为[10, 20, 50, 70, 100, 150, 200]
HBOS：箱子的预定义范围为[5, 10, 15, 20, 25, 30, 50]
OCSVM：nu的预定义范围为[0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99]

（D）建模过程

在本书的无监督学习方法中，我使用了一个1-2-3的建模过程，用于（1）模型开发，（2）阈值确定和（3）正常组和异常组的概况。然而，在XGBOD中，我们可以跳过（2），因为目标是已知的。

(D.1) 第一步 — 构建模型

我分别为训练数据和测试数据生成了六个变量和500个观测值。异常值的百分比由污染率设置为5%。


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pyod.utils.data import generate_data
contamination = 0.05 # percentage of outliers
n_train = 500       # number of training points
n_test = 500        # number of testing points
n_features = 6      # number of features
X_train, X_test, y_train, y_test = generate_data(
    n_train=n_train, 
    n_test=n_test, 
    n_features= n_features, 
    contamination=contamination, 
    random_state=123)

# Make the 2d numpy array a pandas dataframe for each manipulation 
X_train_pd = pd.DataFrame(X_train)
    
# Plot
plt.scatter(X_train_pd[0], X_train_pd[1], c=y_train, alpha=0.8)
plt.title('Scatter plot')
plt.xlabel('x0')
plt.ylabel('x1')
plt.show()

图(D.1)绘制了前两个变量的散点图。黄色点是异常值，紫色点是正常数据点。

图片 (D.1)

我使用函数 decision_functions() 为“X_train”和“X_test”中的每个观测分配异常分数。


from pyod.models.xgbod import XGBOD
xgbod = XGBOD(n_components=4,random_state=100) 
xgbod.fit(X_train,y_train)

# get the prediction labels and outlier scores of the training data
y_train_pred = xgbod.labels_  # binary labels (0: inliers, 1: outliers)
y_train_scores = xgbod.decision_scores_  # raw outlier scores
y_train_scores = xgbod.decision_function(X_train)
# get the prediction on the test data
y_test_pred = xgbod.predict(X_test)  # outlier labels (0 or 1)
y_test_scores = xgbod.decision_function(X_test)  # outlier scores

def count_stat(vector):
    # Because it is '0' and '1', we can run a count statistic. 
    unique, counts = np.unique(vector, return_counts=True)
    return dict(zip(unique, counts))

print("The training data:", count_stat(y_train_pred))
print("The test data:", count_stat(y_test_pred))

因为我们有测试数据的真实结果，所以我们可以验证模型的可预测性。混淆矩阵更加令人满意。模型正确地识别了25个数据点，只错过了一个数据点。

Actual_pred = pd.DataFrame({'Actual': y_test, 'Pred': y_test_pred})
pd.crosstab(Actual_pred['Actual'],Actual_pred['Pred'])

XGBOD的表示学习

在XGBOD中，表示学习是至关重要的。它应用无监督学习来创建转换的异常得分（TOS）。我们可以使用.get_params()来打印出XGBOD的设置，以查看无监督学习的设置。输出包括KNN、AvgKNN、LOF、IForest、HBOS和OCSVM的规格。这些无监督学习模型中的每一个都创建了TOS作为XGBOD的新特征，以添加到原始特征中来构建模型。

输出还打印出了极端梯度提升的超参数。例如，XGBoost模型的学习率为0.1，树的最大深度为3，有100个提升树。


xgbod.get_params()

{'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bytree': 1,
 'estimator_list': [KNN(algorithm='auto', contamination=0.1, leaf_size=30, method='largest',
    metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=1, p=2,
    radius=1.0),
  LOF(algorithm='auto', contamination=0.1, leaf_size=30, metric='minkowski',
    metric_params=None, n_jobs=1, n_neighbors=1, novelty=True, p=2),
  KNN(algorithm='auto', contamination=0.1, leaf_size=30, method='largest',
    metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=3, p=2,
    radius=1.0),
  LOF(algorithm='auto', contamination=0.1, leaf_size=30, metric='minkowski',
    metric_params=None, n_jobs=1, n_neighbors=3, novelty=True, p=2),
  KNN(algorithm='auto', contamination=0.1, leaf_size=30, method='largest',
    metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2,
    radius=1.0),
  LOF(algorithm='auto', contamination=0.1, leaf_size=30, metric='minkowski',
    metric_params=None, n_jobs=1, n_neighbors=5, novelty=True, p=2),
  KNN(algorithm='auto', contamination=0.1, leaf_size=30, method='largest',
    metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=10, p=2,
    radius=1.0),
  LOF(algorithm='auto', contamination=0.1, leaf_size=30, metric='minkowski',
    metric_params=None, n_jobs=1, n_neighbors=10, novelty=True, p=2),
  KNN(algorithm='auto', contamination=0.1, leaf_size=30, method='largest',
    metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=20, p=2,
    radius=1.0),
  LOF(algorithm='auto', contamination=0.1, leaf_size=30, metric='minkowski',
    metric_params=None, n_jobs=1, n_neighbors=20, novelty=True, p=2),
  KNN(algorithm='auto', contamination=0.1, leaf_size=30, method='largest',
    metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=30, p=2,
    radius=1.0),
  LOF(algorithm='auto', contamination=0.1, leaf_size=30, metric='minkowski',
    metric_params=None, n_jobs=1, n_neighbors=30, novelty=True, p=2),
  KNN(algorithm='auto', contamination=0.1, leaf_size=30, method='largest',
    metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=40, p=2,
    radius=1.0),
  LOF(algorithm='auto', contamination=0.1, leaf_size=30, metric='minkowski',
    metric_params=None, n_jobs=1, n_neighbors=40, novelty=True, p=2),
  KNN(algorithm='auto', contamination=0.1, leaf_size=30, method='largest',
    metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=50, p=2,
    radius=1.0),
  LOF(algorithm='auto', contamination=0.1, leaf_size=30, metric='minkowski',
    metric_params=None, n_jobs=1, n_neighbors=50, novelty=True, p=2),
  HBOS(alpha=0.1, contamination=0.1, n_bins=5, tol=0.5),
  HBOS(alpha=0.1, contamination=0.1, n_bins=10, tol=0.5),
  HBOS(alpha=0.1, contamination=0.1, n_bins=15, tol=0.5),
  HBOS(alpha=0.1, contamination=0.1, n_bins=20, tol=0.5),
  HBOS(alpha=0.1, contamination=0.1, n_bins=25, tol=0.5),
  HBOS(alpha=0.1, contamination=0.1, n_bins=30, tol=0.5),
  HBOS(alpha=0.1, contamination=0.1, n_bins=50, tol=0.5),
  OCSVM(cache_size=200, coef0=0.0, contamination=0.1, degree=3, gamma='auto',
     kernel='rbf', max_iter=-1, nu=0.01, shrinking=True, tol=0.001,
     verbose=False),
  OCSVM(cache_size=200, coef0=0.0, contamination=0.1, degree=3, gamma='auto',
     kernel='rbf', max_iter=-1, nu=0.1, shrinking=True, tol=0.001,
     verbose=False),
  OCSVM(cache_size=200, coef0=0.0, contamination=0.1, degree=3, gamma='auto',
     kernel='rbf', max_iter=-1, nu=0.2, shrinking=True, tol=0.001,
     verbose=False),
  OCSVM(cache_size=200, coef0=0.0, contamination=0.1, degree=3, gamma='auto',
     kernel='rbf', max_iter=-1, nu=0.3, shrinking=True, tol=0.001,
     verbose=False),
  OCSVM(cache_size=200, coef0=0.0, contamination=0.1, degree=3, gamma='auto',
     kernel='rbf', max_iter=-1, nu=0.4, shrinking=True, tol=0.001,
     verbose=False),
  OCSVM(cache_size=200, coef0=0.0, contamination=0.1, degree=3, gamma='auto',
     kernel='rbf', max_iter=-1, nu=0.5, shrinking=True, tol=0.001,
     verbose=False),
  OCSVM(cache_size=200, coef0=0.0, contamination=0.1, degree=3, gamma='auto',
     kernel='rbf', max_iter=-1, nu=0.6, shrinking=True, tol=0.001,
     verbose=False),
  OCSVM(cache_size=200, coef0=0.0, contamination=0.1, degree=3, gamma='auto',
     kernel='rbf', max_iter=-1, nu=0.7, shrinking=True, tol=0.001,
     verbose=False),
  OCSVM(cache_size=200, coef0=0.0, contamination=0.1, degree=3, gamma='auto',
     kernel='rbf', max_iter=-1, nu=0.8, shrinking=True, tol=0.001,
     verbose=False),
  OCSVM(cache_size=200, coef0=0.0, contamination=0.1, degree=3, gamma='auto',
     kernel='rbf', max_iter=-1, nu=0.9, shrinking=True, tol=0.001,
     verbose=False),
  OCSVM(cache_size=200, coef0=0.0, contamination=0.1, degree=3, gamma='auto',
     kernel='rbf', max_iter=-1, nu=0.99, shrinking=True, tol=0.001,
     verbose=False),
  IForest(behaviour='old', bootstrap=False, contamination=0.1, max_features=1.0,
      max_samples='auto', n_estimators=10, n_jobs=1, random_state=100,
      verbose=0),
  IForest(behaviour='old', bootstrap=False, contamination=0.1, max_features=1.0,
      max_samples='auto', n_estimators=20, n_jobs=1, random_state=100,
      verbose=0),
  IForest(behaviour='old', bootstrap=False, contamination=0.1, max_features=1.0,
      max_samples='auto', n_estimators=50, n_jobs=1, random_state=100,
      verbose=0),
  IForest(behaviour='old', bootstrap=False, contamination=0.1, max_features=1.0,
      max_samples='auto', n_estimators=70, n_jobs=1, random_state=100,
      verbose=0),
  IForest(behaviour='old', bootstrap=False, contamination=0.1, max_features=1.0,
      max_samples='auto', n_estimators=100, n_jobs=1, random_state=100,
      verbose=0),
  IForest(behaviour='old', bootstrap=False, contamination=0.1, max_features=1.0,
      max_samples='auto', n_estimators=150, n_jobs=1, random_state=100,
      verbose=0),
  IForest(behaviour='old', bootstrap=False, contamination=0.1, max_features=1.0,
      max_samples='auto', n_estimators=200, n_jobs=1, random_state=100,
      verbose=0)],
 'gamma': 0,
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 3,
 'min_child_weight': 1,
 'n_estimators': 100,
 'n_jobs': 1,
 'nthread': None,
 'objective': 'binary:logistic',
 'random_state': 100,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'silent': True,

(D.2) 第二步 — 正常组和异常组的描述性统计

对于两组之间的特征，描述性统计（例如均值和标准差）对于证明模型的可靠性非常重要。

# Let's see how many '0's and '1's.
df_train = pd.DataFrame(X_train)
df_columns = df_train.columns
df_train['pred'] = y_train_pred
df_train['Group'] = np.where(df_train['pred']==1, 'Outlier','Normal')

# Now let's show the summary statistics:
cnt = df_train.groupby('Group')['pred'].count().reset_index().rename(columns={'pred':'Count'})
cnt['Count %'] = (cnt['Count'] / cnt['Count'].sum()) * 100 # The count and count %
stat = df_train.groupby('Group').mean().reset_index() # The avg.
cnt.merge(stat, left_on='Group',right_on='Group') # Put the count and the avg. together
view raw

上表显示了正常组和异常组的计数和计数百分比。请记住，为了有效呈现，请使用特征名称标记特征。该表告诉我们几个重要的结果：

**异常组的大小：**异常组约占10%。请记住，异常组的大小由阈值确定。如果选择更高的阈值，大小将会缩小。
**每个组中的特征统计：**该表显示异常组的特征’0’到’5’的值比正常组的值小。在业务应用中，您可能期望异常组的特征值高于或低于正常组的特征值。因此，特征统计有助于理解模型结果。

(E) 总结

表示学习研究了在没有任何人为干预的情况下发现原始数据表示的系统方法。
XGBOD（极限梯度提升异常检测）应用不同的无监督异常检测方法来创建称为转换异常得分（TOS）的新特征。它使用皮尔逊相关系数来保留有用的特征。
表示学习的默认无监督学习模型包括KNN、AvgKNN、LOF、iForest、HBOS和OCSVM。
XGBOD将TOS添加到原始特征中以构建模型。

参考文献

[1] B. Micenková, B. McWilliams, and I. Assent, “Learning Representations for Outlier Detection on a Budget.” 29-Jul-2015.
[2] C. C. Aggarwal and S. Sathe, “Outlier ensembles: An introduction.” 2017.
[3] Zhao, Y. & Hryniewicki, M. K. (2018). XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning. IJCNN (p./pp. 1–8), : IEEE. ISBN: 978–1–5090–6014–6
[4] Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree-boosting system. In Proceedings of the 22nd ACM sigkdd international conference on knowledge discovery and data mining (pp. 785–794).
[5] N. Moniz and P. Branco, “Evaluation of Ensemble Methods in Imbalanced Regression Tasks,” Proc. First Int. Work. Learn. with Imbalanced Domains Theory Appl., vol. 74, pp. 129–140, 2017.
[6] Sparrow, M. K. (2019). License To Steal: How Fraud Bleeds America’s Health Care System, Updated Edition. United Kingdom: Taylor & Francis.