03HBOS算法检测异常值（直方图的推广）

最新推荐文章于 2024-02-14 12:45:55 发布

小王做笔记

最新推荐文章于 2024-02-14 12:45:55 发布

阅读量1.7k

点赞数

分类专栏：异常检测文章标签：机器学习数据挖掘

本文链接：https://blog.csdn.net/qsx123432/article/details/114839297

版权

异常检测专栏收录该内容

3 篇文章 2 订阅

订阅专栏

HBOS算法进行异常检测

理论部分

定义：

这是一种基于多维度数据各个维度的独立性假设，对于单个数据维度，先进行数据直方图。对分类数据统计每个值的频数并计算相对频率。

它是一种单变量方法的组合，不能对特征之间的依赖关系进行建模，但是计算速度较快，对大数据集友好。其基本假设是数据集的每个维度相互独立。然后对每个维度进行区间(bin)划分，区间的密度越高，异常评分越低。

作用：

为每个数据维度做出数据直方图。对分类数据统计每个值的频数并计算相对频率。

适用场景：

半监督学习的异常检测

分类：

静态宽度直方图
动态宽度直方图
有关上面两者的说明：

使用的操作:

为每个数据维度做出数据直方图（这里假设是不同维度上的数据之间是相互独立的），根据数值特征选择选用哪种图形（静态 / 动态）。
对每个维度计算一个独立的直方图，其中每个箱子的高度为密度的估计。（每个维度的尺度 = 柱状图的宽；数据的密度 = 桶的高度）
对各个类型做一下归一化处理，确保每个维度（每个特征）与异常值得分的权重相等（可以使最大高度为1）。

则计算公式为：
在这里插入图片描述

概率密度越大，异常值评分越小

学习代码

下面我们将通过pyod库完成 HBOS 的学习

参考文章：

https://blog.csdn.net/Sirow/article/details/112692357

https://blog.csdn.net/Joyceying1007/article/details/112688879

以上两篇文章是通过 generate_data() 内置函数生成数据集，然后用于异常检测的案例中，但是我们的数据集不是这样的。下面我们先来学习一下这里的异常检测方法

# 导入相关依赖模块
from pyod.utils.data import evaluate_print,generate_data
from pyod.models.hbos import HBOS
from pyod.utils.example import visualize

# pyod中用于生成toy数据的方法主要是：
# 1、pyod.utils.data.generate_data()
# 2、pyod.utils.data.generate_data_clusters()

# 于是....生成toy example：
contamination = 0.05  # percentage of outliers
n_train = 1000  # number of training points
n_test = 500  # number of testing points
X_train, y_train, X_test, y_test = generate_data(n_train=n_train, n_test=n_test, contamination=contamination, behaviour="old")  ### 生成数据集的函数

# 初始化HBOS模型
clf_name = 'HBOS'
clf = HBOS() 
clf.fit(X_train)

# get the prediction labels and outlier scores of the training data
y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_  # raw outlier scores

# get the prediction on the test data
y_test_pred = clf.predict(X_test)  # outlier labels (0 or 1)
y_test_scores = clf.decision_function(X_test)  # outlier scores

# evaluate and print the results
print("\nOn Training Data:")
evaluate_print(clf_name, y_train, y_train_scores)
print("\nOn Test Data:")
evaluate_print(clf_name, y_test, y_test_scores)

# 可视化展示训练集、测试集异常检测结果
visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,
          y_test_pred, show_figure=True, save_figure=False)

# visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,
#           y_test_pred, show_figure=False, save_figure=True)

D:\Anacoder\lib\site-packages\pyod\utils\data.py:185: FutureWarning: behaviour="old" is deprecated and will be removed in version 0.8.0. Please use behaviour="new", which makes the returned datasets in the order of X_train, X_test, y_train, y_test.
  warn('behaviour="old" is deprecated and will be removed '



On Training Data:
HBOS ROC:0.9903, precision @ rank n:0.9167

On Test Data:
HBOS ROC:0.9883, precision @ rank n:0.88

在这里插入图片描述

from __future__ import division
from __future__ import print_function

import os
import sys

# temporary solution for relative imports in case pyod is not installed
# if pyod is installed, no need to use the following line
#sys.path.append(
#    os.path.abspath(os.path.join(os.path.dirname("__file__"), '..')))

from pyod.models.hbos import HBOS
from pyod.utils.data import generate_data
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize

if __name__ == "__main__":
    contamination = 0.1  # percentage of outliers
    n_train = 200  # number of training points
    n_test = 100  # number of testing points

    # Generate sample data
    X_train, y_train, X_test, y_test = \
        generate_data(n_train=n_train,
                      n_test=n_test,
                      n_features=2,
                      contamination=contamination,
                      random_state=42)

    # train HBOS detector
    clf_name = 'HBOS'
    clf = HBOS()
    clf.fit(X_train)

    # get the prediction labels and outlier scores of the training data
    y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
    y_train_scores = clf.decision_scores_  # raw outlier scores

    # get the prediction on the test data
    y_test_pred = clf.predict(X_test)  # outlier labels (0 or 1)
    y_test_scores = clf.decision_function(X_test)  # outlier scores

    # evaluate and print the results
    print("\nOn Training Data:")
    evaluate_print(clf_name, y_train, y_train_scores)
    print("\nOn Test Data:")
    evaluate_print(clf_name, y_test, y_test_scores)

    # visualize the results
    visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,
              y_test_pred, show_figure=True, save_figure=False)

D:\Anacoder\lib\site-packages\pyod\utils\data.py:185: FutureWarning: behaviour="old" is deprecated and will be removed in version 0.8.0. Please use behaviour="new", which makes the returned datasets in the order of X_train, X_test, y_train, y_test.
  warn('behaviour="old" is deprecated and will be removed '



On Training Data:
HBOS ROC:0.9947, precision @ rank n:0.8

On Test Data:
HBOS ROC:0.9744, precision @ rank n:0.6

在这里插入图片描述

实战代码

这里将会以本地数据集完成异常检测的任务

1. 导入库并完成数据的读取

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

import seaborn as sns

path = r"F:/Python_Tensorflow_codes/006group_learning/team-learning-data-mining-master/AnomalyDetection/outlier.txt"
# df_txt1 = pd.read_csv(path, header=None)
# print(df_txt1)

df = pd.read_table(path, sep=" ", header=None)

2. 查看数据并设置类别

# print(df)

### 怎样对现有的列名进行重新设置
# df.set_axis(["a", "b", "c", "d", "e"], inplace=True)
# print(df)

# ### 回答上面的问题：： 直接对df 的列属性 columns 赋值
df.columns = ["a", "b", "c", "d", "e"]
df

	a	b	c	d	e
0	0.036853	0.034390	0.091979	-0.010263	-0.008141
1	-0.001152	0.021750	-0.020401	0.009866	-0.034471
2	-0.012586	0.047364	0.011108	-0.011569	-0.023341
3	-0.028378	0.043980	0.001264	0.023138	0.005426
4	0.022225	0.007152	-0.037135	-0.029387	-0.099154
...	...	...	...	...	...
995	0.006989	-0.039900	0.013303	-0.035476	0.068688
996	-0.004276	-0.028075	-0.000769	-0.051646	-0.057467
997	-0.000544	0.059539	-0.012897	0.032994	-0.089494
998	-0.039697	0.000563	-0.032977	0.015441	0.007677
999	-0.010866	0.022604	0.014564	-0.014077	-0.025783

1000 rows × 5 columns

type(df["a"])

pandas.core.series.Series

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
df[df.columns] = scaler.fit_transform(df[df.columns])
df.head()

	a	b	c	d	e
0	0.705238	0.729569	1.000000	0.480335	0.479880
1	0.510101	0.661092	0.457152	0.586223	0.341049
2	0.451392	0.799855	0.609358	0.473463	0.399738
3	0.370308	0.781523	0.561804	0.656048	0.551413
4	0.630132	0.582011	0.376318	0.379728	0.000000

X1 = df['a'].values.reshape(-1,1)
X2 = df['b'].values.reshape(-1,1)
X3 = df["c"].values.reshape(-1,1)
X4 = df['d'].values.reshape(-1,1)
X5 = df["e"].values.reshape(-1,1)


X = np.concatenate((X1,X2,X3,X4,X5),axis=1)

## 补充需要的库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

import matplotlib                     ## 这两条语句缺一不可  这里解决了直接书写 matplotlib
from matplotlib import font_manager   ## 用于解决： prop=matplotlib.font_manager.FontProperties(size=20)

# Import models
from pyod.models.abod import ABOD
from pyod.models.cblof import CBLOF
from pyod.models.feature_bagging import FeatureBagging
from pyod.models.hbos import HBOS
from pyod.models.iforest import IForest
from pyod.models.knn import KNN
from pyod.models.lof import LOF

random_state = np.random.RandomState(42)
outliers_fraction = 0.05
# Define seven outlier detection tools to be compared ; 定义要使用的类别
classifiers = {
#         'Angle-based Outlier Detector (ABOD)': ABOD(contamination=outliers_fraction),
#         'Cluster-based Local Outlier Factor (CBLOF)':CBLOF(contamination=outliers_fraction,check_estimator=False, random_state=random_state),
#         'Feature Bagging':FeatureBagging(LOF(n_neighbors=35),contamination=outliers_fraction,check_estimator=False,random_state=random_state),
        'Histogram-base Outlier Detection (HBOS)': HBOS(contamination=outliers_fraction),
#         'Isolation Forest': IForest(contamination=outliers_fraction,random_state=random_state),
#         'K Nearest Neighbors (KNN)': KNN(contamination=outliers_fraction),
#         'Average KNN': KNN(method='mean',contamination=outliers_fraction)
}

### 定义坐标轴线的参数 用的函数是 numpy.meshgrid() 函数
xx , yy = np.meshgrid(np.linspace(0,1 , 200), np.linspace(0, 1, 200))


for i, (clf_name, clf) in enumerate(classifiers.items()):
    clf.fit(X)
    # predict raw anomaly score
    scores_pred = clf.decision_function(X) * -1
        
    # prediction of a datapoint category outlier or inlier
    y_pred = clf.predict(X)
    n_inliers = len(y_pred) - np.count_nonzero(y_pred)
    n_outliers = np.count_nonzero(y_pred == 1)
    plt.figure(figsize=(10, 10))
    
    # copy of dataframe
    dfx = df
    dfx['outlier'] = y_pred.tolist()
    
    # IX1 - inlier feature 1,  IX2 - inlier feature 2
    IX1 =  np.array(dfx['a'][dfx['outlier'] == 0]).reshape(-1,1)
    IX2 =  np.array(dfx['b'][dfx['outlier'] == 0]).reshape(-1,1)
#     IX3 =  np.array(dfx['c'][dfx['outlier'] == 0]).reshape(-1,1)
#     IX4 =  np.array(dfx['d'][dfx['outlier'] == 0]).reshape(-1,1)
#     IX5 =  np.array(dfx['e'][dfx['outlier'] == 0]).reshape(-1,1)


# '''
#     #### 将上面的枚举整理成 for 循环代码：     ### 这种方式显然是不正确的，前边变，后面也变的这种程序使用 for 循环并不是一个好的选择
# #     tuple_ = "abcde"
# #     for i in tuple_:
# #         IX_temp =  np.array(dfx[i][dfx['outlier'] == 0]).reshape(-1,1)
# #         if not (i != "a"):
# #             IX1 = IX_temp
# #         if i == "b":
# #             IX2 = IX_temp
# #         if i == "c":
# #             IX3 = IX_temp    
# #         if i == "d":
# #             IX4 = IX_temp    
# #         if i == "e":
# #             IX5 = IX_temp


# #### 把多个类别放到同一个变量中

# '''

    
    # OX1 - outlier feature 1, OX2 - outlier feature 2
    OX1 =  dfx['a'][dfx['outlier'] == 1].values.reshape(-1,1)
    OX2 =  dfx['b'][dfx['outlier'] == 1].values.reshape(-1,1)
#     OX3 =  dfx['c'][dfx['outlier'] == 1].values.reshape(-1,1)
#     OX4 =  dfx['d'][dfx['outlier'] == 1].values.reshape(-1,1)
#     OX5 =  dfx['e'][dfx['outlier'] == 1].values.reshape(-1,1)
   
         
    print('OUTLIERS : ',n_outliers,'INLIERS : ',n_inliers, clf_name)
        
    # threshold value to consider a datapoint inlier or outlier
    threshold = stats.scoreatpercentile(scores_pred,100 * outliers_fraction)
        
    # decision function calculates the raw anomaly score for every point
    Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
    Z = Z.reshape(xx.shape)
          
    # fill blue map colormap from minimum anomaly score to threshold value
    plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),cmap=plt.cm.Blues_r)
        
    # draw red contour line where anomaly score is equal to thresold
    a = plt.contour(xx, yy, Z, levels=[threshold],linewidths=2, colors='red')
        
    # fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score
    plt.contourf(xx, yy, Z, levels=[threshold, Z.max()],colors='orange')
        
    b = plt.scatter(IX1,IX2, c='white',s=20, edgecolor='k')     ### 这里绘制的就是 IX1, IX2之间的关系昂
    
    c = plt.scatter(OX1,OX2, c='black',s=20, edgecolor='k')
       
    plt.axis('tight')  
    
    # loc=2 is used for the top left corner 
    plt.legend(
        [a.collections[0], b,c],
        ['learned decision function', 'inliers','outliers'],
        prop=matplotlib.font_manager.FontProperties(size=20),
        loc=2)
      
    plt.xlim((0, 1))
    plt.ylim((0, 1))
    plt.title(clf_name)
    plt.show()

OUTLIERS :  50 INLIERS :  950 Histogram-base Outlier Detection (HBOS)

在这里插入图片描述

从上面的程序中我们知道，这是绘制的两个类别的数据，要是绘制多类别数据，并且放到同一个坐标系中，那么我们可以参考 sns.scatterplot(参数)

参考文章： https://www.cnblogs.com/cgmcoding/p/13475462.html

下面将通过一个列子来展示一下 sns.scatterplot()绘制多类别的数据

import pandas as pd
import seaborn as sns

# index = pd.date_range(start="2000-1-11", periods=100, freq="m", name="date")
index = pd.date_range(start="2020-1-11", periods=100, freq="m", name="date")
#### 注意对比两者的不同，为什么会出现这种情况？？？？ 为什么第二条语句也是从2000年开始绘制


data = np.random.randn(100, 4).cumsum(axis=0)

wide_df = pd.DataFrame(data, index, ["a", "b", "c", "d"])
sns.scatterplot(data=wide_df)
plt.show()
###### 出现图中的这种情况怎么解决？？？？？

在这里插入图片描述

##### --------- 
sns.scatterplot(data=df)    ## 注意这是对整个类别的数据进行的处理，data = df 
### 然而： plt.scatter() 是不同的，这个需要两个参数，x,y,:并且不能通过 DataFarme给出data = df

<matplotlib.axes._subplots.AxesSubplot at 0x2ac885c5310>

在这里插入图片描述

接下来我面临的问题是: 将不同类别的数据，按照列方式进行拼接起来 – 》 pandas 的拼接方法

也就是：各个IXi 共同拼接成一个 df_IX 中。 df_IX 格式为DataFrame格式 ====> ???

然后把他们OX设置成 index

这样就可以使用： sns.scatterplot(data=data, index=index)

附录：

疑问记录：generate_data() ：

Utility function to generate synthesized data.
Normal data is generated by a multivariate Gaussiandistribution and outliers are generated by a uniform distribution.

用法：
X_train, y_train, X_test, y_test = generate_data(n_train=n_train, n_test=n_test, contamination=contamination, behaviour=“old”)

也是说，它是一个产生数据的机器，这个机器可以根据训练集样本数，测试集样本数，和其他参数，用来产生数据集（正常样本符合正态分布，异常点为均匀分步）

那么我们手头上现有一份数据集，df_txt ,也就是不需要用到generate_data() , 我们的任务是：处理现有数据，找出异常值。那么df_txt 本身数据就没有样本标签，那如何划分数据集就成了问题？？？？

这时候我想到了用

from sklearn.model_selection import train_test_split

X_train, y_train, X_test, y_test = train_test_split(df_txt, test_size=0.2)

但是划分出来的数据不知道怎样？？？？结果是：效果不好！！！
那么是不是 train_test_split()的使用方式不对呢？

还是train_test_split适用于有监督？还是无标签的数据集同样适用吗？？？？

train_test_split() 作用：切分数据。

使用方法：

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(train_data,train_target,test_size=0.4, random_state=0,stratify=y_train)

使用场合：

切分数据集（将数据切分成训练集和测试集）
简单交叉验证

监督学习的数据既有特征又有标签，而无监督学习的数据中只有特征而没有标签。？？？？

那么就直接进行训练:

path = r"F:/Python_Tensorflow_codes/006group_learning/team-learning-data-mining-master/AnomalyDetection/outlier.txt"
# df_txt1 = pd.read_csv(path, header=None)
# print(df_txt1)

df = pd.read_table(path, sep=" ", header=None)

clf = HBOS()
clf.fit(df)      ## 因为没有标签，可以直接进行训练操作。 fit() 中的 内容是整个数据集

y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_  # raw outlier scores

# print("\nOn Training Data:")
# evaluate_print(clf_name, y_train, y_train_scores)

# print(y_train_pred)
# print(y_train_scores)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UQpJLVIi-1615797336016)(attachment:9043c182-4ec1-4942-81df-4b8bc831a999.png)]

sample() 方法的使用：

小王做笔记

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
03HBOS算法检测异常值（直方图的推广）

HBOS算法进行异常检测理论部分定义：这是一种基于多维度数据各个维度的独立性假设，对于单个数据维度，先进行数据直方图。对分类数据统计每个值的频数并计算相对频率。它是一种单变量方法的组合，不能对特征之间的依赖关系进行建模，但是计算速度较快，对大数据集友好。其基本假设是数据集的每个维度相互独立。然后对每个维度进行区间(bin)划分，区间的密度越高，异常评分越低。作用：为每个数据维度做出数据直方图。对分类数据统计每个值的频数并计算相对频率。适用场景：半监督学习的异常检测分类：静态宽度直方图
复制链接

扫一扫

专栏目录