03HBOS算法检测异常值(直方图的推广)

HBOS算法进行异常检测

理论部分

定义

这是一种基于多维度数据各个维度的独立性假设,对于单个数据维度,先进行数据直方图。对分类数据统计每个值的频数并计算相对频率。

它是一种单变量方法的组合,不能对特征之间的依赖关系进行建模,但是计算速度较快,对大数据集友好。其基本假设是数据集的每个维度相互独立。然后对每个维度进行区间(bin)划分,区间的密度越高,异常评分越低。

作用

为每个数据维度做出数据直方图。对分类数据统计每个值的频数并计算相对频率。

适用场景

半监督学习的异常检测

分类:

  1. 静态宽度直方图
  2. 动态宽度直方图
    有关上面两者的说明:
    在这里插入图片描述

使用的操作:

  1. 为每个数据维度做出数据直方图(这里假设是不同维度上的数据之间是相互独立的), 根据数值特征选择选用哪种图形(静态 / 动态)。

  2. 对每个维度计算一个独立的直方图,其中每个箱子的高度为密度的估计。(每个维度的尺度 = 柱状图的宽 ; 数据的密度 = 桶的高度)

  3. 对各个类型做一下归一化处理,确保每个维度(每个特征)与异常值得分的权重相等(可以使最大高度为1)。

则计算公式为:
在这里插入图片描述

概率密度越大,异常值评分越小

学习代码

下面我们将通过pyod库完成 HBOS 的学习

参考文章:

https://blog.csdn.net/Sirow/article/details/112692357

https://blog.csdn.net/Joyceying1007/article/details/112688879

以上两篇文章是通过 generate_data() 内置函数生成数据集,然后用于异常检测的案例中,但是我们的数据集不是这样的。下面我们先来学习一下 这里的异常检测方法

# 导入相关依赖模块
from pyod.utils.data import evaluate_print,generate_data
from pyod.models.hbos import HBOS
from pyod.utils.example import visualize

# pyod中用于生成toy数据的方法主要是:
# 1、pyod.utils.data.generate_data()
# 2、pyod.utils.data.generate_data_clusters()

# 于是....生成toy example:
contamination = 0.05  # percentage of outliers
n_train = 1000  # number of training points
n_test = 500  # number of testing points
X_train, y_train, X_test, y_test = generate_data(n_train=n_train, n_test=n_test, contamination=contamination, behaviour="old")  ### 生成数据集的函数

# 初始化HBOS模型
clf_name = 'HBOS'
clf = HBOS() 
clf.fit(X_train)

# get the prediction labels and outlier scores of the training data
y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_  # raw outlier scores

# get the prediction on the test data
y_test_pred = clf.predict(X_test)  # outlier labels (0 or 1)
y_test_scores = clf.decision_function(X_test)  # outlier scores

# evaluate and print the results
print("\nOn Training Data:")
evaluate_print(clf_name, y_train, y_train_scores)
print("\nOn Test Data:")
evaluate_print(clf_name, y_test, y_test_scores)

# 可视化展示训练集、测试集异常检测结果
visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,
          y_test_pred, show_figure=True, save_figure=False)

# visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,
#           y_test_pred, show_figure=False, save_figure=True)
D:\Anacoder\lib\site-packages\pyod\utils\data.py:185: FutureWarning: behaviour="old" is deprecated and will be removed in version 0.8.0. Please use behaviour="new", which makes the returned datasets in the order of X_train, X_test, y_train, y_test.
  warn('behaviour="old" is deprecated and will be removed '



On Training Data:
HBOS ROC:0.9903, precision @ rank n:0.9167

On Test Data:
HBOS ROC:0.9883, precision @ rank n:0.88

在这里插入图片描述

from __future__ import division
from __future__ import print_function

import os
import sys

# temporary solution for relative imports in case pyod is not installed
# if pyod is installed, no need to use the following line
#sys.path.append(
#    os.path.abspath(os.path.join(os.path.dirname("__file__"), '..')))

from pyod.models.hbos import HBOS
from pyod.utils.data import generate_data
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize

if __name__ == "__main__":
    contamination = 0.1  # percentage of outliers
    n_train = 200  # number of training points
    n_test = 100  # number of testing points

    # Generate sample data
    X_train, y_train, X_test, y_test = \
        generate_data(n_train=n_train,
                      n_test=n_test,
                      n_features=2,
                      contamination=contamination,
                      random_state=42)

    # train HBOS detector
    clf_name = 'HBOS'
    clf = HBOS()
    clf.fit(X_train)

    # get the prediction labels and outlier scores of the training data
    y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
    y_train_scores = clf.decision_scores_  # raw outlier scores

    # get the prediction on the test data
    y_test_pred = clf.predict(X_test)  # outlier labels (0 or 1)
    y_test_scores = clf.decision_function(X_test)  # outlier scores

    # evaluate and print the results
    print("\nOn Training Data:")
    evaluate_print(clf_name, y_train, y_train_scores)
    print("\nOn Test Data:")
    evaluate_print(clf_name, y_test, y_test_scores)

    # visualize the results
    visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,
              y_test_pred, show_figure=True, save_figure=False)

D:\Anacoder\lib\site-packages\pyod\utils\data.py:185: FutureWarning: behaviour="old" is deprecated and will be removed in version 0.8.0. Please use behaviour="new", which makes the returned datasets in the order of X_train, X_test, y_train, y_test.
  warn('behaviour="old" is deprecated and will be removed '



On Training Data:
HBOS ROC:0.9947, precision @ rank n:0.8

On Test Data:
HBOS ROC:0.9744, precision @ rank n:0.6

在这里插入图片描述

实战代码

这里将会以 本地数据集 完成异常检测的任务

1. 导入库并完成数据的读取

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

import seaborn as sns

path = r"F:/Python_Tensorflow_codes/006group_learning/team-learning-data-mining-master/AnomalyDetection/outlier.txt"
# df_txt1 = pd.read_csv(path, header=None)
# print(df_txt1)

df = pd.read_table(path, sep=" ", header=None)

2. 查看数据并设置类别

# print(df)

### 怎样对现有的列名进行重新设置
# df.set_axis(["a", "b", "c", "d", "e"], inplace=True)
# print(df)

# ### 回答上面的问题:: 直接对df 的列属性 columns 赋值
df.columns = ["a", "b", "c", "d", "e"]
df
abcde
00.0368530.0343900.091979-0.010263-0.008141
1-0.0011520.021750-0.0204010.009866-0.034471
2-0.0125860.0473640.011108-0.011569-0.023341
3-0.0283780.0439800.0012640.0231380.005426
40.0222250.007152-0.037135-0.029387-0.099154
..................
9950.006989-0.0399000.013303-0.0354760.068688
996-0.004276-0.028075-0.000769-0.051646-0.057467
997-0.0005440.059539-0.0128970.032994-0.089494
998-0.0396970.000563-0.0329770.0154410.007677
999-0.0108660.0226040.014564-0.014077-0.025783

1000 rows × 5 columns

type(df["a"])
pandas.core.series.Series
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
df[df.columns] = scaler.fit_transform(df[df.columns])
df.head()
abcde
00.7052380.7295691.0000000.4803350.479880
10.5101010.6610920.4571520.5862230.341049
20.4513920.7998550.6093580.4734630.399738
30.3703080.7815230.5618040.6560480.551413
40.6301320.5820110.3763180.3797280.000000
X1 = df['a'].values.reshape(-1,1)
X2 = df['b'].values.reshape(-1,1)
X3 = df["c"].values.reshape(-1,1)
X4 = df['d'].values.reshape(-1,1)
X5 = df["e"].values.reshape(-1,1)


X = np.concatenate((X1,X2,X3,X4,X5),axis=1)
## 补充需要的库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

import matplotlib                     ## 这两条语句缺一不可  这里解决了直接书写 matplotlib
from matplotlib import font_manager   ## 用于解决: prop=matplotlib.font_manager.FontProperties(size=20)

# Import models
from pyod.models.abod import ABOD
from pyod.models.cblof import CBLOF
from pyod.models.feature_bagging import FeatureBagging
from pyod.models.hbos import HBOS
from pyod.models.iforest import IForest
from pyod.models.knn import KNN
from pyod.models.lof import LOF

random_state = np.random.RandomState(42)
outliers_fraction = 0.05
# Define seven outlier detection tools to be compared ; 定义要使用的类别
classifiers = {
#         'Angle-based Outlier Detector (ABOD)': ABOD(contamination=outliers_fraction),
#         'Cluster-based Local Outlier Factor (CBLOF)':CBLOF(contamination=outliers_fraction,check_estimator=False, random_state=random_state),
#         'Feature Bagging':FeatureBagging(LOF(n_neighbors=35),contamination=outliers_fraction,check_estimator=False,random_state=random_state),
        'Histogram-base Outlier Detection (HBOS)': HBOS(contamination=outliers_fraction),
#         'Isolation Forest': IForest(contamination=outliers_fraction,random_state=random_state),
#         'K Nearest Neighbors (KNN)': KNN(contamination=outliers_fraction),
#         'Average KNN': KNN(method='mean',contamination=outliers_fraction)
}
### 定义坐标轴线的参数 用的函数是 numpy.meshgrid() 函数
xx , yy = np.meshgrid(np.linspace(0,1 , 200), np.linspace(0, 1, 200))


for i, (clf_name, clf) in enumerate(classifiers.items()):
    clf.fit(X)
    # predict raw anomaly score
    scores_pred = clf.decision_function(X) * -1
        
    # prediction of a datapoint category outlier or inlier
    y_pred = clf.predict(X)
    n_inliers = len(y_pred) - np.count_nonzero(y_pred)
    n_outliers = np.count_nonzero(y_pred == 1)
    plt.figure(figsize=(10, 10))
    
    # copy of dataframe
    dfx = df
    dfx['outlier'] = y_pred.tolist()
    
    # IX1 - inlier feature 1,  IX2 - inlier feature 2
    IX1 =  np.array(dfx['a'][dfx['outlier'] == 0]).reshape(-1,1)
    IX2 =  np.array(dfx['b'][dfx['outlier'] == 0]).reshape(-1,1)
#     IX3 =  np.array(dfx['c'][dfx['outlier'] == 0]).reshape(-1,1)
#     IX4 =  np.array(dfx['d'][dfx['outlier'] == 0]).reshape(-1,1)
#     IX5 =  np.array(dfx['e'][dfx['outlier'] == 0]).reshape(-1,1)


# '''
#     #### 将上面的枚举整理成 for 循环代码:     ### 这种方式显然是不正确的,前边变,后面也变的这种程序使用 for 循环并不是一个好的选择
# #     tuple_ = "abcde"
# #     for i in tuple_:
# #         IX_temp =  np.array(dfx[i][dfx['outlier'] == 0]).reshape(-1,1)
# #         if not (i != "a"):
# #             IX1 = IX_temp
# #         if i == "b":
# #             IX2 = IX_temp
# #         if i == "c":
# #             IX3 = IX_temp    
# #         if i == "d":
# #             IX4 = IX_temp    
# #         if i == "e":
# #             IX5 = IX_temp


# #### 把多个类别放到同一个变量中

# '''

    
    # OX1 - outlier feature 1, OX2 - outlier feature 2
    OX1 =  dfx['a'][dfx['outlier'] == 1].values.reshape(-1,1)
    OX2 =  dfx['b'][dfx['outlier'] == 1].values.reshape(-1,1)
#     OX3 =  dfx['c'][dfx['outlier'] == 1].values.reshape(-1,1)
#     OX4 =  dfx['d'][dfx['outlier'] == 1].values.reshape(-1,1)
#     OX5 =  dfx['e'][dfx['outlier'] == 1].values.reshape(-1,1)
   
         
    print('OUTLIERS : ',n_outliers,'INLIERS : ',n_inliers, clf_name)
        
    # threshold value to consider a datapoint inlier or outlier
    threshold = stats.scoreatpercentile(scores_pred,100 * outliers_fraction)
        
    # decision function calculates the raw anomaly score for every point
    Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
    Z = Z.reshape(xx.shape)
          
    # fill blue map colormap from minimum anomaly score to threshold value
    plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),cmap=plt.cm.Blues_r)
        
    # draw red contour line where anomaly score is equal to thresold
    a = plt.contour(xx, yy, Z, levels=[threshold],linewidths=2, colors='red')
        
    # fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score
    plt.contourf(xx, yy, Z, levels=[threshold, Z.max()],colors='orange')
        
    b = plt.scatter(IX1,IX2, c='white',s=20, edgecolor='k')     ### 这里绘制的就是 IX1, IX2之间的关系昂
    
    c = plt.scatter(OX1,OX2, c='black',s=20, edgecolor='k')
       
    plt.axis('tight')  
    
    # loc=2 is used for the top left corner 
    plt.legend(
        [a.collections[0], b,c],
        ['learned decision function', 'inliers','outliers'],
        prop=matplotlib.font_manager.FontProperties(size=20),
        loc=2)
      
    plt.xlim((0, 1))
    plt.ylim((0, 1))
    plt.title(clf_name)
    plt.show()
OUTLIERS :  50 INLIERS :  950 Histogram-base Outlier Detection (HBOS)

在这里插入图片描述

从上面的程序中我们知道,这是绘制的两个类别的数据,要是绘制多类别数据,并且放到同一个坐标系中,那么我们可以参考 sns.scatterplot(参数)

参考文章 : https://www.cnblogs.com/cgmcoding/p/13475462.html

下面将通过一个列子来展示一下 sns.scatterplot()绘制多类别的数据

import pandas as pd
import seaborn as sns

# index = pd.date_range(start="2000-1-11", periods=100, freq="m", name="date")
index = pd.date_range(start="2020-1-11", periods=100, freq="m", name="date")
#### 注意对比两者的不同,为什么会出现这种情况???? 为什么第二条语句也是从2000年开始绘制


data = np.random.randn(100, 4).cumsum(axis=0)

wide_df = pd.DataFrame(data, index, ["a", "b", "c", "d"])
sns.scatterplot(data=wide_df)
plt.show()
###### 出现图中的这种情况怎么解决?????

在这里插入图片描述

##### --------- 
sns.scatterplot(data=df)    ## 注意这是对整个类别的数据进行的处理,data = df 
### 然而: plt.scatter() 是不同的,这个需要两个参数,x,y,:并且不能通过 DataFarme给出data = df

<matplotlib.axes._subplots.AxesSubplot at 0x2ac885c5310>

在这里插入图片描述

接下来我面临的问题是: 将不同类别的数据,按照列方式进行拼接起来 – 》 pandas 的拼接方法

也就是 : 各个IXi 共同拼接成 一个 df_IX 中。 df_IX 格式为DataFrame格式 ====> ???

然后把他们OX设置成 index

这样就可以使用 : sns.scatterplot(data=data, index=index)

附录:

疑问记录:generate_data() :

Utility function to generate synthesized data.
Normal data is generated by a multivariate Gaussiandistribution and outliers are generated by a uniform distribution.

用法:
X_train, y_train, X_test, y_test = generate_data(n_train=n_train, n_test=n_test, contamination=contamination, behaviour=“old”)

也是说,它是一个产生数据的机器,这个机器可以根据训练集样本数,测试集样本数,和其他参数 ,用来产生 数据集(正常样本符合正态分布,异常点为均匀分步)

那么我们手头上现有一份数据集,df_txt ,也就是不需要用到generate_data() , 我们的任务是:处理现有数据,找出异常值。那么df_txt 本身数据就没有样本标签,那如何划分数据集就成了问题????

这时候我想到了用

from sklearn.model_selection import train_test_split

X_train, y_train, X_test, y_test = train_test_split(df_txt, test_size=0.2)

但是划分出来的数据不知道怎样????结果是 : 效果不好!!!
那么是不是 train_test_split()的使用方式不对呢?

还是train_test_split适用于有监督?还是无标签的数据集同样适用吗????

train_test_split() 作用:切分数据。

使用方法:

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(train_data,train_target,test_size=0.4, random_state=0,stratify=y_train)

使用场合:

  1. 切分数据集(将数据切分成 训练集 和 测试集)
  2. 简单交叉验证

监督学习的数据既有特征又有标签,而无监督学习的数据中只有特征而没有标签。????

那么就直接进行训练:

path = r"F:/Python_Tensorflow_codes/006group_learning/team-learning-data-mining-master/AnomalyDetection/outlier.txt"
# df_txt1 = pd.read_csv(path, header=None)
# print(df_txt1)

df = pd.read_table(path, sep=" ", header=None)

clf = HBOS()
clf.fit(df)      ## 因为没有标签,可以直接进行训练操作。 fit() 中的 内容是整个数据集

y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_  # raw outlier scores

# print("\nOn Training Data:")
# evaluate_print(clf_name, y_train, y_train_scores)

# print(y_train_pred)
# print(y_train_scores)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UQpJLVIi-1615797336016)(attachment:9043c182-4ec1-4942-81df-4b8bc831a999.png)]

sample() 方法的使用:


  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值