异常检测 task2 pyod库的简单使用

最新推荐文章于 2025-06-05 09:40:11 发布

原创最新推荐文章于 2025-06-05 09:40:11 发布 · 780 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#python

异常检测专栏收录该内容

3 篇文章

订阅专栏

本文介绍了如何使用PyOD库生成合成数据，包括正态分布和聚类分布的数据集，并展示了如何调用HBOS算法进行异常检测。通过`pyod.utils.data.generate_data()`和`pyod.utils.data.generate_data_clusters()`函数，可以创建带有异常值的数据集。在生成数据后，使用matplotlib进行可视化，区分正常点和异常点。这些工具对于理解和测试异常检测算法非常有用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

学习使用PyOD库生成toy example并调用HBOS

根据官方文档，pyod中用于生成toy数据的方法主要是

pyod.utils.data.generate_data()
pyod.utils.data.generate_data_clusters()

下面给出具体的参数和用法。

函数的查询方法：

在jupyter中输入方法代码，当输入左边括号时，会显示函数的介绍。点击向上的箭头，出现完整的方法介绍

1、pyod.utils.data.generate_data()

正态数据由多元高斯分布，异常值是由均匀分布产生的。

# 函数详解
pyod.utils.data.generate_data(
    n_train=1000,
    n_test=500,
    n_features=2,
    contamination=0.1,
    train_only=False,
    offset=10,
    behaviour='old',
    random_state=None,
)
Docstring:
Utility function to generate synthesized data.
Normal data is generated by a multivariate Gaussian distribution and
outliers are generated by a uniform distribution.

Parameters
----------
n_train : int, (default=1000)
    The number of training points to generate.

n_test : int, (default=500)
    The number of test points to generate.

n_features : int, optional (default=2)
    The number of features (dimensions).

contamination : float in (0., 0.5), optional (default=0.1)
    The amount of contamination of the data set, i.e.
    the proportion of outliers in the data set. Used when fitting to
    define the threshold on the decision function.

train_only : bool, optional (default=False)
    If true, generate train data only.

offset : int, optional (default=10)
    Adjust the value range of Gaussian and Uniform.

behaviour : str, default='old'
    Behaviour of the returned datasets which can be either 'old' or
    'new'. Passing ``behaviour='new'`` returns
    "X_train, y_train, X_test, y_test", while passing ``behaviour='old'``
    returns "X_train, X_test, y_train, y_test".

    .. versionadded:: 0.7.0
       ``behaviour`` is added in 0.7.0 for back-compatibility purpose.
    .. deprecated:: 0.7.0
       ``behaviour='old'`` is deprecated in 0.20 and will not be possible
       in 0.7.2.
    .. deprecated:: 0.7.2.
       ``behaviour`` parameter will be deprecated in 0.7.2 and removed in
       0.8.0.

random_state : int, RandomState instance or None, optional (default=None)
    If int, random_state is the seed used by the random number generator;
    If RandomState instance, random_state is the random number generator;
    If None, the random number generator is the RandomState instance used
    by `np.random`.

Returns
-------
X_train : numpy array of shape (n_train, n_features)
    Training data.

y_train : numpy array of shape (n_train,)
    Training ground truth.

X_test : numpy array of shape (n_test, n_features)
    Test data.

y_test : numpy array of shape (n_test,)
    Test ground truth.
File:      c:\users\huawei\anaconda3\lib\site-packages\pyod\utils\data.py
Type:      function

使用默认参数生成数据，并画图展示

import matplotlib.pyplot as plt
import pyod
a = 
fig = plt.figure()
data1 = a[0][a[1]==1,:]
data2 = a[0][a[1]==0,:]
plt.scatter(data1[:,0], data1[:,1],c = 'r',marker = 'o')
plt.scatter(data2[:,0], data2[:,1],c = 'b',marker = 'o')
fig.show()

在这里插入图片描述

2 pyod.utils.data.generate_data_clusters()

pyod.utils.data.generate_data_clusters(
    n_train=1000,
    n_test=500,
    n_clusters=2,
    n_features=2,
    contamination=0.1,
    size='same',
    density='same',
    dist=0.25,
    random_state=None,
    return_in_clusters=False,
)
Docstring:
Utility function to generate synthesized data in clusters.
   Generated data can involve the low density pattern problem and global
   outliers which are considered as difficult tasks for outliers detection
   algorithms.

Parameters
----------
n_train : int, (default=1000)
    The number of training points to generate.

n_test : int, (default=500)
    The number of test points to generate.

n_clusters : int, optional (default=2)
   The number of centers (i.e. clusters) to generate.

n_features : int, optional (default=2)
   The number of features for each sample.

contamination : float in (0., 0.5), optional (default=0.1)
   The amount of contamination of the data set, i.e.
   the proportion of outliers in the data set.

size : str, optional (default='same')
   Size of each cluster: 'same' generates clusters with same size,
   'different' generate clusters with different sizes.

density : str, optional (default='same')
   Density of each cluster: 'same' generates clusters with same density,
   'different' generate clusters with different densities.

dist: float, optional (default=0.25)
   Distance between clusters. Should be between 0. and 1.0
   It is used to avoid clusters overlapping as much as possible.
   However, if number of samples and number of clusters are too high,
   it is unlikely to separate them fully even if ``dist`` set to 1.0

random_state : int, RandomState instance or None, optional (default=None)
    If int, random_state is the seed used by the random number generator;
    If RandomState instance, random_state is the random number generator;
    If None, the random number generator is the RandomState instance used
    by `np.random`.

return_in_clusters : bool, optional (default=False)
    If True, the function returns x_train, y_train, x_test, y_test each as
    a list of numpy arrays where each index represents a cluster.
    If False, it returns x_train, y_train, x_test, y_test each as numpy
    array after joining the sequence of clusters arrays,

Returns
-------
X_train : numpy array of shape (n_train, n_features)
    Training data.

y_train : numpy array of shape (n_train,)
    Training ground truth.

X_test : numpy array of shape (n_test, n_features)
    Test data.

y_test : numpy array of shape (n_test,)
    Test ground truth.

画图展示

a = pyod.utils.data.generate_data_clusters()
fig = plt.figure()
data1 = a[0][a[2]==1,:]
data2 = a[0][a[2]==0,:]
plt.scatter(data1[:,0], data1[:,1],c = 'r',marker = 'o')
plt.scatter(data2[:,0], data2[:,1],c = 'b',marker = 'o')
plt.legend(['outliers ', 'Normal'])
fig.show()

在这里插入图片描述