异常检测 task2 pyod库的简单使用

学习使用PyOD库生成toy example并调用HBOS

根据官方文档,pyod中用于生成toy数据的方法主要是

  • pyod.utils.data.generate_data()
  • pyod.utils.data.generate_data_clusters()

下面给出具体的参数和用法。

函数的查询方法:

  • 在jupyter中输入方法代码,当输入左边括号时,会显示函数的介绍。点击向上的箭头,出现完整的方法介绍

1、pyod.utils.data.generate_data()

正态数据由多元高斯分布,异常值是由均匀分布产生的。

# 函数详解
pyod.utils.data.generate_data(
    n_train=1000,
    n_test=500,
    n_features=2,
    contamination=0.1,
    train_only=False,
    offset=10,
    behaviour='old',
    random_state=None,
)
Docstring:
Utility function to generate synthesized data.
Normal data is generated by a multivariate Gaussian distribution and
outliers are generated by a uniform distribution.

Parameters
----------
n_train : int, (default=1000)
    The number of training points to generate.

n_test : int, (default=500)
    The number of test points to generate.

n_features : int, optional (default=2)
    The number of features (dimensions).

contamination : float in (0., 0.5), optional (default=0.1)
    The amount of contamination of the data set, i.e.
    the proportion of outliers in the data set. Used when fitting to
    define the threshold on the decision function.

train_only : bool, optional (default=False)
    If true, generate train data only.

offset : int, optional (default=10)
    Adjust the value range of Gaussian and Uniform.

behaviour : str, default='old'
    Behaviour of the returned datasets which can be either 'old' or
    'new'. Passing ``behaviour='new'`` returns
    "X_train, y_train, X_test, y_test", while passing ``behaviour='old'``
    returns "X_train, X_test, y_train, y_test".

    .. versionadded:: 0.7.0
       ``behaviour`` is added in 0.7.0 for back-compatibility purpose.
    .. deprecated:: 0.7.0
       ``behaviour='old'`` is deprecated in 0.20 and will not be possible
       in 0.7.2.
    .. deprecated:: 0.7.2.
       ``behaviour`` parameter will be deprecated in 0.7.2 and removed in
       0.8.0.

random_state : int, RandomState instance or None, optional (default=None)
    If int, random_state is the seed used by the random number generator;
    If RandomState instance, random_state is the random number generator;
    If None, the random number generator is the RandomState instance used
    by `np.random`.

Returns
-------
X_train : numpy array of shape (n_train, n_features)
    Training data.

y_train : numpy array of shape (n_train,)
    Training ground truth.

X_test : numpy array of shape (n_test, n_features)
    Test data.

y_test : numpy array of shape (n_test,)
    Test ground truth.
File:      c:\users\huawei\anaconda3\lib\site-packages\pyod\utils\data.py
Type:      function

使用默认参数生成数据,并画图展示

import matplotlib.pyplot as plt
import pyod
a = 
fig = plt.figure()
data1 = a[0][a[1]==1,:]
data2 = a[0][a[1]==0,:]
plt.scatter(data1[:,0], data1[:,1],c = 'r',marker = 'o')
plt.scatter(data2[:,0], data2[:,1],c = 'b',marker = 'o')
fig.show()

在这里插入图片描述

2 pyod.utils.data.generate_data_clusters()

pyod.utils.data.generate_data_clusters(
    n_train=1000,
    n_test=500,
    n_clusters=2,
    n_features=2,
    contamination=0.1,
    size='same',
    density='same',
    dist=0.25,
    random_state=None,
    return_in_clusters=False,
)
Docstring:
Utility function to generate synthesized data in clusters.
   Generated data can involve the low density pattern problem and global
   outliers which are considered as difficult tasks for outliers detection
   algorithms.

Parameters
----------
n_train : int, (default=1000)
    The number of training points to generate.

n_test : int, (default=500)
    The number of test points to generate.

n_clusters : int, optional (default=2)
   The number of centers (i.e. clusters) to generate.

n_features : int, optional (default=2)
   The number of features for each sample.

contamination : float in (0., 0.5), optional (default=0.1)
   The amount of contamination of the data set, i.e.
   the proportion of outliers in the data set.

size : str, optional (default='same')
   Size of each cluster: 'same' generates clusters with same size,
   'different' generate clusters with different sizes.

density : str, optional (default='same')
   Density of each cluster: 'same' generates clusters with same density,
   'different' generate clusters with different densities.

dist: float, optional (default=0.25)
   Distance between clusters. Should be between 0. and 1.0
   It is used to avoid clusters overlapping as much as possible.
   However, if number of samples and number of clusters are too high,
   it is unlikely to separate them fully even if ``dist`` set to 1.0

random_state : int, RandomState instance or None, optional (default=None)
    If int, random_state is the seed used by the random number generator;
    If RandomState instance, random_state is the random number generator;
    If None, the random number generator is the RandomState instance used
    by `np.random`.

return_in_clusters : bool, optional (default=False)
    If True, the function returns x_train, y_train, x_test, y_test each as
    a list of numpy arrays where each index represents a cluster.
    If False, it returns x_train, y_train, x_test, y_test each as numpy
    array after joining the sequence of clusters arrays,

Returns
-------
X_train : numpy array of shape (n_train, n_features)
    Training data.

y_train : numpy array of shape (n_train,)
    Training ground truth.

X_test : numpy array of shape (n_test, n_features)
    Test data.

y_test : numpy array of shape (n_test,)
    Test ground truth.

画图展示

a = pyod.utils.data.generate_data_clusters()
fig = plt.figure()
data1 = a[0][a[2]==1,:]
data2 = a[0][a[2]==0,:]
plt.scatter(data1[:,0], data1[:,1],c = 'r',marker = 'o')
plt.scatter(data2[:,0], data2[:,1],c = 'b',marker = 'o')
plt.legend(['outliers ', 'Normal'])
fig.show()

在这里插入图片描述

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值