学习使用PyOD库生成toy example并调用HBOS
根据官方文档,pyod中用于生成toy数据的方法主要是
- pyod.utils.data.generate_data()
- pyod.utils.data.generate_data_clusters()
下面给出具体的参数和用法。
函数的查询方法:
- 在jupyter中输入方法代码,当输入左边括号时,会显示函数的介绍。点击向上的箭头,出现完整的方法介绍
1、pyod.utils.data.generate_data()
正态数据由多元高斯分布,异常值是由均匀分布产生的。
# 函数详解
pyod.utils.data.generate_data(
n_train=1000,
n_test=500,
n_features=2,
contamination=0.1,
train_only=False,
offset=10,
behaviour='old',
random_state=None,
)
Docstring:
Utility function to generate synthesized data.
Normal data is generated by a multivariate Gaussian distribution and
outliers are generated by a uniform distribution.
Parameters
----------
n_train : int, (default=1000)
The number of training points to generate.
n_test : int, (default=500)
The number of test points to generate.
n_features : int, optional (default=2)
The number of features (dimensions).
contamination : float in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e.
the proportion of outliers in the data set. Used when fitting to
define the threshold on the decision function.
train_only : bool, optional (default=False)
If true, generate train data only.
offset : int, optional (default=10)
Adjust the value range of Gaussian and Uniform.
behaviour : str, default='old'
Behaviour of the returned datasets which can be either 'old' or
'new'. Passing ``behaviour='new'`` returns
"X_train, y_train, X_test, y_test", while passing ``behaviour='old'``
returns "X_train, X_test, y_train, y_test".
.. versionadded:: 0.7.0
``behaviour`` is added in 0.7.0 for back-compatibility purpose.
.. deprecated:: 0.7.0
``behaviour='old'`` is deprecated in 0.20 and will not be possible
in 0.7.2.
.. deprecated:: 0.7.2.
``behaviour`` parameter will be deprecated in 0.7.2 and removed in
0.8.0.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.
Returns
-------
X_train : numpy array of shape (n_train, n_features)
Training data.
y_train : numpy array of shape (n_train,)
Training ground truth.
X_test : numpy array of shape (n_test, n_features)
Test data.
y_test : numpy array of shape (n_test,)
Test ground truth.
File: c:\users\huawei\anaconda3\lib\site-packages\pyod\utils\data.py
Type: function
使用默认参数生成数据,并画图展示
import matplotlib.pyplot as plt
import pyod
a =
fig = plt.figure()
data1 = a[0][a[1]==1,:]
data2 = a[0][a[1]==0,:]
plt.scatter(data1[:,0], data1[:,1],c = 'r',marker = 'o')
plt.scatter(data2[:,0], data2[:,1],c = 'b',marker = 'o')
fig.show()
2 pyod.utils.data.generate_data_clusters()
pyod.utils.data.generate_data_clusters(
n_train=1000,
n_test=500,
n_clusters=2,
n_features=2,
contamination=0.1,
size='same',
density='same',
dist=0.25,
random_state=None,
return_in_clusters=False,
)
Docstring:
Utility function to generate synthesized data in clusters.
Generated data can involve the low density pattern problem and global
outliers which are considered as difficult tasks for outliers detection
algorithms.
Parameters
----------
n_train : int, (default=1000)
The number of training points to generate.
n_test : int, (default=500)
The number of test points to generate.
n_clusters : int, optional (default=2)
The number of centers (i.e. clusters) to generate.
n_features : int, optional (default=2)
The number of features for each sample.
contamination : float in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e.
the proportion of outliers in the data set.
size : str, optional (default='same')
Size of each cluster: 'same' generates clusters with same size,
'different' generate clusters with different sizes.
density : str, optional (default='same')
Density of each cluster: 'same' generates clusters with same density,
'different' generate clusters with different densities.
dist: float, optional (default=0.25)
Distance between clusters. Should be between 0. and 1.0
It is used to avoid clusters overlapping as much as possible.
However, if number of samples and number of clusters are too high,
it is unlikely to separate them fully even if ``dist`` set to 1.0
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.
return_in_clusters : bool, optional (default=False)
If True, the function returns x_train, y_train, x_test, y_test each as
a list of numpy arrays where each index represents a cluster.
If False, it returns x_train, y_train, x_test, y_test each as numpy
array after joining the sequence of clusters arrays,
Returns
-------
X_train : numpy array of shape (n_train, n_features)
Training data.
y_train : numpy array of shape (n_train,)
Training ground truth.
X_test : numpy array of shape (n_test, n_features)
Test data.
y_test : numpy array of shape (n_test,)
Test ground truth.
画图展示
a = pyod.utils.data.generate_data_clusters()
fig = plt.figure()
data1 = a[0][a[2]==1,:]
data2 = a[0][a[2]==0,:]
plt.scatter(data1[:,0], data1[:,1],c = 'r',marker = 'o')
plt.scatter(data2[:,0], data2[:,1],c = 'b',marker = 'o')
plt.legend(['outliers ', 'Normal'])
fig.show()