make_classification()
def make_classification(
n_samples=100,
n_features=20,
*,
n_informative=2,
n_redundant=2,
n_repeated=0,
n_classes=2,
n_clusters_per_class=2,
weights=None,
flip_y=0.01,
class_sep=1.0,
hypercube=True,
shift=0.0,
scale=1.0,
shuffle=True,
random_state=None,
):
随机生成一个n类的分类问题。
This initially creates clusters of points normally distributed (std=1)
about vertices of an n_informative
-dimensional hypercube with sides of
length 2*class_sep
and assigns an equal number of clusters to each
class. It introduces interdependence between these features and adds
various types of further noise to the data.
Without shuffling, X
horizontally stacks features in the following
order: the primary n_informative
features, followed by n_redundant
linear combinations of the informative features, followed by n_repeated
duplicates, drawn randomly with replacement from the informative and
redundant features. The remaining features are filled with random noise.
Thus, without shuffling, all useful features are contained in the columns
X[:, :n_informative + n_redundant + n_repeated]
.
Parameters
n_samples : int, default=100
样本数量
n_features : int, default=20
特征总数 。其包括: [n_features = n_informative + n_redundant + n_repeated]
n_informative:有用的,有效的信息特性,
n_redundant:冗余特性,
n_repeated:重复的特性
n_features-n_informative-n_redundant-n_repeated:其他没用的随机特征
n_informative : int, default=2
有效的信息特征总数.
官方注解:Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension ``n_informative``. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined within each cluster in order to add covariance. The clusters are then placed on the vertices of the hypercube.
n_redundant : int, default=2
冗余特征的数量。这些特征是作为信息特征的随机线性组合产生的。
n_repeated : int, default=0
重复特征的数量。从信息和冗余特征中随机抽取。
n_classes : int, default=2
分类问题的类(或标签)的数量。
n_clusters_per_class : int, default=2
每个类的clusters数量
shuffle : bool, default=True
选择是否打乱样本和特征?
weights : array-like of shape (n_classes,) or (n_classes - 1,), default=None
The proportions of samples assigned to each class. If None, then
classes are balanced. Note that if ``len(weights) == n_classes - 1``,
then the last class weight is automatically inferred.
More than ``n_samples`` samples may be returned if the sum of
``weights`` exceeds 1. Note that the actual class proportions will
not exactly match ``weights`` when ``flip_y`` isn't 0.
flip_y : float, default=0.01
The fraction of samples whose class is assigned randomly. Larger
values introduce noise in the labels and make the classification
task harder. Note that the default setting flip_y > 0 might lead
to less than ``n_classes`` in y in some cases.
class_sep : float, default=1.0
The factor multiplying the hypercube size. Larger values spread
out the clusters/classes and make the classification task easier.
hypercube : bool, default=True
If True, the clusters are put on the vertices of a hypercube. If
False, the clusters are put on the vertices of a random polytope.
shift : float, ndarray of shape (n_features,) or None, default=0.0
Shift features by the specified value. If None, then features
are shifted by a random value drawn in [-class_sep, class_sep].
scale : float, ndarray of shape (n_features,) or None, default=1.0
Multiply features by the specified value. If None, then features
are scaled by a random value drawn in [1, 100]. Note that scaling
happens after shifting.
random_state : int, RandomState instance or None, default=None
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.
Returns
X : ndarray of shape (n_samples, n_features)
The generated samples.
y : ndarray of shape (n_samples,)
The integer labels for class membership of each sample.