python skikit-learn库总结

1、简介

scikit-learn是一个建立在Scipy基础上的用于机器学习的python模块,而在不同的领域中已经发展出为数众多的基于Scipy的工具包,它们被统一称为Scikits,而在所有的分支版本中,scikit-learn是最有名的。它是开源的,任何人都可以免费地使用它或者进行二次发行。

scikit-learn包含众多定级机器学习算法,它主要有6大类的基本功能,分别是分类,回归,聚类,数据降维,模型选择和数据预处理。

机器学习官方API链接

sklearn dataset 模块学习

2、重点函数讲解

sklearn.datasets.make_blobs(n_samples=100n_features=2centers=Nonecluster_std=1.0center_box=(-10.010.0)shuffle=Truerandom_state=None)[source]   此函数常用来生成测试数据集

Generate isotropic Gaussian blobs for clustering.

Read more in the User Guide.

Parameters:

n_samples : int or array-like, optional (default=100)

If int, it is the total number of points equally divided among clusters. If array-like, each element of the sequence indicates the number of samples per cluster.

n_features : int, optional (default=2)

The number of features for each sample.代表每个物体的特性数,可以决定输出X中的列数

centers : int or array of shape [n_centers, n_features], optional

(default=None) The number of centers to generate, or the fixed center locations. If n_samples is an int and centers is None, 3 centers are generated. If n_samples is array-like, centers must be either None or an array of length equal to the length of n_samples.表示生成数据在图上绘制出几个集合

cluster_std : float or sequence of floats, optional (default=1.0)

The standard deviation of the clusters.生成数据的标准差大小,标准差越大,则数据点越离散,否则则相反,默认给标准差大小为1,与默认给的center_box的比较合适,如果想调整这个大小,则与之相对应的center_box大小成成正比调整,到时绘制的点离散度比较合适,不然就会造成生成数据的点过于离散或者过于聚合

center_box : pair of floats (min, max), optional (default=(-10.0, 10.0))

The bounding box for each cluster center when centers are generated at random.调整生成数据的边界值

shuffle : boolean, optional (default=True)

Shuffle the samples.相当于打乱顺序

random_state : int, RandomState instance or None (default)

Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.设置生成数据的随机值,如果想控制每次产生的数据值是一样的,则使用这个参数传递一个合适的随机值,可以保证每次生成的数据值都一样,有利于重复试验;如果不传递随机值,则每次生成的数据则不一样;其余的函数传递的随机值含义也一样

Returns:

X : array of shape [n_samples, n_features]

The generated samples.生成数据的shape为(n_sample,centers)

y : array of shape [n_samples]

The integer labels for cluster membership of each sample.如果选择的特性数n,则生成数据值由0到n-1一维数组组成

# 使用示例
X, y = make_blobs(n_samples=100, n_features=2, centers=2, random_state=0, cluster_std=1.0)

 sklearn.model_selection.train_test_split(*arrays**options)[source]   交叉生成训练数据集和测试数据集的函数

Split arrays or matrices into random train and test subsets

Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.

Read more in the User Guide.

Parameters:

*arrays : sequence of indexables with same length / shape[0]  

Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.  传入生成数据集,X,y

test_size : float, int or None, optional (default=0.25)  现在推荐使用test_size而不是train_size;指定划分数据集中测试数据集所占的比率

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. By default, the value is set to 0.25. The default will change in version 0.21. It will remain 0.25 only if train_size is unspecified, otherwise it will complement the specified train_size.

train_size : float, int, or None, (default=None) 指定划分训练数据集的比率,与test_size可以同时使用,但是同时使用的比较少

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

random_state : int, RandomState instance or None, optional (default=None)   指定随机划分时的随机种子,如果想要划分的数据集每次都一样的话,就指定一个随机值参数

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

shuffle : boolean, optional (default=True)  打乱数据集

Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

stratify : array-like or None (default=None) 一般传递y数组值,按照y中各类数据的比例分配给train和test

If not None, data is split in a stratified fashion, using this as the class labels.

Returns:

splitting : list, length=2 * len(arrays)

List containing train-test split of inputs.

New in version 0.16: If the input is sparse, the output will be a scipy.sparse.csr_matrix. Else, output type is the same as the input type.

#使用示例: 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=10, stratify=y)

sklearn.datasets.make_regression(n_samples=100n_features=100n_informative=10n_targets=1bias=0.0effective_rank=Nonetail_strength=0.5noise=0.0shuffle=Truecoef=Falserandom_state=None)[source]  生成回归预测数据集

Generate a random regression problem.

The input set can either be well conditioned (by default) or have a low rank-fat tail singular profile. See make_low_rank_matrix for more details.

The output is generated by applying a (potentially biased) random linear regression model with n_informative nonzero regressors to the previously generated input and some gaussian centered noise with some adjustable scale.

Read more in the User Guide.

Parameters:

n_samples : int, optional (default=100)  生成数据的个数

The number of samples.

n_features : int, optional (default=100)  生成数据的特性数

The number of features.

n_informative : int, optional (default=10)  生成数据参与建模的特性个数

The number of informative features, i.e., the number of features used to build the linear model used to generate the output.

n_targets : int, optional (default=1)  目标因变量的个数

The number of regression targets, i.e., the dimension of the y output vector associated with a sample. By default, the output is a scalar.

bias : float, optional (default=0.0)  偏差(截距)

The bias term in the underlying linear model.

effective_rank : int or None, optional (default=None)

if not None:

The approximate number of singular vectors required to explain most of the input data by linear combinations. Using this kind of singular spectrum in the input allows the generator to reproduce the correlations often observed in practice.

if None:

The input set is well conditioned, centered and gaussian with unit variance.

tail_strength : float between 0.0 and 1.0, optional (default=0.5)

The relative importance of the fat noisy tail of the singular values profile if effective_rank is not None.

noise : float, optional (default=0.0) 噪音值,也就是标准差

The standard deviation of the gaussian noise applied to the output.

shuffle : boolean, optional (default=True)

Shuffle the samples and the features.

coef : boolean, optional (default=False)  是否输出coef标识,默认不输出

If True, the coefficients of the underlying linear model are returned.

random_state : int, RandomState instance or None (default)

Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.

Returns:

X : array of shape [n_samples, n_features]

The input samples.

y : array of shape [n_samples] or [n_samples, n_targets]

The output values.

coef : array of shape [n_features] or [n_features, n_targets], optional

The coefficient of the underlying linear model. It is returned only if coef is True.

3、函数使用简要说明

sklearn相关函数
函数使用说明
sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto')n_neighbors调节临近点的个数,一般调整预测值需要用这个参数;weights调整权重,uniform表示初始权重全部一样的;algorithm更换训练算法,auto表示尝试选择一个最佳的算法进行预测
x_train, x_test, y_train, y_test = sklearn.cross_validation.train_test_split(x, y, test_size = 0.2,random_state=0)将原始数据划分成训练数据集合测试数据集,根据test_size参数调整测试数据集合训练数据集的数据各占用总数据的比率
  
  
  
  
  
  
  
  
  
  
  
  

 

 

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Scikit-learn提供了多种贝叶斯分类器的实现,包括高斯朴素贝叶斯、多项式朴素贝叶斯和伯努利朴素贝叶斯。这里以高斯朴素贝叶斯分类器为例,介绍如何使用Scikit-learn实现贝叶斯分类。 1. 数据准备 首先,我们需要准备一些分类数据。这里使用Scikit-learn自带的鸢尾花数据集。代码如下: ```python from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target ``` 2. 数据预处理 在使用贝叶斯分类器之前,需要对数据进行预处理。这里我们使用Scikit-learn的数据预处理工具preprocessing中的StandardScaler类进行标准化处理。代码如下: ```python from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X = scaler.fit_transform(X) ``` 3. 构建模型 接下来,我们可以使用Scikit-learn的GaussianNB类构建高斯朴素贝叶斯分类器。代码如下: ```python from sklearn.naive_bayes import GaussianNB gnb = GaussianNB() ``` 4. 模型训练 模型构建完成后,我们需要使用训练数据对模型进行训练。代码如下: ```python gnb.fit(X, y) ``` 5. 模型预测 训练完成后,我们可以使用模型对新的数据进行分类预测。代码如下: ```python y_pred = gnb.predict(X) ``` 6. 模型评估 最后,我们可以使用Scikit-learn的metrics中的accuracy_score函数计算模型的准确率。代码如下: ```python from sklearn.metrics import accuracy_score accuracy = accuracy_score(y, y_pred) print('Accuracy:', accuracy) ``` 完整代码如下: ```python from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler from sklearn.naive_bayes import GaussianNB from sklearn.metrics import accuracy_score # 数据准备 iris = load_iris() X = iris.data y = iris.target # 数据预处理 scaler = StandardScaler() X = scaler.fit_transform(X) # 构建模型 gnb = GaussianNB() # 模型训练 gnb.fit(X, y) # 模型预测 y_pred = gnb.predict(X) # 模型评估 accuracy = accuracy_score(y, y_pred) print('Accuracy:', accuracy) ``` 注意,这里为了简化代码,使用训练数据进行了模型评估。在实际应用中,应该使用测试数据进行模型评估,以避免过拟合问题。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值