python包：scikit-learn

最新推荐文章于 2024-05-10 20:59:46 发布

bcjz

最新推荐文章于 2024-05-10 20:59:46 发布

阅读量1.5k

点赞数

分类专栏：算法编程文章标签： python scikit-learn sklearn

本文链接：https://blog.csdn.net/algTopD/article/details/122354873

版权

编程同时被 2 个专栏收录

14 篇文章 0 订阅

订阅专栏

算法

10 篇文章 0 订阅

订阅专栏

0. 什么是scikit-learn

sklearn提供了干净、统一的方式实现多个日常经常被使用的算法。
Introduction of scikit-learn in python for datascience

One of the best known is Scikit-Learn, a package that provides efficient versions of a large number of common algorithms. Scikit-Learn is characterized by a clean, uniform, and streamlined API, as well as by very useful and complete online documentation. A benefit of this uniformity is that once you understand the basic use and syntax of Scikit-Learn for one type of model, switching to a new model or algorithm is very straightforward.

1. 安装scikit-learn

scikit-learn是安装时使用的名字，真正在程序中import的时候，使用sklearn，如下面的例子所示。

from sklearn.linear_model import LinearRegression

2. sklearn的常规使用流程

原文链接

Most commonly, the steps in using the Scikit-Learn estimator API are as follows (we will step through a handful of detailed examples in the sections that follow).

Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
选择导入希望的模型类
e.g. from sklearn.linear_model import LinearRgression
Choose model hyperparameters by instantiating this class with desired values.
通过选择超参实例化模型
e.g. model = LinearRegression(fit_intercept=True)
Arrange data into a features matrix and target vector following the discussion above.
将你的数据格式化成特征矩阵和目标向量

rng = np.random.RandomState(42)
x = 10 * rng.rand(50) # Random values in a given shape. Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).
y = 2 * x - 1 + rng.randn(50) # Return a sample (or samples) from the "standard normal" distribution.
plt.scatter(x, y)
plt.scatter(x, y)
X = x[:, np.newaxis] # 因为需要预测截距，所以增加一个newaxis
X.shape

Fit the model to your data by calling the fit() method of the model instance.
调用之前实例化模型的fit函数

mode.fit(x, y)
model.coef_
model.intercept_ # 会发现这个coef非常接近于2，intercept接近于- 1。

This fit() command causes a number of model-dependent internal
computations to take place, and the results of these computations are
stored in model-specific attributes that the user can explore. In
Scikit-Learn, by convention all model parameters that were learned
during the fit() process have trailing underscores; for example in
this linear model, we have the following：model.coef_
fit函数是触发了底层的算法，然后所有模型的参数信息都被存储了。

Apply the Model to new data:
将这个模型应用在新数据集上

For supervised learning, often we predict labels for unknown data using the predict() method.

xfit = np.linspace(-1, 11)
Xfit = xfit[:, np.newaxis]
yfit = model.predict(Xfit)
# 通过下面两个plot命令把training和testing的x和y都显示出来，从而看到fit的效率
plt.scatter(x, y)
plt.plot(xfit, yfit);

For unsupervised learning, we often transform or infer properties of the data using the transform() or predict() method.

3. 利用scikit-learn实现iris分类任务

import seaborn as sns
iris = sns.load_dataset('iris')
X_iris = iris.drop('species', axis=1) 
y_iris = iris['species']

For this task, we will use an extremely simple generative model known as Gaussian naive Bayes, which proceeds by assuming each class is drawn from an axis-aligned Gaussian distribution (see In Depth: Naive Bayes Classification for more details). Because it is so fast and has no hyperparameters to choose, Gaussian naive Bayes is often a good model to use as a baseline classification, before exploring whether improvements can be found through more sophisticated models.
因为没有超参，且速度很快，所以经常被用来作为分类的baseline。

3.1 对数据集进行训练和测试的划分

需要注意的是cross_validation已经废弃了，使用model_selection来替代就可以了。

    from sklearn.model_selection import train_test_split

如果希望在训练和测试中保留同样比列的不同类别数据，可以使用

from sklearn.model_selection import StratifiedKFold, KFold

因为CV是将数据先等分，然后选择k-1进行train，其余1份作为validation，所以设定了k的个数，就确定了训练和测试中的sample cnt。
原文链接
在这里插入图片描述

from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris, random_state=1) # 仅仅做一次划分，默认值是0.25

在这里插入图片描述
train_test_split function

3.2 训练和测试

from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB()                       # 2. instantiate model
model.fit(Xtrain, ytrain)                  # 3. fit model to data
y_model = model.predict(Xtest)             # 4. predict on new data
# get 预测准确度
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)

With an accuracy topping 97%, we see that even this very naive classification algorithm is effective for this particular dataset!
说明有的时候如果数据集很好，简单的模型就可以了。同时，简单的模型也可以作为最基本的baseline。

bcjz

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python包：scikit-learn

1. 安装scikit-learnscikit-learn是安装时使用的名字，真正在程序中import的时候，使用sklearn，如下面的例子所示。import from sklearn.linear_model import LinearRegression2. scikit-learn的使用场景https://zhuanlan.zhihu.com/p/2597326143. scikit-learn关键的模块4. scikit-learn线性回归实战...
复制链接

扫一扫