python包:scikit-learn

0. 什么是scikit-learn

sklearn提供了干净、统一的方式实现多个日常经常被使用的算法。
Introduction of scikit-learn in python for datascience

One of the best known is Scikit-Learn, a package that provides efficient versions of a large number of common algorithms. Scikit-Learn is characterized by a clean, uniform, and streamlined API, as well as by very useful and complete online documentation. A benefit of this uniformity is that once you understand the basic use and syntax of Scikit-Learn for one type of model, switching to a new model or algorithm is very straightforward.

1. 安装scikit-learn

scikit-learn是安装时使用的名字,真正在程序中import的时候,使用sklearn,如下面的例子所示。

from sklearn.linear_model import LinearRegression

2. sklearn的常规使用流程

原文链接

Most commonly, the steps in using the Scikit-Learn estimator API are as follows (we will step through a handful of detailed examples in the sections that follow).

  1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
    选择导入希望的模型类
    e.g. from sklearn.linear_model import LinearRgression
  2. Choose model hyperparameters by instantiating this class with desired values.
    通过选择超参实例化模型
    e.g. model = LinearRegression(fit_intercept=True)
  3. Arrange data into a features matrix and target vector following the discussion above.
    将你的数据格式化成特征矩阵和目标向量
rng = np.random.RandomState(42)
x = 10 * rng.rand(50) # Random values in a given shape. Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).
y = 2 * x - 1 + rng.randn(50) # Return a sample (or samples) from the "standard normal" distribution.
plt.scatter(x, y)
plt.scatter(x, y)
X = x[:, np.newaxis] # 因为需要预测截距,所以增加一个newaxis
X.shape
  1. Fit the model to your data by calling the fit() method of the model instance.
    调用之前实例化模型的fit函数
mode.fit(x, y)
model.coef_
model.intercept_ # 会发现这个coef非常接近于2,intercept接近于- 1。

This fit() command causes a number of model-dependent internal
computations to take place, and the results of these computations are
stored in model-specific attributes that the user can explore. In
Scikit-Learn, by convention all model parameters that were learned
during the fit() process have trailing underscores; for example in
this linear model, we have the following:model.coef_
fit函数是触发了底层的算法,然后所有模型的参数信息都被存储了。

  1. Apply the Model to new data:
    将这个模型应用在新数据集上
  • For supervised learning, often we predict labels for unknown data using the predict() method.
xfit = np.linspace(-1, 11)
Xfit = xfit[:, np.newaxis]
yfit = model.predict(Xfit)
# 通过下面两个plot命令把training和testing的x和y都显示出来,从而看到fit的效率
plt.scatter(x, y)
plt.plot(xfit, yfit);
  • For unsupervised learning, we often transform or infer properties of the data using the transform() or predict() method.

3. 利用scikit-learn实现iris分类任务

import seaborn as sns
iris = sns.load_dataset('iris')
X_iris = iris.drop('species', axis=1) 
y_iris = iris['species']

For this task, we will use an extremely simple generative model known as Gaussian naive Bayes, which proceeds by assuming each class is drawn from an axis-aligned Gaussian distribution (see In Depth: Naive Bayes Classification for more details). Because it is so fast and has no hyperparameters to choose, Gaussian naive Bayes is often a good model to use as a baseline classification, before exploring whether improvements can be found through more sophisticated models.
因为没有超参,且速度很快,所以经常被用来作为分类的baseline。

3.1 对数据集进行训练和测试的划分

需要注意的是cross_validation已经废弃了,使用model_selection来替代就可以了。

    from sklearn.model_selection import train_test_split

如果希望在训练和测试中保留同样比列的不同类别数据,可以使用

from sklearn.model_selection import StratifiedKFold, KFold

因为CV是将数据先等分,然后选择k-1进行train,其余1份作为validation,所以设定了k的个数,就确定了训练和测试中的sample cnt。
原文链接
在这里插入图片描述

from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris, random_state=1) # 仅仅做一次划分,默认值是0.25

在这里插入图片描述
train_test_split function

3.2 训练和测试

from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB()                       # 2. instantiate model
model.fit(Xtrain, ytrain)                  # 3. fit model to data
y_model = model.predict(Xtest)             # 4. predict on new data
# get 预测准确度
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)

With an accuracy topping 97%, we see that even this very naive classification algorithm is effective for this particular dataset!
说明有的时候如果数据集很好,简单的模型就可以了。同时,简单的模型也可以作为最基本的baseline。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值