0. 什么是scikit-learn
sklearn提供了干净、统一的方式实现多个日常经常被使用的算法。
Introduction of scikit-learn in python for datascience
One of the best known is Scikit-Learn, a package that provides efficient versions of a large number of common algorithms. Scikit-Learn is characterized by a clean, uniform, and streamlined API, as well as by very useful and complete online documentation. A benefit of this uniformity is that once you understand the basic use and syntax of Scikit-Learn for one type of model, switching to a new model or algorithm is very straightforward.
1. 安装scikit-learn
scikit-learn是安装时使用的名字,真正在程序中import的时候,使用sklearn,如下面的例子所示。
from sklearn.linear_model import LinearRegression
2. sklearn的常规使用流程
Most commonly, the steps in using the Scikit-Learn estimator API are as follows (we will step through a handful of detailed examples in the sections that follow).
- Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
选择导入希望的模型类
e.g.from sklearn.linear_model import LinearRgression
- Choose model hyperparameters by instantiating this class with desired values.
通过选择超参实例化模型
e.g.model = LinearRegression(fit_intercept=True)
- Arrange data into a features matrix and target vector following the discussion above.
将你的数据格式化成特征矩阵和目标向量
rng = np.random.RandomState(42)
x = 10 * rng.rand(50) # Random values in a given shape. Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).
y = 2 * x - 1 + rng.randn(50) # Return a sample (or samples) from the "standard normal" distribution.
plt.scatter(x, y)
plt.scatter(x, y)
X = x[:, np.newaxis] # 因为需要预测截距,所以增加一个newaxis
X.shape
- Fit the model to your data by calling the fit() method of the model instance.
调用之前实例化模型的fit函数
mode.fit(x, y)
model.coef_
model.intercept_ # 会发现这个coef非常接近于2,intercept接近于- 1。
This fit() command causes a number of model-dependent internal
computations to take place, and the results of these computations are
stored in model-specific attributes that the user can explore. In
Scikit-Learn, by convention all model parameters that were learned
during the fit() process have trailing underscores; for example in
this linear model, we have the following:model.coef_
fit函数是触发了底层的算法,然后所有模型的参数信息都被存储了。
- Apply the Model to new data:
将这个模型应用在新数据集上
- For supervised learning, often we predict labels for unknown data using the predict() method.
xfit = np.linspace(-1, 11)
Xfit = xfit[:, np.newaxis]
yfit = model.predict(Xfit)
# 通过下面两个plot命令把training和testing的x和y都显示出来,从而看到fit的效率
plt.scatter(x, y)
plt.plot(xfit, yfit);
- For unsupervised learning, we often transform or infer properties of the data using the transform() or predict() method.
3. 利用scikit-learn实现iris分类任务
import seaborn as sns
iris = sns.load_dataset('iris')
X_iris = iris.drop('species', axis=1)
y_iris = iris['species']
For this task, we will use an extremely simple generative model known as Gaussian naive Bayes, which proceeds by assuming each class is drawn from an axis-aligned Gaussian distribution (see In Depth: Naive Bayes Classification for more details). Because it is so fast and has no hyperparameters to choose, Gaussian naive Bayes is often a good model to use as a baseline classification, before exploring whether improvements can be found through more sophisticated models.
因为没有超参,且速度很快,所以经常被用来作为分类的baseline。
3.1 对数据集进行训练和测试的划分
需要注意的是cross_validation已经废弃了,使用model_selection来替代就可以了。
from sklearn.model_selection import train_test_split
如果希望在训练和测试中保留同样比列的不同类别数据,可以使用
from sklearn.model_selection import StratifiedKFold, KFold
因为CV是将数据先等分,然后选择k-1进行train,其余1份作为validation,所以设定了k的个数,就确定了训练和测试中的sample cnt。
原文链接
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris, random_state=1) # 仅仅做一次划分,默认值是0.25
3.2 训练和测试
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB() # 2. instantiate model
model.fit(Xtrain, ytrain) # 3. fit model to data
y_model = model.predict(Xtest) # 4. predict on new data
# get 预测准确度
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)
With an accuracy topping 97%, we see that even this very naive classification algorithm is effective for this particular dataset!
说明有的时候如果数据集很好,简单的模型就可以了。同时,简单的模型也可以作为最基本的baseline。