scikit-learn的主要模块和基本使用_prepare a uniform distribution to sample for the a-CSDN博客

Apr 17 2015 更新日期:Apr 19 2015

文章目录

引言

对于一些开始搞机器学习算法有害怕下手的小朋友，该如何快速入门，这让人挺挣扎的。
在从事数据科学的人中，最常用的工具就是R和Python了，每个工具都有其利弊，但是Python在各方面都相对胜出一些，这是因为scikit-learn库实现了很多机器学习算法。

加载数据(Data Loading)

我们假设输入时一个特征矩阵或者csv文件。
首先，数据应该被载入内存中。
scikit-learn的实现使用了NumPy中的arrays，所以，我们要使用NumPy来载入csv文件。
以下是从UCI机器学习数据仓库中下载的数据。

      
      
       
       import numpy 
       
       as np
      
      
      
      
       
       import urllib
      
      
      
      
       
       # url with dataset
      
      
      
      
       
       url = 
       
       "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
      
      
      
      
       
       # download the file
      
      
      
      
       
       raw_data = urllib.urlopen(url)
      
      
      
      
       
       # load the CSV file as a numpy matrix
      
      
      
      
       
       dataset = np.loadtxt(raw_data, delimiter=
       
       ",")
      
      
      
      
       
       # separate the data from the target attributes
      
      
      
      
       
       X = dataset[:,
       
       0:
       
       7]
      
      
      
      
       
       y = dataset[:,
       
       8]

我们要使用该数据集作为例子，将特征矩阵作为X，目标变量作为y。

数据归一化(Data Normalization)

大多数机器学习算法中的梯度方法对于数据的缩放和尺度都是很敏感的，在开始跑算法之前，我们应该进行归一化或者标准化的过程，这使得特征数据缩放到0-1范围中。scikit-learn提供了归一化的方法：

      
      
       
       from sklearn 
       
       import preprocessing
      
      
      
      
       
       # normalize the data attributes
      
      
      
      
       
       normalized_X = preprocessing.normalize(X)
      
      
      
      
       
       # standardize the data attributes
      
      
      
      
       
       standardized_X = preprocessing.scale(X)

特征选择(Feature Selection)

在解决一个实际问题的过程中，选择合适的特征或者构建特征的能力特别重要。这成为特征选择或者特征工程。
特征选择时一个很需要创造力的过程，更多的依赖于直觉和专业知识，并且有很多现成的算法来进行特征的选择。
下面的树算法(Tree algorithms)计算特征的信息量：

      
      
       
       from sklearn 
       
       import metrics
      
      
      
      
       
       from sklearn.ensemble 
       
       import ExtraTreesClassifier
      
      
      
      
       
       model = ExtraTreesClassifier()
      
      
      
      
       
       model.fit(X, y)
      
      
      
      
       
       # display the relative importance of each attribute
      
      
      
      
       
       print(model.feature_importances_)

算法的使用

scikit-learn实现了机器学习的大部分基础算法，让我们快速了解一下。

逻辑回归

大多数问题都可以归结为二元分类问题。这个算法的优点是可以给出数据所在类别的概率。

      
      
       
       from sklearn 
       
       import metrics
      
      
      
      
       
       from sklearn.linear_model 
       
       import LogisticRegression
      
      
      
      
       
       model = LogisticRegression()
      
      
      
      
       
       model.fit(X, y)
      
      
      
      
       
       print(model)
      
      
      
      
       
       # make predictions
      
      
      
      
       
       expected = y
      
      
      
      
       
       predicted = model.predict(X)
      
      
      
      
       
       # summarize the fit of the model
      
      
      
      
       
       print(metrics.classification_report(expected, predicted))
      
      
      
      
       
       print(metrics.confusion_matrix(expected, predicted))

结果：

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, penalty=l2, random_state=None, tol=0.0001)
precision recall f1-score support
   0.0       0.79      0.89      0.84       500
   1.0       0.74      0.55      0.63       268
avg / total 0.77 0.77 0.77 768

[[447 53]
[120 148]]

朴素贝叶斯

这也是著名的机器学习算法，该方法的任务是还原训练样本数据的分布密度，其在多类别分类中有很好的效果。

      
      
       
       from sklearn 
       
       import metrics
      
      
      
      
       
       from sklearn.naive_bayes 
       
       import GaussianNB
      
      
      
      
       
       model = GaussianNB()
      
      
      
      
       
       model.fit(X, y)
      
      
      
      
       
       print(model)
      
      
      
      
       
       # make predictions
      
      
      
      
       
       expected = y
      
      
      
      
       
       predicted = model.predict(X)
      
      
      
      
       
       # summarize the fit of the model
      
      
      
      
       
       print(metrics.classification_report(expected, predicted))
      
      
      
      
       
       print(metrics.confusion_matrix(expected, predicted))

结果：

GaussianNB()
precision recall f1-score support
   0.0       0.80      0.86      0.83       500
    1.0       0.69      0.60      0.64       268
avg / total 0.76 0.77 0.76 768

[[429 71]
[108 160]]

k近邻

k近邻算法常常被用作是分类算法一部分，比如可以用它来评估特征，在特征选择上我们可以用到它。

      
      
       
       from sklearn 
       
       import metrics
      
      
      
      
       
       from sklearn.neighbors 
       
       import KNeighborsClassifier
      
      
      
      
       
       # fit a k-nearest neighbor model to the data
      
      
      
      
       
       model = KNeighborsClassifier()
      
      
      
      
       
       model.fit(X, y)
      
      
      
      
       
       print(model)
      
      
      
      
       
       # make predictions
      
      
      
      
       
       expected = y
      
      
      
      
       
       predicted = model.predict(X)
      
      
      
      
       
       # summarize the fit of the model
      
      
      
      
       
       print(metrics.classification_report(expected, predicted))
      
      
      
      
       
       print(metrics.confusion_matrix(expected, predicted))

结果：

KNeighborsClassifier(algorithm=auto, leaf_size=30, metric=minkowski,
n_neighbors=5, p=2, weights=uniform)
precision recall f1-score support
   0.0       0.82      0.90      0.86       500
    1.0       0.77      0.63      0.69       268
avg / total 0.80 0.80 0.80 768

[[448 52]
[ 98 170]]

决策树

分类与回归树(Classification and Regression Trees ,CART)算法常用于特征含有类别信息的分类或者回归问题，这种方法非常适用于多分类情况。

      
      
       
       from sklearn 
       
       import metrics
      
      
      
      
       
       from sklearn.tree 
       
       import DecisionTreeClassifier
      
      
      
      
       
       # fit a CART model to the data
      
      
      
      
       
       model = DecisionTreeClassifier()
      
      
      
      
       
       model.fit(X, y)
      
      
      
      
       
       print(model)
      
      
      
      
       
       # make predictions
      
      
      
      
       
       expected = y
      
      
      
      
       
       predicted = model.predict(X)
      
      
      
      
       
       # summarize the fit of the model
      
      
      
      
       
       print(metrics.classification_report(expected, predicted))
      
      
      
      
       
       print(metrics.confusion_matrix(expected, predicted))

结果：

DecisionTreeClassifier(compute_importances=None, criterion=gini,
max_depth=None, max_features=None, min_density=None,
min_samples_leaf=1, min_samples_split=2, random_state=None,
splitter=best)
precision recall f1-score support
   0.0       1.00      1.00      1.00       500
    1.0       1.00      1.00      1.00       268
avg / total 1.00 1.00 1.00 768

[[500 0]
[ 0 268]]

支持向量机

SVM是非常流行的机器学习算法，主要用于分类问题，如同逻辑回归问题，它可以使用一对多的方法进行多类别的分类。

      
      
       
       from sklearn 
       
       import metrics
      
      
      
      
       
       from sklearn.svm 
       
       import SVC
      
      
      
      
       
       # fit a SVM model to the data
      
      
      
      
       
       model = SVC()
      
      
      
      
       
       model.fit(X, y)
      
      
      
      
       
       print(model)
      
      
      
      
       
       # make predictions
      
      
      
      
       
       expected = y
      
      
      
      
       
       predicted = model.predict(X)
      
      
      
      
       
       # summarize the fit of the model
      
      
      
      
       
       print(metrics.classification_report(expected, predicted))
      
      
      
      
       
       print(metrics.confusion_matrix(expected, predicted))

结果：

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel=rbf, max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
precision recall f1-score support
   0.0       1.00      1.00      1.00       500
    1.0       1.00      1.00      1.00       268
avg / total 1.00 1.00 1.00 768

[[500 0]
[ 0 268]]

除了分类和回归算法外，scikit-learn提供了更加复杂的算法，比如聚类算法，还实现了算法组合的技术，如Bagging和Boosting算法。

如何优化算法参数

一项更加困难的任务是构建一个有效的方法用于选择正确的参数，我们需要用搜索的方法来确定参数。scikit-learn提供了实现这一目标的函数。
下面的例子是一个进行正则参数选择的程序：

      
      
       
       import numpy 
       
       as np
      
      
      
      
       
       from sklearn.linear_model 
       
       import Ridge
      
      
      
      
       
       from sklearn.grid_search 
       
       import GridSearchCV
      
      
      
      
       
       # prepare a range of alpha values to test
      
      
      
      
       
       alphas = np.array([
       
       1,
       
       0.1,
       
       0.01,
       
       0.001,
       
       0.0001,
       
       0])
      
      
      
      
       
       # create and fit a ridge regression model, testing each alpha
      
      
      
      
       
       model = Ridge()
      
      
      
      
       
       grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))
      
      
      
      
       
       grid.fit(X, y)
      
      
      
      
       
       print(grid)
      
      
      
      
       
       # summarize the results of the grid search
      
      
      
      
       
       print(grid.best_score_)
      
      
      
      
       
       print(grid.best_estimator_.alpha)

结果：

GridSearchCV(cv=None,
estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, solver=auto, tol=0.001),
estimatoralpha=1.0, estimatorcopy_X=True,
estimatorfit_intercept=True, estimatormax_iter=None,
estimatornormalize=False, estimatorsolver=auto,
estimator__tol=0.001, fit_params={}, iid=True, loss_func=None,
n_jobs=1,
param_grid={‘alpha’: array([ 1.00000e+00, 1.00000e-01, 1.00000e-02, 1.00000e-03,
1.00000e-04, 0.00000e+00])},
pre_dispatch=2*n_jobs, refit=True, score_func=None, scoring=None,
verbose=0)
0.282118955686
1.0

有时随机从给定区间中选择参数是很有效的方法，然后根据这些参数来评估算法的效果进而选择最佳的那个。

      
      
       
       import numpy 
       
       as np
      
      
      
      
       
       from scipy.stats 
       
       import uniform 
       
       as sp_rand
      
      
      
      
       
       from sklearn.linear_model 
       
       import Ridge
      
      
      
      
       
       from sklearn.grid_search 
       
       import RandomizedSearchCV
      
      
      
      
       
       # prepare a uniform distribution to sample for the alpha parameter
      
      
      
      
       
       param_grid = {
       
       'alpha': sp_rand()}
      
      
      
      
       
       # create and fit a ridge regression model, testing random alpha values
      
      
      
      
       
       model = Ridge()
      
      
      
      
       
       rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=
       
       100)
      
      
      
      
       
       rsearch.fit(X, y)
      
      
      
      
       
       print(rsearch)
      
      
      
      
       
       # summarize the results of the random parameter search
      
      
      
      
       
       print(rsearch.best_score_)
      
      
      
      
       
       print(rsearch.best_estimator_.alpha)

结果：

RandomizedSearchCV(cv=None,
estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, solver=auto, tol=0.001),
estimatoralpha=1.0, estimatorcopy_X=True,
estimatorfit_intercept=True, estimatormax_iter=None,
estimatornormalize=False, estimatorsolver=auto,
estimator__tol=0.001, fit_params={}, iid=True, n_iter=100,
n_jobs=1,
param_distributions={‘alpha’: },
pre_dispatch=2*n_jobs, random_state=None, refit=True,
scoring=None, verbose=0)
0.282118643885
0.988443794636