scikit-learn的用法

最新推荐文章于 2024-04-25 17:32:43 发布

西伯利亚孤狼A

最新推荐文章于 2024-04-25 17:32:43 发布

阅读量598

点赞数

scikit-learn的用法

2017-08-19 22:24 26人阅读评论(0) 收藏举报

引言

对于一些开始搞机器学习 算法有害怕下手的小朋友，该如何快速入门，这让人挺挣扎的。

在从事数据科学的人中，最常用的工具就是R和Python了，每个工具都有其利弊，但是python在各方面都相对胜出一些，这是因为scikit-learn库实现了很多机器学习算法。

加载数据(Data Loading)

我们假设输入时一个特征矩阵或者csv文件。

首先，数据应该被载入内存中。

scikit-learn的实现使用了NumPy中的arrays，所以，我们要使用NumPy来载入csv文件。

以下是从UCI机器学习数据仓库中下载的数据。

复制代码

1 import numpy as np

2 import urllib

3 # url with dataset

4 url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"

5 # download the file

6 raw_data = urllib.urlopen(url)

7 # load the CSV file as a numpy matrix

8 dataset = np.loadtxt(raw_data, delimiter=",")

9 # separate the data from the target attributes

10 X = dataset[:,0:7]

11 y = dataset[:,8]

复制代码

我们要使用该数据集作为例子，将特征矩阵作为X，目标变量作为y。

注意事项：

（1）可以用浏览器打开那个url，把数据文件保存在本地，然后直接用 np.loadtxt('data.txt', delemiter=",") 就可以加载数据了；

（2）X = dataset[:, 0:7]的意思是：把dataset中的所有行，所有0-7列的数据都保存在X中；

数据归一化(Data Normalization)

大多数机器学习算法中的梯度方法对于数据的缩放和尺度都是很敏感的，在开始跑算法之前，我们应该进行归一化或者标准化的过程，这使得特征数据缩放到0-1范围中。scikit-learn提供了归一化的方法，具体解释参考http://scikit-learn.org/stable/modules/preprocessing.html：

复制代码

1 from sklearn import preprocessing

2 #scale the data attributes

3 scaled_X = preprocessing.scale(X)

4

5 # normalize the data attributes

6 normalized_X = preprocessing.normalize(X)

7

8 # standardize the data attributes

9 standardized_X = preprocessing.scale(X)

复制代码

特征选择(Feature Selection)

在解决一个实际问题的过程中，选择合适的特征或者构建特征的能力特别重要。这成为特征选择或者特征工程。

特征选择时一个很需要创造力的过程，更多的依赖于直觉和专业知识，并且有很多现成的算法来进行特征的选择。

下面的树算法(Tree algorithms)计算特征的信息量：

代码：

复制代码

1 from sklearn import metrics

2 from sklearn.ensemble import ExtraTreesClassifier

3 model = ExtraTreesClassifier()

4 model.fit(X, y)

5 # display the relative importance of each attribute

6 print(model.feature_importances_)

复制代码

输出每个特征的重要程度：

[ 0.13784722 0.15383598 0.25451389 0.17476852 0.02847222 0.12314815 0.12741402]

算法的使用

scikit-learn实现了机器学习的大部分基础算法，让我们快速了解一下。

逻辑回归（官方文档）

大多数问题都可以归结为二元分类问题。这个算法的优点是可以给出数据所在类别的概率。

复制代码

1 from sklearn import metrics

2 from sklearn.linear_model import LogisticRegression

3 model = LogisticRegression()

4 model.fit(X, y)

5 print('MODEL')

6 print(model)

7 # make predictions

8 expected = y

9 predicted = model.predict(X)

10 # summarize the fit of the model

11 print('RESULT')

12 print(metrics.classification_report(expected, predicted))

13 print('CONFUSION MATRIX')

14 print(metrics.confusion_matrix(expected, predicted))

复制代码

结果：

复制代码

1 MODEL

2 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

3 intercept_scaling=1, max_iter=100, multi_class='ovr',

4 penalty='l2', random_state=None, solver='liblinear', tol=0.0001,

5 verbose=0)

6 RESULT

7 precision recall f1-score support

8

9 0.0 1.00 1.00 1.00 4

10 1.0 1.00 1.00 1.00 6

11

12 avg / total 1.00 1.00 1.00 10

13

14 CONFUSION MATRIX

15 [[4 0]

16 [0 6]]

复制代码

输出结果中的各个参数信息，可以参考官方文档。

朴素贝叶斯（官方文档）

这也是著名的机器学习算法，该方法的任务是还原训练样本数据的分布密度，其在多类别分类中有很好的效果。

复制代码

1 from sklearn import metrics

2 from sklearn.naive_bayes import GaussianNB

3 model = GaussianNB()

4 model.fit(X, y)

5 print('MODEL')

6 print(model)

7 # make predictions

8 expected = y

9 predicted = model.predict(X)

10 # summarize the fit of the model

11 print('RESULT')

12 print(metrics.classification_report(expected, predicted))

13 print('CONFUSION MATRIX')

14 print(metrics.confusion_matrix(expected, predicted))

复制代码

结果：

复制代码

MODEL

GaussianNB()

RESULT

precision recall f1-score support

0.0 0.80 1.00 0.89 4

1.0 1.00 0.83 0.91 6

avg / total 0.92 0.90 0.90 10

CONFUSION MATRIX

[[4 0]

[1 5]]

复制代码

k近邻（官方文档）

k近邻算法常常被用作是分类算法一部分，比如可以用它来评估特征，在特征选择上我们可以用到它。

复制代码

1 from sklearn import metrics

2 from sklearn.neighbors import KNeighborsClassifier

3 # fit a k-nearest neighbor model to the data

4 model = KNeighborsClassifier()

5 model.fit(X, y)

6 print(model)

7 # make predictions

8 expected = y

9 predicted = model.predict(X)

10 # summarize the fit of the model

11 print(metrics.classification_report(expected, predicted))

12 print(metrics.confusion_matrix(expected, predicted))

复制代码

结果：

复制代码

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',

metric_params=None, n_neighbors=5, p=2, weights='uniform')

precision recall f1-score support

0.0 0.75 0.75 0.75 4

1.0 0.83 0.83 0.83 6

avg / total 0.80 0.80 0.80 10

[[3 1]

[1 5]]

复制代码

决策树（官方文档）

分类与回归树(Classification and Regression Trees ,CART)算法常用于特征含有类别信息的分类或者回归问题，这种方法非常适用于多分类情况。

复制代码

1 from sklearn import metrics

2 from sklearn.tree import DecisionTreeClassifier

3 # fit a CART model to the data

4 model = DecisionTreeClassifier()

5 model.fit(X, y)

6 print(model)

7 # make predictions

8 expected = y

9 predicted = model.predict(X)

10 # summarize the fit of the model

11 print(metrics.classification_report(expected, predicted))

12 print(metrics.confusion_matrix(expected, predicted))

复制代码

结果

复制代码

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,

max_features=None, max_leaf_nodes=None, min_samples_leaf=1,

min_samples_split=2, min_weight_fraction_leaf=0.0,

random_state=None, splitter='best')

precision recall f1-score support

0.0 1.00 1.00 1.00 4

1.0 1.00 1.00 1.00 6

avg / total 1.00 1.00 1.00 10

[[4 0]

[0 6]]

复制代码

支持向量机（官方文档）

SVM是非常流行的机器学习算法，主要用于分类问题，如同逻辑回归问题，它可以使用一对多的方法进行多类别的分类。

复制代码

1 from sklearn import metrics

2 from sklearn.svm import SVC

3 # fit a SVM model to the data

4 model = SVC()

5 model.fit(X, y)

6 print(model)

7 # make predictions

8 expected = y

9 predicted = model.predict(X)

10 # summarize the fit of the model

11 print(metrics.classification_report(expected, predicted))

12 print(metrics.confusion_matrix(expected, predicted))

复制代码

结果

复制代码

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,

kernel='rbf', max_iter=-1, probability=False, random_state=None,

shrinking=True, tol=0.001, verbose=False)

precision recall f1-score support

0.0 1.00 1.00 1.00 4

1.0 1.00 1.00 1.00 6

avg / total 1.00 1.00 1.00 10

[[4 0]

[0 6]]

复制代码

除了分类和回归算法外，scikit-learn提供了更加复杂的算法，比如聚类算法，还实现了算法组合的技术，如Bagging和Boosting算法。

如何优化算法参数

一项更加困难的任务是构建一个有效的方法用于选择正确的参数，我们需要用搜索的方法来确定参数。scikit-learn提供了实现这一目标的函数。

下面的例子是一个进行正则参数选择的程序：

GridSearchCV官方文档1（模块使用）官方文档2 （原理详解）

复制代码

1 import numpy as np

2 from sklearn.linear_model import Ridge

3 from sklearn.grid_search import GridSearchCV

4 # prepare a range of alpha values to test

5 alphas = np.array([1,0.1,0.01,0.001,0.0001,0])

6 # create and fit a ridge regression model, testing each alpha

7 model = Ridge()

8 grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))

9 grid.fit(X, y)

10 print(grid)

11 # summarize the results of the grid search

12 print(grid.best_score_)

13 print(grid.best_estimator_.alpha)

复制代码

结果：

复制代码

GridSearchCV(cv=None, error_score='raise',

estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,

normalize=False, solver='auto', tol=0.001),

fit_params={}, iid=True, loss_func=None, n_jobs=1,

param_grid={'alpha': array([ 1.00000e+00, 1.00000e-01, 1.00000e-02, 1.00000e-03,

1.00000e-04, 0.00000e+00])},

pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,

verbose=0)

-5.59572064238

0.0

复制代码

有时随机从给定区间中选择参数是很有效的方法，然后根据这些参数来评估算法的效果进而选择最佳的那个。

RandomizedSearchCV官方文档（模块使用）官方文档2 （原理详解）

复制代码

1 import numpy as np

2 from scipy.stats import uniform as sp_rand

3 from sklearn.linear_model import Ridge

4 from sklearn.grid_search import RandomizedSearchCV

5 # prepare a uniform distribution to sample for the alpha parameter

6 param_grid = {'alpha': sp_rand()}

7 # create and fit a ridge regression model, testing random alpha values

8 model = Ridge()

9 rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)

10 rsearch.fit(X, y)

11 print(rsearch)

12 # summarize the results of the random parameter search

13 print(rsearch.best_score_)

14 print(rsearch.best_estimator_.alpha)

复制代码

西伯利亚孤狼A

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scikit-learn的用法

scikit-learn的用法2017-08-19 22:24 26人阅读评论(0) 收藏举报引言对于一些开始搞机器学习算法有害怕下手的小朋友，该如何快速入门，这让人挺挣扎的。在从事数据科学的人中，最常用的工具就是R和Python了，每个工具都有其利弊，但是python在各方面都相对胜出一些，这是因为scikit-learn库实现了很
复制链接

扫一扫

西伯利亚孤狼A CSDN认证博客专家 CSDN认证企业博客

码龄11年

9: 原创

19万+: 周排名

200万+: 总排名

8万+: 访问

: 等级

924: 积分

24: 粉丝

22: 获赞

9: 评论

167: 收藏

私信

关注

热门文章

最新评论

Ubuntu20.04 安装ROS
清都散闲客: 很棒的帖子
图像融合（五）-- 梯度金字塔
乂厾々卅玍（阿辉啊）: 楼主有梯度金字塔matlab代码么
使用selectivesearch工具进行目标检测
荷荷><: 博主，您好，想请问一下我用你的代码实现为什么只出现一个大框，找不到原因，要哭了，我就是黑底图然后有一个圆，想把圆检测出来
图像融合（六）-- 小波融合
ShawnYang222 回复沐沐啊: 你应该先把图像归一化否则在进行小波变换的时候会有大于一的值输出导致融合后得到的图像一片白
使用Imagenet VGG-19模型进行图片识别
じAomrご心相依つ: 请问为什么我在执行代码的时候，什么也不显示呢

您愿意向朋友推荐“博客详情页”吗？

强烈不推荐
不推荐
一般般
推荐
强烈推荐

提交

最新文章

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。