# [置顶] Scikit Learn: 在python中机器学习

2729人阅读 评论(0)

Warning

## 载入示例数据

In [1]: from sklearn import datasets
In [2]: iris = datasets.load_iris()

In [3]: iris.data.shape
Out[3]: (150, 4)

In [5]: iris.target.shape
Out[5]: (150,)

In [6]: import numpy as np

In [7]: np.unique(iris.target)
Out[7]: array([0, 1, 2])

#### 一个改变数据集大小的示例：数码数据集(digits datasets)

In [8]: digits = datasets.load_digits()

In [9]: digits.images.shape
Out[9]: (1797, 8, 8)

In [10]: import pylab as pl

In [11]: pl.imshow(digits.images[0], cmap=pl.cm.gray_r)
Out[11]: <matplotlib.image.AxesImage at 0x3285b90>

In [13]: pl.show()

In [12]: data = digits.images.reshape((digits.images.shape[0], -1))

### 学习和预测

In [14]: from sklearn import svm

In [15]: clf = svm.LinearSVC()

In [16]: clf.fit(iris.data, iris.target) # learn from the data
Out[16]:
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
tol=0.0001, verbose=0)

In [17]: clf.predict([[ 5.0,  3.6,  1.3,  0.25]])
Out[17]: array([0], dtype=int32)

In [18]: clf.coef_
Out[18]:
array([[ 0.18424352,  0.45122644, -0.8079467 , -0.45071302],
[ 0.05190619, -0.89423619,  0.40519245, -0.93781587],
[-0.85087844, -0.98667529,  1.38088883,  1.86538111]])

## 分类

### K最近邻(KNN)分类器

k最近邻2分类器内部使用基于球树(ball tree)3来代表它训练的样本。

KNN分类示例

In [19]: # Create and fit a nearest-neighbor classifier

In [20]: from sklearn import neighbors

In [21]: knn = neighbors.KNeighborsClassifier()

In [22]: knn.fit(iris.data, iris.target)
Out[22]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, n_neighbors=5, p=2,
warn_on_equidistant=True, weights='uniform')

In [23]: knn.predict([[0.1, 0.2, 0.3, 0.4]])
Out[23]: array([0])

#### 训练集和测试集

In [24]: perm = np.random.permutation(iris.target.size)

In [25]: iris.data = iris.data[perm]

In [26]: iris.target = iris.target[perm]

In [27]: knn.fit(iris.data[:100], iris.target[:100])
Out[27]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, n_neighbors=5, p=2,
warn_on_equidistant=True, weights='uniform')

In [28]: knn.score(iris.data[100:], iris.target[100:])
/usr/lib/python2.7/site-packages/sklearn/neighbors/classification.py:129: NeighborsWarning: kneighbors: neighbor k+1 and neighbor k have the same distance: results will be dependent on data order.
neigh_dist, neigh_ind = self.kneighbors(X)
Out[28]: 0.95999999999999996

Bonus的问题：为什么我们使用随机的排列？

### 分类支持向量机(SVMs)

#### 线性支持向量机

SVMs4尝试构建一个两个类别的最大间隔超平面。它选择输入的子集，调用支持向量即离分离的超平面最近的样本点。

In [60]: from sklearn import svm

In [61]: svc = svm.SVC(kernel='linear')

In [62]: svc.fit(iris.data, iris.target)
Out[62]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='linear', probability=False, shrinking=True, tol=0.001,
verbose=False)

scikit-learn中有好几种支持向量机实现。最普遍使用的是svm.SVC，svm.NuSVC和svm.LinearSVC;“SVC”代表支持向量分类器(Support Vector Classifier)(也存在回归SVMs，在scikit-learn中叫作“SVR”)。

#### 使用核

• 线性核

svc = svm.SVC(kernel=’linear’)

• 多项式核

svc = svm.SVC(kernel=’poly’, … degree=3) # degree: polynomial degree

• RBF核(径向基函数)5

svc = svm.SVC(kernel=’rbf’) # gamma: inverse of size of # radial kernel

## 聚类：将观测值聚合

### k均值聚类

(一个替代的k均值算法实现在scipy中的cluster包中。这个scikit-learn实现与之不同，通过提供对象API和几个额外的特性，包括智能初始化。)

In [82]: from sklearn import cluster, datasets

In [84]: k_means = cluster.KMeans(k=3)

In [85]: k_means.fit(iris.data)
Out[85]:
KMeans(copy_x=True, init='k-means++', k=3, max_iter=300, n_init=10, n_jobs=1,
precompute_distances=True,
random_state=<mtrand.RandomState object at 0x7f4d860642d0>, tol=0.0001,
verbose=0)

In [86]: print k_means.labels_[::10]
[1 1 1 1 1 2 2 2 2 2 0 0 0 0 0]

In [87]: print iris.target[::10]
[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]

#### 应用到图像压缩

In [95]: from scipy import misc

In [96]: lena = misc.lena().astype(np.float32)

In [97]: X = lena.reshape((-1, 1)) # We need an (n_sample, n_feature) array

In [98]: k_means = cluster.KMeans(5)

In [99]: k_means.fit(X)
Out[99]:
KMeans(copy_x=True, init='k-means++', k=5, max_iter=300, n_init=10, n_jobs=1,
precompute_distances=True,
random_state=<mtrand.RandomState object at 0x7f4d860642d0>, tol=0.0001,
verbose=0)

In [100]: values = k_means.cluster_centers_.squeeze()

In [101]: labels = k_means.labels_

In [102]: lena_compressed = np.choose(labels, values)

In [103]: lena_compressed.shape = lena.shape

In [31]: import matplotlib.pyplot as plt

In [32]: plt.gray()

In [33]: plt.imshow(lena_compressed)
Out[33]: <matplotlib.image.AxesImage at 0x4b2c510>

In [34]: plt.show()

![Image]

## 用主成分分析降维

In [75]: from sklearn import decomposition

In [76]: pca = decomposition.PCA(n_components=2)

In [77]: pca.fit(iris.data)
Out[77]: PCA(copy=True, n_components=2, whiten=False)

In [78]: X = pca.transform(iris.data)

In [79]: import pylab as pl

In [80]: pl.scatter(X[:, 0], X[:, 1], c=iris.target)
Out[80]: <matplotlib.collections.PathCollection at 0x4104310>

PCA不仅在可视化高维数据集时非常有用。它可以用来作为帮助加速对高维数据不那么有效率的监督方法7的预处理步骤。

## 将一切放在一起：人脸识别

"""
Stripped-down version of the face recognition example by Olivier Grisel

http://scikit-learn.org/dev/auto_examples/applications/face_recognition.html

## original shape of images: 50, 37
"""
import numpy as np
import pylab as pl
from sklearn import cross_val, datasets, decomposition, svm

# ..
lfw_people = datasets.fetch_lfw_people(min_faces_per_person=70, resize=0.4)
perm = np.random.permutation(lfw_people.target.size)
lfw_people.data = lfw_people.data[perm]
lfw_people.target = lfw_people.target[perm]
faces = np.reshape(lfw_people.data, (lfw_people.target.shape[0], -1))
train, test = iter(cross_val.StratifiedKFold(lfw_people.target, k=4)).next()
X_train, X_test = faces[train], faces[test]
y_train, y_test = lfw_people.target[train], lfw_people.target[test]

# ..
# .. dimension reduction ..
pca = decomposition.RandomizedPCA(n_components=150, whiten=True)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

# ..
# .. classification ..
clf = svm.SVC(C=5., gamma=0.001)
clf.fit(X_train_pca, y_train)

# ..
# .. predict on new images ..
for i in range(10):
print lfw_people.target_names[clf.predict(X_test_pca[i])[0]]
_ = pl.imshow(X_test[i].reshape(50, 37), cmap=pl.cm.gray)
_ = raw_input()

## 线性模型：从回归到稀疏

In [104]: diabetes = datasets.load_diabetes()

In [105]: diabetes_X_train = diabetes.data[:-20]

In [106]: diabetes_X_test  = diabetes.data[-20:]

In [107]: diabetes_y_train = diabetes.target[:-20]

In [108]: diabetes_y_test  = diabetes.target[-20:]

### 稀疏模型

In [109]: from sklearn import linear_model

In [110]: regr = linear_model.Lasso(alpha=.3)

In [111]: regr.fit(diabetes_X_train, diabetes_y_train)
Out[111]:
Lasso(alpha=0.3, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute='auto', tol=0.0001,
warm_start=False)

In [112]: regr.coef_ # very sparse coefficients
Out[112]:
array([   0.        ,   -0.        ,  497.34075682,  199.17441034,
-0.        ,   -0.        , -118.89291545,    0.        ,
430.9379595 ,    0.        ])

In [113]: regr.score(diabetes_X_test, diabetes_y_test)
Out[113]: 0.55108354530029791

In [114]: lin = linear_model.LinearRegression()

In [115]: lin.fit(diabetes_X_train, diabetes_y_train)
Out[115]: LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [116]: lin.score(diabetes_X_test, diabetes_y_test)
Out[116]: 0.58507530226905713

## 模型选择：选择估计器和它们的参数

### 格点搜索和交叉验证估计器

#### 格点搜索

scikit-learn提供了一个对象，该对象给定数据，在拟合一个参数网格的估计器时计算分数，并且选择参数最大化交叉验证分数。这个对象在构建时采用一个估计器并且暴露一个估计器API：

In [117]: from sklearn import svm, grid_search

In [118]: gammas = np.logspace(-6, -1, 10)

In [119]: svc = svm.SVC()

In [120]: clf = grid_search.GridSearchCV(estimator=svc, param_grid=dict(gamma=gammas),n_jobs=-1)

In [121]: clf.fit(digits.data[:1000], digits.target[:1000])
Out[121]:
GridSearchCV(cv=None,
estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', probability=False, shrinking=True, tol=0.001,
verbose=False),
fit_params={}, iid=True, loss_func=None, n_jobs=-1,
param_grid={'gamma': array([  1.00000e-06,   3.59381e-06,   1.29155e-05,   4.64159e-05,
1.66810e-04,   5.99484e-04,   2.15443e-03,   7.74264e-03,
2.78256e-02,   1.00000e-01])},
pre_dispatch='2*n_jobs', refit=True, score_func=None, verbose=0)

In [122]: clf.best_score
/usr/lib/python2.7/site-packages/sklearn/utils/__init__.py:79: DeprecationWarning: Function best_score is deprecated; GridSearchCV.best_score is deprecated and will be removed in version 0.12. Please use GridSearchCV.best_score_ instead.
warnings.warn(msg, category=DeprecationWarning)
Out[122]: 0.98600097103091122

In [123]: clf.best_estimator.gamma
/usr/lib/python2.7/site-packages/sklearn/utils/__init__.py:79: DeprecationWarning: Function best_estimator is deprecated; GridSearchCV.best_estimator is deprecated and will be removed in version 0.12. Please use GridSearchCV.best_estimator_ instead.
warnings.warn(msg, category=DeprecationWarning)
Out[123]: 0.0021544346900318843

#### 交叉验证估计器

In [125]: from sklearn import linear_model, datasets

In [126]: lasso = linear_model.LassoCV()

In [128]: X_diabetes = diabetes.data

In [129]: y_diabetes = diabetes.target

In [130]: lasso.fit(X_diabetes, y_diabetes)
Out[130]:
LassoCV(alphas=array([ 2.14804,  2.00327, ...,  0.0023 ,  0.00215]),
copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000,
n_alphas=100, normalize=False, precompute='auto', tol=0.0001,
verbose=False)

In [131]: # The estimator chose automatically its lambda:

In [132]: lasso.alpha
Out[132]: 0.013180196198701137

0
0

【直播】机器学习&深度学习系统实战（唐宇迪）
【直播】Kaggle 神器：XGBoost 从基础到实战（冒教授）
【直播回放】深度学习基础与TensorFlow实践（王琛）
【直播】计算机视觉原理及实战（屈教授）
【直播】机器学习之凸优化（马博士）
【直播】机器学习之矩阵（黄博士）
【直播】机器学习之概率与统计推断（冒教授）
【直播】机器学习之数学基础
【直播】TensorFlow实战进阶（智亮）
【直播】深度学习30天系统实训（唐宇迪）

* 以上用户言论只代表其个人观点，不代表CSDN网站的观点或立场
Thinkgamer微博
个人微信，一起交流！

### 扫一扫，关注我

个人资料
• 访问：645462次
• 积分：8497
• 等级：
• 排名：第2207名
• 原创：211篇
• 转载：22篇
• 译文：2篇
• 评论：240条
个人简介
姓名：Thinkgamer

Github：https://github.com/thinkgamer

主攻：云计算/python/数据分析

程度：熟悉/熟悉/熟悉

微信：gyt13342445911

Email：thinkgamer@163.com

工作状态：在职ing

心灵鸡汤：只要努力，你就是下一个大牛...