通过scikit-learn库代码示例学机器学习算法(总)
对于刚转行或新入门机器学习的同学,算法作为代码的骨架,是最核心、最坚挺的部分,无疑也是最需要关切、最紧迫的事情。一般学习算法的思路是老师先讲授数据结构,然后讲解算法,再通过代码示例说明,这样做的好处是显而易见的,算法可以学的更通透。然而,如果想要在短期内能够实现代码能力codingCapicity则显得冗余耗时,算法精了,代码难以实现,本文意在通过对scikit-learn深度学习库的代码示例学习,稍加剖析,可以由表及里、由浅入深,触及代码实质—算法,加深理解,速效学习,能力所限,错误之处在所难免,敬请批评指正。
-
Scikit-learn requires:
Python (>= 2.7 or >= 3.3),
NumPy (>= 1.8.2),
SciPy (>= 0.13.3).
可以通过pip install -U scikit-learn 或者conda install scikit-learn 进行安装
快速入门scikit-learn机器学习
http://scikit-learn.org/stable/tutorial/basic/tutorial.html
1.机器学习:问题设置 - CSDN博客 https://blog.csdn.net/hanyun9988/article/details/78992578
通常,一个学习问题是指由一组已知的n个样本的数据集如何预测未知数据的性质。分为监督学习(Supervised learning)(包括分类算法、回归算法等)、无监督学习(Unsupervised learning)(包括聚类算法、密度估计算法等)【还有半监督学习及强化学习】。换言之,机器学习就是找到一个算法通过已知预测未知,对于数据集,已知的样本集合称为训练集,而所要预测的数据集则可称为测试集。
2.下载样例数据集:
scikit-learn内置有很多小型标准玩偶数据集,而不用从外部下载任何文件。这些下载函数有:
load_boston([return_X_y]) Load and return the boston house-prices dataset (regression).
load_iris([return_X_y]) Load and return the iris dataset (classification).
load_diabetes([return_X_y]) Load and return the diabetes dataset (regression).
load_digits([n_class, return_X_y]) Load and return the digits dataset (classification).
load_linnerud([return_X_y]) Load and return the linnerud dataset (multivariate regression).
load_wine([return_X_y]) Load and return the wine dataset (classification).
load_breast_cancer([return_X_y]) Load and return the breast cancer wisconsin dataset (classification).
这些数据集很有用,可以快速了解scikit所实现的众多算法的行为方式。
下面将以鸢尾花iris flower数据集和digits手写数字数据集为例学习分类算法。
python
Python 3.6.6,scikit-learn v0.19.2,numpy1.15.1
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> digits = datasets.load_digits()
数据集是一个字典样的对象,拥有所有的数据及一些包含数据的元数据。这些玩偶数据集的下载函数会返回一个元组(X,y),X将数据存储于.data成员中,是一个n_samples*n_features的数组,m目标值y是长度为n_samples的数组,将变量存储在.target成员中。GitHub的源码可以看出iris及digits数据集都是CSV文件。https://github.com/scikit-learn/scikit-learn/tree/master/sklearn/datasets/data
`>>>print(digits.data)
[[ 0. 0. 5. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 10. 0. 0.]
[ 0. 0. 0. ..., 16. 9. 0.]
...,
[ 0. 0. 1. ..., 6. 0. 0.]
[ 0. 0. 2. ..., 12. 0. 0.]
[ 0. 0. 10. ..., 12. 1. 0.]]`
`>>> digits.target
array([0, 1, 2, ..., 8, 9, 8])`
鸢尾花的数据是一个大小为n_samples*n_features的数组。数字识别案例中,每一个原始样本都是大小为8*8的图片。
`>>> digits.images[0]
array([[ 0., 0., 5., 13., 9., 1., 0., 0.],
[ 0., 0., 13., 15., 10., 15., 5., 0.],
[ 0., 3., 15., 2., 0., 11., 8., 0.],
[ 0., 4., 12., 0., 0., 8., 8., 0.],
[ 0., 5., 8., 0., 0., 9., 8., 0.],
[ 0., 4., 11., 0., 1., 12., 7., 0.],
[ 0., 2., 14., 5., 10., 12., 0., 0.],
[ 0., 0., 6., 13., 10., 0., 0., 0.]])`
3.学习与预测
scikit-learn中,分类预测其是一个Python对象,通过fit(X,y)和predict(T)实现的。其中一个预测器是sklearn.svm.SVC,实现支持向量分类。该预测器的算法机制是支持向量机,在此封装为黑箱接口
>>> from sklearn import svm
>>> clf = svm.SVC(gamma=0.001, C=100.)
选择模型参数
可以手动设参,如上面设置gamma的值,也可以通过工具自动寻找好的参数值,如grid search和cross validation工具。预测器的实例clf也是一个分类器,它必须固定模型,并从中学习。可以使用fit方法实现。
>>> clf.fit(digits.data[:-1], digits.target[:-1])
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
现在我们可以预测新值了,分类器将告诉我们digits数据集中最新的数值,而该值未曾训练分类器。
>>> clf.predict(digits.data[-1:])
array([8])
![分类器返回的图片](http://scikit-learn.org/stable/_images/sphx_glr_plot_digits_last_image_001.png)
###4.保存模型
Python内置的存留模型工具pickle进行模型存储。
>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> clf2.predict(X[0:1])
array([0])
>>> y[0]
特殊情况下,我们还可以用joblib.dump&joblib.load代替pickle,因为前者对大数据更有效,而后者仅可以存储于硬盘,不是对字符串操作。joblib.dump&joblib.load也接受文件样对象代替文件名操作。
>>> from sklearn.externals import joblib
>>> joblib.dump(clf, 'filename.pkl')
>>> clf = joblib.load('filename.pkl')
5.规则
5.1类型转换
输入在没有特殊情况时必须转为float64类型:
>>> import numpy as np
>>> from sklearn import random_projection
>>> rng = np.random.RandomState(0)
>>> X = rng.rand(10, 2000)
>>> X = np.array(X, dtype='float32')
>>> X.dtype
dtype('float32')
>>> transformer = random_projection.GaussianRandomProjection()
>>> X_new = transformer.fit_transform(X)
>>> X_new.dtype
dtype('float64')
上例中,X是float32,被fit_transform(X)转换为float64。
回归分析的目标数据需要转换为float64,分类目标数据可以维持不变。
>>> from sklearn import datasets
>>> from sklearn.svm import SVC
>>> iris = datasets.load_iris()
>>> clf = SVC()
>>> clf.fit(iris.data, iris.target)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> list(clf.predict(iris.data[:3]))
[0, 0, 0]
>>> clf.fit(iris.data, iris.target_names[iris.target])
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> list(clf.predict(iris.data[:3]))
['setosa', 'setosa', 'setosa']
5.2重新设定及更新参数
>>> import numpy as np
>>> from sklearn.svm import SVC
>>> rng = np.random.RandomState(0)
>>> X = rng.rand(100, 10)
>>> y = rng.binomial(1, 0.5, 100)
>>> X_test = rng.rand(5, 10)
>>> clf = SVC()
>>> clf.set_params(kernel='linear').fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> clf.predict(X_test)
array([1, 0, 1, 1, 0])
>>> clf.set_params(kernel='rbf').fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> clf.predict(X_test)
array([0, 0, 0, 1, 0])
5.3多类与多标记
>>> from sklearn.svm import SVC
>>> from sklearn.multiclass import OneVsRestClassifier
>>> from sklearn.preprocessing import LabelBinarizer
>>> X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
>>> y = [0, 0, 1, 1, 2]
>>> classif = OneVsRestClassifier(estimator=SVC(random_state=0))
>>> classif.fit(X, y).predict(X)
array([0, 0, 1, 1, 2])
>>> y = LabelBinarizer().fit_transform(y)
>>> classif.fit(X, y).predict(X)
array([[1, 0, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
[0, 0, 0]])
>> from sklearn.preprocessing import MultiLabelBinarizer
>> y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
>> y = MultiLabelBinarizer().fit_transform(y)
>> classif.fit(X, y).predict(X)
array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[1, 0, 1, 1, 0],
[0, 0, 1, 0, 1]])
GitHub-scikit-learn:SVM部分源码赏析:
https://github.com/scikit-learn/scikit-learn
scikit-learn/sklearn/svm/init.py
“””
The :mod:sklearn.svm
module includes Support Vector Machine algorithms.
“”“
# See http://scikit-learn.sourceforge.net/modules/svm.html for complete
# documentation.
# Author: Fabian Pedregosa <fabian.pedregosa@inria.fr> with help from
# the scikit-learn community. LibSVM and LibLinear are copyright
# of their respective owners.
# License: BSD 3 clause (C) INRIA 2010
from .classes import SVC, NuSVC, SVR, NuSVR, OneClassSVM, LinearSVC, \
LinearSVR
from .bounds import l1_min_c
from . import libsvm, liblinear, libsvm_sparse
__all__ = ['LinearSVC',
'LinearSVR',
'NuSVC',
'NuSVR',
'OneClassSVM',
'SVC',
'SVR',
'l1_min_c',
'liblinear',
'libsvm',
'libsvm_sparse']
scikit-learn/sklearn/svm/setup.py
import os
from os.path import join
import numpy
from sklearn._build_utils import get_blas_info
def configuration(parent_package='', top_path=None):
from numpy.distutils.misc_util import Configuration
config = Configuration('svm', parent_package, top_path)
config.add_subpackage('tests')
# Section LibSVM
# we compile both libsvm and libsvm_sparse
config.add_library('libsvm-skl',
sources=[join('src', 'libsvm', 'libsvm_template.cpp')],
depends=[join('src', 'libsvm', 'svm.cpp'),
join('src', 'libsvm', 'svm.h')],
# Force C++ linking in case gcc is picked up instead
# of g++ under windows with some versions of MinGW
extra_link_args=['-lstdc++'],
)
libsvm_sources = ['libsvm.pyx']
libsvm_depends = [join('src', 'libsvm', 'libsvm_helper.c'),
join('src', 'libsvm', 'libsvm_template.cpp'),
join('src', 'libsvm', 'svm.cpp'),
join('src', 'libsvm', 'svm.h')]
config.add_extension('libsvm',
sources=libsvm_sources,
include_dirs=[numpy.get_include(),
join('src', 'libsvm')],
libraries=['libsvm-skl'],
depends=libsvm_depends,
)
# liblinear module
cblas_libs, blas_info = get_blas_info()
if os.name == 'posix':
cblas_libs.append('m')
liblinear_sources = ['liblinear.pyx',
join('src', 'liblinear', '*.cpp')]
liblinear_depends = [join('src', 'liblinear', '*.h'),
join('src', 'liblinear', 'liblinear_helper.c')]
config.add_extension('liblinear',
sources=liblinear_sources,
libraries=cblas_libs,
include_dirs=[join('..', 'src', 'cblas'),
numpy.get_include(),
blas_info.pop('include_dirs', [])],
extra_compile_args=blas_info.pop('extra_compile_args',
[]),
depends=liblinear_depends,
# extra_compile_args=['-O0 -fno-inline'],
** blas_info)
# end liblinear module
# this should go *after* libsvm-skl
libsvm_sparse_sources = ['libsvm_sparse.pyx']
config.add_extension('libsvm_sparse', libraries=['libsvm-skl'],
sources=libsvm_sparse_sources,
include_dirs=[numpy.get_include(),
join("src", "libsvm")],
depends=[join("src", "libsvm", "svm.h"),
join("src", "libsvm",
"libsvm_sparse_helper.c")])
return config
if __name__ == '__main__':
from numpy.distutils.core import setup
setup(**configuration(top_path='').todict())
将设置函数中的顶级路径放入字典,传入多个参数至setup()
scikit-learn/sklearn/svm/base.py
from __future__ import print_function
import numpy as np
import scipy.sparse as sp
import warnings
from abc import ABCMeta, abstractmethod
from . import libsvm, liblinear
from . import libsvm_sparse
from ..base import BaseEstimator, ClassifierMixin
from ..preprocessing import LabelEncoder
from ..utils.multiclass import _ovr_decision_function
from ..utils import check_array, check_consistent_length, check_random_state
from ..utils import column_or_1d, check_X_y
from ..utils import compute_class_weight
from ..utils.extmath import safe_sparse_dot
from ..utils.validation import check_is_fitted, _check_large_sparse
from ..utils.multiclass import check_classification_targets
from ..externals import six
from ..exceptions import ConvergenceWarning
from ..exceptions import NotFittedError
LIBSVM_IMPL = ['c_svc', 'nu_svc', 'one_class', 'epsilon_svr', 'nu_svr']
def _one_vs_one_coef(dual_coef, n_support, support_vectors):
"""Generate primal coefficients from dual coefficients
for the one-vs-one multi class LibSVM in the case
of a linear kernel."""
# get 1vs1 weights for all n*(n-1) classifiers.
# this is somewhat messy.
# shape of dual_coef_ is nSV * (n_classes -1)
# see docs for details
n_class = dual_coef.shape[0] + 1
# XXX we could do preallocation of coef but
# would have to take care in the sparse case
coef = []
sv_locs = np.cumsum(np.hstack([[0], n_support]))
for class1 in range(n_class):
# SVs for class1:
sv1 = support_vectors[sv_locs[class1]:sv_locs[class1 + 1], :]
for class2 in range(class1 + 1, n_class):
# SVs for class1:
sv2 = support_vectors[sv_locs[class2]:sv_locs[class2 + 1], :]
# dual coef for class1 SVs:
alpha1 = dual_coef[class2 - 1, sv_locs[class1]:sv_locs[class1 + 1]]
# dual coef for class2 SVs:
alpha2 = dual_coef[class1, sv_locs[class2]:sv_locs[class2 + 1]]
# build weight for class1 vs class2
coef.append(safe_sparse_dot(alpha1, sv1)
+ safe_sparse_dot(alpha2, sv2))
return coef
class BaseLibSVM(six.with_metaclass(ABCMeta, BaseEstimator)):
"""Base class for estimators that use libsvm as backing library
This implements support vector machine classification and regression.
Parameter documentation is in the derived `SVC` class.
"""
# The order of these must match the integer values in LibSVM.
# XXX These are actually the same in the dense case. Need to factor
# this out.
_sparse_kernels = ["linear", "poly", "rbf", "sigmoid", "precomputed"]
@abstractmethod