本文作为学习sklearn的笔记,有理解错误之处,敬请指教,不胜感激。
1.首先对本例所涉及的几个模块做一下介绍
1.1pipeline.Pipeline
http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline
http://scikit-learn.org/stable/modules/pipeline.html#pipeline
pipeline 可以用来 连接多个estimators,在很多有固定处理顺序的数据处理程序(比如本文的特征提取,规则化,分类)中,这个函数还是比较好用的。官文中它介绍主要有两个功能:
- 方面,把多个estimators连接起来,用一次fit和predicate就行了
- 参数的选择,可以一次对所有estimators的参数进行grid search(http://scikit-learn.org/stable/modules/grid_search.html#grid-search
要注意的是在 pipeline中,除最后一个estimator外其他的都必须是transformers,最后一个可以是任意类型如 (transformer, classifier, etc.).另外其参数可以使用'__'来设定,型如<estimator>__<parameter>,如下:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
estimators = [('reduce_dim', PCA()), ('svm', SVC())]
clf=Pipeline(estimators)
clf.set_params(svm__C=10)
由上也可以看出,Pipeline的参数是一个list,每个元素是一个元组
(key,
value)
另外,clf就可当做一个estimator用了,使用fit、predict啥的
在选择最佳参数时,其参数设定如下:
1.2pipeline.FeatureUnion
FeatureUnion和pipeline基本相同,不过,它只是transformer的集成,并且他是将其中的transformers并行处理,例如本文中的pca得到了2个特征,然后searchkbest得到了一个特征,最后输出是3个特征。要注意的是每次输出的特征,有可能相同。
2.正文
- 用featureunion选择特征处理的transformer
- 用pipeline连接estimator和transformer
- 用gridsearchcv选择参数
# Author: Andreas Mueller <amueller@ais.uni-bonn.de>
# copy by :IKoala
# License: BSD 3 clause
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
#selectkbest默认的得分函数是 f_classif,还可以选用卡方分布等
iris = load_iris()
X, y = iris.data, iris.target
# This dataset is way to high-dimensional. Better do PCA:
#主成分分析法选两个特征
pca = PCA(n_components=2)
# Maybe some original features where good, too?
#使用f_classif选择一个特征
selection = SelectKBest(k=1)
# Build estimator from PCA and Univariate selection:
#三个特征集合
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
# Use combined features to transform dataset:
#将原特征转化为这三个
X_features = combined_features.fit(X, y).transform(X)
svm = SVC(kernel="linear")
# Do grid search over k, n_components and C:
pipeline = Pipeline([("features", combined_features), ("svm", svm)])
#寻找最佳参数
param_grid = dict(features__pca__n_components=[1, 2, 3],
features__univ_select__k=[1, 2],
svm__C=[0.1, 1, 10])
grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=10)
grid_search.fit(X, y)
print(grid_search.best_estimator_)
得到的结果是
transformer_list=[('pca', PCA(copy=True, n_components=2, whiten=False)), ('univ_select', SelectKBest(k=2, score_func=<function f_classif at 0x0000000006BC5E48>))],
transformer_weights=None)), ('svm', SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False))])
可见其参数已经得到了选择