sklearn学习-SVM例程总结2(特征选择——单因素方差分析(方差分析anova ))

SVM with univariate feature selection(单因素方差分析)

本文隶属于机器学习的特征选择部分,是训练前对数据的预处理部分。对于机器学习而言,特征选择是影响结果的极其重要的组成部分。这部分内容涉及到数理统计的内容,尤其是方差分析。当然,这些只是我这几天查到的,其背后的知识还有很多,以后的学习会不断加深这方面的理解,这里只对这个例程以及背后的特征选择做个初步总结。
例程和总结如下:

SVM-Anova

"""
=================================================
SVM-Anova: SVM with univariate feature selection
=================================================

This example shows how to perform univariate feature selection before running a
SVC (support vector classifier) to improve the classification scores.
"""
print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets, feature_selection
from sklearn.cross_validation import cross_val_score#交叉验证
from sklearn.pipeline import Pipeline

###############################################################################
# Import some data to play with
digits = datasets.load_digits()
y = digits.target
# Throw away data, to be in the curse of dimension settings
y = y[:200]
X = digits.data[:200]
n_samples = len(y)
X = X.reshape((n_samples, -1))
# add 200 non-informative features
X = np.hstack((X, 2 * np.random.random((n_samples, 200))))

###############################################################################
# Create a feature-selection transform and an instance of SVM that we
# combine together to have an full-blown estimator
#feature_selection.f_classif:计算所提供样本的方差分析F-值anova:方差分析
#feature_selection.SelectPercentile(k):只留下k值最高的一组特征,返回最终的估计器
transform = feature_selection.SelectPercentile(feature_selection.f_classif)
#anova:Analysis of Variance(方差分析)
#http://baike.baidu.com/link?url=8ufVQvD2KZrWbS3VvvuhYDfw3dk8nSD84QRUNB1P864
#rW8XKSw6-P4-xGIHVkAEBHUIjQGFhFsPtQhazMQrUVmcAqLVDBkQKVXSb3MPq92QFhPaPmVyEgsMNF
#ZJ_p1B-QyQ-tHMQKFJB_recu1qG9nDDpfdDbwMAomoktviOFca
clf = Pipeline([('anova', transform), ('svc', svm.SVC(C=1.0))])

###############################################################################
# Plot the cross-validation score as a function of percentile of features
score_means = list()
score_stds = list()
percentiles = (1, 3, 6, 10, 15, 20, 30, 40, 60, 80, 100)

for percentile in percentiles:
    #clf.set_params:设置此估计器的参数。
    #使用网格搜索(grid search)和交叉验证(cross validation)来选择参数.
    #对方差分析中的参数percentile进行调节,实现多重比较检验
    #用于确定控制变量的不同水平对观测变量的影响程度如何
    clf.set_params(anova__percentile=percentile)
    # Compute cross-validation score using 1 CPU
    #http://scikit-learn.org/dev/modules/generated/sklearn.model_selection.
    #cross_val_score.html#sklearn.model_selection.cross_val_score
    #cross_val_score:最简单的交叉验证方法,cv选择折数,默认是3折交叉验证
    this_scores = cross_val_score(clf, X, y, n_jobs=1)
    score_means.append(this_scores.mean())
    score_stds.append(this_scores.std())
#plt.errorbar以折线形式画出均值和方差
plt.errorbar(percentiles, score_means, np.array(score_stds))

plt.title(
    'Performance of the SVM-Anova varying the percentile of features selected')
plt.xlabel('Percentile')
plt.ylabel('Prediction rate')

plt.axis('tight')
plt.show()

运行结果
参考链接:
方差分析
sklearn.cross_val_score
sklearn.Feature selection
sklearn.pipline
CSDN.pipline
numpy.random
sklearn.UserGuide
np.hstack
特征选择
CSDN.feature_selection

  • 0
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值