摘要:学习sklearn的特征选择模型,先画导向图,然后写demo,最后给出了离散类型的几种用法。作为初探指导记录与学习。
0. 学习sklearn的特征选择
http://scikit-learn.org/stable/modules/feature_selection.html#removing-features-with-low-variance
1. 特征选择
2. demo代码
# coding=utf-8
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.backends.backend_agg import FigureCanvas
from matplotlib.figure import Figure
from sklearn import datasets, svm
from sklearn.feature_selection import SelectPercentile, f_classif, chi2, mutual_info_classif, VarianceThreshold, RFE
# #############################################################################
# 加载数据,数据形式为:150 * 4
from sklearn.svm import SVC
iris = datasets.load_iris()
# Some noisy data not correlated
# 数据形式:150 * 6
E = np.random.uniform(0, 0.1, size=(len(iris.data), 6))
# Add the noisy data to the informative features
# 把燥音与iris数据合并,构造10维数据
X = np.hstack((iris.data, E))
y = iris.target
# 初始化画板
# fig = Figure()
fig = plt.figure(figsize=(8, 4))
ax = fig.add_subplot(111)
# 自定义字体
myfont = matplotlib.font_manager.FontProperties(fname="d:\\pyProject\\font\\SIMYOU.TTF")
# 获取数据维度
X_indices = np.arange(X.shape[-1])
# ##########################RFE###################################################
# 这个分类器可以根据实现的情况来选择,这个选择了SVC
estimator = SVC(kernel="linear")
selector_svc_rfe = RFE(estimator, 4, step=1)
selector = selector_svc_rfe.fit(X, y)
print(u'RFE-support:')
print(selector.support_)
print(u'RFE-ranking:')
print(selector.ranking_)
# #############################################################################
# 对于特征分数采用MI进行单变量特征选择
# 取50%个有意义的特征
selector_vt = VarianceThreshold(threshold=0.06)
selector_vt.fit(X, y)
# 修正分数
# scores = -np.log10(selector.scores_)
scores = selector_vt.variances_
scores /= scores.max()
# 作图
ax.bar(X_indices - .55, scores, width=.1,
label=r'Variance Value', color='y',
edgecolor='black')
# #############################################################################
# 对于特征分数采用MI进行单变量特征选择
# 取50%个有意义的特征
selector_mi = SelectPercentile(mutual_info_classif, percentile=50)
selector_mi.fit(X, y)
# 修正分数
# scores = -np.log10(selector.scores_)
scores = selector_mi.scores_
scores /= scores.max()
# 作图
ax.bar(X_indices - .45, scores, width=.1,
label=r'MI score', color='gray',
edgecolor='black')
# #############################################################################
# 对于特征分数采用F-test进行单变量特征选择
# 取50%个有意义的特征
selector_ft = SelectPercentile(f_classif, percentile=50)
selector_ft.fit(X, y)
# 修正分数
scores = -np.log10(selector_ft.pvalues_)
scores /= scores.max()
# 作图
ax.bar(X_indices - .35, scores, width=.1,
label=r'F-test score ($-Log(p_{value})$)', color='darkorange',
edgecolor='black')
# #############################################################################
# 对于特征分数采用chi2进行单变量特征选择
# 取50%个有意义的特征
selector_chi = SelectPercentile(chi2, percentile=50)
selector_chi.fit(X, y)
# 修正分数
scores = -np.log10(selector_chi.pvalues_)
scores /= scores.max()
# 作图
ax.bar(X_indices - .25, scores, width=.08,
label=r'chi2 score ($-Log(p_{value})$)', color='red',
edgecolor='black')
# #############################################################################
# Compare to the weights of an SVM
# 通过SVC对每个特征权重计算
clf = svm.SVC(kernel='linear')
clf.fit(X, y)
# 平方和再计算比值[带有coef_属性的分类都可以作为特征权重的选择计算]
svm_weights = (clf.coef_ ** 2).sum(axis=0)
svm_weights /= svm_weights.max()
# 作图
ax.bar(X_indices - .15, svm_weights, width=.1, label='SVM weight',
color='navy', edgecolor='black')
# ################################MI selection#############################################
clf_selected = svm.SVC(kernel='linear')
clf_selected.fit(selector_mi.transform(X), y)
# 得出的权重情况
svm_weights_selected = (clf_selected.coef_ ** 2).sum(axis=0)
svm_weights_selected /= svm_weights_selected.max()
# 作图
ax.bar(X_indices[selector_mi.get_support()] + .05, svm_weights_selected,
width=.1, label='SVM weights after MI selection', color='b',
edgecolor='black')
# ##############################F-test selection###############################################
clf_selected = svm.SVC(kernel='linear')
clf_selected.fit(selector_ft.transform(X), y)
# 得出的权重情况
svm_weights_selected = (clf_selected.coef_ ** 2).sum(axis=0)
svm_weights_selected /= svm_weights_selected.max()
# 作图
ax.bar(X_indices[selector_ft.get_support()] - .05, svm_weights_selected,
width=.1, label='SVM weights after F-test selection', color='c',
edgecolor='black')
# #############################chi-2 selection################################################
clf_selected = svm.SVC(kernel='linear')
clf_selected.fit(selector_chi.transform(X), y)
# 得出的权重情况
svm_weights_selected = (clf_selected.coef_ ** 2).sum(axis=0)
svm_weights_selected /= svm_weights_selected.max()
# 作图
ax.bar(X_indices[selector_chi.get_support()] + .15, svm_weights_selected,
width=.1, label='SVM weights after chi-2 selection', color='g',
edgecolor='black')
# #############################VarianceThreshold################################################
clf_selected = svm.SVC(kernel='linear')
clf_selected.fit(selector_vt.transform(X), y)
# 得出的权重情况
svm_weights_selected = (clf_selected.coef_ ** 2).sum(axis=0)
svm_weights_selected /= svm_weights_selected.max()
# 作图
ax.bar(X_indices[selector_vt.get_support()] + .25, svm_weights_selected,
width=.1, label='SVM weights after VarianceThreshold selection', color='deeppink',
edgecolor='black')
# #############################svc_rfe################################################
clf_selected = svm.SVC(kernel='linear')
clf_selected.fit(selector_svc_rfe.transform(X), y)
# 得出的权重情况
svm_weights_selected = (clf_selected.coef_ ** 2).sum(axis=0)
svm_weights_selected /= svm_weights_selected.max()
# 作图
ax.bar(X_indices[selector_svc_rfe.get_support()] + .35, svm_weights_selected,
width=0.08, label='SVM weights after SVC RFE selection', color='gold',
edgecolor='black')
# 图显示
ax.set_title("特征选择计算", fontproperties=myfont)
ax.set_xlabel('特征维度', fontproperties=myfont)
# ax.set_yticks(())
ax.set_xticks([i for i in X_indices])
plt.legend(loc='upper right')
plt.show()
3.运行结果
RFE-support:
[ True True True True False False False False False False]
RFE-ranking:
[1 1 1 1 7 3 5 2 4 6]
【作者:happyprince, http://blog.csdn.net/ld326/article/details/78874360】