Feature selection
首先,feature selection很重要,我就不阐述原因了,我们来看几种常用对方法
官方文档:https://scikit-learn.org/stable/modules/feature_selection.html
下面会依次介绍几种常用的features selection的方法,和如何使用他们,因为官网注意介绍的各个参数和方法的细节.
1. Univariate feature selection(单feature selection)
1.1.使用Chi-square,但是要注意features必须是数字,并且非负。
1.2.使用Mutual Information
2. Recursive feature elimination (递归特征消除)RFE
3.Tree-based feature selection
4. RFECV
1. Univariate feature selection(单feature selection)
对于regression和classification使用的方法是不同的,如果使用classification但是使用了regression的方法,是没有效果的。
如果不知道如何区分的,可以看这篇文章:「机器学习_11」分类和回归
-
For regression:
f_regression
,mutual_info_regression
-
For classification:
chi2
,f_classif
,mutual_info_classif
下面介绍classification中两种最常见的方法:Mutual Information / Chi-square
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import mutual_info_classif
X, y = load_iris(return_X_y=True)
print(X.shape)
输出结果:
(150, 4)
先介绍下函数:sklearn.feature_selection.SelectKBest
- 作用:Select features according to the k highest scores.
- 函数:class
sklearn.feature_selection.
SelectKBest
(score_func=<function f_classif>, *, k=10) - 参数介绍:score_func就是使用的那种类型。k表示要得到的feature个数,默认为10个
1.1.使用Chi-square,但是要注意features必须是数字,并且非负。
具体的参数细节请参考官网文档
k=2表示:选出features的个数,根据不同的初始数据,需要自己调参
# feature as count, should be non-negative
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
print(X_new.shape)
输出结果:
(150, 2)
使用chi-square后,x从4列变成了2列
下面介绍如何使用:
# X_train :train features
# Y_train :train labels
# X_test: test features
# 使用X_train和Y_train进行训练
# fit函数是表示训练,transform表示的是转化,
# fit_transform表示根据X_train和Y_train后,返回处理完X_train的结果
model = SelectKBest(chi2, k=2)
X_train_new = model.fit_transform(X_train, Y_train)
print(X_train_new.shape)
#对测试的数据需要进行一样的变换,否则就会出现features不统一的情况
X_test_new = model.transform(X_test)
print(X_test_new.shape)
1.2.使用Mutual Information
#use loge
X_new = SelectKBest(mutual_info_classif, k=2).fit_transform(X, y)
print(X_new.shape)
输出结果:
(150, 2)
1.3 实例
# Import the necessary libraries first
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pd.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
# Summarize scores
np.set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# Summarize selected features
print(features[0:5,:])
输出结果:
[ 111.52 1411.887 17.605 53.108 2175.565 127.669 5.393 181.304]
[[148. 0. 33.6 50. ]
[ 85. 0. 26.6 31. ]
[183. 0. 23.3 32. ]
[ 89. 94. 28.1 21. ]
[137. 168. 43.1 33. ]]
参考:https://www.datacamp.com/community/tutorials/feature-selection-python
2. Recursive feature elimination (递归特征消除)RFE
这里使用的是LogisticRegression模型
# Import your necessary dependencies
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# Feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %s" % (fit.n_features_))
print("Selected Features: %s" % (fit.support_))
print("Feature Ranking: %s" % (fit.ranking_))
输出结果:
Num Features: 3
Selected Features: [ True False False False False True True False]
Feature Ranking: [1 2 3 5 6 1 1 4]
使用fit_transform或者transform进行依次转化train和test即可
3.Tree-based feature selection
基于决策树的实现
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
X, y = load_iris(return_X_y=True)
X.shape
输出:
(150, 4)
clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(X, y)
clf.feature_importances_
feature_importance显示的是每个features的重要性,输出是数组的形式
输出:
array([ 0.04..., 0.05..., 0.4..., 0.4...])
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
X_new.shape
输出:
(150, 2)
显然,我们应该保留后两个features。
其实该方式使用的是SelectFromModel方法,该方法和下面要介绍的RFECV是有一定差别的,下面我来阐述下;
他们有效地尝试达到相同的结果,但是每种技术使用的方法略有不同。RFE会在迭代过程中删除最不重要的功能。因此,基本上,它首先删除了一些不重要的功能,然后进行拟合,然后再次删除并拟合。它重复此迭代,直到达到适当数量的功能为止。
SelectFromModel的健壮性稍差一些,因为它只是根据作为参数给出的阈值删除不太重要的功能。没有涉及迭代。
4. RFECV
第二个部分我们提到了RFE,本质上和这个是一样的,只不是该方法还加入了cross-validation
从准确率中也可以看到有提高
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_classification
# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,
n_redundant=2, n_repeated=0, n_classes=8,
n_clusters_per_class=1, random_state=0)
estimator = tree.DecisionTreeClassifier(max_depth=None)
X_train, X_test, y_train, y_test = train_test_split(X, y)
estimator = tree.DecisionTreeClassifier(max_depth=None)
# The "accuracy" scoring is proportional to the number of correct
# classifications
rfecv = RFECV(estimator=estimator, step=1, cv=StratifiedKFold(2),
scoring='accuracy')
rfecv.fit(X_train, y_train)
print("Optimal number of features : %d" % rfecv.n_features_)
# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
下面是以图的形式表示出,features的选择
比较两次准确率
estimator.fit(X_train, y_train)
print(X_train.shape)
print(estimator.score(X_test,y_test))
X_testnew=rfecv.transform(X_test)
X_trainnew=rfecv.transform(X_train)
print(X_trainnew.shape)
estimator.fit(X_trainnew, y_train)
print(estimator.score(X_testnew,y_test))
输出:
(750, 25) 0.668 (750, 3) 0.704