「sklearn_2」Feature selection

                               Feature selection

首先,feature selection很重要,我就不阐述原因了,我们来看几种常用对方法

官方文档:https://scikit-learn.org/stable/modules/feature_selection.html

下面会依次介绍几种常用的features selection的方法,和如何使用他们,因为官网注意介绍的各个参数和方法的细节.

1. Univariate feature selection(单feature selection)

   1.1.使用Chi-square,但是要注意features必须是数字,并且非负。

   1.2.使用Mutual Information

2. Recursive feature elimination (递归特征消除)RFE

3.Tree-based feature selection

4. RFECV

 

 

1. Univariate feature selection(单feature selection)

对于regression和classification使用的方法是不同的,如果使用classification但是使用了regression的方法,是没有效果的。

如果不知道如何区分的,可以看这篇文章:「机器学习_11」分类和回归

下面介绍classification中两种最常见的方法:Mutual Information / Chi-square

from sklearn.datasets import load_iris

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import mutual_info_classif

X, y = load_iris(return_X_y=True)
print(X.shape)

输出结果:

(150, 4)

 

先介绍下函数:sklearn.feature_selection.SelectKBest

  • 作用:Select features according to the k highest scores.
  • 函数:class sklearn.feature_selection.SelectKBest(score_func=<function f_classif>*k=10)
  • 参数介绍:score_func就是使用的那种类型。k表示要得到的feature个数,默认为10个

 

1.1.使用Chi-square,但是要注意features必须是数字,并且非负。

具体的参数细节请参考官网文档

k=2表示:选出features的个数,根据不同的初始数据,需要自己调参

# feature as count, should be non-negative
X_new = SelectKBest(chi2, k=2).fit_transform(X, y) 
print(X_new.shape)

输出结果:

(150, 2)

使用chi-square后,x从4列变成了2列

 

下面介绍如何使用:

# X_train :train features
# Y_train :train labels
# X_test: test features

# 使用X_train和Y_train进行训练
# fit函数是表示训练,transform表示的是转化,
# fit_transform表示根据X_train和Y_train后,返回处理完X_train的结果
model = SelectKBest(chi2, k=2)
X_train_new = model.fit_transform(X_train, Y_train) 
print(X_train_new.shape)

#对测试的数据需要进行一样的变换,否则就会出现features不统一的情况
X_test_new = model.transform(X_test) 
print(X_test_new.shape)



 

1.2.使用Mutual Information

#use loge
X_new = SelectKBest(mutual_info_classif, k=2).fit_transform(X, y) 
print(X_new.shape)

输出结果:

(150, 2)

 

1.3 实例

# Import the necessary libraries first
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pd.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)

# Summarize scores
np.set_printoptions(precision=3)
print(fit.scores_)

features = fit.transform(X)
# Summarize selected features
print(features[0:5,:])

输出结果:

[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]
[[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]
 [137.  168.   43.1  33. ]]

 

参考:https://www.datacamp.com/community/tutorials/feature-selection-python

 

2. Recursive feature elimination (递归特征消除)RFE

这里使用的是LogisticRegression模型

# Import your necessary dependencies
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %s" % (fit.n_features_))
print("Selected Features: %s" % (fit.support_))
print("Feature Ranking: %s" % (fit.ranking_))

输出结果:

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 3 5 6 1 1 4]

使用fit_transform或者transform进行依次转化train和test即可

 

3.Tree-based feature selection

基于决策树的实现

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
X, y = load_iris(return_X_y=True)
X.shape

输出:

(150, 4)
clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(X, y)
clf.feature_importances_ 

feature_importance显示的是每个features的重要性,输出是数组的形式

输出:

array([ 0.04...,  0.05...,  0.4...,  0.4...])
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
X_new.shape

输出:

(150, 2)

显然,我们应该保留后两个features。

其实该方式使用的是SelectFromModel方法,该方法和下面要介绍的RFECV是有一定差别的,下面我来阐述下;


他们有效地尝试达到相同的结果,但是每种技术使用的方法略有不同。

RFE会在迭代过程中删除最不重要的功能。因此,基本上,它首先删除了一些不重要的功能,然后进行拟合,然后再次删除并拟合。它重复此迭代,直到达到适当数量的功能为止。

SelectFromModel的健壮性稍差一些,因为它只是根据作为参数给出的阈值删除不太重要的功能。没有涉及迭代。

参考:https://datascience.stackexchange.com/questions/23539/difference-between-rfe-and-selectfrommodel-in-scikit-learn

 

4. RFECV

第二个部分我们提到了RFE,本质上和这个是一样的,只不是该方法还加入了cross-validation

参考:https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html#sklearn.feature_selection.RFECV

从准确率中也可以看到有提高

import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV

from sklearn.datasets import make_classification

# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,
                           n_redundant=2, n_repeated=0, n_classes=8,
                           n_clusters_per_class=1, random_state=0)



estimator = tree.DecisionTreeClassifier(max_depth=None)
X_train, X_test, y_train, y_test = train_test_split(X, y)

estimator = tree.DecisionTreeClassifier(max_depth=None)

# The "accuracy" scoring is proportional to the number of correct
# classifications
rfecv = RFECV(estimator=estimator, step=1, cv=StratifiedKFold(2),
              scoring='accuracy')
rfecv.fit(X_train, y_train)
print("Optimal number of features : %d" % rfecv.n_features_)
# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

下面是以图的形式表示出,features的选择

比较两次准确率

estimator.fit(X_train, y_train)
print(X_train.shape)
print(estimator.score(X_test,y_test))

X_testnew=rfecv.transform(X_test)
X_trainnew=rfecv.transform(X_train)
print(X_trainnew.shape)
estimator.fit(X_trainnew, y_train)
print(estimator.score(X_testnew,y_test))

输出:

(750, 25)
0.668
(750, 3)
0.704
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值