「sklearn_2」Feature selection

最新推荐文章于 2024-09-20 13:56:19 发布

qq_1144521901

最新推荐文章于 2024-09-20 13:56:19 发布

阅读量488

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/qq_36098284/article/details/106153351

版权

python 专栏收录该内容

33 篇文章 0 订阅

订阅专栏

Feature selection

首先，feature selection很重要，我就不阐述原因了，我们来看几种常用对方法

官方文档：https://scikit-learn.org/stable/modules/feature_selection.html

下面会依次介绍几种常用的features selection的方法，和如何使用他们，因为官网注意介绍的各个参数和方法的细节.

1. Univariate feature selection（单feature selection）

1.1.使用Chi-square，但是要注意features必须是数字，并且非负。

1.2.使用Mutual Information

2. Recursive feature elimination （递归特征消除）RFE

3.Tree-based feature selection

4. RFECV

1. Univariate feature selection（单feature selection）

对于regression和classification使用的方法是不同的，如果使用classification但是使用了regression的方法，是没有效果的。

如果不知道如何区分的，可以看这篇文章：「机器学习_11」分类和回归

For regression: f_regression, mutual_info_regression
For classification: chi2, f_classif, mutual_info_classif

下面介绍classification中两种最常见的方法：Mutual Information / Chi-square

from sklearn.datasets import load_iris

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import mutual_info_classif

X, y = load_iris(return_X_y=True)
print(X.shape)

输出结果：

(150, 4)

先介绍下函数:sklearn.feature_selection.SelectKBest

作用：Select features according to the k highest scores.
函数：class sklearn.feature_selection.SelectKBest(score_func=<function f_classif>, *, k=10)
参数介绍：score_func就是使用的那种类型。k表示要得到的feature个数，默认为10个

1.1.使用Chi-square，但是要注意features必须是数字，并且非负。

具体的参数细节请参考官网文档

k=2表示：选出features的个数，根据不同的初始数据，需要自己调参

# feature as count, should be non-negative
X_new = SelectKBest(chi2, k=2).fit_transform(X, y) 
print(X_new.shape)

输出结果：

(150, 2)

使用chi-square后，x从4列变成了2列

下面介绍如何使用：

# X_train :train features
# Y_train ：train labels
# X_test: test features

# 使用X_train和Y_train进行训练
# fit函数是表示训练，transform表示的是转化，
# fit_transform表示根据X_train和Y_train后，返回处理完X_train的结果
model = SelectKBest(chi2, k=2)
X_train_new = model.fit_transform(X_train, Y_train) 
print(X_train_new.shape)

#对测试的数据需要进行一样的变换，否则就会出现features不统一的情况
X_test_new = model.transform(X_test) 
print（X_test_new.shape）

1.2.使用Mutual Information

#use loge
X_new = SelectKBest(mutual_info_classif, k=2).fit_transform(X, y) 
print(X_new.shape)

输出结果：

(150, 2)

1.3 实例

# Import the necessary libraries first
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pd.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)

# Summarize scores
np.set_printoptions(precision=3)
print(fit.scores_)

features = fit.transform(X)
# Summarize selected features
print(features[0:5,:])

输出结果：

[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]
[[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]
 [137.  168.   43.1  33. ]]

参考：https://www.datacamp.com/community/tutorials/feature-selection-python

2. Recursive feature elimination （递归特征消除）RFE

这里使用的是LogisticRegression模型

# Import your necessary dependencies
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %s" % (fit.n_features_))
print("Selected Features: %s" % (fit.support_))
print("Feature Ranking: %s" % (fit.ranking_))

输出结果：

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 3 5 6 1 1 4]

使用fit_transform或者transform进行依次转化train和test即可

3.Tree-based feature selection

基于决策树的实现

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
X, y = load_iris(return_X_y=True)
X.shape

输出：

(150, 4)

clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(X, y)
clf.feature_importances_

feature_importance显示的是每个features的重要性，输出是数组的形式

输出：

array([ 0.04...,  0.05...,  0.4...,  0.4...])

model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
X_new.shape

输出：

(150, 2)

显然，我们应该保留后两个features。

其实该方式使用的是SelectFromModel方法，该方法和下面要介绍的RFECV是有一定差别的，下面我来阐述下；

他们有效地尝试达到相同的结果，但是每种技术使用的方法略有不同。

RFE会在迭代过程中删除最不重要的功能。因此，基本上，它首先删除了一些不重要的功能，然后进行拟合，然后再次删除并拟合。它重复此迭代，直到达到适当数量的功能为止。

SelectFromModel的健壮性稍差一些，因为它只是根据作为参数给出的阈值删除不太重要的功能。没有涉及迭代。

参考：https://datascience.stackexchange.com/questions/23539/difference-between-rfe-and-selectfrommodel-in-scikit-learn

4. RFECV

第二个部分我们提到了RFE，本质上和这个是一样的，只不是该方法还加入了cross-validation

参考：https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html#sklearn.feature_selection.RFECV

从准确率中也可以看到有提高

import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV

from sklearn.datasets import make_classification

# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,
                           n_redundant=2, n_repeated=0, n_classes=8,
                           n_clusters_per_class=1, random_state=0)



estimator = tree.DecisionTreeClassifier(max_depth=None)
X_train, X_test, y_train, y_test = train_test_split(X, y)

estimator = tree.DecisionTreeClassifier(max_depth=None)

# The "accuracy" scoring is proportional to the number of correct
# classifications
rfecv = RFECV(estimator=estimator, step=1, cv=StratifiedKFold(2),
              scoring='accuracy')
rfecv.fit(X_train, y_train)
print("Optimal number of features : %d" % rfecv.n_features_)
# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

下面是以图的形式表示出，features的选择

比较两次准确率

estimator.fit(X_train, y_train)
print(X_train.shape)
print(estimator.score(X_test,y_test))

X_testnew=rfecv.transform(X_test)
X_trainnew=rfecv.transform(X_train)
print(X_trainnew.shape)
estimator.fit(X_trainnew, y_train)
print(estimator.score(X_testnew,y_test))

输出：