递归式特征消除：Recursive feature elimination

最新推荐文章于 2024-05-27 10:15:10 发布

csuRookie

最新推荐文章于 2024-05-27 10:15:10 发布

阅读量1.3k

点赞数

分类专栏：机器学习

机器学习专栏收录该内容

16 篇文章 0 订阅

订阅专栏

递归式特征消除：Recursive feature elimination

1.1删除方差低的特征

举个栗子，假如我们有一个是布尔值的特征，我们想去删去0（或1）个数大于总数的80%。

偏差var[x]=p(1-p)

所以我们赋值threshold为0.8*（1-0.8）

>>> from sklearn.feature_selection import VarianceThreshold
>>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
>>> sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
>>> sel.fit_transform(X)
array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

1.2 选择特征中最好的k个

>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> from sklearn.feature_selection import chi2
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
>>> X_new.shape
(150, 2)

用于预测: f_regression

用于分类: chi2 or f_classif

2.1 REF

包裹式特征选择直接把最终将要使用的学习器性能作为特征子集的评价标准，换而言之选择出了最有利于学习器性能发挥量身定做的特征子集。包裹式特征选择比过滤式特征选择更好，但由于需要在特征选择的过程中多次训练学习器，故计算开销较大。

　　LVW（Las Vegas Wrapper）使用随机策略来进行子集搜索。以最终分类器的误差为特征子集的评价标准。交叉验证的方法来估计学习器的误差，不断的随机选择特征子集进行子集更新，直到停止条件满足不再进行子集更新。


 
 
   
   
    
    
   
   
   
   
    
    
     
     from sklearn.svm 
     
     import SVC
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from sklearn.datasets 
     
     import load_digits
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from sklearn.feature_selection 
     
     import RFE
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     # Load the digits dataset
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     digits = load_digits()
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     X = digits.images.reshape((len(digits.images), 
     
     -1))
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     y = digits.target
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     # Create the RFE object and rank each pixel
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     svc = SVC(kernel=
     
     "linear", C=
     
     1)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     rfe = RFE(estimator=svc, n_features_to_select=
     
     50, step=
     
     1)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     rfe.fit(X, y)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     x=rfe.fit_transform(X,y)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     x.shape
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     (
     
     1797L, 
     
     50L)

2.2 REFCV （交叉验证版的REF）


 
 
   
   
    
    
   
   
   
   
    
    
     
     from sklearn.cross_validation 
     
     import StratifiedKFold
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from sklearn.feature_selection 
     
     import RFECV
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from sklearn.svm 
     
     import SVC
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from sklearn.datasets 
     
     import load_digits
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     data=load_iris()
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     x=data.data
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     y=data.target
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     svc = SVC(kernel=
     
     "linear")
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     # The "accuracy" scoring is proportional to the number of correct
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     # classifications
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     rfecv = RFECV(estimator=svc, step=
     
     1, cv=StratifiedKFold(y, 
     
     2),
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
                   scoring=
     
     'accuracy')
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     rfecv.fit(x, y)

SelectFromModel

3.1 以L1为基的特征选择

L1可以稀疏系数，用feature_selection.SelectFromModel可以选择不是0的系数

>>> from sklearn.svm import LinearSVC
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
>>> model = SelectFromModel(lsvc, prefit=True)
>>> X_new = model.transform(X)
>>> X_new.shape
(150, 3)

3.2 树模型为基础的特征选择

树或森林可以计算特征的重要性，feature_selection.SelectFromModel可以选择出重要性大的特征

>>> from sklearn.ensemble import ExtraTreesClassifier
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> clf = ExtraTreesClassifier()
>>> clf = clf.fit(X, y)
>>> clf.feature_importances_  
array([ 0.04...,  0.05...,  0.4...,  0.4...])
>>> model = SelectFromModel(clf, prefit=True)
>>> X_new = model.transform(X)
>>> X_new.shape               
(150, 2)

参考文献机器学习（周志华）

sklearn官方文档 http://scikit-learn.org/0.17/modules/feature_selection.html#feature-selection

csuRookie

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
递归式特征消除：Recursive feature elimination

递归式特征消除：Recursive feature elimination 1.1.1删除方差低的特征举个栗子，假如我们有一个是布尔值的特征，我们想去删去0（或1）个数大于总数的80%。偏差var[x]=p(1-p)所以我们赋值threshold为0.8*（1-0.8）&gt;&gt;&gt; from sklearn....
复制链接

扫一扫

专栏目录