scikit-learn（工程中用的相对较多的模型介绍）：1.13. Feature selection

最新推荐文章于 2024-09-18 17:21:03 发布

mmc2015

最新推荐文章于 2024-09-18 17:21:03 发布

阅读量3.7k

点赞数 3

分类专栏： scikit-learn scikit-learn 文章标签：机器学习 scikit-learn 特征选择工程应用 Feature selection

本文链接：https://blog.csdn.net/mmc2015/article/details/47333579

版权

本文介绍了scikit-learn库中的特征选择技术，包括：基于方差的筛选、单变量特征选择、递归特征消除、L1正则化特征选择、基于树的特征选择，以及将特征选择融入管道流程。特别提到了各种方法的适用场景和注意事项，如使用SelectKBest、RFECV、L1正则化和基于树的特征重要性计算。

摘要由CSDN通过智能技术生成

参考：http://scikit-learn.org/stable/modules/feature_selection.html

The classes in the sklearn.feature_selection module can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets.

1、removing features with low variance

VarianceThreshold 是特征选择的简单baseline方法，他删除方差达不到阈值的特征。默认情况下，删除all zero-variance features, i.e. features that have the same value in all samples.

假设我们想要删除超过80%的样本数都是0或都是1（假设是boolean features）的所有特征，由于boolean features是bernoulli随机变量，所以方差为Var[X] = p(1-p)，所以我们可以使用阈值0.8*（1-0.8）：

>>> from sklearn.feature_selection import VarianceThreshold
>>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
>>> sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
>>> sel.fit_transform(X)
array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])