Scikit-learn 1.13.Feature selection

最新推荐文章于 2024-07-22 15:15:29 发布

zhaoyuxia517

最新推荐文章于 2024-07-22 15:15:29 发布

阅读量313

点赞数

分类专栏：机器学习 python 特征选择文章标签： python 特征选择 sklearn

python 同时被 3 个专栏收录

24 篇文章 0 订阅

订阅专栏

机器学习

5 篇文章 0 订阅

订阅专栏

特征选择

1 篇文章 0 订阅

订阅专栏

1.13.1 移除低方差的特征 VarianceThreshold方法

布尔类型特征是Bernoulli随机变量，方差为：Var[X]=p(1-p)

举例，阈值：.8*（1-.8）

>>> from sklearn.feature_selection import VarianceThreshold
>>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
>>> sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
>>> sel.fit_transform(X)
array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

可见，移除了第一列，因为第一类为0的概率p=5/6>.8，所以被移除。

1.13.2 单变量特征选择

SelectKBest removes all but the $k$ highest scoring features
SelectPercentile removes all but a user-specified highest scoring percentage of features
using common univariate statistical tests for each feature:false positive rate SelectFpr, false discovery rateSelectFdr, or family wise error SelectFwe.
GenericUnivariateSelect allows to perform univariate featureselection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator.

举例：卡方检验，取最好的两个特征

>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> from sklearn.feature_selection import chi2
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
>>> X_new.shape
(150, 2)

1.13.3 递归特征消除（RFE）

预测器估计特征的权重，逐一去掉重要性最低的特征。

1.13.4 用SelectFromModel进行特征选择

与预测器结合，重要性小于阈值的特征会被移除。

1.13.4.1 基于L1的特征选择

带L1范式的线性模型有稀疏解：多数系数为0.当目标是降维时，可用feature_selection.SelectFromModel选择非零系数。用于次目的的稀疏估计器有，linear_model.Lasso（回归），linear_model.LogisticRegression（分类），svm.LinearSVC（分类）。

举例：

>>> from sklearn.svm import LinearSVC
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
>>> model = SelectFromModel(lsvc, prefit=True)
>>> X_new = model.transform(X)
>>> X_new.shape
(150, 3)

对于SVMs和logistic-regression，参数C控制稀疏：C越小，特征选择的越少。对于Lasso，alpha参数越大，特征选择越少。

1.13.4.2 基于树的特征选择

基于树的预测器可以计算特征重要性，可以用来去除不相关的特征

举例：

>>> from sklearn.ensemble import ExtraTreesClassifier
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> clf = ExtraTreesClassifier()
>>> clf = clf.fit(X, y)
>>> clf.feature_importances_  
array([ 0.04...,  0.05...,  0.4...,  0.4...])
>>> model = SelectFromModel(clf, prefit=True)
>>> X_new = model.transform(X)
>>> X_new.shape               
(150, 2)

1.13.5 特征选择作为管道的一部分

特征选择作为真正学习前的预处理步骤

clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
  ('classification', RandomForestClassifier())
])
clf.fit(X, y)

sklearn.svm.LinearSVC和 sklearn.feature_selection.SelectFromModel 结合，评价特征重要性，选择相关特征。接着，在转换后的数据上训练 sklearn.ensemble.RandomForestClassifier，即只使用相关特征。

zhaoyuxia517

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Scikit-learn 1.13.Feature selection

1.13.1 移除低方差的特征 VarianceThreshold方法布尔类型特征是Bernoulli随机变量，方差为：Var[X]=p(1-p)举例，阈值：.8*（1-.8）>>> from sklearn.feature_selection import VarianceThreshold>>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1]
复制链接

扫一扫