1.13.1 移除低方差的特征 VarianceThreshold
方法
布尔类型特征是Bernoulli随机变量,方差为:Var[X]=p(1-p)
举例,阈值:.8*(1-.8)
>>> from sklearn.feature_selection import VarianceThreshold >>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]] >>> sel = VarianceThreshold(threshold=(.8 * (1 - .8))) >>> sel.fit_transform(X) array([[0, 1], [1, 0], [0, 0], [1, 1], [1, 0], [1, 1]])可见,移除了第一列,因为第一类为0的概率p=5/6>.8,所以被移除。
1.13.2 单变量特征选择
举例:卡方检验,取最好的两个特征
SelectKBest
removes all but thehighest scoring features
SelectPercentile
removes all but a user-specified highest scoring percentage of features- using common univariate statistical tests for each feature:false positive rate
SelectFpr
, false discovery rateSelectFdr
, or family wise errorSelectFwe
.GenericUnivariateSelect
allows to perform univariate featureselection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator.
>>> from sklearn.datasets import load_iris >>> from sklearn.feature_selection import SelectKBest >>> from sklearn.feature_selection import chi2 >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> X.shape (150, 4) >>> X_new = SelectKBest(chi2, k=2).fit_transform(X, y) >>> X_new.shape (150, 2)1.13.3 递归特征消除(RFE)
预测器估计特征的权重,逐一去掉重要性最低的特征。
1.13.4 用SelectFromModel进行特征选择
与预测器结合,重要性小于阈值的特征会被移除。
1.13.4.1 基于L1的特征选择
带L1范式的线性模型有稀疏解:多数系数为0.当目标是降维时,可用feature_selection.SelectFromModel
选择非零系数。用于次目的的稀疏估计器有,linear_model.Lasso
(回归),linear_model.LogisticRegression
(分类),svm.LinearSVC
(分类)。
举例:
>>> from sklearn.svm import LinearSVC >>> from sklearn.datasets import load_iris >>> from sklearn.feature_selection import SelectFromModel >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> X.shape (150, 4) >>> lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y) >>> model = SelectFromModel(lsvc, prefit=True) >>> X_new = model.transform(X) >>> X_new.shape (150, 3)对于SVMs和logistic-regression,参数C控制稀疏:C越小,特征选择的越少。对于Lasso,alpha参数越大,特征选择越少。
1.13.4.2 基于树的特征选择
基于树的预测器可以计算特征重要性,可以用来去除不相关的特征
举例:
>>> from sklearn.ensemble import ExtraTreesClassifier >>> from sklearn.datasets import load_iris >>> from sklearn.feature_selection import SelectFromModel >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> X.shape (150, 4) >>> clf = ExtraTreesClassifier() >>> clf = clf.fit(X, y) >>> clf.feature_importances_ array([ 0.04..., 0.05..., 0.4..., 0.4...]) >>> model = SelectFromModel(clf, prefit=True) >>> X_new = model.transform(X) >>> X_new.shape (150, 2)1.13.5 特征选择作为管道的一部分
特征选择作为真正学习前的预处理步骤
clf = Pipeline([ ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))), ('classification', RandomForestClassifier()) ]) clf.fit(X, y)
sklearn.svm.LinearSVC
和
sklearn.feature_selection.SelectFromModel
结合,评价特征重要性,选择相关特征。接着,在转换后的数据上训练
sklearn.ensemble.RandomForestClassifier,即只使用相关特征。