参考:http://scikit-learn.org/stable/modules/feature_selection.html
The classes in the sklearn.feature_selection module can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets.
1、removing features with low variance
VarianceThreshold 是特征选择的简单baseline方法,他删除方差达不到阈值的特征。默认情况下,删除all zero-variance features, i.e. features that have the same value in all samples.
假设我们想要删除 超过80%的样本数都是0或都是1(假设是boolean features) 的所有特征,由于boolean features是bernoulli随机变量,所以方差为Var[X] = p(1-p),所以我们可以使用阈值0.8*(1-0.8):
删除了第一列,因为p=5/6 > 0.8。
2、Univariate feature selection(单变量特征选择)(我用这个非常多)
Univariate feature selection基于univariate statistical tests(单变量统计检验),分为: