特征选择
坊间传言:数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已。因此特征工程显得尤为重要。最近的工作中做了较多的特征工作,在这里做一个小小的总结。
peason特征选择
笔者本身统计学的,一直认为pearson也只能做做相关性分析,判断两个变量相关性什么的。在工作中刚开始并未觉得卡方特征选择效果会有多好,于是退而求其次,选择了peason,这个是spark里特有的。废话不多说,直接上代码吧,公式你们都会懂的,就不贴了。
for(String featurename:featurenames){
cols.add(functions.corr("label", featurename).alias(featurename));
}
Dataset<Row> featruecorr = feature.
agg(functions.first("label"), (scala.collection.Seq)scala.collection.JavaConversions.asScalaBuffer(cols));
Row[] tempvalueRows = (Row[]) featruecorr.collect();
String tempfeaturenames[] = featruecorr.columns();
ArrayList<String> tempArrayList = new ArrayList<String>();
ArrayList<String> tempcorrList = new ArrayList<String>();
for(int i = 1; i < tempvalueRows[0].length(); i++){
if (tempvalueRows[0].get(i) instanceof Double){
if(Math.abs((Double)tempvalueRows[0].get(i)) > threshhold){
tempArrayList.add(tempfeaturenames[i]);
tempcorrList.add(String.valueOf(tempvalueRows[0].get(i)));
}
}
}
String[] newfeaturenames = new String[tempArrayList.size()];
newfeaturenames = (String[])tempArrayList.toArray(newfeaturenames);
System.out.println("newfeaturenames: ");
for(int i =0;i < newfeaturenames.length;i++){
System.out.println(String.valueOf(newfeaturenames[i]) + " " + tempcorrList.get(i));
}
System.out.println("rawfeaturenames num is " + featurenames.length);
System.out.println("newfeaturenames num is " + newfeaturenames.length);
缺点:pearson是假设两组数据是连续变量,而我们通常处理分类问题时,label列是离散的,无法解决非线性相关问题。
卡方特征选择
卡方特征选择是基于卡方检验的思想(理解了卡方布,更有助于如何设置参数),其效果也是相对较好的。
public static String[] feautureSelectionByChiSq(Dataset<Row> feature, Double threshholdfpr,Integer featurenum, String[] featurenames){
VectorAssembler va = new VectorAssembler()
.setInputCols(featurenames).setOutputCol("features");
feature = va.transform(feature);
Dataset<Row> newfeature = feature.select("userid", "features", "label");
feature.unpersist();
// scaler
StandardScaler scaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("scaledFeatures")
.setWithStd(true)
.setWithMean(false);
Dataset<Row> scalerfeature = scaler.fit(newfeature).transform(newfeature);
//
ChiSqSelector selector = null;
if(featurenum < 0){
selector = new ChiSqSelector()
.setFpr(threshholdfpr)
// .setPercentile(threshholdfpr)
.setFeaturesCol("scaledFeatures")
.setLabelCol("label")
.setOutputCol("selectedFeatures");
}else{
selector = new ChiSqSelector()
.setNumTopFeatures(featurenum)
.setFeaturesCol("scaledFeatures")
.setLabelCol("label")
.setOutputCol("selectedFeatures");
}
ChiSqSelectorModel selectorModel = selector.fit(scalerfeature);
ArrayList<String> tempList = new ArrayList<String>();
System.out.println(selectorModel.selectedFeatures());
for(int i = 0; i < selectorModel.selectedFeatures().length; i++ ){
tempList.add(featurenames[i]);
}
System.out.println(selectorModel.getFpr());
String[] newfeaturenames = new String[tempList.size()];
newfeaturenames = (String[])tempList.toArray(newfeaturenames);
return newfeaturenames;
}
随机森林特征选择
long long ago就看到了可以用rf进行特征选择,可在spark里一直苦苦不得其法,最近刚好在一大牛的博客看到,豁然开朗,先放代码,原理下次再放,今天比较懒。
VectorAssembler va = new VectorAssembler()
.setInputCols(featurenames).setOutputCol("features");
feature = va.transform(feature);
Dataset<Row> newfeature = feature.select("userid", "features", "label");
feature.unpersist();
// scaler
StandardScaler scaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("scaledFeatures")
.setWithStd(true)
.setWithMean(false);
Dataset<Row> scalerfeature = scaler.fit(newfeature).transform(newfeature);
//
RandomForestClassifier rf = new RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("scaledFeatures");
RandomForestClassificationModel model = rf.fit(scalerfeature);
HashMap<String, Double> featurenameMap = new HashMap<String, Double>();
ArrayList<String> tempList = new ArrayList<String>();
for(int i =0; i < featurenames.length; i++){
featurenameMap.put(featurenames[i], model.featureImportances().toArray()[i]);
if(model.featureImportances().toArray()[i] > threshhold){
tempList.add(featurenames[i]);
}
}
String[] newfeaturenames = new String[tempList.size()];
newfeaturenames = (String[])tempList.toArray(newfeaturenames);
计算每一个特征与响应变量的相关性:工程上常用的手段有计算皮尔逊系数和互信息系数,皮尔逊系数只能衡量线性相关性而互信息系数能够很好地度量各种相关性,但是计算相对复杂一些,好在很多toolkit里边都包含了这个工具(如sklearn的MINE),得到相关性之后就可以排序选择特征了;
交叉熵可在神经网络(机器学习)中作为损失函数,p表示真实标记的分布,q则为训练后的模型的预测标记分布,交叉熵损失函数可以衡量p与q的相似性。交叉熵作为损失函数还有一个好处是使用sigmoid函数在梯度下降时能避免均方误差损失函数学习速率降低的问题,因为学习速率可以被输出的误差所控制。
正则化选择特征
L1正则方法具有稀疏解的特性,因此天然具备特征选择的特性,但是要注意,L1没有选到的特征不代表不重要,原因是两个具有高相关性的特征可能只保留了一个,如果要确定哪个特征重要应再通过L2正则方法交叉检验
深度学习选择特征
从深度学习模型中选择某一神经层的特征后就可以用来进行最终目标模型的训练了
通过方差选择特征
一般步骤
Step1: Exploratory Data Analysis
Removing features with low variance
Univariate feature selection
Recursive feature elimination
参考文献
1.知乎 https://www.zhihu.com/question/28641663
2. Noriko Oshima,https://www.zhihu.com/question/41252833/answer/108777563
3. kaggle http://www.jianshu.com/p/32def2294ae6
4.http://blog.jasonding.top/2017/02/12/Feature%20Engineering/%E3%80%90%E7%89%B9%E5%BE%81%E5%B7%A5%E7%A8%8B%E3%80%91%E7%89%B9%E5%BE%81%E9%80%89%E6%8B%A9%E5%8F%8AmRMR%E7%AE%97%E6%B3%95%E8%A7%A3%E6%9E%90/