spark xgboost 特征重要性分析 gain、cover、freq

特征重要性指标评估三种常用的方式:

①gain 增益意味着相应的特征对通过对模型中的每个树采取每个特征的贡献而计算出的模型的相对贡献。与其他特征相比,此度量值的较高值意味着它对于生成预测更为重要。

②cover 覆盖度量指的是与此功能相关的观测的相对数量。例如,如果您有100个观察值,4个特征和3棵树,并且假设特征1分别用于决定树1,树2和树3中10个,5个和2个观察值的叶节点;那么该度量将计算此功能的覆盖范围为10 + 5 + 2 = 17个观测值。这将针对所有4项功能进行计算,并将以17个百分比表示所有功能的覆盖指标。

③freq 频率(频率)是表示特定特征在模型树中发生的相对次数的百分比。在上面的例子中,如果feature1发生在2个分裂中,1个分裂和3个分裂在每个树1,树2和树3中;那么特征1的权重将是2 + 1 + 3 = 6。特征1的频率被计算为其在所有特征的权重上的百分比权重。

增益是解释每个特征的相对重要性的最相关属性。

那么在spark ml.dmlc.xgboost4j 类库中如何获取特征重要性?

看源码可以知道 model.booster.getFeatureScore(null)只是获取③freq 频率。①gain②cover如何获取,需要通过获取model.booster.getModelDump(null, true),并对其进一步处理进行解决。

val modelInfos = model.booster.getModelDump(null, true)
println(modelInfos(0))
val modelDump = XGBoostFeatureImportanciesUtil.getFeatureImportancies(modelInfos)

 

下面是自己写的核心工具类,调用getFeatureImportancies(modelInfos)可以获得每个特征的 ①gain②cover③freq 频率。这里用\001进行了分割。main方法里有modelInfos的样例。

import com.alibaba.fastjson.JSONObject;
import ml.dmlc.xgboost4j.java.Booster;
import ml.dmlc.xgboost4j.java.XGBoostError;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class XGBoostFeatureImportanciesUtil {


    public static String split = "\001";

    public static String getFeatureImportancies(String[] modelInfos) throws XGBoostError {

        String featureImportancies = "";
        Map<String, Integer> featureFreq = new HashMap();
        Map<String, Double> featureGain = new HashMap();
        Map<String, Double> featureCover = new HashMap();
        String[] var4 = modelInfos;
        int var5 = modelInfos.length;

        for(int var6 = 0; var6 < var5; ++var6) {
            String tree = var4[var6];
            String[] var8 = tree.split("\n");
            int var9 = var8.length;

            for(int var10 = 0; var10 < var9; ++var10) {
                String node = var8[var10];
                String[] array = node.split("\\[");
                if (array.length != 1) {
                    String fid = array[1].split("\\]")[0];
                    fid = fid.split("<")[0];
                    String gain = array[1].split("gain=")[1];
                    gain = gain.split(",")[0];
                    String cover = array[1].split("cover=")[1];
                    cover = cover.split(",")[0];
                    if (featureFreq.containsKey(fid)) {
                        featureFreq.put(fid, 1 + (Integer)featureFreq.get(fid));
                        featureGain.put(fid, Double.valueOf(gain)+(Double)featureGain.get(fid));
                        featureCover.put(fid, Double.valueOf(cover)+(Double)featureCover.get(fid));

                    } else {
                        featureFreq.put(fid, 1);
                        featureGain.put(fid, Double.valueOf(gain));
                        featureCover.put(fid, Double.valueOf(cover));
                    }
                }
            }
        }
        Double gainSum = 0.0d;
        for(Double gain : featureGain.values()){
            gainSum = gainSum + gain;
        }
        if(gainSum==0.0d){return "";}
        for(String fid:featureFreq.keySet()){

            featureImportancies = featureImportancies
                    .concat(fid.concat(split)
                            .concat(featureFreq.get(fid).toString()).concat(split)
                            .concat(featureGain.get(fid).toString()).concat(split)
                            .concat((featureGain.get(fid)/gainSum)+"").concat(split)
                            .concat(featureCover.get(fid).toString()))
                    .concat("\n");
        }
        return featureImportancies;
    }
public static void main(String[] args) {
    String modelinfo = "0:[f46<0.0204612] yes=1,no=2,missing=1,gain=2319.78,cover=114480\n" +
            "\t1:[f23<0.603893] yes=3,no=4,missing=4,gain=478.741,cover=66524\n" +
            "\t\t3:[f51<0.00335573] yes=7,no=8,missing=8,gain=245.974,cover=62588.2\n" +
            "\t\t\t7:[f20<0.512786] yes=15,no=16,missing=16,gain=60.4821,cover=40874.2\n" +
            "\t\t\t\t15:[f63<0.0108743] yes=31,no=32,missing=31,gain=20.7964,cover=31967.2\n" +
            "\t\t\t\t\t31:[f23<0.537906] yes=63,no=64,missing=64,gain=10.1748,cover=26765\n" +
            "\t\t\t\t\t\t63:leaf=-0.190268,cover=25790.8\n" +
            "\t\t\t\t\t\t64:leaf=-0.178159,cover=974.25\n" +
            "\t\t\t\t\t32:[f49<0.267547] yes=65,no=66,missing=66,gain=15.3241,cover=5202.25\n" +
            "\t\t\t\t\t\t65:leaf=-0.184069,cover=4794.75\n" +
            "\t\t\t\t\t\t66:leaf=-0.161812,cover=407.5\n" +
            "\t\t\t\t16:[f23<0.482164] yes=33,no=34,missing=34,gain=26.911,cover=8907\n" +
            "\t\t\t\t\t33:[f42<1.2857] yes=67,no=68,missing=67,gain=0.971554,cover=2953.75\n" +
            "\t\t\t\t\t\t67:leaf=-0.187414,cover=2946.75\n" +
            "\t\t\t\t\t\t68:leaf=-0.1125,cover=7\n" +
            "\t\t\t\t\t34:[f4<0.999999] yes=69,no=70,missing=69,gain=19.0911,cover=5953.25\n" +
            "\t\t\t\t\t\t69:leaf=-0.183261,cover=2066\n" +
            "\t\t\t\t\t\t70:leaf=-0.170449,cover=3887.25\n" +
            "\t\t\t8:[f49<0.0197215] yes=17,no=18,missing=18,gain=61.3492,cover=21714\n" +
            "\t\t\t\t17:[f4<0.999999] yes=35,no=36,missing=35,gain=21.5379,cover=624.75\n" +
            "\t\t\t\t\t35:[f12<3] yes=71,no=72,missing=72,gain=1.60138,cover=300.75\n" +
            "\t\t\t\t\t\t71:leaf=-0.14714,cover=125.75\n" +
            "\t\t\t\t\t\t72:leaf=-0.171023,cover=175\n" +
            "\t\t\t\t\t36:[f20<0.525825] yes=73,no=74,missing=74,gain=11.2899,cover=324\n" +
            "\t\t\t\t\t\t73:leaf=-0.14304,cover=155.25\n" +
            "\t\t\t\t\t\t74:leaf=-0.103387,cover=168.75\n" +
            "\t\t\t\t18:[f21<0.492682] yes=37,no=38,missing=38,gain=53.6745,cover=21089.2\n" +
            "\t\t\t\t\t37:[f52<1.2884] yes=75,no=76,missing=75,gain=13.3477,cover=5378.25\n" +
            "\t\t\t\t\t\t75:leaf=-0.18332,cover=5364.75\n" +
            "\t\t\t\t\t\t76:leaf=-0.0758621,cover=13.5\n" +
            "\t\t\t\t\t38:[f52<1.2884] yes=77,no=78,missing=77,gain=38.0972,cover=15711\n" +
            "\t\t\t\t\t\t77:leaf=-0.171475,cover=15651.8\n" +
            "\t\t\t\t\t\t78:leaf=-0.0887967,cover=59.25\n" +
            "\t\t4:[f23<0.856171] yes=9,no=10,missing=10,gain=142.686,cover=3935.75\n" +
            "\t\t\t9:[f40<0.0587658] yes=19,no=20,missing=19,gain=36.5799,cover=2969.5\n" +
            "\t\t\t\t19:[f6<0.999999] yes=39,no=40,missing=39,gain=12.9572,cover=2716.75\n" +
            "\t\t\t\t\t39:[f20<0.550559] yes=79,no=80,missing=80,gain=15.747,cover=1731.75\n" +
            "\t\t\t\t\t\t79:leaf=-0.167095,cover=679.75\n" +
            "\t\t\t\t\t\t80:leaf=-0.146154,cover=1052\n" +
            "\t\t\t\t\t40:[f14<2] yes=81,no=82,missing=82,gain=9.42129,cover=985\n" +
            "\t\t\t\t\t\t81:leaf=-0.174622,cover=842.25\n" +
            "\t\t\t\t\t\t82:leaf=-0.142957,cover=142.75\n" +
            "\t\t\t\t20:[f15<0.52803] yes=41,no=42,missing=42,gain=7.56139,cover=252.75\n" +
            "\t\t\t\t\t41:[f27<5e+08] yes=83,no=84,missing=84,gain=6.3671,cover=68.75\n" +
            "\t\t\t\t\t\t83:leaf=-0.0166667,cover=11\n" +
            "\t\t\t\t\t\t84:leaf=-0.101277,cover=57.75\n" +
            "\t\t\t\t\t42:[f43<0.0633769] yes=85,no=86,missing=86,gain=5.73181,cover=184\n" +
            "\t\t\t\t\t\t85:leaf=-0.0285714,cover=6\n" +
            "\t\t\t\t\t\t86:leaf=-0.13352,cover=178\n" +
            "\t\t\t10:[f14<2] yes=21,no=22,missing=22,gain=154.902,cover=966.25\n" +
            "\t\t\t\t21:[f54<0.000459881] yes=43,no=44,missing=43,gain=56.6445,cover=810.5\n" +
            "\t\t\t\t\t43:[f25<1363] yes=87,no=88,missing=88,gain=7.87352,cover=534.75\n" +
            "\t\t\t\t\t\t87:leaf=-0.156702,cover=396.25\n" +
            "\t\t\t\t\t\t88:leaf=-0.125448,cover=138.5\n" +
            "\t\t\t\t\t44:[f21<0.74018] yes=89,no=90,missing=90,gain=55.2292,cover=275.75\n" +
            "\t\t\t\t\t\t89:leaf=-0.13506,cover=143.75\n" +
            "\t\t\t\t\t\t90:leaf=-0.0451128,cover=132\n" +
            "\t\t\t\t22:[f24<872] yes=45,no=46,missing=46,gain=52.8332,cover=155.75\n" +
            "\t\t\t\t\t45:[f33<0.0417348] yes=91,no=92,missing=92,gain=36.2418,cover=125.25\n" +
            "\t\t\t\t\t\t91:leaf=0.0578755,cover=67.25\n" +
            "\t\t\t\t\t\t92:leaf=-0.0491525,cover=58\n" +
            "\t\t\t\t\t46:[f50<1.03741e-06] yes=93,no=94,missing=93,gain=7.27111,cover=30.5\n" +
            "\t\t\t\t\t\t93:leaf=-0.024,cover=5.25\n" +
            "\t\t\t\t\t\t94:leaf=-0.158095,cover=25.25\n" +
            "\t2:[f23<0.482164] yes=5,no=6,missing=6,gain=485.305,cover=47955.8\n" +
            "\t\t5:[f24<871] yes=11,no=12,missing=12,gain=26.8177,cover=9959\n" +
            "\t\t\t11:[f12<2] yes=23,no=24,missing=24,gain=5.38835,cover=3626.25\n" +
            "\t\t\t\t23:[f59<0.961747] yes=47,no=48,missing=48,gain=2.22561,cover=1007.75\n" +
            "\t\t\t\t\t47:[f23<0.317355] yes=95,no=96,missing=96,gain=0.245255,cover=981.25\n" +
            "\t\t\t\t\t\t95:leaf=-0.187817,cover=97.5\n" +
            "\t\t\t\t\t\t96:leaf=-0.168918,cover=883.75\n" +
            "\t\t\t\t\t48:[f25<27260] yes=97,no=98,missing=98,gain=5.89723,cover=26.5\n" +
            "\t\t\t\t\t\t97:leaf=-0.147826,cover=22\n" +
            "\t\t\t\t\t\t98:leaf=-0.0181818,cover=4.5\n" +
            "\t\t\t\t24:[f57<4.02672] yes=49,no=50,missing=49,gain=2.78306,cover=2618.5\n" +
            "\t\t\t\t\t49:[f38<0.483983] yes=99,no=100,missing=100,gain=1.7698,cover=2613.75\n" +
            "\t\t\t\t\t\t99:leaf=-0.174041,cover=761.75\n" +
            "\t\t\t\t\t\t100:leaf=-0.183702,cover=1852\n" +
            "\t\t\t\t\t50:leaf=-0.0782609,cover=4.75\n" +
            "\t\t\t12:[f38<0.623771] yes=25,no=26,missing=26,gain=10.7639,cover=6332.75\n" +
            "\t\t\t\t25:[f25<92] yes=51,no=52,missing=52,gain=8.55349,cover=3259.25\n" +
            "\t\t\t\t\t51:[f47<0.146558] yes=101,no=102,missing=102,gain=7.37567,cover=63.25\n" +
            "\t\t\t\t\t\t101:leaf=0.025,cover=3\n" +
            "\t\t\t\t\t\t102:leaf=-0.128163,cover=60.25\n" +
            "\t\t\t\t\t52:[f21<0.515144] yes=103,no=104,missing=104,gain=3.08302,cover=3196\n" +
            "\t\t\t\t\t\t103:leaf=-0.164141,cover=2863\n" +
            "\t\t\t\t\t\t104:leaf=-0.150299,cover=333\n" +
            "\t\t\t\t26:[f40<0.189202] yes=53,no=54,missing=54,gain=3.86644,cover=3073.5\n" +
            "\t\t\t\t\t53:[f15<0.528013] yes=105,no=106,missing=106,gain=6.87584,cover=115\n" +
            "\t\t\t\t\t\t105:leaf=-0.04,cover=6.5\n" +
            "\t\t\t\t\t\t106:leaf=-0.153425,cover=108.5\n" +
            "\t\t\t\t\t54:[f61<0.550285] yes=107,no=108,missing=108,gain=2.70689,cover=2958.5\n" +
            "\t\t\t\t\t\t107:leaf=-0.172909,cover=2845\n" +
            "\t\t\t\t\t\t108:leaf=-0.150218,cover=113.5\n" +
            "\t\t6:[f12<3] yes=13,no=14,missing=13,gain=353.559,cover=37996.8\n" +
            "\t\t\t13:[f22<0.727618] yes=27,no=28,missing=28,gain=280.015,cover=21181.8\n" +
            "\t\t\t\t27:[f20<0.590193] yes=55,no=56,missing=56,gain=92.3481,cover=19267\n" +
            "\t\t\t\t\t55:[f24<871] yes=109,no=110,missing=110,gain=65.2798,cover=18465\n" +
            "\t\t\t\t\t\t109:leaf=-0.150815,cover=6106.5\n" +
            "\t\t\t\t\t\t110:leaf=-0.137983,cover=12358.5\n" +
            "\t\t\t\t\t56:[f38<0.12501] yes=111,no=112,missing=112,gain=31.4306,cover=802\n" +
            "\t\t\t\t\t\t111:leaf=-0.0633452,cover=139.5\n" +
            "\t\t\t\t\t\t112:leaf=-0.116353,cover=662.5\n" +
            "\t\t\t\t28:[f14<2] yes=57,no=58,missing=58,gain=93.2516,cover=1914.75\n" +
            "\t\t\t\t\t57:[f25<262] yes=113,no=114,missing=114,gain=31.8208,cover=1435.5\n" +
            "\t\t\t\t\t\t113:leaf=-0.0870588,cover=360.25\n" +
            "\t\t\t\t\t\t114:leaf=-0.122044,cover=1075.25\n" +
            "\t\t\t\t\t58:[f29<883] yes=115,no=116,missing=116,gain=50.9932,cover=479.25\n" +
            "\t\t\t\t\t\t115:leaf=-0.110231,cover=150.5\n" +
            "\t\t\t\t\t\t116:leaf=-0.0398787,cover=328.75\n" +
            "\t\t\t14:[f20<0.779492] yes=29,no=30,missing=30,gain=182.847,cover=16815\n" +
            "\t\t\t\t29:[f24<871] yes=59,no=60,missing=60,gain=97.5659,cover=16492.2\n" +
            "\t\t\t\t\t59:[f22<0.46508] yes=117,no=118,missing=118,gain=31.147,cover=5885.5\n" +
            "\t\t\t\t\t\t117:leaf=-0.177988,cover=2302.25\n" +
            "\t\t\t\t\t\t118:leaf=-0.162419,cover=3583.25\n" +
            "\t\t\t\t\t60:[f4<0.999999] yes=119,no=120,missing=120,gain=60.9987,cover=10606.8\n" +
            "\t\t\t\t\t\t119:leaf=-0.160178,cover=5177\n" +
            "\t\t\t\t\t\t120:leaf=-0.144722,cover=5429.75\n" +
            "\t\t\t\t30:[f15<0.528035] yes=61,no=62,missing=62,gain=30.2279,cover=322.75\n" +
            "\t\t\t\t\t61:[f6<-1e-06] yes=121,no=122,missing=122,gain=24.0147,cover=144.25\n" +
            "\t\t\t\t\t\t121:leaf=-0.138144,cover=23.25\n" +
            "\t\t\t\t\t\t122:leaf=-0.0286885,cover=121\n" +
            "\t\t\t\t\t62:[f8<-1e-06] yes=123,no=124,missing=124,gain=11.7343,cover=178.5\n" +
            "\t\t\t\t\t\t123:leaf=-0.00833333,cover=11\n" +
            "\t\t\t\t\t\t124:leaf=-0.115727,cover=167.5";
    String[] modelInfos = new String[1];
    modelInfos[0] = modelinfo;
    try {

        String importance = getFeatureImportancies(modelInfos);
        System.out.println(importance);

    } catch (XGBoostError xgBoostError) {
        xgBoostError.printStackTrace();
    }
}
}

 

开放给大家,希望对大家有用!转载请注明出处

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值