xgboost的特征重要性feature_importance计算

1、sklearn的原生接口和sklearn接口调用feature_importance有差别:
bst = xgb.train(param, d1_train, num_boost_round=100, evals=watch_list)
xgc = xgb.XGBClassifier(objective=‘binary:logistic’, seed=10086, **bst_params)

xgc.feature_importances_ 等同于xgc.get_booster().get_fscore()等同于xgc.get_booster().get_score(importance_type=“weight”)

而原生接口直接bst.get_fscore()以及bst.get_score(importance_type=“weight”)

2、以sklearn接口为例xgboost的特征重要性有两种途径,如下:
xgc.feature_importances_、xgb.plot_importance(xgc, max_num_features=10)
但是两者输出的结果有差异:
在这里插入图片描述
在这里插入图片描述
网上有说其中一个的importance_type是gain,一个是weight。其实不然,查看源码:

    @property
    def feature_importances_(self):
        """
        Returns
        -------
        feature_importances_ : array of shape = [n_features]

        """
        b = self.get_booster()
        fs = b.get_fscore()
        all_features = [fs.get(f, 0.) for f in b.feature_names]
        all_features = np.array(all_features, dtype=np.float32)
        return all_features / all_features.sum()
def plot_importance(booster, ax=None, height=0.2,
                    xlim=None, ylim=None, title='Feature importance',
                    xlabel='F score', ylabel='Features',
                    importance_type='weight', max_num_features=None,
                    grid=True, show_values=True, **kwargs):

    """Plot importance based on fitted trees.

    Parameters
    ----------
    booster : Booster, XGBModel or dict
        Booster or XGBModel instance, or dict taken by Booster.get_fscore()
    ax : matplotlib Axes, default None
        Target axes instance. If None, new figure and axes will be created.
    grid : bool, Turn the axes grids on or off.  Default is True (On).
    importance_type : str, default "weight"
        How the importance is calculated: either "weight", "gain", or "cover"
        "weight" is the number of times a feature appears in a tree
        "gain" is the average gain of splits which use the feature
        "cover" is the average coverage of splits which use the feature
            where coverage is defined as the number of samples affected by the split
    max_num_features : int, default None
        Maximum number of top features displayed on plot. If None, all features will be displayed.
    height : float, default 0.2
        Bar height, passed to ax.barh()
    xlim : tuple, default None
        Tuple passed to axes.xlim()
    ylim : tuple, default None
        Tuple passed to axes.ylim()
    title : str, default "Feature importance"
        Axes title. To disable, pass None.
    xlabel : str, default "F score"
        X axis title label. To disable, pass None.
    ylabel : str, default "Features"
        Y axis title label. To disable, pass None.
    show_values : bool, default True
        Show values on plot. To disable, pass False.
    kwargs :
        Other keywords passed to ax.barh()

    Returns
    -------
    ax : matplotlib Axes
    """
    # TODO: move this to compat.py
    try:
        import matplotlib.pyplot as plt
    except ImportError:
        raise ImportError('You must install matplotlib to plot importance')

    if isinstance(booster, XGBModel):
        importance = booster.get_booster().get_score(importance_type=importance_type)
    elif isinstance(booster, Booster):
        importance = booster.get_score(importance_type=importance_type)
    elif isinstance(booster, dict):
        importance = booster
    else:
        raise ValueError('tree must be Booster, XGBModel or dict instance')

    if len(importance) == 0:
        raise ValueError('Booster.get_score() results in empty')

从源码可知importance_type都是‘weight’,就是特征用于分割的次数,只是feature_importances_ 返回值均一化处理了下。

class XGBModel(XGBModelBase):
    # pylint: disable=too-many-arguments, too-many-instance-attributes, missing-docstring
    def __init__(self, max_depth=None, learning_rate=None, n_estimators=100,
                 verbosity=None, objective=None, booster=None,
                 tree_method=None, n_jobs=None, gamma=None,
                 min_child_weight=None, max_delta_step=None, subsample=None,
                 colsample_bytree=None, colsample_bylevel=None,
                 colsample_bynode=None, reg_alpha=None, reg_lambda=None,
                 scale_pos_weight=None, base_score=None, random_state=None,
                 missing=np.nan, num_parallel_tree=None,
                 monotone_constraints=None, interaction_constraints=None,
                 importance_type="gain", gpu_id=None,
                 validate_parameters=None, **kwargs):

 def feature_importances_(self):
        """
        Feature importances property
        .. note:: Feature importance is defined only for tree boosters
            Feature importance is only defined when the decision tree model is chosen as base
            learner (`booster=gbtree`). It is not defined for other base learner types, such
            as linear learners (`booster=gblinear`).
        Returns
        -------
        feature_importances_ : array of shape ``[n_features]``
        """
        if self.get_params()['booster'] not in {'gbtree', 'dart'}:
            raise AttributeError(
                'Feature importance is not defined for Booster type {}'
                .format(self.booster))
        b = self.get_booster()
        score = b.get_score(importance_type=self.importance_type)
        all_features = [score.get(f, 0.) for f in b.feature_names]
        all_features = np.array(all_features, dtype=np.float32)
        return all_features / all_features.sum()

xgboost版本更新后的feature_importances_ 的importance_type=‘gain’

3、比较importance_type不同取值
在这里插入图片描述
发现两种方法的变量重要性排序有差异,
实际上,判断特征重要性共有三个维度,而在实际中,三个选项中的每个选项的功能重要性排序都非常不同

  1. 权重。在所有树中一个特征被用来分裂数据的次数。

  2. 覆盖。在所有树中一个特征被用来分裂数据的次数,并且有多少数据点通过这个分裂点。实际上coverage可以理解为被分到该节点的样本的二阶导数之和,而特征度量的标准就是平均的coverage值。

  3. 增益。使用特征分裂时平均训练损失的减少量

然后,推荐使用shap衡量特征重要性,参考这篇博客,以及之前对feature_importance原理的说明

行文之初参考这篇博文

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值