python get score gain_【原创】xgboost 特征评分的计算原理

最新推荐文章于 2023-04-30 23:28:04 发布

weixin_39800957

最新推荐文章于 2023-04-30 23:28:04 发布

阅读量571

点赞数

文章标签： python get score gain

xgboost是基于GBDT原理进行改进的算法，效率高，并且可以进行并行化运算；

而且可以在训练的过程中给出各个特征的评分，从而表明每个特征对模型训练的重要性，

调用的源码就不准备详述，本文主要侧重的是计算的原理，函数get_fscore源码如下，

源码来自安装包：xgboost/python-package/xgboost/core.py

通过下面的源码可以看出，特征评分可以看成是被用来分离决策树的次数，而这个与

《统计学习基础-数据挖掘、推理与推测》中10.13.1 计算公式有写差异，此处需要注意。

注：考虑的角度不同，计算方法略有差异。

def get_fscore(self, fmap=''):

"""Get feature importance of each feature.

Parameters

----------

fmap: str (optional)

The name of feature map file

"""

return self.get_score(fmap, importance_type='weight')

def get_score(self, fmap='', importance_type='weight'):

"""Get feature importance of each feature.

Importance type can be defined as:

'weight' - the number of times a feature is used to split the data across all trees.

'gain' - the average gain of the feature when it is used in trees

'cover' - the average coverage of the feature when it is used in trees

Parameters

----------

fmap: str (optional)

The name of feature map file

"""

if importance_type not in ['weight', 'gain', 'cover']:

msg = "importance_type mismatch, got '{}', expected 'weight', 'gain', or 'cover'"

raise ValueError(msg.format(importance_type))

# if it's weight, then omap stores the number of missing values

if importance_type == 'weight':

# do a simpler tree dump to save time

trees = self.get_dump(fmap, with_stats=False)

fmap = {}

for tree in trees:

for line in tree.split('\n'):

# look for the opening square bracket

arr = line.split('[')

# if no opening bracket (leaf node), ignore this line

if len(arr) == 1:

continue

# extract feature name from string between []

fid = arr[1].split(']')[0].split('

if fid not in fmap:

# if the feature hasn't been seen yet

fmap[fid] = 1

else:

fmap[fid] += 1

return fmap

else:

trees = self.get_dump(fmap, with_stats=True)

importance_type += '='

fmap = {}

gmap = {}

for tree in trees:

for line in tree.split('\n'):

# look for the opening square bracket

arr = line.split('[')

# if no opening bracket (leaf node), ignore this line

if len(arr) == 1:

continue

# look for the closing bracket, extract only info within that bracket

fid = arr[1].split(']')

# extract gain or cover from string after closing bracket

g = float(fid[1].split(importance_type)[1].split(',')[0])

# extract feature name from string before closing bracket

fid = fid[0].split('

if fid not in fmap:

# if the feature hasn't been seen yet

fmap[fid] = 1

gmap[fid] = g

else:

fmap[fid] += 1

gmap[fid] += g

# calculate average value (gain/cover) for each feature

for fid in gmap:

gmap[fid] = gmap[fid] / fmap[fid]

return gmap

GBDT特征评分的计算说明原理：

链接：1、http://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/

详细的代码说明过程：可以从上面的链接进入下面的链接：

http://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting

weixin_39800957

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。