python get score gain_【原创】xgboost 特征评分的计算原理

xgboost是基于GBDT原理进行改进的算法,效率高,并且可以进行并行化运算;

而且可以在训练的过程中给出各个特征的评分,从而表明每个特征对模型训练的重要性,

调用的源码就不准备详述,本文主要侧重的是计算的原理,函数get_fscore源码如下,

源码来自安装包:xgboost/python-package/xgboost/core.py

通过下面的源码可以看出,特征评分可以看成是被用来分离决策树的次数,而这个与

《统计学习基础-数据挖掘、推理与推测》中10.13.1 计算公式有写差异,此处需要注意。

注:考虑的角度不同,计算方法略有差异。

def get_fscore(self, fmap=''):

"""Get feature importance of each feature.

Parameters

----------

fmap: str (optional)

The name of feature map file

"""

return self.get_score(fmap, importance_type='weight')

def get_score(self, fmap='', importance_type='weight'):

"""Get feature importance of each feature.

Importance type can be defined as:

'weight' - the number of times a feature is used to split the data across all trees.

'gain' - the average gain of the feature when it is used in trees

'cover' - the average coverage of the feature when it is used in trees

Parameters

----------

fmap: str (optional)

The name of feature map file

"""

if importance_type not in ['weight', 'gain', 'cover']:

msg = "importance_type mismatch, got '{}', expected 'weight', 'gain', or 'cover'"

raise ValueError(msg.format(importance_type))

# if it's weight, then omap stores the number of missing values

if importance_type == 'weight':

# do a simpler tree dump to save time

trees = self.get_dump(fmap, with_stats=False)

fmap = {}

for tree in trees:

for line in tree.split('\n'):

# look for the opening square bracket

arr = line.split('[')

# if no opening bracket (leaf node), ignore this line

if len(arr) == 1:

continue

# extract feature name from string between []

fid = arr[1].split(']')[0].split('

if fid not in fmap:

# if the feature hasn't been seen yet

fmap[fid] = 1

else:

fmap[fid] += 1

return fmap

else:

trees = self.get_dump(fmap, with_stats=True)

importance_type += '='

fmap = {}

gmap = {}

for tree in trees:

for line in tree.split('\n'):

# look for the opening square bracket

arr = line.split('[')

# if no opening bracket (leaf node), ignore this line

if len(arr) == 1:

continue

# look for the closing bracket, extract only info within that bracket

fid = arr[1].split(']')

# extract gain or cover from string after closing bracket

g = float(fid[1].split(importance_type)[1].split(',')[0])

# extract feature name from string before closing bracket

fid = fid[0].split('

if fid not in fmap:

# if the feature hasn't been seen yet

fmap[fid] = 1

gmap[fid] = g

else:

fmap[fid] += 1

gmap[fid] += g

# calculate average value (gain/cover) for each feature

for fid in gmap:

gmap[fid] = gmap[fid] / fmap[fid]

return gmap

GBDT特征评分的计算说明原理:

链接:1、http://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/

详细的代码说明过程:可以从上面的链接进入下面的链接:

http://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值