关于xgboost模型下得到每个变量(维度)评分的实践研究

本文链接：https://blog.csdn.net/Zeus_daifu/article/details/125137579

一.研究背景

1. 树模型和评分卡相比，有着预测精度高，模型训练过程相对简单的优点，但其在变量的解释性上，相较于评分卡有一定的差距；
1. 一般的树模型能给出每个变量的重要性，但其每个变量与模型最终的预测值之间的量化（函数）关系，一直是研究的重点，也是一直未能直接突破的点。

综上所述，树模型为基础理论的主要算法，我们对其精确度高的优点有着很大的需求量，同时如果能解决变量的解释性，找到变量与最终预测值间较容易理解的函数关系，将在一定程度上兼顾精确性的同时，满足模型可解释性的需求。

二.研究的理论基础

1.实践基础
- xgboost.Booster.predict方法中提供了相关功能，其中当pred_contribs (bool) – When this is True the output will be a matrix of size (nsample, nfeats + 1) with each record indicating the feature contributions (SHAP values) for that prediction. The sum of all feature contributions is equal to the raw untransformed margin value of the prediction. Note the final column is the bias term.
2.理论基础
- 2018年3月Demystifying Black-Box Models with SHAP Value Analysis论文的发表，为树模型，神经网络模型等，提供了解释模型的理论基础和实践代码

综上两者的发展，结合评分转换的相关基础知识，本文主要对变量评分与预测评分间的关系进行实践操作验证与研究，具体的理论基础，大家可以详细阅读论文和相关资料。

三.具体实践操作

1.安装最新的xgboost模块，本人实践中安装的版本为0.82
2.使用案例数据iris训练一个模型
3.验证模型预测概率值与pred_contribs=True下输出值的函数关系
4.通过评分转换公式，验证每个变量的维度得分与总得分之间的转换关系

3.1 安装最新的模块

Collecting xgboost
  Downloading https://files.pythonhosted.org/packages/6a/49/7e10686647f741bd9c8918b0decdb94135b542fe372ca1100739b8529503/xgboost-0.82-py2.py3-none-manylinux1_x86_64.whl (114.0MB)
    100% |████████████████████████████████| 114.0MB 151kB/s
Requirement already satisfied: numpy in /opt/conda/lib/python3.6/site-packages (from xgboost) (1.13.3)
Requirement already satisfied: scipy in /opt/conda/lib/python3.6/site-packages (from xgboost) (1.1.0)
Installing collected packages: xgboost
Successfully installed xgboost-0.82

3.2 训练模型

import pandas as pd
from sklearn.datasets import load_iris

iris_df = load_iris()
iris_df.target[iris_df.target == 2] = 0
iris_data = pd.DataFrame(iris_df.data, columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])
iris_data['target'] = iris_df.target


import xgboost as xgb

dtrain = xgb.DMatrix(iris_data[iris_data.columns.difference(['target'])], label=iris_data.target)

# specify parameters via map, definition are same as c++ version
param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic', 'seed': 0}

# specify validations set to watch performance
# watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 20
bst = xgb.train(param, dtrain, num_round)

/opt/conda/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
/opt/conda/lib/python3.6/site-packages/xgboost/core.py:588: FutureWarning: Series.base is deprecated and will be removed in a future version
  data.base is not None and isinstance(data, np.ndarray) \

bst

<xgboost.core.Booster at 0x7fd0e9867390>

ypred = bst.predict(dtrain)
ypred[0:5]

array([ 0.00490831,  0.00490831,  0.00490831,  0.00490831,  0.00490831], dtype=float32)

ypred_contribs = bst.predict(dtrain, pred_contribs=True)
ypred_contribs[0:5]

array([[-3.0387814 ,  0.65096694, -1.09195888, -0.33184546, -1.50028646],
       [-3.0387814 ,  0.65096694, -1.09195888, -0.33184546, -1.50028646],
       [-3.0387814 ,  0.65096694, -1.09195888, -0.33184546, -1.50028646],
       [-3.0387814 ,  0.65096694, -1.09195888, -0.33184546, -1.50028646],
       [-3.0387814 ,  0.65096694, -1.09195888, -0.33184546, -1.50028646]], dtype=float32)

score_a = sum(ypred_contribs[0])
print(score_a)

-5.31190526485

使用logis函数实现pred_contribs值与预测概率间的函数关系

import numpy as np
def logis(x):
    return 1/(1+np.exp(-x))

logis(score_a)

0.0049083096667698247

ypred[0]

0.0049083065

上述实验说明了，对于 pred_contribs=True 下输出的pred_contribs(解释为每个特征对最后打分的影响因子)，其中最后一列为bais，相加进行logistics函数转换后，即为预测为1的概率值；两者之间结果在小数点前6位上能保持很好的一致性

3.3 结合评分转换公式，推导出从pred_contribs值到输出评分的函数关系

def prob2Score(prob, thea=50, basescore=600, PDO=20):
    B = PDO / np.log(2)
    A = basescore + B * np.log(1 / thea)
    score = A - B * np.log(prob / (1 - prob))
    return score

prob2Score(logis(score_a))

640.3920638700547

def pred_contrib2score(ypred_contribs,thea=50, basescore=600, PDO=20):
    B = PDO / np.log(2)
    A = basescore + B * np.log(1 / thea)
    base_score = A - B * ypred_contribs[-1] 
    x_score = [-B*beta for beta in ypred_contribs[0:-1]]
    return base_score,x_score,sum(x_score)+base_score

pred_contrib2score(ypred_contribs[0])

(530.41199291737485,
 [87.680697252217627,
  -18.782935589075397,
  31.507273232861802,
  9.575036056675879],
 640.39206387005481)

上述实验说明了，各个变量的pred_contribs值与评分转换的关系为：

bais偏置项到评分的转换公式为A - B *bais
对于变量x而言，其转换公式为-B*beta,其中beta为每个变量对应的pred_contribs值
模型的最终得分与各项之间的函数关系为，bais得分与各个变量pred_contribs值转换的得分汇总，即为模型的最终得分。
注：同样的，直接将预测概率转为得分，和利用pred_contribs值转换的得分，在小数点前6位保持了很好的一致性。

3.4 对于测试数据，如何计算出其的得分

iris_data.head()

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

iris_data.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'target'], dtype='object')

new_data = pd.DataFrame([{'sepal_length':5.2, 'sepal_width':4.5, 'petal_length':1.2, 'petal_width':0.15}])
new_data

	petal_length	petal_width	sepal_length	sepal_width
0	1.2	0.15	5.2	4.5

dtest = xgb.DMatrix(new_data)

new_ypred_contribs = bst.predict(dtest, pred_contribs=True)
new_ypred_contribs

array([[-2.95274711,  0.65096694, -0.49509707, -0.33184546, -1.50028646]], dtype=float32)

new_ypred = bst.predict(dtest)
new_ypred

array([ 0.00967], dtype=float32)

logis(sum(new_ypred_contribs[0]))

0.0096700072210058954

prob2Score(new_ypred)

array([ 620.68786621], dtype=float32)

pred_contrib2score(new_ypred_contribs[0])

(530.41199291737485,
 [85.198272152439699,
  -18.782935589075397,
  14.285481779856159,
  9.575036056675879],
 620.68784731727123)

综上可知，能对新数据进行预测，通过公式，能得到评分，且评分在小数点前4位，计算结果是一致的。

四. 结论

1.使用新版的xgboost模块，结合SHAP指标，能在一定程度上解决模型的可解释性问题；
2.结合评分转换公式，能在小数点前4位上，保证直接使用概率转换与使用pred_contribs值转换的结果保持一致性，一般而言，我们对评分的精度要求为整数部分，所以计算精度引起的结果不一致的影响非常之小；
3.使用此指标，我们能得到每个维度的具体评分，但每个维度具体取值或某一区间段所对应的评分值，仍没有解决，后续还有进一步研究探索的空间。

五.参考资料

1.LIME算法理论核心基础截图 [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BbUSqvJ0-1654437789362)(./image/LIME图片.png)]
2.17SHAP
3.SHAP代码
4.Demystifying Black-Box Models with SHAP Value Analysis