归因分析笔记6:SHAP包使用及源码阅读

lagoon_lala

已于 2022-05-20 10:16:54 修改

阅读量1.6w

点赞数 14

分类专栏：医学计算机人工智能数据分析文章标签：归因分析 SHAP

于 2022-03-10 09:35:48 首次发布

本文链接：https://blog.csdn.net/lagoon_lala/article/details/123393113

版权

人工智能同时被 3 个专栏收录

90 篇文章

订阅专栏

医学计算机

24 篇文章

订阅专栏

数据分析

11 篇文章

订阅专栏

突然发现这篇文章居然被百度文库给盗了, 举报侵权还要我自己打印保证函, 最逗的是, 上传保证函图片还要求开启flash, 其心昭然若揭.

Python包:

https://github.com/slundberg/shap

该包的文档:

https://shap.readthedocs.io/en/latest/?badge=latest

SHAP（SHapley Additive exPlanations）是一种归因方法attribution method, 一种描述特征影响模型平均行为的全局解释方法. 基于解释单个预测的局部解释方法Shapley 值, 通过组合Shapley 值得到.

SHAP包的介绍参考:

https://blog.csdn.net/weixin_34355360/article/details/112737643

SHAP多分类参考:

https://www.pythonheidong.com/blog/article/535295/a1230a74909938a8f058/

安装

pip install shap

conda install -c conda-forge shap

activate Liver

pip install shap

使用示例

Kernel SHAP的实现, 核 SHAP 是一种与模型无关的方法，用于估计任何模型的 SHAP 值。因为它不对模型类型做出假设，所以 KernelExplainer 比其他特定于模型类型的算法慢。

该例子解释iris数据集上的多分类 SVM

完整的notebook代码(解释scikit-learn中的6种模型):

https://slundberg.github.io/shap/notebooks/Iris%20classification%20with%20scikit-learn.html

import sklearn

import shap

from sklearn.model_selection import train_test_split

# print the JS visualization code to the notebook

shap.initjs()

# train a SVM classifier

X_train,X_test,Y_train,Y_test = train_test_split(*shap.datasets.iris(), test_size=0.2, random_state=0)

svm = sklearn.svm.SVC(kernel='rbf', probability=True)

svm.fit(X_train, Y_train)

# 用 Kernel SHAP解释测试集的预测

explainer = shap.KernelExplainer(svm.predict_proba, X_train, link="logit")

shap_values = explainer.shap_values(X_test, nsamples=100)

其中explainer.shap_values(X_test, nsamples=100)代表解释每个预测(单个测试样本)时重新评估模型的次数(见下方)

分别解释4个特征贡献, 从平均输出0.3206推向0.01

shap.kmeans

shap.kmeans聚类简化计算过程, 参考:

https://zhuanlan.zhihu.com/p/484529670

# 0.聚类，为了使计算过程简化，加快速度

X_train_summary = shap.kmeans(x_train, 100)

# 1.创建解释器对象

explainer = shap.KernelExplainer(svm_model.predict_proba, data=X_train_summary, link="logit")

shap_values()

进入shap_values函数仔细看下:

作用: 估计一组采样的Shap值

输入参数

X : numpy.array or pandas.DataFrame or any scipy.sparse 矩阵

用于解释模型输出的样本矩阵

nsamples : "auto" or int

Number of times to re-evaluate the model when explaining each prediction. More samples lead to lower variance estimates of the SHAP values. The "auto" setting uses `nsamples = 2 * X.shape[1] + 2048`.

解释每个预测(单个测试样本)时重新评估模型的次数。样本越多，Shap值的方差估计越低。设为"auto"代表使用`nSamples=2*X.Shape[1]+2048`。

返回值

array or list

For models with a single output this returns a matrix of SHAP values (# samples x # features). Each row sums to the difference between the model output for that sample and the expected value of the model output (which is stored as expected_value attribute of the explainer). For models with vector outputs this returns a list of such matrices, one for each output.

对于具有单一输出的模型，这将返回Shap值矩阵(#Samples x#Feature)。每行的总和(sums)为该样本的模型输出与其期望值之间的差值(存储为解释器的expected_value属性)。

对于具有矢量输出的模型，这将返回此类矩阵的列表，每个输出对应一个矩阵。

(查看文档例子后可以理解), shap_values矩阵中, 每一行是一个样本:

shap_values常用来绘制全局摘要

shap_values[0]则用来绘制单个实例(其中的0改成其他也是一样)

KernelExplainer返回值使用

例子参考:

https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/model_agnostic/Multioutput%20Regression%20SHAP.html#Get-SHAP-Values-and-Plots

使用Kernel Explainer（模型不可知解释器方法形式为 SHAP）设置解释器

Set the explainer using the Kernel Explainer (Model agnostic explainer method form SHAP)

explainer = shap.KernelExplainer(model = model.predict, data = X.head(50), link = "identity")

获取单个示例的 Shapley 值(即只输入一个样本用于解释)

# Set the index of the specific example to explain

X_idx = 0

shap_value_single = explainer.shap_values(X = X.iloc[X_idx:X_idx+1,:], nsamples = 100)

显示单个样本的详细信息(即输入值)

X.iloc[X_idx:X_idx+1,:]

单个样本-单个标签的热力图:

# print the JS visualization code to the notebook

shap.initjs()

print(f'Current label Shown: {list_of_labels[current_label.value]}')

shap.force_plot(base_value = explainer.expected_value[current_label.value],

shap_values = shap_value_single[current_label.value],

features = X.iloc[X_idx:X_idx+1,:]

)

可以看出, 这个热力图是可以直接接受对输入参数喂入Kernel解释器返回的shap_value

为特定输出/标签/目标创建汇总图:

# Note: 限制前50个训练样本是因为计算所有样本时间太长

shap_values = explainer.shap_values(X = X.iloc[0:50,:], nsamples = 100)

# print the JS visualization code to the notebook

shap.initjs()

print(f'Current Label Shown: {list_of_labels[current_label.value]}\n')

shap.summary_plot(shap_values = shap_values[current_label.value],

features = X.iloc[0:50,:]

)

shap.initjs()

shap.force_plot(base_value = explainer.expected_value[current_label.value],

shap_values = shap_values[current_label.value],

features = X.iloc[0:50,:]

)

可以看出这个summary_plot和force_plot一样可以接收Kernel Explainer的shap_values作为参数

基于上面的汇总图，我们可以看到特征 01、03 和 07 是对模型没有影响的特征，可以被删除

KernelExplainer源码注释

"""Uses the Kernel SHAP method to explain the output of any function.

Kernel SHAP方法来解释任何函数的输出

Kernel SHAP is a method that uses a special weighted linear regression to compute the importance of each feature. The computed importance values are Shapley values from game theory and also coefficents from a local linear regression.

Kernel SHAP 是一种使用特殊的加权线性回归来计算每个特征的重要性的方法。计算得到重要性值是来自博弈论的 Shapley 值，也是来自局部线性回归(local linear regression)的系数(coefficents)。

Parameters参数

----------

model : function or iml.Model

User supplied function that takes a matrix of samples (# samples x # features) and computes the output of the model for those samples.

The output can be a vector (# samples) or a matrix (# samples x # model outputs).

输入的model是用户提供的函数(function)，这个function的输入是样本矩阵（# 样本数 x # 特征数）, 且function为这些样本计算模型的输出。输出可以是向量（#样本数）或矩阵（# 样本数 x # 模型输出数）。

自定义包装函数的办法, 参考:

https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/neural_networks/Census%20income%20classification%20with%20Keras.html

# Here we take the Keras model trained above and explain why it makes different predictions

# for different individuals. SHAP expects model functions to take a 2D numpy array

# as input, so we define a wrapper function around the original Keras predict function.

#在这里，我们采用上面训练的Keras模型，并解释为什么它对不同的个体做出不同的预测。SHAP 希望模型函数采用 2D numpy 数组作为输入，因此我们围绕原始 Keras 预测函数定义了一个包装函数

def f(X):

return regression.predict([X[:,i] for i in range(X.shape[1])]).flatten()

data : numpy.array or pandas.DataFrame or shap.common.DenseData or any scipy.sparse matrix

The background dataset to use for integrating out features.

用于整合(类似于从联合分布对一个变量的所有情况求和得到另一个变量的边缘分布)特征的背景数据集(background dataset)。

integrating out参考:

https://www.physicsforums.com/threads/integrating-out.504853/

To determine the impact of a feature, that feature is set to "missing" and the change in the model output is observed.

为了确定一个特征的影响，将该特征设置为“缺失”并观察模型输出的变化。

Since most models aren't designed to handle arbitrary missing data at test time, we simulate "missing" by replacing the feature with the values it takes in the background dataset.

由于大多数模型的设计目的不是在测试时处理任意(arbitrary)缺失的数据，因此我们通过将特征替换为背景数据集(background dataset)中的值来模拟“缺失”。

So if the background dataset is a simple sample of all zeros, then we would approximate a feature being missing by setting it to zero.

因此，如果背景数据集是一个全零的简单样本，那么我们将通过将其设置为零来近似缺失的特征。

For small problems this background dataset can be the whole training set, but for larger problems consider using a single reference value or using the kmeans function to summarize the dataset.

对于小问题，此背景数据集可以是整个训练集，但对于较大的问题，请考虑使用单个参考值或使用 kmeans 函数来汇总数据集。

Note: for sparse case we accept any sparse matrix but convert to lil format for performance.

link : "identity" or "logit"

A generalized linear model link to connect the feature importance values to the model output.

广义线性模型link, 将特征重要性值与模型输出关联。

Since the feature importance values, phi, sum up to the model output, it often makes sense to connect them to the output with a link function where link(output) = sum(phi).

由于(各个)特征重要性值 phi 求和等于模型输出，因此使用链接函数link function（link(output) = sum(phi)）将它们连接到输出通常是有意义的。

If the model output is a probability then the LogitLink link function makes the feature importance values have log-odds units.

如果模型输出是概率，则 LogitLink 链接函数(link function)使特征重要性值具有对数几率的单位。

Examples

--------

See :ref:`Kernel Explainer Examples <kernel_explainer_examples>`

summary_plot

找shap.summary_plot文档和看源码

summary plot

概括图，即 summary plot，该图是对全部样本全部特征的shaple值进行求和，可以反映出特征重要性及每个特征对样本正负预测的贡献。

shap.summary_plot(shap_values, data[use_cols])

summary plot与bar_plot区别

bar_plot为单个样本shapley值的查看, 输入的为一行样本的所以特征列, 而概括图还进行了汇总运算

源码对应函数summary_legacy

函数说明

Create a SHAP beeswarm plot, colored by feature values when they are provided.

创建一个 SHAP 蜂群图，按特征值着色

参数

----------

shap_values : numpy.array

对于单输出解释，这是一个 SHAP 值矩阵

For single output explanations this is a matrix of SHAP values (# samples样本数 x # features特征数).

对于多输出解释，这是这些 SHAP 值矩阵的列表。

For multi-output explanations this is a list of such matrices of SHAP values.

features : numpy.array or pandas.DataFrame or list

feature值矩阵 (# samples x # features) 或 feature_names 列表作为简写

Matrix of feature values (# samples x # features) or a feature_names list as shorthand

feature_names : list

特征的名称

Names of the features (length # features)

max_display : int

图中要包含多少个值最大特征（默认为 20，或交互图为 7）

How many top features to include in the plot (default is 20, or 7 for interaction plots)

plot_type : "dot" (default for single output), "bar" (default for multi-output), "violin",

or "compact_dot".

What type of summary plot to produce. Note that "compact_dot" is only used for SHAP interaction values.

要生成什么类型的摘要图。

“dot”（默认为单输出）、“bar”（默认为多输出）、“violin”或“compact_dot”。请注意，“compact_dot”仅用于 SHAP 交互值。

plot_size : "auto" (default), float, (float, float), or None

绘图的大小

What size to make the plot. By default the size is auto-scaled based on the number of features that are being displayed. Passing a single float will cause each row to be that many inches high. Passing a pair of floats will scale the plot by that number of inches. If None is passed then the size of the current figure will be left unchanged.

def summary_legacy(shap_values, features=None, feature_names=None, max_display=None, plot_type=None,

color=None, axis_color="#333333", title=None, alpha=1, show=True, sort=True,

color_bar=True, plot_size="auto", layered_violin_max_num_bins=20, class_names=None,

class_inds=None,

color_bar_label=labels["FEATURE_VALUE"],

cmap=colors.red_blue,

# depreciated

auto_size_plot=None,

use_log_scale=False):

feature_names赋默认值:

if feature_names is None:

feature_names = features.columns

核心代码

单一输出预测任务

elif not multi_class and plot_type == "bar":

feature_inds = feature_order[:max_display]#最好的几个特征下标

y_pos = np.arange(len(feature_inds))#各个y的坐标通过等差数组计算

global_shap_values = np.abs(shap_values).mean(0)

pl.barh(y_pos, global_shap_values[feature_inds], 0.7, align='center', color=color)

pl.yticks(y_pos, fontsize=13)#坐标刻度

pl.gca().set_yticklabels([feature_names[i] for i in feature_inds])#坐标轴

二分类/多分类

elif multi_class and plot_type == "bar":

if class_names is None:

class_names = ["Class "+str(i) for i in range(len(shap_values))]

feature_inds = feature_order[:max_display]

y_pos = np.arange(len(feature_inds))

left_pos = np.zeros(len(feature_inds))#多了堆叠条形图的堆叠起始位置

if class_inds is None:

class_inds = np.argsort([-np.abs(shap_values[i]).mean() for i in range(len(shap_values))])# 默认设置根据绝对值大小排类别显示顺序

elif class_inds == "original":

class_inds = range(len(shap_values))

#开始画图

for i, ind in enumerate(class_inds):

global_shap_values = np.abs(shap_values[ind]).mean(0) #对每一个小数组(类别)分别求均值

pl.barh(

y_pos, global_shap_values[feature_inds], 0.7, left=left_pos, align='center',

color=color(i), label=class_names[ind]

)

left_pos += global_shap_values[feature_inds]

pl.yticks(y_pos, fontsize=13)

pl.gca().set_yticklabels([feature_names[i] for i in feature_inds])

pl.legend(frameon=False, fontsize=12)

其中, barh在y轴绘制横向条形图

gca(get current axes)控制坐标轴

mean(0), axis = 0：压缩行，对各列求均值(一列是一个特征嘛)，返回 1* n 矩阵

全局shap值global_shap_values的计算也就是:

$$ I_j=\frac{1}{n}\sum_{i=1}^n{}|\phi_j^{(i)}| $$

可视化

参考这个示例, 有非常多的图种类, 不过也是针对explainer的

https://shap.readthedocs.io/en/latest/example_notebooks/overviews/An%20introduction%20to%20explainable%20AI%20with%20Shapley%20values.html?highlight=Kernel%20Explainer%20Examples

可视化第一个prediction的解释如果不想用JS,传入matplotlib=True

# plot the SHAP values for the Setosa output of the first instance

shap.force_plot(explainer.expected_value[0], shap_values[0][0,:], X_test.iloc[0,:], link="logit")