Towards Robust Interpretability with Self-Explaining Neural Networks (SENN)

exploreandconquer

已于 2023-12-29 23:10:11 修改

阅读量1k

点赞数 20

分类专栏： Interpretability 文章标签：计算机视觉深度学习图像处理 python 笔记

于 2023-12-26 17:10:26 首次发布

本文链接：https://blog.csdn.net/rad1ant_up/article/details/135226521

版权

Interpretability 专栏收录该内容

13 篇文章 3 订阅

订阅专栏

本篇文章发表于NeurIPS 2018。

文章链接：https://arxiv.org/abs/1806.07538

代码链接：GitHub - raj-shah/senn: Towards Robust Interpretability with Self-Explaining Neural Networks, Alvarez-Melis et al. 2018

一、概述

文章在Abstract部分指出，现有的机器学习可解释方法专注于posteriori explanations（即post-hoc），但是很少关注self-explaining的工作（即inherent/intrinsic interpretability），原因在于大多数方法首先倾向于保证模型性能，而在此基础上再提供事后解释。并且，对于已经训练好的模型来说，事后可解释可能是不得已的选择（毕竟当初设计模型的时候并没有对模型的可解释性加以设计/限制，只能“马后炮”，去“强行”对得到的结果解释）

论文提出了对explanation的三个要求：

Explicitness（显性）：指模型或系统中的explanation是否清晰明了。指explanation的明确程度，即explanation是直接而明确的、不模糊。在机器学习中，explicitness意味着我们能够清楚地解释模型的决策或预测基于哪些特征或数据。
Faithfulness（忠实性）：描述了explanation是否忠实于模型的内部运作或决策机制。指explanation是否准确地反映了模型的行为，并且不会误导或歪曲模型的真实运作方式。这意味着explanation应该能够如实地反映模型所基于的数据和特征，并且不应该引入不准确的信息。
Stability（稳定性）：指explanation是否随着数据或输入的微小变化而变化。稳定性意味着对于相似的输入或数据，模型的explanation应该是一致的或具有一定的稳定性。在可解释性中，稳定的解释对于确保解释的可靠性和一致性至关重要。

在Introduction部分，作者指出可解释方法的发展趋势与挑战：

高建模能力与可解释性之间的compromise：为了取得competitive performance，通常需要具有高建模能力的深度学习模型。然而，这些模型通常内部复杂，难以直接解释其决策原因。
事后解释方法：近期的研究侧重于生成基于模型性能的事后解释（优先级：性能＞可解释性），并且这些解释是局部的（locally）、针对单个样本而言的。事后解释方法不能很好地理解模型的内部工作方式，比如梯度和反向传播；也有方法通过“预言者”（oracle）查询来estimate simpler models that capture the local input-output behavior. （Note: 所谓的oracle是一种理想化的模型，它可以完美地获得输入、输出之间的关系进而提供精准预测；例如，我们可以使用一个简单的线性分类器作为oracle，这个线性分类器可以在局部完美地拟合数据做到精准预测，进而我们可以通过这个线性分类器理解模型在局部的行为从而得到一定的局部可解释性）
挑战：局部性（locality）的定义（例如，如何定义结构化数据的局部性）、可辨识性（identifiability）以及计算成本（computational cost）。
point-wise interpretation的限制：point-wise interpretation通常不对相邻输入（相邻：输入之间相似度较高）得到的解释进行比较，导致解释不稳定且常常相互矛盾。这暗示了解释的unstable，可能会因为微小变化产生完全不同的解释。

贴一段原文：

“In this work, we build highly complex interpretable models bottom up, maintaining the desirable characte ristics of simple linear models in terms of features and coefficients, without limiting performance. For example, to ensure stability (and, therefore, interpretability), coefficients in our model vary slowly around each input, keeping it effectively a linear model, albeit locally. In other words, our model operates as a simple interpretable model locally (allowing for point-wise interpretation) but not globally (which would entail sacrificing capacity). We achieve this with a regularization scheme that ensures our model not only looks like a linear model, but (locally) behaves like one.”

简而言之，SENN从简单的线性模型出发构建复杂模型，而复杂模型虽然在global看起来并不是线性模型，但是在local拥有线性模型的良好性质（可解释），并且不会限制模型性能；为了获得stable explanation，模型在面对不同输入时coefficient的变化很缓慢。

二、方法

1. Generalized coefficients

本方法是从linear model的推广。线性模型可以通过以下式子表示：

$f(x)=\sum_{i}^{n}\theta _{i}x_{i}+\theta _{0}$

将input $x$ 前面的系数设置为与 $x$ 有关的函数，并省略偏置项：

$f(x)=\theta (x)^{T}\cdot x$

其中 $\theta (x)$ 是从一个complex model class $\Theta$ 中选择的，可以通过深度神经网络进行学习。但是为了保证可解释性，（至少在局部）我们应该保证距离相近的两个输入所对应的系数也应该是相近的。更准确地说，对于在 $x_{0}$ 附近所有的 $x$ 应该满足：

$\triangledown _{x}f(x)\approx \theta (x_{0})$

“In other words, the model acts locally, around each $x_{0}$ , as a linear model with a vector of stable coefficients $\theta (x_{0})$ .”

2. Feature basis

传统的可解释模型试图将每个变量（每个特征或者像素）视为explanation的基本单元，但是我们人类理解图像的时候很少将pixel作为基本单元，而是更加依赖于高级特征（如笔画）。作者将这种更加general的高级特征称为interpretable basis concepts。

"Formally, we consider functions $h(x): \mathcal{X}\rightarrow\mathcal{Z}\subset \mathbb{R}^{k}$ , where $\mathcal{Z}$ is some space of interpretable atoms."

即，构造一个从原始像素空间 $\mathcal{X}$ 到可解释的概念空间 $\mathcal{Z}$ 的映射 $h(x)$ ，其中 $\mathcal{Z}$ 的维度是 $k$ ， $k$ 的值不能太大以保证explanation易于理解。

“Alternatives for $h(\cdot )$ include: (i) subset aggregates of the input (e.g., with $h(x)=Ax$ for a boolean mask matrix A), (ii) predefined, pre-grounded feature extractors designed with expert knowledge (e.g., filters for image processing), (iii) prototype based concepts, e.g. $h_{i}(x)=\left \| x-z_{i} \right \|$ for some $z_{i}\in \mathcal{X}$ , or learnt representations with specific constraints to ensure grounding. ”（解释一下，prototype-based concepts通常将训练集中对预测有帮助的、具有代表性的某些信息作为prototype，比如，一张鸟类图像的某个含有鸟嘴的patch，这些prototypes是trainable的；预测时将当前输入与模型学到的prototypes进行比较，作为最终的决策依据。后面会对涉及prototype based concepts的几篇文章做介绍）

The generalized model is now:

$f(x)=\theta(x)^{T}h(x)=\sum_{i=1}^{K}\theta(x)_{i}h(x)_{i}$

模型所解释的是通过映射 $h$ 后的概念 $h(x)$ 而不是原始的像素 $x$ 。

3. Further generalization

将求和替换为更加general的方式： $g(z_{1,}...,z_{k})$ ，其中 $z_{i}$ 就是 $\theta$ 与 $h$ 的乘积 $\theta(x)_{i}h(x)_{i}$ ，如果 $g$ 代表一个求和函数，就对应于我们刚刚所介绍的线性模型。

为了保持类似于线性模型的可解释性，函数 $g$ 需要保证：（1）排列不变（2）每个变量 $h(x)_{i}$ 对输出的影响是独立的，避免乘法操作（3）保留 $\theta(x)_{i}$ 的符号和相对大小，符号决定了每个变量对输出影响是积极还是消极的，相对大小决定了不同特征之间的重要性排名。

4. Self-explaining models

将以上讨论的可解释模型公式化，以下截取自原论文：

让我们具体解释一下以上三个definition：

首先是Lipschitz连续：

对于函数 $f:\mathbb{R}^{n}\rightarrow \mathbb{R}^{m}$ ，如果存在一个常数 $L$ ，使得对于所有的 $x,y$ 都有：

$\left \| f(x)-f(y) \right \|\leqslant L\left \| x-y \right \|$

则称函数 $f$ 是Lipschitz连续的。

Definition 3.1中只是把 $x,y$ 对应地换成了 $h(x),h(y)$ 。回忆一下， $h(\cdot )$ 是从原始像素空间 $\mathcal{X}$ 到可解释的概念空间 $\mathcal{Z}$ 的映射，因此Definition 3.1的含义在于，可解释概念空间的任意两点都应该满足Lipschitz连续（globally）。Note：在现实场景中的数据常常分布在一个不规则的低维流形上，因此在全局都施加统一的bound过于严格了，由此引出Definition 3.2。
Definition 3.2在3.1的基础上，把 $x$ 限制在 $\delta$ -邻域内，从global condition变为local condition，使可解释概念空间至少在局部是Lipschitz连续的。Note： $L$ 和 $\delta$ 都是与 $x_{0}$ 有关的，这代表“Lipschitz constant can vary throughout the space." 由此引出对本文方法的定义。
Definition 3.3：如果满足以下条件，则称其是self-explaining prediction model
- P1) $g$ 是单调的、独立相加的；
- P2) $g$ 对 $z_{i}$ 的导数非负；
- P3) $\theta$ is locally diffrence bounded by $h$ 。Note：如果 $h$ 是恒等函数 $h(x)=x$ ，那么就是原始的Lipschitz连续定义。回顾，此处 $h(\cdot )$ 是从像素空间到可解释概念空间的映射；
- 要保证 $h(\cdot )$ 确实把原始的像素空间映射到了可解释的空间，每一维 $h_{i}(x)$ 都是可解释的；
- $k$ 很小，即可解释空间维度低。

最后，对于输入 $x$ ， $f(x)$ 的explanation会是一个集合 $\varepsilon _{f}(x)\equiv \left \{ h_{i}(x),\theta _{i}(x)\right \}$ ，其中的元素分别代表basis concepts以及对应的influence scores。

当 $\theta$ （可能也有 $h$ ）是通过神经网络实现时，就把这个网络称为自解释神经网络SENN。

如上所述， $\theta$ 相对于 $h$ 应该具有稳定性，即，不会因为输入 $x$ 发生微小的改变而导致其对应的explanation $\varepsilon _{f}(x)$ 发生巨大变化。

设 $f$ 为关于 $h(x)$ 的函数：

$f(x)=g(h(x))$

设 $z=h(x)$ ，通过链式法则我们可以得到：

$\triangledown _{x}f=\triangledown _{z}f\cdot J_{x}^{h}$

其中 $J_{x}^{h}$ 是 $h$ 关于 $x$ 的Jacobian；

给定点 $x_{0}$ ，得到对应的可解释概念 $h(x_{0})$ ， $h(x_{0})$ 对应的influence score $\theta(x_{0})$ 应该满足：

$\theta (x_{0})\approx \triangledown _{z}f$

也就是说，we want $\theta(x_{0})$ to behave as the derivative of $f$ with respect to the concept vector $h(x)$ around $x_{0}$ ，即 $f$ 对概念向量 $h(x_{0})$ 在 $x_{0}$ 处的偏导；这个偏导的意义实际上就是模型 $f$ 在概念 $h(x_{0})$ 处的影响因子。这样，就可以保证输入发生微小改变时其explanation也不会发生太大变化。

实际上这样的操作很难执行，因此可以把 $\theta (x_{0})\approx \triangledown _{z}f$ 代入 $\triangledown _{x}f=\triangledown _{z}f\cdot J_{x}^{h}$ ，可以得到以下的proxy condition：

$\mathcal{L}_{\theta }(f(x)):=\left \| \triangledown _{x}f(x)-\theta (x)^{\mathrm{T}}\cdot J_{x}^{h} \right \|\approx 0$

$\mathcal{L}_{\theta }(f(x))$ 中的三项都可以通过计算得到，当使用可微架构 $h$ 和 $\theta$ 时，我们获得 $\mathcal{L}_{\theta }(f(x))$ 的梯度，从而将其作为优化目标中的正则化项，加入此正则项可以权衡性能以及参数的局部稳定性（从而实现可解释性）。最终的损失函数为 $\mathcal{L}_{y}(f(x),y)+\lambda \mathcal{L}_\theta(f)$ 。