Faithful Vision-Language Interpretation via Concept Bottleneck Models (FVLC)-CSDN博客

本文链接：https://blog.csdn.net/Rad1ant_up/article/details/136141288

本篇文章发表于ICLR 2024。

文章链接：https://openreview.net/attachment?id=rp0EdI8X4e&name=pdf

一、概述

由于ICLR 2024刚刚放榜不久，这篇论文在官网上还没有更新作者，状态还停留在审稿阶段，但实际上已经被接收了。

比较有趣的是，作者在本文摘要的后面引用了美国著名历史学家丹尼尔·J·布尔斯廷的一句话，意思大概是：“阻碍我们发现新知识的并不是无知，而是对知识的自以为是。”这句话在一定程度上也揭示了发展可解释深度学习的意义和思路——真正自以为是的究竟是不可解释的黑盒模型还是我们人类，抑或是我们都是？我们人类运用的知识一定是正确/全面的吗，会不会限制我们对新知识的发现呢？——这些值得我们去思考。

从标题和摘要就能看出来，这篇文章也是CBMs“大家庭”的一员。概念瓶颈模型舍弃了传统端到端的方式，在最终预测之前添加了一个概念瓶颈层 (Concept Bottleneck Layer, CBL)，用以预测人类可理解的概念，通过这些概念再进一步地得到最终预测，由此实现了模型的可解释性。众所周知，概念瓶颈模型的这种形式具有两个比较明显的缺点，首先是因为对输入信息进行了压缩，模型会存在accuray-interpretability的trade-off；此外，概念瓶颈层中的概念需要人类自行定义，需要大量的人工annotation，大大限制了CBM在现实中的应用。

随着近几年多模态大语言模型的发展，这个问题在一定程度上被得以解决。Label-Free CBM借用pre-trained GPT-3自动生成concepts，并用CLIP-Dissect将网络提取到的图像特征与自动生成的concepts进行匹配 (align)，从而消除了人工标注的步骤。但是，输入图像与文本易被噪声所干扰，使用pre-trained模型存在unstable的问题，因此本篇文章的作者在Label-Free CBM的基础上提出了更加stable的模型——Faithful Vision-Language Concept (FVLC) models。

作者指出，faithful concept应该具备四个特性：

Faithful concept应该尽可能与original concept一致：Significant overlap between the top-k indices of the “faithful concept” and the original concept, ensuring interpretability.
在concept generation过程中可以抵抗噪声和干扰：Inherent stability, with the concept vector remaining robust against random noise and perturbations during LLM concept set generation.
预测结果要与vanilla CBMs保持一致：A prediction distribution close to that of the vanilla CBMs, preserving its outstanding performance.
Output distribution具备稳定性(stable)：Stable output distribution, remaining robust during self-supervised learning and LLM concept set generation, even in the presence of perturbations.

二、方法

在具体介绍本文提出的方法之前，我们先来回顾一些知识点。

1. Concept Bottleneck Models (CBMs)

首先是概念瓶颈模型CBMs，这一部分已经写过很多篇博客了。如果大家对CBMs熟悉的话，应该知道CBMs有两大主要缺点：1. 因为原始数据特征的不完全提取而导致的性能损失； 2. 需要大量的人工标注。针对这两个问题，已经有大量文献提出了潜在的解决方法，比如SENN、PCBM、Label-Free CBM等。

回顾一下CBMs的notation：We consider a classification task with a concept set denoted as $\mathcal{C}=\left \{ p_1,...,p_k \right \}$ and a training dataset represented as $\left \{ (x_i,y_i,c_i) \right \}_{i=1}^N$ , where for $i \in [N]$ , $x_i \in \mathbb{R}^d$ is the feature vector, $y_i \in \mathbb{R}^{d_z}$ denotes the label, where $d_z$ corresponds to the number of classes, and $c_i \in \mathbb{R}^k$ denotes the concept vector whose $j$ -th entry represents the weight of the concept $p_j$ . In CBMs, we aim to learn two representations, one transforms from the input space to the concept space, which is represented by $g:\mathbb{R}^d\rightarrow \mathbb{R}^{k}$ . The other one maps from the concept space to the prediction space, which can be denoted by $f:\mathbb{R}^k\rightarrow \mathbb{R}^{d_z}$ . For any input $x$ , we aim to make its predicted concept vector $\hat{c}=g(x)$ and prediction $\hat{y}=f(g(x))$ to be close to its underlying ones.

2. Label-free CBMs

Label-free CBMs有四个步骤：

Step 1: Concept set creation and filtering.

询问GPT-3一系列问题并做筛选，产生概念集合 $\mathcal{C}$ ；

Step 2 and 3: Learning the Concept Bottleneck Layer (CBL).

学习从特征空间到概念空间的prejection weights $W_c$ 。具体的做法是首先使用CLIP生成concept activation matrix $M_{i,j}=E_I(x_i)\cdot E_T(P_j)$ ，其中 $E_I$ 与 $E_T$ 分别为CLIP中的image encoder与text encoder，矩阵 $M$ 的行代表不同的图片，列代表不同的概念，其中的元素代表图片 i 中概念 j 的存在情况（表示为乘积）。 $W_c$ 是一个 $k \times d$ 的矩阵，代表了特征空间到概念空间的映射， $y(x,\boldsymbol{c})=W_{F}g(x)$ 。用 $l \in [d]$ 表示我们关注的神经元，所有图片在该神经元上对应的activation pattern可以表示为 $q_l=\left[g_l(bf(x_1)),\cdots,g_l(bf(x_N))\right]$ ，优化目标是使得第 i 个神经元与第 i 个concept尽可能对齐/匹配，由以下式子给出：

$\mathcal{L}(W_c)=\sum_{i=1}^k-\sin(P_i,q_i)=\sum_{i=1}^k-\frac{\bar{q_i}^3\cdot\bar{M_{:,i}}^3}{||\bar{q_i}^3||_2||\bar{M_{:,i}}^3||_2}$

Step 4: After successfully learning the Concept Bottleneck Layer, the next step involves training the final predictor using the fully connected layer.

学习从概念到类别的映射 $W_F\in\mathbb{R}^{d_z\times k}$ ， $y(x,\boldsymbol{c})=W_{F}g(x)$

接下来介绍本文提出的FVLC。

3. Faithful Vision-Language Concept

由于Label-free CBMs概念集合的产生依赖于GPT-3，因此可能会引入不稳定(instability)和扰动(perturbation)。此外，不仅概念会被干扰，输入图片也会不可避免地存在被干扰的风险，因此在以上情况下更需要保持概念的stability，也就是所谓的“faithful concept”。

那么什么是faithful concept？由上所述我们可以知道，faithful concept要具备当输入或概念集本身被扰动时概念向量仍然能够保持稳定的能力。我们应该对此进行合理的定义。（图片截取自原论文）

定义一：

两个概念向量按激活值从大到小的顺序排列后前 k 个concepts的overlap程度 $V_k(x,x^{\prime})$

此处是为了后面比较faithful concepts与original concepts之间的差异所作出的定义。

(注: $T_k$ 是一个包含了concept索引的集合，而并不是具体的concept，因此后面对concept进行perturbation后，对于stable and faithful concept而言，这个索引集 $T_k$ 是不会发生太大变化的，即使concept本身发生了变化。)

定义二：

Similarity of Explanation: faithful concept $\tilde{g}(x)$ 与original concept $g(x)$ 的 top-k1 overlap 程度大于等于 $\beta_1$ ，易知 $\beta_1=1$ 对应于二者的top-k1 concepts完全相同。这一点是为了保证faithful concept要尽可能与original concept在前 k1 个concepts上保持一致；
Stability of Explanation: 进行 $\rho$ 的扰动后的概念 $\tilde{g}(x)+\rho$ 与扰动前的概念 $\tilde{g}(x)$ 的top-k2 overlap程度大于等于 $\beta_2$ ，易知 $\beta_2=1$ 对应于二者完全相同。这一点是为了保证扰动后概念向量仍然不会发生太大变化(具体来说是扰动后概念的rank尽可能与原来保持一致)；
Closeness of Prediction: 用faithful concept与original concept产生的结果要尽可能一致， $D$ 代表某种距离度量比如KL散度， $\alpha_1=0$ 时对应于二者的预测结果完全一致；
Stability of Prediction: 对faithful concept进行扰动 $\delta$ 后的预测结果不会发生太大变化， $\alpha_2=0$ 时对应于二者的预测结果完全一致；

整体上，我们可以说：

$\color{blue}{for~any~given~x,~\tilde{c}=\tilde{\boldsymbol{g}}(x)~is~a~(D,R,\alpha,\beta,k_1,k_2)\text{-Faithful-Vision-Language~Concept}}$

4. FVLC Framework

这一节的写作上有点乱，领会精神吧......

Sensitivity: 除了上面讨论的similarity与stability，sensitivity敏感性指的是，当我们排除掉(exclude)关键的concep时预测应该表现出敏感性，而对其进行微小扰动时应该表现出稳定性。

让我们再次回到定义二，总结一下各个参数的理想值应该是什么：

Top-k approach: $\beta_1$ 尽可能接近于1；

Stability: $R_1$ 应该尽可能大， $\beta_2$ 尽可能接近于1；

Prediction: $R_2$ 应该尽可能大， $\alpha_1,\alpha_2$ 尽可能接近于0；

网络整体示意图：

整体的做法和Label-free CBM基本是一致的，只是使用 $\mathcal{L}_1,\mathcal{L}_2,\mathcal{L}_3,\mathcal{L}_4$ 来限制网络以产生faithful concepts。总体的目标函数为：

$\begin{aligned}&\min_{\tilde{W}_c}\mathbb{E}_x[\lambda_1D(y(x,\tilde{\boldsymbol{c}}),y(x,\boldsymbol{c}))-\lambda_2V_{k_1}(\tilde{\boldsymbol{g}}(x),\boldsymbol{g}(x))+\lambda_3\max_{||\delta||\leq R_2}D(y(x,\tilde{\boldsymbol{c}}),y(x,\tilde{\boldsymbol{c}}+\delta))\\&{-\lambda_4\max_{||\rho||\leq R_1}V_{k_2}(\tilde{\boldsymbol{g}}(x),\tilde{\boldsymbol{g}}(x)+\rho)}],\end{aligned}$

这四项 $\mathcal{L}_1,\mathcal{L}_2,\mathcal{L}_3,\mathcal{L}_4$ 分别对应于：prediction closeness，concept similarity，prediction stability，concept stability。

可以使用PSGD解决这个优化问题，但是因为top-k overlap function $V_k$ 是不可微的，所以要用surrogate loss来替代。

具体来说，只优化前k个entries并简单地使用 $\ell_{1}\operatorname{-norm}$ 使得它们尽可能接近，见下：

（然而，从交集变为 $\ell_{1}\operatorname{-norm}$ 的“逐点匹配”，虽然使损失函数可微了，但对concept的rank也进行了限制。也就是说，如果是使用原本的交集操作，只要top-k中的concepts存在就行了，对顺序没有要求——比如perturbation之前top-k concepts的indices是{1,3,5,7}，perturbation之后是{3,1,7,5}，交集的结果是二者“完全重合”，但用 $\ell_{1}\operatorname{-norm}$ 则不是。）

从而，放宽后的目标函数变为：

$\begin{aligned}&\min_{\tilde{W}_c}\mathbb{E}_x[D(y(x,\tilde{\boldsymbol{c}}),y(x,\boldsymbol{c}))+\lambda_1\underbrace{\mathcal{L}_{k_1}(\tilde{\boldsymbol{g}}(x),\boldsymbol{g}(x))}_{\mathcal{L}_2}+\lambda_2\underbrace{\max_{||\delta||\leq R_2}D(y(x,\tilde{\boldsymbol{c}}),y(x,\tilde{\boldsymbol{c}}+\boldsymbol{\delta}))}_{\mathcal{L}_3}\\&+\lambda_3\underbrace{\max_{\|\rho\|\leq R_1}\left.\mathcal{L}_{k_2}(\tilde{\boldsymbol{g}}(x),\tilde{\boldsymbol{g}}(x)+\boldsymbol{\rho})\right]}_{\mathcal{L}_4}.\end{aligned}$