Big_models的解释

最新推荐文章于 2024-07-28 20:53:06 发布

YingJingh

最新推荐文章于 2024-07-28 20:53:06 发布

阅读量722

点赞数

分类专栏：论文记录文章标签：人工智能自然语言处理语言模型

本文链接：https://blog.csdn.net/Hekena/article/details/130789548

版权

论文记录专栏收录该内容

147 篇文章 9 订阅

订阅专栏

文章目录

大语言模型的解释

原文链接：https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html

大语言模型的解释

大语言模型中的参数一般是亿级别，但是，模型中这么多的参数是什么含义，目前还是不清楚的。模型的可解释性也是big model 在后期应用中的一个坎儿。
OpenAI：用GPT-4解释GPT-2，语言模型可以解释语言模型中的神经元：https://hub.baai.ac.cn/view/26749

1.自动化大模型解释（open AI 成果）

1.1 三个步骤：

使用GPT4 解释神经元的激活性（根据问题实际激活的神经元）。（根据GPT4 的注意力情况，可以看出神经元主要集中在哪些domain）

Generate an explanation of the neuron’s behavior by showing the explainer model (token, activation) pairs from the neuron’s responses to text excerpts
通过显示神经元对文本摘录的反应中的解释者模型（标记、激活）对，生成对神经元行为的解释

以上一步的解释作为条件，使用GPT4模拟激活。

Simulate: Use the simulator model to simulate the neuron’s activations based on the explanation
使用模拟器模型根据解释模拟神经元的激活

比较实际激活和模拟激活的神经元，对GPT4 给出的解释进行评分。

Score: Automatically score the explanation based on how well the simulated activations match the real activations
评分：根据模拟激活与实际激活的匹配程度自动对解释进行评分

这项技术使我们能够利用GPT-4来定义和自动测量可解释性的定量概念，我们称之为“解释分数”：衡量语言模型使用自然语言压缩和重建神经元激活的能力。

1.2 涉及到的模型：

The subject model is the model that we are attempting to interpret. （需要解释的模型）
The explainer model comes up with hypotheses about subject model behavior. （用来解释subject model的模型）
The simulator model makes predictions based on the hypothesis. Based on how well the predictions match reality, we can judge the quality of the hypothesis. The simulator model should interpret hypotheses the same way an idealized human would. （判断explainer model 解释的内容和人类判断之间的匹配程度）

1.3 具体实验步骤

第一步：生成关于neural behavior的解释
使用explainer 模型生成关于神经元行为的解释。
在这一步中，我们创建一个提示，发送到解释者模型，以生成神经元行为的一个或多个解释。提示由其他真实神经元的几个few-shot examples组成，文本摘录和研究人员书面解释中的tab-separated (token, activation) pairs 。最后，few-shot examples包含被解释神经元的文本摘录中的tab-separated (token, activation) pairs .

tab-separated (token, activation) pairs 示例如下：

Neuron 1
Activations:
<start>
the		0
 sense		0
 of		0
 together	3
ness		7
 in		0
 our		0
 town		1
 is		0
 strong		0
.		0
<end>
<start>
[prompt truncated …]
<end>

Same activations, but with all zeros filtered out:
<start>
 together	3
ness		7
town		1
<end>
<start>
[prompt truncated …]
<end>

Explanation of neuron 1 behavior: the main thing this neuron does is find phrases related to community

激活被标准化为0-10标度，并离散化为整数值，负激活值映射为0，神经元有史以来观察到的最大激活值映射至10。对于神经元激活稀疏（<20%非零）的序列，我们发现在完整的令牌列表之后额外重复具有非零激活的令牌/激活对是有帮助的，这有助于模型专注于相关令牌。
第二步：Simulate the neuron’s behavior using the explanations
通过这种方法，我们的目的是回答这样一个问题：supposing a proposed explanation accurately and comprehensively explains a neuron’s behavior, how would that neuron activate for each token in a particular sequence?（假设所提出的解释准确而全面地解释了神经元的行为，那么该神经元将如何对特定序列中的每个标记进行激活？）To do this, we use the simulator model to simulate neuron activations for each subject model token, conditional on the proposed explanation.（为此，我们使用模拟器模型来模拟每个受试者模型令牌的神经元激活，条件是所提出的解释。）

我们提示模拟器模型为每个主题模型令牌输出0-10之间的整数。对于每个预测的激活位置，我们检查分配给每个数字（“0”、“1”、…、“10”）的概率，并使用这些概率来计算输出的预期值。由此得到的模拟神经元值在[0，10]的标度上。

我们最简单的方法是所谓的“一次一个”方法。该提示由some few-shot examples and a single-shot example组成。
one at a time（一次一个）的方式比较slow，提出了改进的tricks：Unfortunately, the “one at a time” method is quite slow, as it requires one forward pass per simulated token. We use a trick to parallelize the probability predictions across all tokens by having few-shot examples where activation values switch from being “unknown” to being actual values at a random location in the sequence.

# 示例如下：
Neuron 4
Explanation of neuron 4 behavior: the main thing this neuron does is find present tense verbs ending in 'ing'
Activations:
<start>
Star		unknown
 ting		unknown
 from		unknown
 a		unknown
 position	unknown
 of		unknown
 strength	unknown
<end>

第三步：评分——通过比较simulated and actual neuron behavior对解释进行评分

Conceptually, given an explanation and simulation strategy, we now have a simulated neuron, a “neuron” for which we can predict activation values for any given text excerpt. To score an explanation, we want to compare this simulated neuron against the real neuron for which the explanation was generated. That is, we want to compare two lists of values: the simulated activation values for the explanation over multiple text excerpts, and the actual activation values of the real neuron on the same text excerpts.

然而，我们模拟的激活是在[0，10]的尺度上，而真实的激活会有一些任意的分布。因此，我们假设有能力将模拟神经元的激活分布校准为实际神经元的分布。
第一种评分方法：我们选择简单地对评分的文本摘录进行线性校准。如果是真实激活和模拟激活之间的相关系数，那么我们对模拟进行缩放，使其平均值与真实激活的平均值相匹配，并且它们的标准差是真实激活的标准差的倍。这使解释的方差最大化。
这激发了我们的主要评分方法，相关性评分，它只是简单地报告。请注意，如果模拟神经元的行为与真实神经元相同，则得分为1。如果模拟神经元的行为是随机的，例如，如果解释与神经元的行为无关，那么分数将趋于0左右。

第二种评分方法：根据消融评分进行验证

理解网络的另一种方法是在前向传递过程中扰动其内部值并观察其效果。这表明了一种更昂贵的评分方法，即我们用模拟神经元代替真实神经元（即将其激活消融为模拟激活值），并检查网络行为是否被保留。

第三种评分方法：根据人工评分进行验证

一个潜在的担忧是，基于模拟的评分实际上并不能反映人类对解释的评估（更多讨论请参阅此处）。我们收集了人类对解释质量的评估，看看他们是否同意基于分数的评估。
我们给人类标注者任务，让他们看到与模拟器模型相同的文本摘录和激活（用颜色高亮显示）（包括顶部激活和随机），并要求他们根据这些解释对激活模式的捕捉程度对5种提出的解释进行评级和排序。我们发现解释者模型的解释并不多样，因此通过改变解释生成提示中使用的few-shot examples，或通过使用修改后的提示，要求解释者模型在一次完成中提供可能解释的编号列表，增加了解释的多样性。

评分方法的比较
我们发现，平均而言，相关性评分和消融评分之间有着明确的关系。因此，本文的其余部分使用相关性评分，因为它的计算要简单得多。然而，相关性评分似乎并没有捕捉到消融评分所揭示的模拟解释中的所有缺陷。特别是，0.9的相关分数仍然导致平均消融分数相对较低（仅随机文本摘录的分数为0.3，顶部和随机的分数为0.6；关于如何选择这些文本摘录，见下文）。