【语言模型可解释性】Interpretability in the Wild / transformer-debugger （TDB）工具

Arachis_X

已于 2024-03-15 13:58:54 修改

阅读量855

点赞数 25

分类专栏： nlp 文章标签：语言模型 transformer 人工智能 nlp

于 2024-03-15 13:57:48 首次发布

本文链接：https://blog.csdn.net/Arachis_X/article/details/136737698

版权

nlp 专栏收录该内容

24 篇文章 0 订阅

订阅专栏

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small 可解释性：GPT-2 small 中的间接对象识别

2022.11

论文地址
 代码地址
请添加图片描述

Abstract

Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models, or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads grouped into 7 main classes, which we discovered using a combination of interpretability approaches relying on causal interventions. To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior “in the wild” in a language model. We evaluate the reliability of our explanation using three quantitative criteria–faithfulness, completeness and minimality. Though these criteria support our explanation, they also point to remaining gaps in our understanding. Our work provides evidence that a mechanistic understanding of large ML models is feasible, opening opportunities to scale our understanding to both larger models and more complex tasks.

机理可解释性研究试图从机器学习模型的内部组件来解释其行为。

然而，以往的大多数研究要么只关注小型模型中的简单行为，要么只是笼统地描述大型模型中的复杂行为。

在这项工作中，我们通过解释 GPT-2 small如何执行一项名为 "间接对象识别（IOI）"的自然语言任务，弥补了这一差距。

我们的解释包含 26 个注意头，分为 7 个主要类别，这些类别是我们利用依赖于因果干预的可解释性方法组合发现的。

据我们所知，这项研究是在语言模型中 "wild "反向设计自然行为的最大规模端到端尝试。

我们使用三个定量标准，忠实性、完整性和最小性，来评估我们解释的可靠性。尽管这些标准支持我们的解释，但它们也指出了我们在理解上仍然存在的差距。

我们的工作提供了证据，证明对大型 ML 模型的机理理解是可行的，从而为将我们的理解扩展到更大的模型和更复杂的任务提供了机会。

openai 的 transformer-debugger （TDB）工具

2024.3

transformer-debugger 工具包代码地址

Transformer Debugger (TDB) is a tool developed by OpenAI’s Superalignment team with the goal of supporting investigations into specific behaviors of small language models. The tool combines automated interpretability techniques with sparse autoencoders.

TDB enables rapid exploration before needing to write code, with the ability to intervene in the forward pass and see how it affects a particular behavior. It can be used to answer questions like, “Why does the model output token A instead of token B for this prompt?” or “Why does attention head H attend to token T for this prompt?” It does so by identifying specific components (neurons, attention heads, autoencoder latents) that contribute to the behavior, showing automatically generated explanations of what causes those components to activate most strongly, and tracing connections between components to help discover circuits.

Transformer Debugger (TDB) 是 OpenAI 的 Superalignment 团队开发的一款工具，旨在支持对小型语言模型的特定行为进行研究。该工具结合了自动可解释性技术和稀疏自动编码器。

TDB 可以在需要编写代码之前进行快速探索，并能对前向传递进行干预，查看其对特定行为的影响。

它可以用来回答诸如 "为什么模型会输出标记 A 而不是标记 B "或 "为什么注意力头 H 会关注标记 T "之类的问题。
它通过识别对行为有贡献的特定组件（神经元、注意头、自动编码器潜变量），显示自动生成的关于导致这些组件激活最强烈的原因的解释，以及追踪组件之间的连接来帮助发现回路。

Arachis_X

关注

25
点赞
踩
29

收藏

觉得还不错? 一键收藏
0
评论
【语言模型可解释性】Interpretability in the Wild / transformer-debugger （TDB）工具

机理可解释性研究试图从机器学习模型的内部组件来解释其行为。然而，以往的大多数研究要么只关注小型模型中的简单行为，要么只是笼统地描述大型模型中的复杂行为。在这项工作中，我们通过解释 GPT-2 small如何执行一项名为 "间接对象识别（IOI）"的自然语言任务，弥补了这一差距。我们的解释包含 26 个注意头，分为 7 个主要类别，这些类别是我们利用依赖于因果干预的可解释性方法组合发现的。据我们所知，这项研究是在语言模型中 "wild "反向设计自然行为的最大规模端到端尝试。我们使用三个定量标准，
复制链接

扫一扫