【双语新闻】AGI安全与对齐，DeepMind近期工作_reverse engineer factual recall-CSDN博客

本文链接：https://blog.csdn.net/qq_29883477/article/details/142218645

我们想与AF社区分享我们最近的工作总结。以下是关于我们正在做什么，为什么会这么做以及我们认为它的意义所在的一些详细信息。我们希望这能帮助人们从我们的工作基础上继续发展，并了解他们的工作如何与我们相关联。

by Rohin Shah, Seb Farquhar, Anca Dragan

21st Aug 2024

AI Alignment Forum

We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours.

我们是谁？

Who are we?

我们是Google DeepMind的主要团队，致力于研究AI系统存在风险的技术方法。自从我们的上一篇文章^[1]之后，我们已经发展成为AGI安全与对齐团队，并将其视为AGI对齐（包括机制可解释性、可扩展监督等子团队），以及前沿安全性团队（致力于前沿安全性框架^[2]的开发和运行，包括危险能力评估）。我们自上次文章发布以来一直在扩大：去年增长了39%，今年前半段增长了37%。领导团队由Anca Dragan、Rohin Shah、Allan Dafoe和Dave Orr组成，Shane Legg是执行发起人。我们属于由Anca领导的总体AI安全与对齐组织，该组织还包括Gemini Safety（专注于当前Gemini模型的安全培训），以及Voice of All in Alignment团队，专注于价值和观点多样性的一致性技术。

请注意，我保留了原文中的特殊字符和格式标点符号。
We’re the main team at Google DeepMind working on technical approaches to existential risk from AI systems. Since our last post^[3], we’ve evolved into the AGI Safety & Alignment team, which we think of as AGI Alignment (with subteams like mechanistic interpretability, scalable oversight, etc.), and Frontier Safety (working on the Frontier Safety Framework^[4], including developing and running dangerous capability evaluations). We’ve also been growing since our last post: by 39% last year, and by 37% so far this year. The leadership team is Anca Dragan, Rohin Shah, Allan Dafoe, and Dave Orr, with Shane Legg as executive sponsor. We’re part of the overall AI Safety and Alignment org led by Anca, which also includes Gemini Safety (focusing on safety training for the current Gemini models), and Voices of All in Alignment, which focuses on alignment techniques for value and viewpoint pluralism.

我们在做什么？

What have we been up to?

以下是自上次更新以来，我们列出的在2023年和2024年的前几个月内发表的一些关键工作，按照主题/子团队分类：
It’s been a while since our last update, so below we list out some key work published in 2023 and the first part of 2024, grouped by topic / sub-team.

在过去1.5年中的重大投资包括：1）加强监督，以提供正确的学习信号，帮助模型与安全性保持一致，并避免引发灾难性风险；2）前沿安全研究，分析模型是否有能力引发灾难性风险的可能性；3）（本体论的）可解释性，作为实现前沿安全和对齐目标的潜在工具。除了这些投资之外，我们还尝试了一些有前景的领域和想法，以帮助我们识别应当做新的投资方向。
Our big bets for the past 1.5 years have been 1) amplified oversight, to enable the right learning signal for aligning models so that they don’t pose catastrophic risks, 2) frontier safety, to analyze whether models are capable of posing catastrophic risks in the first place, and 3) (mechanistic) interpretability, as a potential enabler for both frontier safety and alignment goals. Beyond these bets, we experimented with promising areas and ideas that help us identify new bets we should make.

边界安全

Frontier Safety

前沿安全团队的任务是_确保从极端伤害中确保安全，通过预见、评估并帮助谷歌准备前沿模型的强大能力来实现这一点。_虽然目前的重点主要在滥用威胁模型上，我们也在研究不一致的威胁模型。
The mission of the Frontier Safety team is to ensure safety from extreme harms by anticipating, evaluating, and helping Google prepare for powerful capabilities in frontier models. While the focus so far has been primarily around misuse threat models, we are also working on misalignment threat models.

FSF

我们最近发布了我们的前沿安全框架^[5]，其大致遵循了负责任能力扩展的方法，类似于Anthropic的负责任扩展政策^[6]和OpenAI的准备性框架^[7]。关键的不同之处在于FSF适用于Google：在Google中存在多种不同的前沿LLM部署，而不仅仅是单个聊天机器人和API（这进而影响利益相关者参与、政策执行、缓解计划等）。
We recently published our Frontier Safety Framework^[8], which, in broad strokes, follows the approach of responsible capability scaling^[9], similar to Anthropic’s Responsible Scaling Policy^[10] and OpenAI’s Preparedness Framework^[11]. The key difference is that the FSF applies to Google: there are many different frontier LLM deployments across Google, rather than just a single chatbot and API (this in turn affects stakeholder engagement, policy implementation, mitigation plans, etc).

我们团队在这一领域引领了谷歌的全局策略，并且证明了负责任的能力扩展不仅可以适用于小型创业公司，同样也适用于大型科技企业。
We’re excited that our small team led the Google-wide strategy in this space, and demonstrated that responsible capability scaling can work for large tech companies in addition to small startups.

在我们试点框架时，FSF重点关注的一个领域是如何将关键能力级别（CCL）映射到我们采取的缓解措施。这是我们未来版本迭代优先级中的一个重要问题。
A key area of the FSF we’re focusing on as we pilot the Framework, is how to map between the critical capability levels (CCLs) and the mitigations we would take. This is high on our list of priorities as we iterate on future versions.

一些评论（例如在这里^[12]）也准确地指出，FSF并未包含承诺。这是因为科学处于初期阶段，最佳实践需要发展。但最终我们关心的是实际工作是否完成。在实践中，我们确实对双子座1.5进行了危险能力评估，并且报告了我们认为足以以高置信度排除极端风险的足够评估。
Some commentary (e.g. here^[13]) also highlighted (accurately) that the FSF doesn’t include commitments. This is because the science is in early stages and best practices will need to evolve. But ultimately, what we care about is whether the work is actually done. In practice, we did run and report dangerous capability evaluations for Gemini 1.5 that we think are sufficient to rule out extreme risk with high confidence.

危险能力评估

Dangerous Capability Evaluations

我们的关于《评估前沿模型的危险能力》的文章是最全面的危险能力评估集合，到我们所知的程度，它已经指导了其他组织设计评估。我们定期对前沿模型运行和报告这些评估，包括Gemini 1.0（原论文），Gemini 1.5（见第9.5.2节）以及Gemma 2（见第7.4节）。我们特别高兴能够通过我们的Gemma 2评估帮助发展开源共享的准则。我们自豪于当前在评估和FSF实施透明度方面设定的标准，并期望看到其他实验室采用类似的方法。
Our paper on Evaluating Frontier Models for Dangerous Capabilities^[14] is the broadest suite of dangerous capability evaluations published so far, and to the best of our knowledge has informed the design of evaluations at other organizations. We regularly run and report these evaluations on our frontier models, including Gemini 1.0 (original paper), Gemini 1.5^[15] (see Section 9.5.2), and Gemma 2^[16] (see Section 7.4). We’re especially happy to have helped develop open sourcing norms through our Gemma 2 evals. We take pride in currently setting the bar on transparency around evaluations and implementation of the FSF, and we hope to see other labs adopt a similar approach.

在此之前，我们通过《极端风险的模型评估》（Model evaluation for extreme risks^[17]）一文为危险能力评估设定了基础原则，并在《高级AI模型的整体安全与责任评估》（Holistic Safety and Responsibility Evaluations of Advanced AI Models^[18]）中更全面地讨论了设计评估的方法，从当前危害到极端风险。
Prior to that we set the stage with Model evaluation for extreme risks^[19], which set out the basic principles behind dangerous capability evaluation, and also talked more holistically about designing evaluations across present day harms to extreme risks in Holistic Safety and Responsibility Evaluations of Advanced AI Models^[20].

机械可解释性

Mechanistic Interpretability

机制可解释性是我们安全策略的重要组成部分，最近我们深入研究了稀疏自动编码器（SAEs）。我们发布了Gated SAEs^[21]和JumpReLU SAEs^[22]，这是SAE的新架构，显著提高了重构损失与稀疏性之间的帕雷托前沿。这两篇论文通过盲法研究严格评估了架构变化，展示了结果特征的可解释性并没有退化。顺便说一下，Gated SAEs是我们所知的第一个在具有超过10亿参数的大语言模型（Gemma-7B）上扩展并严格评估SAE的工作。
Mechanistic interpretability is an important part of our safety strategy, and lately we’ve focused deeply on Sparse AutoEncoders (SAEs). We released Gated SAEs^[23] and JumpReLU SAEs^[24], new architectures for SAEs that substantially improved the Pareto frontier of reconstruction loss vs sparsity. Both papers rigorously evaluate the architecture change by running a blinded study evaluating how interpretable the resulting features are, showing no degradation. Incidentally, Gated SAEs was the first public work that we know of to scale and rigorously evaluate SAEs on LLMs with over a billion parameters (Gemma-7B).

我们也非常兴奋地训练并发布了Gemma Scope^[25]，一个用于Gemma 2 2B和9B（每层和每个子层）的公开、全面的SAE套件。我们相信Gemma 2位于“小到足以让学术界的研究人员相对容易地进行工作”的甜蜜点与“大到足以展示有趣且可用解释技术研究的高级行为”之间。我们希望这将使Gemma 2成为学术界/外部机械解释研究中的首选模型，并能够促进更多大胆的解释性研究，而不局限于工业实验室。您可以通过访问Gemma Scope^[26]来获取它，并且有一个由Neuronpedia提供支持的交互式Gemma Scope演示^[27]，感谢Neuronpedia^[28]。
We’ve also been really excited to train and release Gemma Scope^[29], an open, comprehensive suite of SAEs for Gemma 2 2B and 9B (every layer and every sublayer). We believe Gemma 2 sits at the sweet spot of “small enough that academics can work with them relatively easily” and “large enough that they show interesting high-level behaviors to investigate with interpretability techniques”. We hope this will make Gemma 2 the go-to models of choice for academic/external mech interp research, and enable more ambitious interpretability research outside of industry labs. You can access Gemma Scope here^[30], and there’s an interactive demo of Gemma Scope^[31], courtesy of Neuronpedia^[32].

团队在四月份的进展更新中可以看到一系列关于小组研究的小博客文章，链接如下：进展更新^[33]。
You can also see a series of short blog posts on smaller bits of research in the team’s progress update^[34] in April.

在SAEs之前，我们致力于：
Prior to SAEs, we worked on:

《电路分析可解释性尺度？从仓鼠的多项选择能力证据》^[35]: 这一关键贡献在于表明用于较小模型的电路分析技术具有扩展性：我们获得了大量关于Chinchilla（70B）如何在知道答案的情况下将答案与对应的字母映射到一起的理解，即对于多项选择问题。
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla^[36]: The key contribution here was to show that the circuit analysis techniques used in smaller models scaled: we gained significant understanding about how, after Chinchilla (70B) “knows” the answer to a multiple choice question, it maps that to the letter corresponding to that answer.
事实探索：尝试在神经元级别上反向工程事实回忆^[37]：尽管这项工作未能实现其雄心勃勃的目标，即在超置的早期MLP层中机械地理解事实是如何计算的，但它确实提供了进一步的证据表明超置正在发生，并否定了关于事实回忆可能如何运作的一些简单假设。它还为该领域的未来工作提供了一些指导原则，例如将早期层视为产生“多令牌嵌入”的方式，这种方式相对独立于先前上下文。
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level^[38]: While this work didn’t reach its ambitious goal of mechanistically understanding how facts are computed in superposition in early MLP layers, it did provide further evidence that superposition is happening, and falsified some simple hypotheses about how factual recall might work. It also provided guidelines for future work in the area, such as viewing the early layers as producing a “multi-token embedding” that is relatively independent of prior context.
AtP∗: 一种高效且可扩展的方法，用于定位LLM行为到组件^[39]: 在电路发现的关键方面是找到对研究行为至关重要的模型的哪些组件。激活补丁是一种原则性的方法，但对于每个组件都需要单独的操作（类似于训练模型），而归因补贴则是近似的方法，并且能够与两个前向和一个反向操作同时为所有组件进行操作。本文调查了归因补贴法，诊断并解决了两个问题，展示了其结果的AtP*算法对完整激活补贴提供了令人印象深刻的良好逼近效果。
AtP∗: An efficient and scalable method for localizing LLM behaviour to components^[40]: A crucial aspect of circuit discovery is finding which components of the model are important for the behavior under investigation. Activation patching is the principled approach, but requires a separate pass for each component (comparable to training a model), whereas attribution patching is an approximation, but can be done for every component simultaneously with two forward & one backward pass. This paper investigated attribution patching, diagnosed two problems and fixed them, and showed that the resulting AtP* algorithm is an impressively good approximation to full activation patching.
["Tracr：编译变换器作为可解释性实验室"](#tracr%E3%80%90%E7%BC%96%E5%8C%85%E6%9B%B1%E5%8F%98%E6%9C%AC%E4%B8%AD%E7%AB%B6%E7%BA%A2%E5%AE%9A%E5%9B%BE%E6%8D%9F" ""Tracr：编译变换器作为可解释性实验室"")(链接^[41])：“让我们能够创建变换器权重，我们知道了模型正在做什么的确切答案，这允许我们将它作为可解释性工具的测试案例。我们已经看到了一些使用Tracr的例子，但其使用的范围并没有如我们所希望的那样广泛，因为由Tracr生成的模型与在野外训练的模型有很大的不同。（这是工作完成时已知的风险之一，但我们曾期望这不会成为太大的缺点。）
Tracr: Compiled Transformers as a Laboratory for Interpretability^[42]: Enabled us to create Transformer weights where we know the ground truth answer about what the model is doing, allowing it to serve as a test case for our interpretability tools. We’ve seen a few cases where people used Tracr, but it hasn’t had as much use as we’d hoped for, because Tracr-produced models are quite different from models trained in the wild. (This was a known risk at the time the work was done, but we hoped it wouldn’t be too large a downside.)

放大监督

Amplified Oversight

我们加强的监督工作旨在对AI系统输出结果的所有原因，以及在AI拥有广泛超人类能力时的情况进行监督，这些情况接近于一个完全了解所有相关理由的人类所能提供的监督。社区通常称之为“可扩展监督”，但我们希望明确指出，并不一定包括将监督应用到大量不同情境的数量级上，即监控的含义并不局限于此。
Our amplified oversight work aims to provide supervision on any single situation that is as close as possible to that of a human with complete understanding of all of the reasons that the AI system produced its output - including when the AI has a very broad range of superhuman capabilities. (The community often calls this “scalable oversight”, but we want to be clear that this does not necessarily include scaling to large numbers of situations, as in monitoring.)

辩论的理论工作

Theoretical Work on Debate

在理论层面，原始辩论协议^[43]允许一位多项式时间验证者利用最优化论辩者之间的辩论来解决任何PSPACE内的问题。但是我们的AI系统并不最优，并且我们不应假定它们是最优的！即使一个理想的AI能够反驳谎言，如果我们实际训练的AI系统不能做到这一点，也无关紧要。当不诚实的辩论者通过将容易的问题分解为一个受限的实证主义者无法回答但最优诚实AI可以回答的困难子问题来撒谎时，晦涩的论点^[44]问题就出现了。
On the theoretical side, the original debate protocol^[45] enables a polynomial-time verifier to decide any problem in PSPACE given debates between optimal debaters. But our AI systems are not optimal, and we should not assume they are! It doesn't matter if an optimal AI could refute lies, if the AI systems we train in practice cannot do so. The problem of obfuscated arguments^[46] is exactly when a dishonest debater lies by breaking an easy problem down into hard subproblems that an optimal honest AI could answer but a bounded one could not.

双向高效率辩论^[47]提供了一种新的辩论协议，它允许一个_多项式时间_诚实策略向甚至更有限的裁判证明事实，即使面对无限制的不诚实策略也是如此。这还没有达到我们的需求：诚实策略的时间复杂度只在人类可判断的论证长度上是多项式的，而我们希望它在AI可判断的论证长度上是高效的。我们有一些正在进行中的工作，希望能跨过这个缺口，并且我们认为如果能够跨越这个差距，这将会影响我们在实证工作中研究的协议类型。
Doubly-efficient debate^[48] provides a new debate protocol that enables a polynomial-time honest strategy to prove facts to an even more limited judge, even against an unbounded dishonest strategy. This is not quite yet what we want: the honest strategy is only polynomial-time in the length of the human-judgeable argument, whereas we would like it to be efficient in terms of the length of the AI-judgeable argument. We have some work in progress that we hope will cross this gap, and we expect that if we do cross the gap this will influence which protocols we study in our empirical work.

实证工作在辩论中的应用

Empirical Work on Debate

在实证方面，我们进行了仅限于推理的辩论实验，这些实验有助于挑战社区的预期。首先，在存在信息不对称的任务中，理论表明辩论应该接近与向裁判提供完整信息一样好（甚至更好），而在这类纯推理实验中，辩论的表现要明显差得多。其次，在不存在信息不对称的任务上，拥有辩论信息的弱裁判模型并不能超过没有辩论信息的弱裁判模型。第三，我们只发现了有限的证据表明更强的辩论者可以显著提高裁判的准确性——如果想让辩论在长远上取得成功，这一点至关重要。
On the empirical side, we ran inference-only experiments with debate^[49] that help challenge what the community expects. First, on tasks with information asymmetry, theory suggests that debate should be close to as good as (or even better than) giving the judge access to the full information, whereas in these inference-only experiments debate performs significantly worse. Second, on tasks without information asymmetry, weak judge models with access to debates don’t outperform weak judge model without debate. Third, we find only limited evidence that stronger debaters lead to much higher judge accuracy – and we really need to make this be the case for debate to succeed in the long run.

我们的感觉是，这些问题之所以发生，是因为模型在评估辩论方面不太擅长：实际的论点似乎相当好。我们当前的工作正在研究如何训练我们的LLM法官成为更好的人类裁判员代理，之后我们计划使用辩论程序对辩手进行微调，并检查这是否能关闭我们所观察到的差距。
Qualitatively, our sense is that these issues occur because the models are not very good at judging debates: the actual debater arguments seem quite good. Our current work is looking into training our LLM judges to be better proxies of human judges, after which we plan to try finetuning the debaters using the debate protocol, and checking that this closes the gaps we’ve observed.

因果对齐

Causal Alignment

在团队中，一项长期的研究探索了如何理解因果激励可以为设计安全的AI系统提供贡献。因果关系为我们提供了相当通用的工具来理解那些‘试图’实现目标的代理会做什么，并且提供了它们行为的原因解释。我们开发了算法【发现代理】^[50]，可以帮助我们识别可以通过代理视角来理解系统中的哪些部分。理论上，这可以使我们能够通过经验发现具有目标导向的代理，并确定它们在优化什么。
A long-running stream of research in our team explores how understanding causal incentives can contribute to designing safe AI systems. Causality gives us pretty general tools for understanding what agents that are ‘trying’ to achieve goals will do, and provides explanations for how they act. We developed algorithms for discovering agents^[51], which can help us identify which parts of systems can be understood through an agent-lens. In principle, this could allow us to empirically discover goal-directed agents, and determine what they are optimizing for.

我们还表明，因果世界模型是智能体鲁棒性的一个关键方面，这表明一些因果工具可能适用于任何足够强大的智能体。该论文在2024年ICLR会议上获得了优秀论文提名奖。这项工作继续指导安全缓解措施的发展，这些措施通过管理智能体的激励来工作，例如基于过程监督的方法。它还可以用于设计一致性检查，评估代理在环境中的长期行为，扩展我们今天所拥有的短期时间框架的一致性检查。
We have also shown that causal world models are a key aspect of agent robustness^[52], suggesting that some causal tools are likely to apply to any sufficiently powerful agent. The paper got an Honorable Mention for Best Paper at ICLR 2024. This work continues to inform the development of safety mitigations that work by managing an agent’s incentives, such as methods based on process supervision. It can also be used to design consistency checks that look at long-run behavior of agents in environments, extending the more short-horizon consistency checks we have today.

新兴主题

Emerging Topics

这包括我们进行的一些研究，这些研究不一定属于多年计划的一部分，而是专注于解答一个特定的问题，或者探究一个领域是否应该成为我们长期关注的重点。这种研究方式已经导致了一些不同的论文产出：
We also do research that isn’t necessarily part of a years-long agenda, but is instead tackling one particular question, or investigating an area to see whether it should become one of our longer-term agendas. This has led to a few different papers:

在2022年末，人们对AI系统存在一种期望（或至少曾有这种期望），即大语言模型中只有少数类似于真理特性的功能。人们希望找到并列出所有这些功能，并确定哪个功能与“模型的信念”对应，然后利用这个功能来构建一个诚实的AI系统。在《无监督大语言模型知识发现面临的挑战》（Challenges with unsupervised LLM knowledge discovery）这篇论文中，我们旨在通过展示大量的类似于真理特性的特征（特别是那些模拟其他智能体信念的特征），有力地反驳这种直觉。我们的目标并未完全实现，这可能是因为所使用的AI系统不够强大，无法显示出这样的特征。然而，我们确实展示了存在许多显著的功能，这些功能至少具有与真理特性类似的否定一致性和平等性，并且“欺骗”了多种无监督知识发现方法。
One alignment hope^[53] that people have (or at least had in late 2022) is that there are only a few “truth-like” features in LLMs, and that we can enumerate them all and find the one that corresponds to the “model’s beliefs”, and use that to create an honest AI system. In Challenges with unsupervised LLM knowledge discovery^[54], we aimed to convincingly rebut this intuition by demonstrating a large variety of “truth-like” features (particularly features that model the beliefs of other agents). We didn’t quite hit that goal, likely because our LLM wasn’t strong enough to show such features, but we did show the existence of many salient features that had at least the negation consistency and confidence properties of truth-like features, which “tricked” several unsupervised knowledge discovery approaches.

通过"解析grokking的电路效率"(arxiv.org/abs/2309.02390^[55])，我们深入探讨了"深度学习科学"(["深度学习的影响理论"，alignmentforum.org/posts/tKYGvA9dKHa3GWBBk/theories-of-impact-for-science-of-deep-learning](""深度学习的影响理论"，alignmentforum.org/posts/tKYGvA9dKHa3GWBBk/theories-of-impact-for-science-of-deep-learning" ""深度学习的影响理论"，alignmentforum.org/posts/tKYGvA9dKHa3GWBBk/theories-of-impact-for-science-of-deep-learning" ""深度学习的影响理论"，alignmentforum.org/posts/tKYGvA9dKHa3GWBBk/theories-of-impact-for-science-of-deep-learning" ""深度学习的影响理论"，alignmentforum.org/posts/tKYGvA9dKHa3GWBBk/theories-of-impact-for-science-of-deep-learning"))。本文试图解答以下问题：在grokking现象中，为何网络的测试性能在持续训练后急剧提高？尽管网络已经在训练阶段获得了几乎完美的表现水平。

文中提出了一个令人信服的答案，并通过预测类似环境中的多个新颖现象验证了这一答案。我们本希望通过更深入理解训练动态来提升安全性，但遗憾的是，这个希望并没有得到实现（不过仍有潜力通过这些见解检测到新能力）。因此，我们决定不再在“深度学习科学”领域投入更多资源，因为还有其他更加有前景的研究方向。尽管如此，我们对这一领域的研究仍然充满热情，并期待看到更多的研究工作。

请注意：这里的翻译保留了原文中的链接和格式化标识符。
Explaining grokking through circuit efficiency^[56] was a foray into “science of deep learning^[57]”. It tackles the question: in grokking, why does the network’s test performance improve dramatically upon continued training, despite having already achieved nearly perfect training performance? It gives a compelling answer to this question, and validates this answer by correctly predicting multiple novel phenomena in a similar setting. We hoped that better understanding of training dynamics would enable improved safety, but unfortunately that hope has mostly not panned out (though it is still possible that the insights would help with detection of new capabilities). We’ve decided not to invest more in “science of deep learning”, because there are other more promising things to do, but we remain excited about it and would love to see more research on it.

《追求权力可能是可预测和可训练代理的可能》这篇短文基于追求权力框架，探讨了如何从学习到的代理目标误化角度出发来构建风险论点。该文章仍然假定人工智能系统在追求一个目标，但是具体指出这个目标集合与训练期间学得的行为一致。

请注意：翻译结果中包含了一段原文内容和其解释性的中文版本。

返回符号:
Power-seeking can be probable and predictive for trained agents^[58] is a short paper building on the power-seeking framework^[59] that shows how the risk argument would be made from the perspective of goal misgeneralization of a learned agent. It still assumes that the AI system is pursuing a goal, but specifies that the goal comes from a set of goals that are consistent with the behavior learned during training.

我们下一步计划做什么？

What are we planning next?

当前我们正在努力的工作中，最令人激动和重要的项目之一是对技术AGI安全的自我高层次方法进行修订。虽然对前沿安全性、可解释性和强化监督的投资是这一议程的关键组成部分，但这些因素并不一定能够形成一个系统性的风险应对策略。我们正构建一个逻辑框架来分析技术失准风险，并利用这个框架优先规划研究项目，以便更全面地覆盖我们需要克服的挑战集。
Perhaps the most exciting and important project we are working on right now is revising our own high level approach to technical AGI safety. While our bets on frontier safety, interpretability, and amplified oversight are key aspects of this agenda, they do not necessarily add up to a systematic way of addressing risk. We’re mapping out a logical structure for technical misalignment risk, and using it to prioritize our research so that we better cover the set of challenges we need to overcome.

在這一點上，我們特別關注需要解決的重要領域。即使強化監督檢查的表現完全符合期望，這也可能不足以確保一致性。在分佈變化的情況下，AI系統可能會以放大監督檢查無法支持的方式運行，正如我們之前在目標泛化^[60]中研究過的那樣。要應對這種情況，需要投資於敵對訓練、不確定性估計、監測等；我們希望通過控制框架^[61]的部分評估這些緩解措施。
As part of that, we’re drawing attention to important areas that require addressing. Even if amplified oversight worked perfectly, that is not clearly sufficient to ensure alignment. Under distribution shift, the AI system could behave in ways that amplified oversight wouldn’t endorse, as we have previously studied in goal misgeneralization^[62]. Addressing this will require investments in adversarial training, uncertainty estimation, monitoring, and more; we hope to evaluate these mitigations in part through the control framework^[63].

我们期待着当我们的想法准备好接受反馈和讨论时，与您分享更多。感谢您的参与，并对我们工作的质量、知识体系和行动标准提出高标准。
We’re looking forward to sharing more of our thoughts with you when they are ready for feedback and discussion. Thank you for engaging and for holding us to high standards for our work, epistemics, and actions.

参考资料

[1]

上一篇文章: https://www.alignmentforum.org/posts/nzmCvRvPm4xJuqztv/deepmind-is-hiring-for-the-scalable-alignment-and-alignment

[2]

前沿安全性框架: https://deepmind.google/discover/blog/introducing-the-frontier-safety-framework/

[3]

last post: https://www.alignmentforum.org/posts/nzmCvRvPm4xJuqztv/deepmind-is-hiring-for-the-scalable-alignment-and-alignment

[4]

Frontier Safety Framework: https://deepmind.google/discover/blog/introducing-the-frontier-safety-framework/

[5]

前沿安全框架: https://deepmind.google/discover/blog/introducing-the-frontier-safety-framework/

[6]

负责任扩展政策: https://www.anthropic.com/news/anthropics-responsible-scaling-policy

[7]

准备性框架: https://openai.com/preparedness/

[8]

Frontier Safety Framework: https://deepmind.google/discover/blog/introducing-the-frontier-safety-framework/

[9]

responsible capability scaling: https://www.gov.uk/government/publications/emerging-processes-for-frontier-ai-safety/emerging-processes-for-frontier-ai-safety#responsible-capability-scaling

[10]

Responsible Scaling Policy: https://www.anthropic.com/news/anthropics-responsible-scaling-policy

[11]

Preparedness Framework: https://openai.com/preparedness/

[12]

在这里: https://www.lesswrong.com/posts/y8eQjQaCamqdc842k/deepmind-s-frontier-safety-framework-is-weak-and-unambitious

[13]

here: https://www.lesswrong.com/posts/y8eQjQaCamqdc842k/deepmind-s-frontier-safety-framework-is-weak-and-unambitious

[14]

Evaluating Frontier Models for Dangerous Capabilities: https://arxiv.org/pdf/2403.13793

[15]

Gemini 1.5: https://arxiv.org/abs/2403.05530

[16]

Gemma 2: https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf

[17]

Model evaluation for extreme risks: https://arxiv.org/pdf/2305.15324

[18]

Holistic Safety and Responsibility Evaluations of Advanced AI Models: https://arxiv.org/pdf/2404.14068

[19]

Model evaluation for extreme risks: https://arxiv.org/pdf/2305.15324

[20]

Holistic Safety and Responsibility Evaluations of Advanced AI Models: https://arxiv.org/pdf/2404.14068

[21]

Gated SAEs: https://arxiv.org/abs/2404.16014

[22]

JumpReLU SAEs: https://arxiv.org/abs/2407.14435

[23]

Gated SAEs: https://arxiv.org/abs/2404.16014

[24]

JumpReLU SAEs: https://arxiv.org/abs/2407.14435

[25]

Gemma Scope: https://deepmind.google/discover/blog/gemma-scope-helping-the-safety-community-shed-light-on-the-inner-workings-of-language-models/

[26]

Gemma Scope: https://huggingface.co/google/gemma-scope

[27]

Neuronpedia提供支持的交互式Gemma Scope演示: https://www.neuronpedia.org/gemma-scope

[28]

Neuronpedia: https://www.neuronpedia.org/

[29]

Gemma Scope: https://deepmind.google/discover/blog/gemma-scope-helping-the-safety-community-shed-light-on-the-inner-workings-of-language-models/

[30]

here: https://huggingface.co/google/gemma-scope

[31]

interactive demo of Gemma Scope: https://www.neuronpedia.org/gemma-scope

[32]

Neuronpedia: https://www.neuronpedia.org/

[33]

进展更新: https://www.alignmentforum.org/posts/HpAr8k74mW4ivCvCu/progress-update-from-the-gdm-mech-interp-team-summary

[34]

progress update: https://www.alignmentforum.org/posts/HpAr8k74mW4ivCvCu/progress-update-from-the-gdm-mech-interp-team-summary

[35]

《电路分析可解释性尺度？从仓鼠的多项选择能力证据》: https://arxiv.org/pdf/2307.09458

[36]

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla: https://arxiv.org/pdf/2307.09458

[37]

事实探索：尝试在神经元级别上反向工程事实回忆: https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall

[38]

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level: https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall

[39]

AtP∗: 一种高效且可扩展的方法，用于定位LLM行为到组件: https://arxiv.org/pdf/2403.00745

[40]

AtP∗: An efficient and scalable method for localizing LLM behaviour to components: https://arxiv.org/pdf/2403.00745

[41]

链接: https://proceedings.neurips.cc/paper_files/paper/2023/file/771155abaae744e08576f1f3b4b7ac0d-Paper-Conference.pdf

[42]

Tracr: Compiled Transformers as a Laboratory for Interpretability: https://proceedings.neurips.cc/paper_files/paper/2023/file/771155abaae744e08576f1f3b4b7ac0d-Paper-Conference.pdf

[43]

原始辩论协议: https://arxiv.org/abs/1805.00899

[44]

晦涩的论点: https://www.alignmentforum.org/posts/PJLABqQ962hZEqhdB/debate-update-obfuscated-arguments-problem

[45]

original debate protocol: https://arxiv.org/abs/1805.00899

[46]

obfuscated arguments: https://www.alignmentforum.org/posts/PJLABqQ962hZEqhdB/debate-update-obfuscated-arguments-problem

[47]

双向高效率辩论: https://arxiv.org/pdf/2311.14125

[48]

Doubly-efficient debate: https://arxiv.org/pdf/2311.14125