[今日Arxiv] 思维迭代：利用内心对话进行自主大型语言模型推理

曲奇人工智能安全

于 2024-09-22 15:49:59 发布

阅读量1.3k

点赞数 29

文章标签： llama 人工智能

本文链接：https://blog.csdn.net/qq_29883477/article/details/142438098

版权

思维迭代：利用内心对话进行自主大型语言模型推理

Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning

URL：https://arxiv.org/abs/2409.12618

注：翻译可能存在误差，详细内容建议查看原始文章。

桑托什·库马尔·拉杜 1 ，雅萨明·诺里·杰莱阿尼，阿拉·古卡西安 1 ，和奥克塔伊·格克塔斯 1
Santosh Kumar Radha 1 , Yasamin Nouri Jelyani , Ara Ghukasyan 1 , and Oktay Goktas 1

1 阿格诺斯蒂克公司，325 前街西，多伦多，安大略省 M5V 2Y1 2 多伦多大学，60 圣乔治街，多伦多，安大略省，M5S 1A7，加拿大
1 Agnostiq Inc., 325 Front St W, Toronto, ON M5V 2Y1 2 University of Toronto, 60 St George St, Toronto, Ontario, M5S 1A7, Canada

摘要

Abstract

迭代式人类互动是利用大型语言模型（LLM）高级语言处理能力的常见且有效的方法。通过以对话方式使用结构良好的提示，人类用户可以有效地引导LLM发展出更加深思熟虑和准确的回答。受此见解启发，我们提出了一种名为“思维迭代”（IoT）框架来增强LLM回答质量，该框架通过生成针对输入问题及当前LLM响应迭代的“思维”激发提示实现这一目标。与静态或准静态方法（如“思考链”CoT或“思想树”ToT）不同，IoT的推理路径是动态调整的，基于不断演变的情境，并且不会产生最终被丢弃的探索性思维分支。IoT框架包含三个组成部分：（1）内部对话代理(IDA)，负责生成指导性的、具体情境下适用的提示；（2）LLM代理(LLMA)，处理这些提示来改进其回答；以及（3）迭代提问循环，实现前两个组件间的对话过程。我们介绍该框架的两种变体：自主思维迭代(AIoT)，其中由LLM决定何时停止迭代；引导式思维迭代(GIoT)，始终强制执行固定次数的迭代。我们在不同数据集上探索了IoT的表现，这些数据集跨越诸如GPQA数据集中复杂的推理任务、24点游戏中的问题解决（“Game of 24”）、速解纵横字谜(“Mini Crosswords”)以及源自HotpotQA数据集的多跳问答等众多领域。我们的结果表明，IoT为LLM自主回答完善提供了一个可行的框架，并在CoT上展现出显著改进，从而推动了减少人类干预的更加适应性和高效的推理系统的实现。
Iterative human engagement is a common and effective means of leveraging the advanced language processing power of large language models (LLMs). Using well-structured prompts in a conversational manner, human users can effectively influence an LLM to develop more thoughtful and accurate responses. Motivated by this insight, we propose the Iteration of Thought (IoT) framework for enhancing LLM responses by generating “thought”-provoking prompts vis a vis an input query and the current iteration of an LLM’s response. Unlike static or semi-static approaches, e.g. Chain of Thought (CoT) or Tree of Thoughts (ToT), IoT adapts its reasoning path dynamically, based on evolving context, and without generating alternate exp lor at ive thoughts which are ultimately discarded. The three components of the IoT framework are (1) an Inner Dialogue Agent (IDA) responsible for generating instructive, context-specific prompts; (2) an LLM Agent (LLMA) that processes these prompts to refine its responses; and (3) an iterative prompting loop that implements a conversation between the former two components. We introduce two variants of our framework: Autonomous Iteration of Thought (AIoT), where an LLM decides when to stop iterating, and Guided Iteration of Thought (GIoT), which always forces a fixed number iterations. We investigate the performance of IoT across various datasets, spanning complex reasoning tasks from the GPQA dataset, exp lor at ive problem-solving in Game of 24 , puzzle solving in Mini Crosswords , and multi-hop question answering from the HotpotQA dataset. Our results show that IoT represents a viable paradigm for autonomous response refinement in LLMs, showcasing significant improvements over CoT and thereby enabling more adaptive and efficient reasoning systems that minimize human intervention.

1 引言

1 Introduction

像 GPT-3、PaLM（Anil 等人，2023年）、以及它们的后继者，包括 GPT-4（OpenAI，2023年）、Gemini（Team 等人，2023年）、LLaMA（Dubey 等人，2024年）和 Claude 的大型语言模型（LLM）的发展已经革新了自然语言处理。LLM 使人工智能系统能够以惊人的熟练程度执行广泛的任务。在人类-LLM 相互作用的背景下，从实践经验中得出的一个关键观察是：LLM 的响应质量通常会随着反复提示和用户反馈而提高。最近的研究表明，简单的提示可能会导致校准错误，而更复杂、迭代式的提示策略则显著提高了准确性和可靠性（Krishna 等人，2024年）。这些结果表明，在适当情境的输入序列下，LLM 可以更加有效地利用他们的内部知识库（江等，2020；Petroni等，2019；Talmor等，2020；Roberts等，2020）来提供更丰富、更细致的答案（Sloman，1996）。
The development of Large Language Models (LLMs) like GPT-3 , PaLM ( Anil et al. , 2023 ), and their successors, including GPT-4 ( OpenAI , 2023 ), Gemini ( Team et al. , 2023 ), LLaMA ( Dubey et al. , 2024 ), and Claude , has revolutionized natural language processing. LLMs have empowered AI systems to perform a wide range of tasks with remarkable proficiency. In the context of human-LLM interaction, a critical observation from practical experience is that the quality of LLM responses tends to improve with repeated prompting and user feedback. Recent research demonstrated that naïve prompting can lead to calibration errors, while more sophisticated, iterative prompting strategies significantly improve both accuracy and reliability ( Krishna et al. , 2024 ). These results suggest that, given context-appropriate sequences of inputs, LLMs can much more effectively leverage their internal knowledge base ( Jiang et al. , 2020 ; Petroni et al. , 2019 ; Talmor et al. , 2020 ; Roberts et al. , 2020 ) to provide richer, more nuanced answers ( Sloman , 1996 ).

请添加图片描述

图 1：说明增强 LLM 理解能力的不同提示策略。输入-输出（IO）方法采用直接的输入输出方式，没有中间推理过程。链式思维（CoT）( Wei 等人, 2022 ) 提示引入了一个单一、线性的推理路径，而树状思维（ToT）( Yao 等人, 2024 ) 方法通过并行探索多个推理路径进行扩展。提出的迭代思维（IoT）框架(本研究) 引入了内对话代理（IDA)，用于动态调整推理路径，实现跨路径的自适应探索以提高响应准确性。

人类用户与大型语言模型（LLM）的交互通常如下进行：用户向LLM提出问题，收到初步回复后，如果答案不完整或欠佳，用户会通过重申背景线索来进一步指导LLM（例如，提醒LLM其角色、建议考虑更多信息，或是指出需要改进的回答部分）。这种往返交流有助于缩小LLM的关注范围，同时减少了用户的搜寻工作量，因为大部分推理和信息检索工作由LLM承担。
A human user’s interaction with in an LLM often proceeds as follows: the user poses a question to the LLM, receives an initial response, and, if the answer is incomplete or suboptimal, provides additional guidance to the LLM by reiterating contextual clues ( e.g. by reminding the LLM of its role, suggesting additional information to consider, or highlighting specific parts of the response that need refinement). This back-and-forth process helps narrow the focus of the LLM while reducing the research effort required from the user, since the LLM is responsible for the bulk of the reasoning and information retrieval.

我们确定了两种主要的人类与LLM交互形式。在第一种交互形式中，用户简单地引导LLM浏览其内部知识库。例如，假设有一个场景，LLM生成的代码因缺少括号而出现语法错误。用户可能会提示它“验证语法”，这将导致LLM在后续响应中纠正错误。在第二种交互形式中，用户引入新信息以改进LLM的回答。例如，可以要求LLM提供某个特定城市的最新天气信息，但它却无法访问实时数据。在这种情况下，用户可以提供这些信息（使用工具或API），然后提示LLM建议适合当地天气的着装或推荐访问的目的地。总的来说，第一种交互形式引导LLM更好地利用其内部知识，而第二种交互形式则涉及用新信息增强LLM的知识库。
We identify two predominant forms of human-LLM interaction. In the first form of interaction, the user simply guides an LLM through its own internal knowledge base. For example, consider a scenario where an LLM generates code that is syntactically incorrect due to a missing bracket. The user might prompt it to “verify the syntax,” leading the LLM to correct the error in a subsequent response. In the second for of interaction, the user introduces new information to improve the LLM’s response. For example, an LLM may be asked to provide up-to-date weather information for a specific city, but lacks access to real-time data. In this case, the user can supply this information (using a tool or API), then prompt the LLM to e.g. recommend weather-appropriate clothing or destination to visit in that locale. All together, the first form an interaction leads the LLM to better utilize its internal knowledge, whereas the second form of interaction involves augmenting the LLM’s knowledge with new information.

研究显示，提示语的表达方式能够显著影响模型在不同场景下的表现，这证明了迭代提问改进大语言模型（LLM）应答的潜力（布朗，2020年；奥帕尔-翁等人，2024年）。图1展示了从简单的输入输出（IO）方法到更高级的方法的进展，如思维链（CoT）（魏等人，2022年）和思维树（ToT）（姚等人，2024年）ThCoT在单一的线性路径上引入了顺序推理步骤，而ToT则平行探索多个推理途径，形成分支结构以优化输出。
The potential of iterative prompting to improve LLM responses is supported by research showing that prompt phrasing can significantly influence a model’s performance in various settings ( Brown , 2020 ; Opsahl-Ong et al. , 2024 ). Figure 1 illustrates the progression from simple Input-Output (IO) approaches to more advanced methods like Chain-of-Thought (CoT) ( Wei et al. , 2022 ) and Tree-ofThought (ToT) ( Yao et al. , 2024 )ThCoT introduces sequential reasoning steps along a single linear path, while ToT explores multiple reasoning pathways in parallel, forming a branching structure to optimize the output.

这些方法代表了“推理框架”，它们依赖于静态或半静态提示，可能难以适应每个查询和回答情境的不断变化，这可能会限制LLM回应的质量。思考链（CoT）提示鼓励LLM阐述其中间推理步骤，从而在复杂任务上表现出更好的性能。同样，相关的思维链方法（Sahoo等人，2024年，等其他方法）沿多个路径进行推理，考虑到更广泛的潜在回答，其中大多数生成后会被弃用，这在解决谜题或填字游戏等更具探索性的任务上表现得更好。其他框架如自我精炼（Madaan等人，2024年）和自我验证（Weng等人，2022年）使LLM能够迭代地批评并改进其输出，但仍依赖于静态或半静态提示。在一个更广泛的背景下，通过推理技术而不是大量训练来追求改善推理的价值，在OpenAI的全新一系列 o1模型（OpenAI, 2024年）等近期进展中得到了进一步强调。这些专有模型被专门设计为在回应之前花更多时间“思考”问题，专注于利用推理解决科学、编码和数学领域的复杂任务。这些发展凸显了人工智能社区的一个更广泛的转变，即转向训练后的推理能力增强作为一种更具可扩展性的方法。
these methods represent “reasoning frameworks” that rely on static or semi-static prompts, which may struggle to adapt to the evolving context of each query and response, potentially limiting the quality of LLM responses. CoT prompting encourages LLMs to articulate its intermediate reasoning steps, which leads to better performance on complex tasks. Similarly, the related ToT approach (among other methods ( Sahoo et al. , 2024 )) reasons along multiple paths to consider a wider breadth of potential responses, most of which are generated then discarded, leading to better performance on more exp lor at ive tasks like solving puzzles or crosswords. Other frameworks like Self-Refine ( Madaan et al. , 2024 ) and Self-Verification ( Weng et al. , 2022 ) enable LLMs to iterative ly critique and refine their outputs, but still rely on static or semi-static prompts. In a broader context, the value of pursuing improved reasoning with inference techniques, as opposed to extensive training, is underscored by more recent advancements such as OpenAI’s new series of o1 models ( OpenAI , 2024 ). These proprietary models are specifically designed to spend more time “thinking” through problems before responding, focusing on inference to solve complex tasks in science, coding, and math. Such developments highlight a broader shift in the AI community toward post-training enhancement of reasoning capabilities as a more scalable approach.

在本项工作中，注意到缺乏旨在复制人类与大型语言模型（LLM）互动的动态性质的推理框架，我们提出IOT（IoT）作为一种自主、迭代和适应性的方法来处理没有人类反馈的LLM推理。
In this work, noting the lack of reasoning frameworks that strive to replicate the dynamic nature of human-LLM interactions, we propose IoT as an autonomous, iterative, and adaptive approach to LLM reasoning without human feedback.

1.1 思想迭代（IoT）

1.1 Iteration of thought (IoT)

与上述静态和半静态框架不同，IoT利用内部对话代理（IDA）在每次迭代中调整和完善其推理路径。这使得能够进行适应性探索，跨越不同的推理树，从而促进生成更灵活、更具情境感知的响应过程。与现有方法的比较如图 1 所示。
Unlike the aforementioned static and semi-static frameworks, IoT utilizes an Inner Dialogue Agent (IDA) to adjust and refine its reasoning path during each iteration. This enables adaptive exploration across different reasoning trees, fostering a more flexible and context-aware response generation process. A comparison to existing methods is shown schematically in Figure 1 .

IOT的核心框架由三个主要组件组成。更多细节也提供在第2节中。
The core IoT framework is composed of three main components. Further details are also provided in Section 2 .

内在对话代理（IDA）：IDA 作为“引导者”，根据原始用户查询和 LLM 的前一响应动态生成上下文敏感提示。调整后的提示服务于迭代地引导 LLM 趋向更精确、准确的答案。数学上，可以将 IDA 表示为函数 $C :$ $\mathcal{Q}\times\mathcal{R}\times\mathcal{K}^{\prime}\rightarrow\mathcal{P}$ ，其中 $\mathcal{Q}$ 是可能查询的空间， $\mathcal{R}$ 是潜在 LLM 的空间，而 P 是生成提示的空间。在每一步中，它采用当前查询 $\in Q$ 和前一响应 $\in R$ 生成新提示 $\in P$ 。这一过程使得提示生成动态化，将 IDA 与更为僵化的手段（如 CoT）区分开来，并允许其适应不断变化的上下文。
• Inner dialogue agent (IDA): The IDA functions as a “guide” that dynamically generates context-sensitive prompts based on the original user query and the LLM’s previous response. The adjusted prompts servce to iterative ly lead the LLM toward more refined and accurate answers. Mathematically, the IDA can be represented as a function $C :$ $\mathcal{Q}\times\mathcal{R}\times\mathcal{K}^{\prime}\rightarrow\mathcal{P}$ , here $\mathcal{Q}$ is the space of possible queries, $\mathcal{R}$ is the space of potential LLM es, and P is the space of gen prompts. At each step, it ta current query ∈Q and the previous response r ∈R to generate a new prompt ∈P . This process makes prompt generation dynamic , differentiating IoT from more rigid approaches like CoT and allowing it to adapt to an evolving context.

• LLM代理（LLMA）：LLMA体现了LLM的核心推理能力，并处理IDA动态生成的提示。它利用LLM的内部知识 $K$ 来形成其响应。正式地，我们将LLA建模为一个函数 $L\,:\,\mathcal{Q}\times$ $\mathcal{P}\times K\to\mathcal{R}$ P×→R。LLMA以查询 $q$ 、提示 $p$ 和知识库 $K$ 作为输入，然后生成精炼的响应 $r$ 。LLMA还识别其自身推理中的不确定区域或空白，向IDA提供反馈以便相应调整提示。这种互动创建了一个闭环系统，在没有外部输入的情况下持续提高答案的质量。
• LLM agent (LLMA): The LLMA embodies the core reasoning capabilities of an LLM and processes the IDA’s dynamically generated prompts. It uses an LLM’s internal knowledge $K$ ne its responses. Formally, we mod l the LL A as a function $L\,:\,\mathcal{Q}\times$ $\mathcal{P}\times K\to\mathcal{R}$ P × →R . The LLMA take as input a query q , prompt p and a knowledge base K then generates a refined response r . The LLMA also identifies areas of uncertainty or gaps in its own reasoning, providing feedback for the IDA to adjust prompts accordingly. This interaction creates a closed-loop system that continuously improves the quality of answers without external inputs.