Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models思想缓冲:大型语言模型的思维增强推理

We introduce Buffer of Thoughts (BoT), a novel and versatile thought-augmented reasoning approach for enhancing accuracy, efficiency and robustness of large language models (LLMs). Specifically, we propose meta-buffer to store a series of informative high-level thoughts, namely thought-template, distilled from the problem-solving processes across various tasks. Then for each problem, we retrieve a relevant thought-template and adaptively instantiate it with specific reasoning structures to conduct efficient reasoning. To guarantee the scalability and stability, we further propose buffer-manager to dynamically update the meta-buffer, thus enhancing the capacity of meta-buffer as more tasks are solved. We conduct extensive experiments on 10 challenging reasoning-intensive tasks, and achieve significant performance improvements over previous SOTA methods: 11% on Game of 24, 20% on Geometric Shapes and 51% on Checkmate-in-One. Further analysis demonstrate the superior generalization ability and model robustness of our BoT, while requiring only 12% of the cost of multi-query prompting methods (e.g., tree/graph of thoughts) on average. Notably, we find that our Llama3-8B + BoT has the potential to surpass Llama3-70B model. Our project is available at https://github.com/YangLing0818/buffer-of-thought-llm

我们介绍了思想缓冲(BoT),一种新颖的、通用的思想增强推理方法,用于提高大型语言模型(llm)的准确性、效率和鲁棒性。具体来说,我们提出了元缓冲区来存储一系列信息丰富的高级思想,即思想模板,这些思想是从不同任务的解决问题过程中提炼出来的。然后,针对每个问题,检索相应的思想模板,并自适应地实例化特定的推理结构,以进行有效的推理。为了保证可扩展性和稳定性,我们进一步提出了缓冲区管理器来动态更新元缓冲区,从而在处理更多任务时增强元缓冲区的容量。我们在10个具有挑战性的推理密集型任务上进行了广泛的实验,并比以前的SOTA方法取得了显着的性能改进:Game of 24提高了11%,Geometric Shapes提高了20%,checkmate in- one提高了51%。进一步的分析表明,我们的BoT具有卓越的泛化能力和模型鲁棒性,而平均只需要多查询提示方法(例如,树/思想图)的12%的成本。值得注意的是,我们发现我们的Llama3-8B + BoT具有超越Llama3-70B模型的潜力。我们的项目可以在https://github.com/YangLing0818/buffer-of-thought-llm上找到

1 Introduction

A series of Large Language Models (LLMs) [1–5] like GPT-4 [3], PaLM [2] and LLaMA [6, 7] have showcased the impressive performance in various reasoning tasks. In addition to scaling up the model size to improve the reasoning performance, there are more effective prompting methods that further enhance the functionality and performance of LLMs. We divide these methods into two categories: (i) single-query reasoning: these methods [8–10] usually focus on prompt engineering and their reasoning process can be finished within a single query, such as CoT [8] that appends the input query with ’Let’s think step by step’ to produce rationales for increasing reasoning accuracy, and Few-shot Prompting [11, 12, 9, 13] which provides task-relevant exemplars to assist the answer generation; (ii) multi-query reasoning: these methods [14, 15] focus on leveraging multiple LLM queries to elicit different plausible reasoning paths, thus decomposing a complex problem into a series of simpler sub-problems, such as Least-to-Most [16], ToT [14] and GoT [17].

However, both kinds of methods face some limitations: (1) single-query reasoning usually requires prior assumption or relevant exemplars of reasoning process, which makes it impractical to manually design them task by task, thus lacking universality and generalization; (2) Due to the recursive expansion of reasoning paths, multi-query reasoning is usually computationally-intensive when finding a unique intrinsic structure underlying the reasoning process for each specific task;

(3)Both single-query and multi-query reasoning processes are limited by their designed exemplars and reasoning structures, and they neglect to derive general and high-level guidelines or thoughts from previously-completed tasks, which are informative for improving efficiency and accuracy when solving similar problems.

一系列大型语言模型(llm)[1-5],如GPT-4[3]、PaLM[2]和LLaMA[6,7],已经在各种推理任务中展示了令人印象深刻的性能。除了扩大模型大小来提高推理性能外,还有更有效的提示方法可以进一步增强llm的功能和性能。我们将这些方法分为两类:(i)单查询推理:这些方法[8 - 10]通常侧重于提示工程,其推理过程可以在单个查询中完成,例如CoT[8]在输入查询中附加“Let ' s think step by step”以产生提高推理精度的理由,以及Few-shot prompts[11,12,9,13]提供与任务相关的示例以辅助答案生成;(ii)多查询推理:这些方法[14,15]侧重于利用多个LLM查询来引出不同的似是而非的推理路径,从而将复杂问题分解为一系列更简单的子问题,如Least-to-Most[16]、ToT[14]和GoT[17]。

然而,这两种方法都存在一定的局限性:(1)单查询推理通常需要对推理过程进行先验假设或相关示例,难以逐个任务手工设计,缺乏通用性和泛化性;(2)由于推理路径的递归扩展,当为每个特定任务的推理过程寻找唯一的内在结构时,多查询推理通常是计算密集型的;(3)单查询和多查询推理过程都受到其设计的示例和推理结构的限制,它们忽略了从先前完成的任务中推导出一般和高级的指导方针或思想,这些指导方针或思想对于解决类似问题时提高效率和准确性具有重要意义。

To address these limitations, we propose Buffer of Thoughts (BoT), a novel and versatile thoughtaugmented reasoning framework aimed at enhancing reasoning accuracy, efficiency and robustness of LLMs across various tasks. Specifically, we design meta-buffer, a lightweight library housing a series of universal high-level thoughts (thought-template), which are distilled from different problem-solving processes and can be shared across tasks. Then, for each problem, we retrieve a relevant thoughttemplate and instantiate it with specific reasoning structure for efficient thought-augmented reasoning.

In order to guarantee the scalability and stability of our BoT, we further propose buffer-manager to dynamically update the meta-buffer, which effectively enhances the capacity of meta-buffer as more tasks are solved.

Our method has three critical advantages: (i) Accuracy Improvement: With the shared thoughttemplates, we can adaptively instantiate high-level thoughts for addressing different tasks, eliminating the need to build reasoning structures from scratch, thereby improving reasoning accuracy. (ii) Reasoning Efficiency: Our thought-augmented reasoning could directly leverage informative historical reasoning structures to conduct reasoning without complex multi-query processes, thus improving reasoning efficiency. (iii) Model Robustness: The procedure from thought retrieval to thought instantiation is just like the human thought process, enabling LLMs to address similar problems in a consistent way, thus significantly enhancing the model robustness of our method. Our empirical studies demonstrate that Buffer of Thoughts significantly improves precision, efficiency, and robustness over a diverse array of tasks. Here, we summarize our contributions as follows:

1. We propose a novel thought-augmented reasoning framework Buffer of Thoughts (BoT) for improving the accuracy, efficiency and robustness of LLM-based reasoning.

2. We propose meta-buffer for store informative high-level thoughts distilled from different problems, and adaptively instantiate each thought template to address each specific task.

3. We design buffer-manager to distill thought-templates from various solutions, and is continually improves the capacity of meta-buffer as more tasks are solved.

4. We conduct extensive experiments on 10 challenging reasoning-intensive tasks. Our BoT achieves significant performance improvements over previous SOTA methods: 11% on Game of 24, 20% on Geometric Shapes and 51% on Checkmate-in-One, while requiring only 12% of the cost of multi-query prompting methods on average.

为了解决这些限制,我们提出了思想缓冲(BoT),这是一种新颖而通用的思想增强推理框架,旨在提高llm在各种任务中的推理准确性、效率和鲁棒性。具体来说,我们设计了元缓冲区,这是一个轻量级的库,包含一系列通用的高级思想(思想模板),这些思想是从不同的问题解决过程中提炼出来的,可以跨任务共享。然后,对于每个问题,我们检索一个相关的思想模板,并用特定的推理结构实例化它,以实现有效的思想增强推理。

为了保证BoT的可扩展性和稳定性,我们进一步提出了缓冲区管理器来动态更新元缓冲区,随着任务的增加,有效地提高了元缓冲区的容量。

我们的方法有三个关键优势:(i)准确性提高:通过共享的思想模板,我们可以自适应地实例化处理不同任务的高级思想,从而消除了从头开始构建推理结构的需要,从而提高了推理准确性。(ii)推理效率:我们的思维增强推理可以直接利用信息丰富的历史推理结构进行推理,而不需要复杂的多查询过程,从而提高推理效率。(iii)模型鲁棒性:从思想检索到思想实例化的过程就像人类的思维过程一样,使llm能够以一致的方式解决类似的问题,从而显著提高了我们方法的模型鲁棒性。我们的实证研究表明,缓冲思想显著提高精度,效率和鲁棒性在不同的任务阵列。在此,我们总结了我们的贡献:

1。为了提高基于llm的推理的准确性、效率和鲁棒性,我们提出了一种新的思想增强推理框架——思想缓冲(BoT)。

2. 我们提出了元缓冲区来存储从不同问题中提炼出来的信息丰富的高级思想,并自适应地实例化每个思想模板来解决每个特定的任务。

3. 我们设计了缓冲区管理器,从各种解决方案中提取思想模板,并随着任务的解决不断提高元缓冲区的容量。

4. 我们对10个具有挑战性的推理密集型任务进行了广泛的实验。与以前的SOTA方法相比,我们的BoT实现了显着的性能改进:Game of 24提高了11%,Geometric Shapes提高了20%,将死合一提高了51%,而平均只需要多查询提示方法12%的成本。

2 Related Work and Discussions

Retrieval-Augmented Language Models The retrieval-augmented (Large) Language Model is introduced as a solution to mitigate the phenomenon of hallucination and enhance the output quality of language models [18–22]. When presented with an input question, the retrieval-augmented LLM first queries an external database with billion-level tokens [23] for retrieving a subset of the text corpus to help generating the final answer. Notably, the retrieval-augmented LLM achieves superior question-answering performance using fewer parameters compared to conventional LLMs [19], and it has found application across various downstream tasks [24–26], including multi-modal generation [24, 22, 23, 25] and biomedical applications [26, 27]. In this paper, we construct a novel category of retrieval database, termed meta-buffer, which contains a series of high-level thoughts rather than specific instances, aiming to universally address various tasks for LLM-based reasoning.

检索-增强(Large)语言模型作为缓解幻觉现象和提高语言模型输出质量的一种解决方案被引入[18-22]。当输入问题出现时,检索增强LLM首先查询具有十亿级令牌的外部数据库[23],以检索文本语料库的子集,以帮助生成最终答案。值得注意的是,与传统LLM相比,检索增强LLM使用更少的参数实现了更出色的问答性能[19],并且它已经在各种下游任务中得到了应用[24 - 26],包括多模态生成[24,22,23,25]和生物医学应用[26,27]。在本文中,我们构建了一种新的检索数据库,称为元缓冲区,它包含一系列高层次的思想,而不是具体的实例,旨在普遍解决基于llm的推理的各种任务。

Prompt-based Reasoning with Large Language Models Prompting techniques have significantly enahnced the arithmetic and commonsense reasoning capabilities of LLMs. Chain-of-Thought (CoT) prompting [8] and its variants [28–30], such as Least-to-Most [16], Decomposed Prompting [31], and Auto-CoT [13]—prompt LLMs to break down complex questions into simpler subtasks and systematically solve them before summarizing a final answer. Numerous studies [32–37] have demonstrated the effectiveness of these prompting methods across a wide range of tasks and benchmarks.

提示技术极大地提高了llm的算术和常识推理能力。思维链(Chain-of-Thought, CoT)提示[8]及其变体[28-30],如Least-to-Most[16]、decomposition prompts[31]和Auto-CoT[13],它们提示llm将复杂问题分解为更简单的子任务,并在总结最终答案之前系统地解决它们。许多研究[32-37]已经证明了这些提示方法在广泛的任务和基准中是有效的。

Innovations like Tree-of-Thought [14] and Graph-of-Thought [17], have further advanced this field by exploring dynamic, non-linear reasoning pathways to expand heuristic capabilities of LLMs [38, 39].

思维树(Tree-of-Thought)[14]和思维图(Graph-of-Thought)[17]等创新通过探索动态、非线性推理路径来扩展法学硕士的启发式能力,进一步推进了这一领域的发展[38,39]。

However, they suffer from increased resource demands and greater time complexity, depend on manual prompt crafting, and are often tailored to specific task types. Recent meta prompting methods [15, 40] utilize a same task-agnostic form of prompting for various tasks and recursively guide a single LLM to adaptively addressing different input queries. Nevertheless, such a long meta prompt may require a considerable context window, and these methods fail to leverage historical informative guidelines or thoughts for potential similar tasks.

然而,它们面临着不断增加的资源需求和更大的时间复杂性,依赖于手动提示制作,并且经常针对特定的任务类型进行定制。最近的元提示方法[15,40]对各种任务使用相同的任务无关的提示形式,并递归地引导单个LLM自适应地处理不同的输入查询。然而,如此长的元提示可能需要相当大的上下文窗口,并且这些方法无法利用历史信息指导或想法来执行潜在的类似任务。

Analogical Reasoning Analogical reasoning is a useful technique for natural language reasoning [41–45]. Recent works demonstrate that LLMs can perform analogical reasoning just like humans [46, 47, 12, 48, 49]. For example, Analogical Prompting [12] and Thought Propagation [48] prompt LLMs to self-generate a set of analogous problems, and then utilize the results of analogous problems to produce a solution for input problem. However, the specific solutions for self-explored problems may introduce additional noise and cause error accumulation. Recent Thought-Retriever [49] uses the intermediate thoughts generated when solving past user to address analogous queries, but it only focuses on textual comprehension/generation instead of general reasoning problems. Thus, a more high-level and general analogical approach for LLM complex reasoning is still lacking.


类比推理是一种有用的自然语言推理技术[41-45]。最近的研究表明,llm可以像人类一样进行类比推理[46,47,12,48,49]。例如,Analogical prompts[12]和Thought Propagation[48]促使llm自生成一组类似问题,然后利用类似问题的结果生成输入问题的解。然而,针对自探索问题的具体解决方案可能会引入额外的噪声并导致误差积累。最近的思想检索器[49]使用解决过去用户时产生的中间思想来解决类似的查询,但它只关注文本理解/生成,而不是一般的推理问题。因此,对于LLM复杂推理,仍然缺乏一种更高级、更通用的类比方法。

3 Buffer of Thoughts

Overview of Buffer of Thoughts In this section, we introduce our Buffer of Thoughts in detail and we also illustrate our core thought-augmented reasoning process in Figure 2. Given a specific task, we utilize our problem-distiller (Section 3.1) to extract critical task-specific information along with relevant constraints. Based on the distilled information, we search in meta-buffer (Section 3.2) that contains a series of high-level thoughts (thought-template) and retrieve a most relevant thoughttemplate for the task. Subsequently, we instantiate the retrieved thought-template with more taskspecific reasoning structures and conduct reasoning process. Finally, we employs a buffer-manager (Section 3.3) for summarizing the whole problem-solving process and distilling high-level thoughts for imcreasing the capacity of meta-buffer.

在本节中,我们将详细介绍我们的思想缓冲,并在图2中说明我们的核心思想增强推理过程。给定一个特定的任务,我们利用问题提取器(第3.1节)提取特定于任务的关键信息以及相关约束。基于提炼出来的信息,我们在包含一系列高级思想(思想模板)的元缓冲区(第3.2节)中搜索,并检索与任务最相关的思想模板。随后,我们将检索到的思想模板实例化为更具任务特异性的推理结构,并进行推理处理。最后,我们使用了一个缓冲区管理器(第3.3节)来总结整个解决问题的过程,并提炼出提高元缓冲区容量的高级思想。

Figure 2: Illustration of different reasoning process. Buffer of Thoughts enables large language models to tackle complex reasoning tasks through our thought-augmented reasoning process. Thought template is marked in orange and instantiated thought is marked in blue.

图2:不同推理过程的示意图。思想缓冲使大型语言模型能够通过我们的思维增强推理过程来处理复杂的推理任务。思想模板用橙色标记,实例化的思想用蓝色标记。

图内容的翻译:

A certain shopping mall sells a batch of branded shirts, with an average daily sales of 20 pieces and a profit of 40 yuan per piece. In order to expand sales, increase profits, and reduce inventory as soon as possible, the mall has decided to take appropriate price reduction measures. After investigation, it was found that for every 1 yuan decrease in the price of this shirt, an average of 2 more shirts are sold per day. If the mall wants to make an average profit of 1200 yuan per day, how much price should each shirt be reduced?

某商场销售一批品牌衬衫,平均每天销售20件,每件利润40元。为了扩大销售,增加利润,尽快减少库存,商场决定采取适当的降价措施。经过调查发现,这种衬衫的价格每降低1元,平均每天多卖出2件。如果商场想平均每天盈利1200元,那么每件衬衫应该降价多少?

Chain-of-Thought

1.Calculate the current daily profit: Current daily profit = 20 * 40 = 800 yuan 2.Calculate additional daily profit from selling more shirts: Additional daily profit = 2 * 40 = 80 yuan …… 5. Calculate Price reduction: Price reduction needed = 320 / 2 = 160 yuan Therefore, each shirt should be reduced by 160 yuan in order for the mall to make an average profit of 1200 yuan per day.

1.计算当期日利润:当期日利润= 20 * 40 = 800元计算卖出更多衬衫的日额外利润:日额外利润= 2 * 40 = 80元......计算降价量:所需降价量= 320 / 2 = 160元。因此,商场平均每天盈利1200元,每件衬衫需要降价160元。

Plan-and-Solve

To devise a plan to solve the problem, let's break it down into steps: 1.Determine the current daily profit: Calculate the current daily profit using the average daily sales and profit per shirt …….

3. Set up the equation for the desired daily profit Create an equation that relates new profit per shirt, increased number of shirts sold, and desired daily profit …… Step 4: Solve the equation Expand the equation and collect like terms: 1200=(20×40)+(2𝑝×40)−(20×𝑝)−(2𝑝×𝑝) Divide by 2 to simplify: 𝒑𝟐 −30𝑝−200 = 0 p≈35.6155

为了制定一个解决问题的计划,让我们把它分成几个步骤:确定当前日利润:使用平均日销售额和每件衬衫的利润.......计算当前日利润
3. 建立期望的日利润方程建立每件衬衫的新利润、增加的衬衫销售量和期望的日利润......步骤4:求解方程展开方程并收集类似项:1200=(20×40)+(2𝑝×40) - (20×𝑝)-(2𝑝×𝑝)除以2简化为𝒑𝑝−200 = 0 p≈35.6155

Thought Template 𝑻1

To solve any quadratic equation of the form 𝑎𝑥2 + 𝑏𝑥 + 𝑐 = 0, we can follow a general approach based on the method described. Here is the structured template for solving such equations: 

Step 1: Calculate the Discriminant 

Compute the discriminant 𝐷 using the formula 𝐷=𝑏! − 4𝑎𝑐.

Step 2: Determine the Nature of the Roots

要解形式为𝑎𝑥2 +𝑏+𝑐= 0的任何二次方程,我们可以遵循基于所描述方法的一般方法。下面是解决这类方程的结构化模板:
步骤1:计算判别式
使用公式𝐷=𝑏!计算判别式𝐷−4𝑎𝑐。
第二步:确定根的性质

Thought Template 𝑻1

To solve any quadratic equation of the form 𝑎𝑥2 + 𝑏𝑥 + 𝑐 = 0, we can follow a general approach based on the method described. Here is the structured template for solving such equations: 

Step 1: Calculate the Discriminant •Compute the discriminant 𝐷 using the formula 𝐷=𝑏! − 4𝑎𝑐.

Step 2: Determine the Nature of the Roots •If 𝐷≥0, the equation has two distinct real roots.

•If 𝐷=0, the equation has exactly one real root (also known as a repeated or double root).

•If 𝐷<0, the equation has two complex roots.

Step 3: Compute the Roots •For 𝐷≥0, calculate the roots using the formula 𝑥 = "#± % !& For 𝐷<0, calculate the real and imaginary parts of the complex roots using the formula 𝑥 = "#± "%' !& where 𝑖 is the imaginary unit.

要解形式为𝑎𝑥2 +𝑏+𝑐= 0的任何二次方程,我们可以遵循基于所描述方法的一般方法。下面是解决这类方程的结构化模板:


第一步:计算判别式•使用公式𝐷=𝑏!−4𝑎𝑐。


第2步:确定根的性质•如果𝐷≥0,方程有两个不同的实根。


•如果𝐷=0,则该方程只有一个实根(也称为重根或二重根)。


•如果𝐷<0,则方程有两个复根。


•当𝐷≥0时,用公式≥≥“#±% !”计算根;当𝐷<0时,用公式≥≥≥“#±%”!&计算复根的实部和虚部,其中,应为虚数单位。

Thought Template 𝑻n

def process_element(element): # Define how to process each individual element, e.g., applying a filter or transformation.

def combine_elements(element1, element2): # Define how to combine elements, e.g., summing numbers, concatenating strings, etc.

def check_condition(accumulated_result): # Define the condition that the accumulated result must meet.

def solve_problem(input_list):

def process_element(element): #定义如何处理每个单独的元素,例如,应用过滤器或转换。
def combine_elements(element1, element2): #定义如何组合元素,例如,求和数字,连接字符串等。
def check_condition(accumulated_result): #定义累积结果必须满足的条件。
def solve_problem (input_list):

Solution: 

•𝑝 be the original profit per shirt (40 yuan).

•𝑥 be the amount by which the price of each shirt is reduced.

•𝑞 be the original quantity of shirts sold daily (20 pieces).

•𝑟 be the rate at which the quantity of shirts sold increases.

The mall wants to maintain an average daily profit of 1200 yuan, so we have: 

After simplification: 𝒙𝟐 − 𝟑𝟎𝒙 + 𝟐𝟎𝟎 = 𝟎 

Now we can solve this quadratic equation using the steps outlined in the template: 

Step 1: Calculate the Discriminant 𝐷=𝒃𝟐 − 𝟒𝒂𝒄 𝑫 = (−𝟑𝟎)𝟐−𝟒 ∗ 𝟏 ∗ 𝟐𝟎𝟎 =100 Step 2: 

Determine the Nature of the Roots Since 𝐷>=0, the equation has two distinct real roots.

Step 3: Compute the Roots 𝒙 = "𝒃± 𝑫 𝟐𝒂 = 𝟑𝟎±𝟏𝟎 𝟐 The two possible solutions are: 𝑥=20 or x =10 Due to reducing inventory as soon as possible, x=20 is taken

解决方案:
•𝑝为每件衬衫的原始利润(40元)。
•为每件衬衫降价的幅度。
•𝑞是每天售出的衬衫的原始数量(20件)。
•𝑟是衬衫销售量增长的速率。
商场希望维持日均1200元的利润,所以我们有:
简化后的公式为:𝒙乌兹别克斯坦-乌兹别克斯坦𝒙+乌兹别克斯坦-乌兹别克斯坦=乌兹别克斯坦
现在我们可以使用模板中列出的步骤来解这个二次方程:
第一步:计算判别𝐷=𝒃𝟐−𝟒𝒂𝒄𝑫=(−𝟑𝟎)𝟐−𝟒∗𝟏∗𝟐𝟎𝟎= 100步骤2:
由于𝐷>=0,方程有两个不同的实根。
第三步:计算根𝒙=𝒃±两种可能的解是:≥20或≥10为了尽快减少库存,取≥20

3.1 Problem Distiller

Most of complex tasks contain implicit constraints, complex object relationships, and intricate variables and parameters within their contexts. Consequently, during the reasoning stage, LLMs need to overcome three main challenges: extracting vital information, recognizing potential constraints, and performing accurate reasoning. These challenges would impose a significant burden on a single LLM. Therefore, we separate the extraction and comprehension stages of task information from the final reasoning stage, through prepending a problem distiller to the reasoning process.

大多数复杂任务都包含隐含的约束、复杂的对象关系以及复杂的变量和参数。因此,在推理阶段,法学硕士需要克服三个主要挑战:提取重要信息、识别潜在约束和执行准确的推理。这些挑战会给一个法学硕士带来沉重的负担。因此,我们通过在推理过程中增加一个问题蒸馏器,将任务信息的提取和理解阶段与最终推理阶段分开。

More concretely, we design a meta prompt ϕ to first distill and formalize the task information. The distilled task information could be denoted as: xd = LLM(ϕ(x)), (1) where x is the task statement. Due to the page limit, we put the detailed meta prompt for problemdistiller in Appendix A.2.

更具体地说,我们设计了一个元提示φ来首先提取和形式化任务信息。提取的任务信息可以表示为:xd = LLM(ϕ(x)),(1),其中x为任务语句。由于页面限制,我们将problemdistiller的详细元提示放在附录A.2中。

Problem Condensation and Translation

We use the problem distiller to extract key elements from input tasks, focusing on: (1). Essential parameters and variables for problem-solving; (2).The objectives of the input tasks and their corresponding constraints. We then re-organize this distilled information into a clear, comprehensible format for the subsequent reasoning stage. We then translate the specific problems into high-level concepts and structures. This translation procedure decomposes complex real-world problems, like intricate mathematical application scenarios, into simpler, multi-step calculations, making it easier for later retrieval of high-level thought.
我们使用问题蒸馏器从输入任务中提取关键元素,重点关注:(1)解决问题的基本参数和变量;(2)输入任务的目标及其约束条件。然后,我们将这些提炼出来的信息重新组织成清晰、可理解的格式,以供后续推理阶段使用。然后,我们将具体问题转化为高级概念和结构。这个翻译过程将复杂的现实问题,如复杂的数学应用场景,分解成更简单的多步计算,使其更容易在以后检索高层次的思想。

3.2 Thought-Augmented Reasoning with Meta Buffer 

Motivation Human often summarize and induce higher-level guidelines when solving problems and then apply them to relevant problems. Motivated by this, we propose meta-buffer, a lightweight library that contains a series of high-level thoughts (thought-template) for addressing various types of problems. Unlike traditional methods [11, 46, 12, 36, 9] that require specific instructions or exemplars, our high-level thought-templates can be adaptively instantiated when solving different problems, thereby enhancing LLMs with superior precision and flexibility.

3.2基于Meta Buffer的思维增强推理
人类在解决问题时,往往会总结和归纳出更高层次的准则,然后将其应用到相关问题中。受此启发,我们提出了元缓冲区,这是一个轻量级库,包含一系列高级思想(思想模板),用于解决各种类型的问题。与传统方法[11,46,12,36,9]不同,我们的高级思想模板可以在解决不同问题时自适应实例化,从而提高llm的精度和灵活性。

Thought Template As a kind of high-level guideline, our thought-template is stored in metabuffer , and is obtained from various problem-solving processes by our buffer-manager. The details about acquiring thought-templates would be introduced in Section 3.3. Since our BoT aims to provide a general reasoning approach for various tasks, we correspondingly classify the thoughttemplates into six categories: Text Comprehension, Creative Language Generation, Common Sense Reasoning, Mathematical Reasoning, Code Programming and Application Scheduling. We provide some example thought-templates in Appendix A.1.Such classification of thought-templates can facilitate the template retrieval for finding most suitable solutions to different problems. Here we denote thought template, template description and its corresponding category as (Ti , DTi , Ck), where i denotes the index of meta-template, k ∈ Z + and 1 ≤ k ≤ 6, which means Ck is in one of the six categories, and DTi is the description of thought template.
思想模板作为一种高级指南,我们的思想模板存储在元缓冲区中,由缓冲区管理器从各种解决问题的过程中获得。关于获取思想模板的细节将在3.3节中介绍。由于我们的BoT旨在为各种任务提供通用的推理方法,因此我们相应地将思想模板分为六类:文本理解、创造性语言生成、常识推理、数学推理、代码编程和应用程序调度。我们在附录A.1中提供了一些示例思想模板。这种思想模板的分类可以方便地检索模板,为不同的问题找到最合适的解决方案。这里我们将思想模板、模板描述及其对应的范畴表示为(Ti, DTi, Ck),其中i表示元模板的索引,k∈Z +,且1≤k≤6,表示Ck属于六个范畴之一,DTi表示思想模板的描述。

Template Retrieval For each task, our BoT retrieves a thought-template Ti that is highly similar to the distilled problem xd by calculating the embedding similarity between the description DTi and xd.The retrieval process can be formulated as: j = argmaxi (Sim(f(xd), {f(DTi )} N i=1)), where Sim(f(xd), {f(DTi )} n i=0) >= δ, (2) 

N is the size of the meta-buffer, f(·) is a normal text embedding model, and Tj denotes the retrieved thought template. We set a threshold δ (0.5∼0.7 is recommended) to determine whether the current task is new. Therefore, if Sim(f(xd), {f(DTi )} n i=0) < δ, we identify the task x as a new task.

对于每个任务,我们的BoT通过计算描述DTi和xd之间的嵌入相似度来检索与提炼问题xd高度相似的思想模板Ti。
检索过程可表示为:j = argmaxi (Sim(f(xd), {f(DTi)} N i=1)),其中Sim(f(xd), {f(DTi)} N i=0) >= δ, (2)
N为元缓冲区的大小,f(·)为普通文本嵌入模型,Tj为检索到的思想模板。我们设置了一个阈值δ(建议0.5 ~ 0.7)来确定当前任务是否为新任务。因此,如果Sim(f(xd), {f(DTi)} n i=0) < δ,我们将任务x识别为新任务。

Instantiated Reasoning 

For each specific task, we discuss two situations for the instantiated reasoning, depending on whether the current task is new: The first situation is that we successfully retrieve a thought-template Tj for the task. In this case, as presented in Figure 2, our thoughtaugmented reasoning will be adaptively instantiated to suitable reasoning structures with our designed instantiation prompt (in Appendix A.3). For example, in a Checkmate-in-One problem, we instantiate the template of updating chess board state to solve the problem step by step. Thus we conduct the instantiated reasoning for task x using the distilled information xd and the retrieved template Tj , and produce its solution Sx as: 

Sx = LLMinstantiation(xd, Tj ), (3) 

where LLMinstantiation denotes the instantiated reasoner with a LLM.

In the second situation, the task is identified as a new task. To enable proper instantiated reasoning, we prepare three general coarse-grained thought-templates for utilization. Based on the distilled task information xd, our BoT would automatically assign a suitable thought-template to the reasoning process. The detailed pre-defined thought-templates are included in Appendix A.3).

实例化推理
对于每个特定的任务,我们讨论实例化推理的两种情况,这取决于当前任务是否是新的:第一种情况是我们成功地检索了任务的思想模板Tj。在这种情况下,如图2所示,我们的思想增强推理将通过我们设计的实例化提示(见附录A.3)自适应地实例化到合适的推理结构。例如,在将军合一问题中,我们实例化了更新棋盘状态的模板来逐步解决问题。因此,我们使用提取的信息xd和检索到的模板Tj对任务x进行实例化推理,得到其解Sx为:
Sx = llmininstance (xd, Tj), (3)
其中llmininstance表示具有LLM的实例化推理器。
在第二种情况下,任务被标识为新任务。为了启用适当的实例化推理,我们准备了三个通用的粗粒度思想模板供使用。基于提取的任务信息xd,我们的BoT将自动为推理过程分配合适的思想模板。详细的预先定义的思想模板包含在附录A.3中。

3.3 Buffer Manager 

We propose buffer-manager to summarize the high-level guidelines and thoughts that are gained from each problem-solving process. It can generalize each specific solution to more problems, storing the critical distilled knowledge in the form of thought-templates within the meta buffer.

In contrast to methods that temporarily generate exemplars or instructions for each problem, our buffer-manager can ensure permanent advancements in accuracy, efficiency, and robustness for LLM-based reasoning.

Template Distillation 

To extract a general though-template, we propose a three-step approach: 

(1) Core task summarization: identifying and describing basic types and core challenges of problems; 

(2) Solution steps description: summarize the general steps for solving a problem; 

(3) General answering template: based on the above analysis, propose a solution template or approach that can be widely applied to similar problems. Additionally, to boost the generalization ability and stability of template distillation, we carefully design two types of in-context examples of how to generate thoughttemplate—in-task and cross-task examples. Cross-task means we choose the template distilled from one task to tackle the problem of other tasks, such as addressing a mathematical problem with a code-related thought-template. The new template distilled from input task x can be denoted as: 

Tnew = LLMdistill(xd, Sx), (4) 

where LLMdistill is the LLM-based template distiller initialized with the following prompt:

3.3缓冲区管理器


我们建议使用buffer-manager来总结从每个问题解决过程中获得的高级指导方针和思想。它可以将每个特定的解决方案推广到更多的问题,将提炼出来的关键知识以思想模板的形式存储在元缓冲区中。
与为每个问题临时生成示例或指令的方法相比,我们的缓冲管理器可以确保基于llm的推理在准确性、效率和鲁棒性方面的永久进步。

模板蒸馏


为了提取一个通用的思想模板,我们提出了一个三步走的方法:
(1)核心任务总结:识别和描述问题的基本类型和核心挑战;
(2)解决步骤说明:总结解决问题的一般步骤;
(3)通用回答模板:在上述分析的基础上,提出一个可以广泛应用于类似问题的解决模板或方法。此外,为了提高模板蒸馏的泛化能力和稳定性,我们精心设计了两种类型的上下文示例,用于生成任务中思想模板和跨任务示例。交叉任务意味着我们选择从一个任务中提炼出来的模板来解决其他任务的问题,比如用一个与代码相关的思想模板来解决一个数学问题。从输入任务x提取的新模板可以表示为:
Tnew = LLMdistill(xd, Sx), (4)
其中llm蒸馏器是基于llm的模板蒸馏器,用以下提示符初始化:

Prompt for Template Distillation: 

User: [Problem Description] + [Solution Steps or Code] 

To extract and summarize the high-level paradigms and general approaches for solving such problems, please follow these steps in your response: 

1. Core task summarization: Identify and describe the basic type and core challenges of the problem, such as classifying it as a mathematical problem (e.g., solving a quadratic equation), a data structure problem (e.g., array sorting), an algorithm problem (e.g., search algorithms), etc. And analyze the most efficient way to solve the problem.

2. Solution Steps Description: Outline the general solution steps, including how to define the problem, determine variables, list key equations or constraints, choose appropriate solving strategies and methods, and how to verify the correctness of the results.

3. General Answer Template: Based on the above analysis, propose a template or approach that can be widely applied to this type of problem, including possible variables, functions, class definitions, etc. If it is a programming problem, provide a set of base classes and interfaces that can be used to construct solutions to specific problems.

Please ensure that your response is highly concise and structured, so that specific solutions can be transformed into generalizable methods.

[Optional] Here are some exemplars of the thought-template: (Choose cross-task or in-task exemplars based on the analysis of the Core task summarization.)

模板蒸馏提示:
用户:[问题描述]+[解决步骤或代码]
要提取和总结解决此类问题的高级范例和一般方法,请在您的回答中遵循以下步骤:
1. 核心任务总结:识别和描述问题的基本类型和核心挑战,例如将其分类为数学问题(例如,求解二次方程),数据结构问题(例如,数组排序),算法问题(例如,搜索算法)等。并分析解决问题的最有效方法。
2. 解题步骤说明:概述一般解题步骤,包括如何定义问题、确定变量、列出关键方程或约束条件、选择合适的解题策略和方法,以及如何验证结果的正确性。
3. 通用答案模板:根据以上分析,提出一个可以广泛应用于这类问题的模板或方法,包括可能的变量、函数、类定义等。如果它是一个编程问题,则提供一组基类和接口,这些基类和接口可用于构造特定问题的解决方案。
请确保您的回答是高度简洁和结构化的,以便具体的解决方案可以转化为可推广的方法。
[可选]以下是思想模板的一些示例:(根据核心任务总结的分析选择跨任务或任务内示例。)

Dynamic Update of Meta-Buffer 

After template distillation, we need to consider whether the distilled template should be updated into the meta-buffer. If we initialize an empty meta-buffer or encounter a problem without a proper thought-template, the distilled thought-templates will be directly stored in the meta-buffer. If we solve problem with a retrieved thought-template, new insights may arise during the instantiation of a certain thought-template. Therefore, to avoid the redundancy of the meta-buffer while maintaining newly-generated informative thoughts, we will calculate the similarity between the embedding vectors of DTnew and {DTi } n i=0 and update the meta-buffer with the following rule: 

Max(Sim(f(DTnew ), {f(DTi )} n i=0)) < δ. (5) 

Otherwise, it means the meta-buffer has already possessed the necessary knowledge to solve this task and does not need to perform the update. Our dynamic update strategy effectively reduces the computational burden of template retrieval while ensuring the lightweight property of our meta-buffer.

We further conduct ablation study to analyze it in Section 6.

元缓冲区的动态更新
模板蒸馏后,我们需要考虑蒸馏后的模板是否应该更新到元缓冲区中。如果我们初始化一个空的元缓冲区,或者遇到一个没有合适的思想模板的问题,提炼出来的思想模板将直接存储在元缓冲区中。如果我们用检索到的思想模板来解决问题,新的见解可能会在某个思想模板的实例化过程中产生。因此,为了在保持新生成的信息思想的同时避免元缓冲区的冗余,我们将计算DTnew与{DTi}的嵌入向量i=0之间的相似度,并按照以下规则更新元缓冲区:
Max(Sim(f(DTi)), {f(DTi)} n i=0) < δ。(5)
否则,这意味着元缓冲区已经拥有解决此任务所需的知识,不需要执行更新。我们的动态更新策略有效地减少了模板检索的计算负担,同时保证了元缓冲区的轻量级。
我们在第6节进一步进行烧蚀研究进行分析。

4 Experiments 

Datasets and Tasks

To evaluate the efficacy of our proposed Buffer of Thoughts and compare with previous methods, we consider a diverse set of tasks and datasets that require varying degrees of mathematical and algorithmic reasoning, domain-specific knowledge, and literary creativity: (a).

The Game of 24 from ToT [14], where the objective is to form an arithmetic expression that equals 24 using each of four given numbers exactly once; (b). Three BIG-Bench Hard (BBH) [35] tasks: Geometric Shapes, Multi-Step Arithmetic Two, and Word Sorting; (c). Three reasoning tasks directly obtained from the BIG-Bench suite [50]: Checkmate-in-One, Penguins—where the task is to answer questions about penguins’ attributes based on a given table and additional natural language information, and DateUnderstanding—a task that involves inferring dates from natural language descriptions, performing arithmetic operations on dates, and utilizing global knowledge such as the number of days in February; (d). Python Programming Puzzles (P3) [51, 52], a collection of challenging programming puzzles written in Python with varying difficulty levels; (e). Multilingual Grade School Math (MGSM) [33], a multilingual version of the GSM8K dataset [53] featuring translations of a subset of examples into ten typologically diverse languages, including Bengali, Japanese, and Swahili; (f). Shakespearean Sonnet Writing from meta-prompting [15], a novel task where the goal is to write a sonnet following the strict rhyme scheme "ABAB CDCD EFEF GG" and incorporating three provided words verbatim.

4实验


数据集和任务
为了评估我们提出的思想缓冲的有效性,并与以前的方法进行比较,我们考虑了一组不同的任务和数据集,这些任务和数据集需要不同程度的数学和算法推理、领域特定知识和文学创造力:(a)。
来自ToT[14]的24博弈,其目标是使用四个给定数字中的每一个恰好一次形成一个等于24的算术表达式;(b)三个BIG-Bench Hard (BBH)[35]任务:几何形状、多步算术二和单词排序;(c)直接从BIG-Bench套件中获得的三个推理任务[50]:将死合一,企鹅——任务是根据给定的表格和额外的自然语言信息回答关于企鹅属性的问题,以及日期理解——任务包括从自然语言描述中推断日期,对日期进行算术运算,并利用全局知识(如2月的天数);(d). Python编程谜题(P3)[51,52],用Python编写的具有不同难度的具有挑战性的编程谜题集合;(e)多语言小学数学(MGSM) [33], GSM8K数据集[53]的多语言版本,将样本子集翻译成十种不同类型的语言,包括孟加拉语、日语和斯瓦希里语;(f).基于元提示的莎士比亚十四行诗写作[15],这是一个新颖的任务,其目标是按照严格的押韵方案“ABAB CDCD EFEF GG”编写十四行诗,并逐字合并三个提供的单词。

Implementation and Baselines 

For the fair comparisons with previous methods, we use GPT-4 as the base model of our BoT, including the main experiment and the ablation study (in Section 6). We also use Llama3-8B and Llama3-70B in our analysis part on NVIDIA A100-PCIE-40GB GPU. We compare our Buffer of Thoughts with the following prompting methods: 1. Standard Prompting: This is our most basic baseline, where an LLM is asked to generate a response directly from the input query, without any specific guiding input-output examples or additional instructions beyond the task description included in the query.

为了与之前的方法进行公平的比较,我们使用GPT-4作为我们BoT的基础模型,包括主实验和烧蚀研究(第6节)。我们在NVIDIA A100-PCIE-40GB GPU上的分析部分也使用了Llama3-8B和Llama3-70B。我们将我们的思想缓冲与以下提示方法进行比较:标准提示:这是我们最基本的基线,其中要求LLM直接从输入查询生成响应,而不需要任何特定的指导输入输出示例或查询中包含的任务描述之外的其他指令

Table 1: Comparing BoT with previous methods across various tasks. We denote the best score in blue , and the second-best score in green . Our BoT significantly outperforms other methods on all tasks, especially on general reasoning problems.

表1:在各种任务中BoT与以前的方法的比较。我们用蓝色表示最好的分数,用绿色表示第二好的分数。我们的BoT在所有任务上都明显优于其他方法,特别是在一般推理问题上。

2. Single-query Method: This includes Zero-shot CoT [8] and PAL [10], which use the LLM to analyze natural language problems and generate intermediate reasoning steps. We also include Expert Prompting [9], which creates an expert identity tailored to the specific context of the input query, and then integrates this expert profile into the input to generate a well-informed response.

2. 单查询方法:包括Zero-shot CoT[8]和PAL[10],它们使用LLM来分析自然语言问题并生成中间推理步骤。我们还包括专家提示[9],它创建了一个针对输入查询的特定上下文定制的专家身份,然后将该专家配置文件集成到输入中,以生成一个消息灵通的响应。

3. Multi-query Method: This includes ToT [14] and GoT [17], which enable LLMs to make deliberate decisions by considering multiple reasoning paths and self-evaluating choices to determine the next course of action. These methods also allow for looking ahead or backtracking when necessary to make global decisions. Additionally, we include Meta Prompting [15], which employs an effective scaffolding technique designed to enhance the functionality of LLMs.

3. 多查询方法:包括ToT[14]和GoT[17],它们使法学硕士能够通过考虑多种推理路径和自我评估选择来做出深思熟虑的决策,以确定下一步的行动。当需要做出全局决策时,这些方法还允许向前看或向后看。此外,我们还包括Meta提示[15],它采用了一种有效的脚手架技术,旨在增强llm的功能。

4.1 BoT Achieves Better Accuracy, Efficiency and Robustness 

Reasoning Accuracy 

As shown in Table 1, our BoT consistently outperforms all previous prompting methods across multiple kinds of challenging benchmarks, particularly demonstrated in complicated reasoning tasks such as Game of 24 and Checkmate-in-One. Taking GPT-4 as a baseline, our method achieves an astonishing 79.4% accuracy improvement in Game of 24, and compared to ToT, which has a good performance on this task, we also achieve an 8.4% accuracy improvement. What’s more, compared to recent Meta-prompting method [15], we see significant accuracy improvements: 23% on Game of 24, 20% on Geometric Shapes and 51% on Checkmate-in-One. Existing methods need complex, iterative, and heuristic search strategies to address these problems on a case-by-case basis. Conversely, our BoT leverages the historical insights and informative guidelines from thought-templates, and further adaptively instantiate a more optimal reasoning structure for addressing these complex problems.

4.1 BoT实现更高的准确性、效率和鲁棒性
推理的准确性
如表1所示,我们的BoT在多种具有挑战性的基准测试中始终优于所有以前的提示方法,特别是在复杂的推理任务中,如Game of 24和将死合一。以GPT-4为基准,我们的方法在Game of 24中实现了惊人的79.4%的准确率提升,与在该任务上表现良好的ToT相比,我们的方法也实现了8.4%的准确率提升。更重要的是,与最近的meta提示方法[15]相比,我们看到了显著的准确率提高:Game of 24的准确率提高了23%,Geometric Shapes的准确率提高了20%,checkmate - one的准确率提高了51%。现有的方法需要复杂的、迭代的和启发式的搜索策略来逐个解决这些问题。相反,我们的BoT利用了思想模板的历史见解和信息指导,并进一步自适应地实例化了解决这些复杂问题的更优推理结构。

Reasoning Efficiency 

In addition to significant improvements in accuracy, as a multi-query method, our BoT can achieve comparable reasoning time to single-query method across various tasks, while being considerably less than conventional multi-query method like ToT [14] as shown in Figure 3.

For example, in Game of 24, both single-query and multi-query methods necessitate iterative and heuristic searches to identify feasible solutions. This process is particularly time-consuming and inefficient, especially for the multi-query method, which involves conducting multi-query search and backtrace phases. In contrast, our BoT directly retrieves a thought-template in code format, thus a program is instantiated to traverse combinations of numbers and symbols, thereby eliminating the need to build the reasoning structure from scratch. This allows for solving the problem with just one query after invoking the problem-distiller, significantly reducing the time required for complex reasoning. Notably, our BoT requires only 12% of the cost of multi-query methods (e.g., tree of thoughts and meta-prompting) on average.

推理效率
除了准确率显著提高之外,作为一种多查询方法,我们的BoT可以在各种任务中实现与单查询方法相当的推理时间,而与传统的多查询方法(如ToT[14])相比则要少得多,如图3所示。
例如,在Game of 24中,单查询和多查询方法都需要迭代和启发式搜索来确定可行解。这个过程特别耗时和低效,特别是对于多查询方法,它涉及进行多查询搜索和回溯阶段。相比之下,我们的BoT直接以代码格式检索思想模板,因此程序被实例化以遍历数字和符号的组合,从而消除了从头构建推理结构的需要。这允许在调用问题提取器后仅使用一个查询解决问题,从而显著减少复杂推理所需的时间。值得注意的是,我们的BoT平均只需要多查询方法(例如,思想树和元提示)的12%的成本。

Reasoning Robustness 

To better evaluate our BoT, we devise a new evaluation metric: success rate, which is used to assess the reasoning robustness. We randomly sample 1000 examples from various benchmarks as a test subset and evaluate different methods on this subset. As shown in Figure 4, we repeat this evaluation process 10 times and take the average accuracy as the success rate of different methods on each benchmark. Compared with other methods, our BoT consistently maintains a higher success rate across various tasks, surpassing the second-best by 10% in average success rate. We attribute our outstanding robustness to the great generalization ability of our distilled thought-templates during reasoning across different tasks. By offering high-level thought from the suitable thought-templates, the stability of our method across different tasks is greatly enhanced.

为了更好地评估BoT,我们设计了一个新的评估指标:成功率,用于评估推理鲁棒性。我们从各种基准测试中随机抽取1000个例子作为测试子集,并在该子集上评估不同的方法。如图4所示,我们将此评估过程重复10次,并将平均准确率作为不同方法在每个基准上的成功率。与其他方法相比,我们的BoT在各种任务中始终保持较高的成功率,平均成功率比第二名高出10%。我们将出色的鲁棒性归功于我们在不同任务推理过程中提炼的思想模板的强大泛化能力。通过从合适的思想模板中提供高层次的思想,我们的方法在不同任务中的稳定性大大增强。

5 Model Analysis 

Distribution Analysis of Thought-Templates 

As depicted in the left figure of Figure 5, we choose six different benchmarks, each sampled with 100 distinct tasks. We update the meta-buffer from scratch, and after completing all sampled tasks, we display the number of derived thought-templates.

We can observe that our BoT generates a greater number of thought-templates in the MGSM tasks that contain more diverse scenarios. In tasks with relatively simple requirements, such as Checkmatein-One and Penguins, BoT produces more fixed thought-templates tailored for those specific issues.

The distribution of templates indicates that our BoT can effectively discover appropriate thought templates for different benchmarks.

如图5的左图所示,我们选择了六个不同的基准,每个基准有100个不同的任务。我们从头开始更新元缓冲区,在完成所有采样任务后,我们显示派生的思想模板的数量。
我们可以观察到,BoT在包含更多不同场景的MGSM任务中生成了更多的思想模板。在要求相对简单的任务中,例如checkmatein和Penguins, BoT会针对这些特定问题生成更固定的思维模板。
模板的分布表明我们的BoT可以有效地发现适合不同基准的思想模板。

Distribution Analysis of Time Cost 

As illustrated in Figure 5, we measured the average time cost for each component of BoT’s reasoning framework across different tasks. The time required for distilling task information and template retrieval is relatively short, whereas instantiated reasoning takes longer. Overall, considering the complexity of different components, our BoT achieves a relatively balanced distribution of time cost, demonstrating the efficiency of our BoT framework.

Better Trade-off between Model Size and Performance 

As depicted in Figure 6, on Game of 24, word list sorting and Checkmate-in-One, Llama3-8B and Llama-70B models [6] may result in poor outcomes. However, equipped with our BoT, both models demonstrate a substantial accuracy improvement. Notably, BoT+Llama3-8B has the potential to surpass single Llama3-70B model. Our BoT enables smaller models to exhibit the capabilities that approximate or even surpass larger models, significantly bridging the gap between their reasoning abilities. Furthermore, it greatly diminishes the inference cost required by large language models when tackling complex problems.

时间成本的分布分析
如图5所示,我们测量了BoT推理框架跨不同任务的每个组件的平均时间成本。提取任务信息和模板检索所需的时间相对较短,而实例化推理所需的时间较长。总体而言,考虑到不同组件的复杂性,我们的BoT实现了相对平衡的时间成本分配,证明了我们的BoT框架的效率。
更好地权衡模型大小和性能
如图6所示,在Game of 24中,单词列表排序和将军合一,Llama3-8B和Llama-70B模型[6]可能会导致较差的结果。然而,配备了我们的BoT,这两个模型都显示出了显著的准确性提高。值得注意的是,BoT+美洲驼- 8b具有超越单一美洲驼- 70b车型的潜力。我们的BoT使较小的模型能够展示接近甚至超越较大模型的能力,显著弥合了它们之间的推理能力差距。此外,在处理复杂问题时,它大大降低了大型语言模型所需的推理成本。

Figure 6: We evaluate the trade-off between model size and performance with Llama3-8B and Llama3-70B models on three challenging benchmarks.

图6:我们在三个具有挑战性的基准测试中评估了Llama3-8B和Llama3-70B模型大小和性能之间的权衡。

6 Ablation Study 

Impact of Problem-Distiller As illustrated in Figure 7, when the problem-distiller is disabled, both Llama3-70B and GPT-4 experience a certain degree of accuracy decline. More complex problems, such as Game of 24 and Checkmate-in-One, show a more significant accuracy reduction, whereas relatively simpler problems like word list sorting and MGSM exhibit smaller decreases. This is because LLMs can more easily extract key information in simpler tasks, making the impact of the problem-distiller less noticeable. In contrast, extracting key information and potential constraints in complex problems is more challenging, making the role of our problem-distiller more prominent, thereby explaining the differences depicted in the figure.

6消融研究


问题蒸馏器的影响如图7所示,当问题蒸馏器被禁用时,Llama3-70B和GPT-4的精度都会出现一定程度的下降。更复杂的问题,如Game of 24和Checkmate-in-One,显示出更显著的准确性下降,而相对简单的问题,如单词列表排序和MGSM,则显示出较小的下降。这是因为llm可以更容易地从更简单的任务中提取关键信息,从而使问题蒸馏器的影响不那么明显。相比之下,在复杂问题中提取关键信息和潜在约束更具挑战性,这使得我们的问题蒸馏器的作用更加突出,从而解释了图中所描述的差异。

Figure 7: We conduct ablation study on problem-distiller across four benchmarks, employing Llama370B and GPT-4 as the base models.

图7:我们使用Llama3 70B和GPT-4作为基准模型,在四个基准上对问题蒸馏器进行消融研究。

Impact of Meta-Buffer 

As illustrated in Figure 8, when the meta-buffer is disabled, both Llama370B and GPT-4 models exhibit a noticeable decline in performance, particularly in benchmarks requiring complex reasoning, such as Game of 24 and Checkmate-in-One. This further underscores the superiority of our meta-buffer in addressing complex problems.

元缓冲区的影响
如图8所示,当元缓冲区被禁用时,Llama3 70B和GPT-4模型都表现出明显的性能下降,特别是在需要复杂推理的基准测试中,例如Game of 24和将军合一。这进一步强调了我们的元缓冲区在解决复杂问题方面的优势。

Figure 8: We conduct ablation study on meta-buffer across four benchmarks, employing Llama3-70B and GPT-4 as the base models.

图8:我们使用Llama3-70B和GPT-4作为基准模型,在四个基准上对meta-buffer进行了消融研究。

Figure 9: We conduct ablation study on buffer-manager regarding reasoning accuracy across four tasks, employing Llama3-70B and GPT-4 as the base models.

图9:我们使用Llama3-70B和GPT-4作为基本模型,对buffer-manager在四个任务中的推理精度进行了消融研究。

Impact of Buffer-Manager 

In this ablation study, we divide the entire process into four rounds.

In each round, we randomly sample 50 questions from each benchmark and conduct reasoning. In the subsequent round, we continue to randomly sample another 50 questions from each benchmark.

As depicted in Figure 9, with the increase of the number of rounds, the model with the buffermanager continually expands the meta-buffer while also utilizing the thought-templates obtained from previously solved problems to help addressing subsequent similar problems. Therefore, we can observe that the accuracy of BoT steadily improves with each round. In contrast, the model without the buffer-manager fails to exhibit an upward trend. Additionally, we have also measured the reasoning time as depicted in Figure 10. when the number of rounds increases, the model with the buffer-manager will experience a continual improvement in reasoning efficiency. This is because, with the continual expansion of the meta-buffer, the likelihood of retrieving suitable thought-templates also increases. Consequently, models can avoid constructing reasoning structures from scratch, thereby enhancing the inference efficiency accordingly.

在这项消融研究中,我们将整个过程分为四轮。
在每一轮中,我们从每个基准中随机抽取50个问题进行推理。在接下来的一轮中,我们继续从每个基准中随机抽取另外50个问题。
如图9所示,随着轮次的增加,带有buffermanager的模型不断扩展元缓冲区,同时还利用从先前解决的问题中获得的思想模板来帮助解决后续的类似问题。因此,我们可以观察到BoT的准确率每轮都在稳步提高。相反,没有缓冲区管理器的模型不能显示上升趋势。此外,我们还测量了推理时间,如图10所示。当轮数增加时,具有缓冲管理器的模型的推理效率将不断提高。这是因为,随着元缓冲区的不断扩大,检索合适的思想模板的可能性也在增加。因此,模型可以避免从头构建推理结构,从而提高推理效率。

Figure 10: We conduct ablation study on buffer-manager regarding reasoning efficiency across four tasks, employing Llama3-70B and GPT-4 as the base models.

图10:我们采用Llama3-70B和GPT-4作为基本模型,对buffer-manager进行了四个任务推理效率的消融研究。

7 Discussion 

Limitations and Future Directions 

Despite our method’s significant improvement in accuracy while maintaining reasoning efficiency and robustness, our method’s enhancements are limited when addressing problems requiring human-like creativity, as this issue often does not rely on a specific thought-template. Besides, if our BoT initializes the meta-buffer with a weaker model, the quality of the derived thought-templates may be suboptimal due to the weaker model’s limited reasoning ability and instruction-following capability. Overall, our BoT brings out a set of future directions: 

1. integrating external resources with BoT to build a open-domain system like agent models [54, 55].

2. making the distillation of thought-templates optimizable, which may significantly enhance their template qualities for more complex tasks.

Conclusion 

In this work, we introduce Buffer of Thoughts, a novel beffered reasoning framework that employs LLMs to utilize pre-accumulated experiences and methodologies from prior tasks as thought-templates stored within a meta-buffer. We further design buffer-manager to continuously refine the problem-solving processes and dynamically distill thought-templates, thereby progressively raising the LLM’s reasoning capacity. Our BoT demonstrates SOTA performance on 10 challenging tasks, and offers promising prospects for future research and application.

7讨论
限制与未来方向
尽管我们的方法在保持推理效率和鲁棒性的同时显著提高了准确性,但在解决需要类似人类创造力的问题时,我们的方法的增强是有限的,因为这个问题通常不依赖于特定的思想模板。此外,如果我们的BoT使用较弱的模型初始化元缓冲区,则由于较弱的模型的有限推理能力和指令遵循能力,派生的思想模板的质量可能不是最优的。总的来说,我们的BoT提出了一系列未来的方向:
1. 整合外部资源与BoT构建类似于agent模型的开放域系统[54,55]。
2. 使思想模板的精馏可优化,这可能会显著提高更复杂任务的模板质量。
结论
在这项工作中,我们引入了思想缓冲,这是一种新颖的更好的推理框架,它使用llm利用预先积累的经验和先前任务的方法作为存储在元缓冲中的思想模板。我们进一步设计了缓冲管理器,不断完善解决问题的过程,动态提炼思想模板,从而逐步提高法学硕士的推理能力。我们的BoT在10个具有挑战性的任务上展示了SOTA的性能,并为未来的研究和应用提供了良好的前景。

附录

A  Additional Method Details 

A.1 Detailed Thought-Templates 

Here we show six example thought-templates in six different categories:

A.1.1 Text Comprehension

附加方法详细信息
A.1详细的思想模板
在这里,我们展示了六个不同类别的六个示例思维模板:

A.1.1文本理解

Task Description: 

The task involves analyzing a table with various attributes of penguins, such as name, age, height, and weight, and answering questions about these attributes. The table may be updated with new entries, and additional context or comparisons may be provided in natural language.

Solution Description: 

To accurately answer questions about the penguins’ attributes, one must be able to interpret the data presented in tabular form, understand any additional information provided in natural language, and apply logical reasoning to identify the correct attribute based on the question asked.

Thought Template: 

Step 1: Parse the initial table, extracting the header information and each penguin’s attributes into a structured format (e.g., a list of dictionaries).

Step 2: Read and integrate any additional natural language information that updates or adds to the table, ensuring the data remains consistent.

Step 3: Identify the attribute in question (e.g., oldest penguin, heaviest penguin) and the corresponding column in the table.

Step 4: Apply logical reasoning to compare the relevant attribute across all entries to find the correct answer (e.g., the highest age for the oldest penguin).

Step 5: Select the answer from the provided options that matches the result of the logical comparison.

任务描述:
该任务包括分析一个包含企鹅各种属性(如姓名、年龄、身高和体重)的表,并回答有关这些属性的问题。表可以用新条目更新,并且可以用自然语言提供额外的上下文或比较。
解决方案描述:
为了准确地回答有关企鹅属性的问题,人们必须能够解释以表格形式呈现的数据,理解以自然语言提供的任何附加信息,并根据所提出的问题应用逻辑推理来识别正确的属性。
认为模板:
步骤1:解析初始表,将标题信息和每个企鹅的属性提取为结构化格式(例如,字典列表)。
步骤2:读取并集成更新或添加到表中的任何其他自然语言信息,确保数据保持一致。
步骤3:确定所讨论的属性(例如,最老的企鹅,最重的企鹅)和表中相应的列。
步骤4:应用逻辑推理来比较所有条目中的相关属性以找到正确答案(例如,最老企鹅的最高年龄)。
步骤5:从提供的选项中选择与逻辑比较结果匹配的答案。

A.1.2 Creative Language Generation

A.1.2创造性语言生成

Task Description: 

The task is to generate a sonnet that adheres to the traditional English sonnet rhyme scheme of "ABAB CDCD EFEF GG" and includes three specific words verbatim in the text.

Solution Description: 

Writing a sonnet involves crafting 14 lines of poetry that follow a specific rhyme pattern.

The lines are typically in iambic pentameter, though flexibility in rhythm can be allowed for creative reasons. The given rhyme scheme dictates the end sounds of each line, ensuring a structured poetic form. Incorporating the three provided words verbatim requires strategic placement within the lines to maintain the poem’s coherence and thematic unity.

Thought Template: 

Step 1: Identify the three words that must be included in the sonnet.

Step 2: Understand the rhyme scheme "ABAB CDCD EFEF GG" and prepare a list of rhyming words that could be used.

Step 3: Develop a theme or story for the sonnet that can naturally incorporate the three provided words.

Step 4: Begin drafting the sonnet by writing the first quatrain (four lines) following the "ABAB" rhyme scheme, ensuring one or more of the provided words are included.

Step 5: Continue with the second quatrain "CDCD," the third quatrain "EFEF," and finally the closing couplet "GG," each time incorporating the provided words as needed.

Step 6: Review the sonnet for coherence, flow, and adherence to the rhyme scheme, making adjustments as necessary.

任务描述:
任务是生成一首遵循传统英语十四行诗“ABAB CDCD EFEF GG”韵脚的十四行诗,并在文本中逐字包含三个特定的单词。
解决方案描述:
写一首十四行诗需要按照特定的押韵模式写出14行诗。
这些线条通常是抑扬格的五音步,尽管出于创造性的原因可以允许节奏的灵活性。给定的押韵方案决定了每一行的结尾音,确保了有结构的诗歌形式。将所提供的三个单词一字不差地结合在一起,需要有策略地放置在行内,以保持诗歌的连贯性和主题的统一。
认为模板:
步骤1:确定必须包含在十四行诗中的三个单词。
第二步:理解“ABAB CDCD EFEF GG”的韵律,并准备一个可以使用的押韵词列表。
第三步:为十四行诗制定一个主题或故事,可以自然地结合三个提供的单词。
第四步:开始起草十四行诗,按照“ABAB”押韵方案写第一个四行诗(四行),确保包含一个或多个提供的单词。
第五步:继续第二个四行诗“CDCD”,第三个四行诗“EFEF”,最后是结束的对行诗“GG”,每次都根据需要合并提供的单词。
第六步:回顾十四行诗的连贯性、流畅性和对韵式的坚持,必要时进行调整。

A.1.3 Common Sense Reasoning

A.1.3常识推理

Task Description: 

Given a specific date and an event, such as a holiday or historical event, determine the following date.

Solution Description: 

To determine the next date, we need to consider the structure of the calendar, the number of days in each month, and whether it’s a leap year. Typically, the number of days in a month is fixed, except February may vary due to leap years. The next day in a year is usually the date increased by one day unless it’s the end of the month, then the next day will be the first day of the following month. For the end of the year, the next day will be January 1st of the following year.

Thought Template: 

Step 1: Identify the given date’s month and day number.

Step 2: Check if it’s the end of the month; if so, confirm the start date of the next month.

Step 3: If it’s not the end of the month, simply add one to the day number.

Step 4: Pay special attention to the end of the year, ensuring the year increments.

任务描述:
给定特定的日期和事件(如假日或历史事件),确定以下日期。
解决方案描述:
为了确定下一个日期,我们需要考虑日历的结构、每个月的天数,以及这是否是闰年。通常,一个月的天数是固定的,除了二月可能会因闰年而变化。一年中的第二天通常是增加一天的日期,除非是月底,那么第二天就是下个月的第一天。对于一年的结束,第二天将是下一年的1月1日。
认为模板:
步骤1:确定给定日期的月份和日期。
第二步:检查是否到了月底;如果是,确定下个月的开始日期。
第三步:如果不是月底,只需在日期上加1。
第四步:年终要特别注意,确保年终有增量。

A.1.4 Mathematical Reasoning

A.1.4数学推理

Task Description: 

Solve an quadratic equation of the form ax2 + bx + c = 0 considering any situations.

Solution Description: 

To solve any quadratic equation of the form ax2 + bx + c = 0, we can follow a general approach based on the method described. Here is the structured template for solving such equations: Thought Template: 

Step 1: Calculate the Discriminant - Compute the discriminant D using the formula D = b 2 − 4ac.

Step 2: Determine the Nature of the Roots - If D > 0, the equation has two distinct real roots.

- If D = 0, the equation has exactly one real root (also known as a repeated or double root).

- If D < 0, the equation has two complex roots.

Step 3: Compute the Roots - For D ≥ 0, calculate the roots using the formula x = −b± √ D 2a .

- For D < 0, calculate the real and imaginary parts of the complex roots using the formula x = −b 2a ± √ −D 2a i, where i is the imaginary unit.

任务描述:
解一个二次方程,形式为ax2 + bx + c = 0,考虑任何情况。
解决方案描述:
要解任何形式为ax2 + bx + c = 0的二次方程,我们可以遵循基于上述方法的一般方法。下面是解决这类方程的结构化模板:
步骤1:计算判别式—用公式D = b 2−4ac计算判别式D。
步骤2:确定根的性质-如果D > 0,方程有两个不同的实根。
如果D = 0,则方程只有一个实根(也称为重根或二重根)。
—如果D < 0,则方程有两个复根。
步骤3:计算根—当D≥0时,用公式x =−b±√d2a计算根。
—当D < 0时,用公式x = - b 2a±√- D 2a i计算复根的实部和虚部,其中i为虚数单位。

A.1.5 Code Programming

A.1.5代码编程

Task Description: 

When given a list of numbers, try to utilize 4 basic mathematical operations (+-*/) to get a target number.

Thought Template:

任务描述:当给定一个数字列表时,尝试使用4个基本的数学运算(+-*/)来获得目标数字。
认为模板:  自己看原文吧

from itertools import permutations , product
def perform_operation (a , b , operation ) :
# Define the operation logic (e.g. , addition , subtraction ,
etc .).
pass
def evaluate_sequence ( sequence , operations ) :
# Apply operations to the sequence and check if the result
meets the criteria .
pass
def generate_combinations ( elements , operations ) :
# Generate all possible combinations of elements and
operations .
pass
def format_solution ( sequence , operations ) :
# Format the sequence and operations into a human - readable
string .
pass
def find_solution ( input_elements , target_result ) :
# Data Input Handling
# Validate and preprocess input data if necessary .
# Core Algorithm Logic
for sequence in permutations ( input_elements ) :
for operation_combination in generate_combinations (
sequence , operations ) :
try :
if evaluate_sequence ( sequence ,
operation_combination ) == target_result :
# Data Output Formatting
return format_solution ( sequence ,
operation_combination )
except Exception as e :
# Error Handling
# Handle specific exceptions that may occur
during evaluation .
continue
# If no solution is found after all iterations , return a
default message .
# return No solution found message
return
# Example usage :
input_elements = [1 , 7 , 10 , 3]
target_result = 24
print ( find_solution ( input_elements , target_result ))

A.2 Prompt for Problem Distiller

A.2问题蒸馏器提示

[Problem Distiller]: 

As a highly professional and intelligent expert in information distillation, you excel at extracting essential information to solve problems from user input queries. You adeptly transform this extracted information into a suitable format based on the respective type of the issue.

Please categorize and extract the crucial information required to solve the problem from the user’s input query, the distilled information should include.

1. Key information: Values and information of key variables extracted from user input, which will be handed over to the respective expert for task resolution, ensuring all essential information required to solve the problem is provided.

2. Restrictions: The objective of the problem and corresponding constraints.

3. Distilled task: Extend the problem based on 1 and 2, summarize a meta problem that can address the user query and handle more input and output variations. Incorporate the real-world scenario of the extended problem along with the types of key variables and information constraints from the original problem to restrict the key variables in the extended problem. After that, use the user query input key information as input to solve the problem as an example.

【问题蒸馏器】:作为一个高度专业和智能的信息蒸馏专家,您擅长从用户输入查询中提取关键信息来解决问题。您可以根据问题的各自类型熟练地将提取的信息转换为合适的格式。
请从用户的输入查询中分类并提取解决问题所需的关键信息,提取的信息应包括。
1. 关键信息:从用户输入中提取关键变量的值和信息,并将其交给相应的专家进行任务解决,确保提供解决问题所需的所有必要信息。
2. 约束条件:问题的目标和相应的约束条件。
3. 提炼任务:在1和2的基础上扩展问题,总结一个可以解决用户查询并处理更多输入和输出变化的元问题。将扩展问题的真实场景与原始问题的关键变量类型和信息约束结合起来,以限制扩展问题中的关键变量。之后,以用户查询输入关键信息作为输入来解决问题为例。

A.3 Prompt for Instantiated Reasoning

A.3实例化推理提示

[Meta Reasoner] 

You are a Meta Reasoner who are extremely knowledgeable in all kinds of fields including Computer Science, Math, Physics, Literature, History, Chemistry, Logical reasoning, Culture, Language..... You are also able to find different high-level thought for different tasks. Here are three reasoning sturctures: 

i) Prompt-based structure: It has a good performance when dealing with problems like Common Sense Reasoning, Application Scheduling 

ii) Procedure-based structure It has a good performance when dealing with creative tasks like Creative Language Generation, and Text Comprehension 

iii) Programming-based: It has a good performance when dealing with Mathematical Reasoning and Code Programming, it can also transform real-world problems into programming problem which could be solved efficiently.

(Reasoning instantiation) 

Your task is: 

1. Deliberately consider the context and the problem within the distilled respond from problem distiller and use your understanding of the question within the distilled respond to find a domain expert who are suitable to solve the problem.

2. Consider the distilled information, choose one reasoning structures for the problem.

3. If the thought-template is provided, directly follow the thought-template to instantiate for the given problem.

(元Reasoner)
你是一个元推理者,在各种领域都非常有知识,包括计算机科学,数学,物理,文学,历史,化学,逻辑推理,文化,语言.....你也能够为不同的任务找到不同的高层次思维。以下是三种推理结构:
i)基于提示的结构:在处理常识推理、应用程序调度等问题时具有良好的性能
ii)基于程序的结构,在处理创造性语言生成、文本理解等创造性任务时表现良好
iii)基于编程:它在处理数学推理和代码编程方面具有良好的性能,也可以将现实问题转化为可以高效解决的编程问题。
(推理实例化)
你的任务是:
1. 仔细考虑从问题蒸馏器中提炼出来的回答中的上下文和问题,并利用你对提炼出来的回答中问题的理解,找到适合解决问题的领域专家。
2. 考虑提炼出来的信息,为问题选择一个推理结构。
3. 如果提供了思想模板,直接按照思想模板对给定问题进行实例化。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值