【LMMs prompt优化】Intent-based Prompt Calibration

Intent-based Prompt Calibration: Enhancing prompt optimization with synthetic boundary cases 基于意图的提示校准: 利用合成边界案例加强提示优化

论文地址
代码地址
【arXiv 2402】基于意图的提示校准:利用合成边界案例增强提示优化(AutoPrompt)
Auto-Prompt | 大模型提示(Prompt)优化新方法IPC:可根据用户意图进行定向优化
请添加图片描述
请添加图片描述

Abstract

Prompt engineering is a challenging and important task due to the high sensitivity of Large Language Models (LLMs) to the given prompt and the inherent ambiguity of a textual task instruction. Automatic prompt engineering is essential to achieve optimized performance from LLMs. Recent studies have demonstrated the capabilities of LLMs to automatically conduct prompt engineering by employing a meta-prompt that incorporates the outcomes of the last trials and proposes an improved prompt. However, this requires a high-quality benchmark to compare different prompts, which is difficult and expensive to acquire in many real-world use cases. In this work, we introduce a new method for automatic prompt engineering, using a calibration process that iteratively refines the prompt to the user intent. During the optimization process, the system jointly generates synthetic data of boundary use cases and optimizes the prompt according to the generated dataset. We demonstrate the effectiveness of our method with respect to strong proprietary models on real-world tasks such as moderation and generation. Our method outperforms state-of-the-art methods with a limited number of annotated samples. Furthermore, we validate the advantages of each one of the system’s key components. Our system is built in a modular way, facilitating easy adaptation to other tasks.
.

提示工程是一项具有挑战性的重要任务,因为大语言模型(LLM)对给定的提示具有高度敏感性,而且文本任务指令本身具有模糊性。自动提示工程对于优化 LLM 的性能至关重要。

最近的研究表明, LLMs有能力通过使用元提示meta-prompt)自动进行提示工程,元提示结合了上次试验的结果,并提出了改进的提示。

然而,这需要一个高质量的基准来比较不同的提示,而在现实世界的许多使用案例中,很难获得这种基准,而且成本高昂。

在这项工作中,我们引入了一种新的自动提示工程方法,使用校准过程根据用户意图迭代改进提示。在优化过程中,系统会联合生成边界用例的合成数据,并根据生成的数据集优化提示。

我们证明了我们的方法与强大的专有模型相比,在实际任务(如修改和生成)中的有效性。在注释样本数量有限的情况下,我们的方法优于最先进的方法。

此外,我们还验证了系统每个关键组件的优势。我们的系统以模块化方式构建,便于适应其他任务。

Demo

请添加图片描述



全文翻译

1 Introduction

In recent years, there has been significant enhancements in the capabilities of Large Language Models (LLMs), demonstrating impressive generative performance across a variety of tasks [11, 4]. Nevertheless, despite these advancements, the quality of the models’ outputs is highly sensitive to the conditioned prompt [18, 40]. Even a slight modification in the prompt format can significantly impact the model’s performance [27]. This issue is even more evident in popular proprietary models, where a change in model version results in drastic changes in model behaviour on a wide range of tasks [6].

近年来,大型语言模型(LLMs)的能力有了显著提高,在各种任务中都表现出了令人印象深刻的生成性能[11, 4]。然而,尽管取得了这些进步,模型输出的质量对条件提示非常敏感 [18, 40]。即使对提示格式稍作修改,也会对模型的性能产生重大影响 [27]。这一问题在流行的专有模型中更为明显,模型版本的改变会导致模型在各种任务中的表现发生巨大变化[6]。

In order to tackle the prompt sensitivity issue, several methods [15, 16] proposed to use soft prompts which require access to the LLM itself in order to perform the optimization. Recently, [35, 37, 24] demonstrated the effectiveness of using LLMs themselves to optimize the prompt. To this end, each prompt is assigned a score based on a given benchmark and an appropriate metric. The optimization process is performed iteratively by providing a meta-prompt that incorporates the history of the last few prompt scores and guiding the model to suggest a better prompt with a higher score. However, the high-quality, large benchmarks required by this approach to evaluate the performance of the different prompts often do not exist in many real-world use cases. Moreover, iterating on such large datasets can be costly.

为了解决提示敏感性问题

  1. 有几种方法[15, 16]提出使用软提示,需要访问 LLM 本身才能进行优化。
  2. 最近,[35, 37, 24]证明了使用 LLM 本身优化提示的有效性。
  • 为此,每个提示符都会根据给定的基准和适当的指标分配一个分数。优化过程通过提供元提示迭代进行,元提示结合了最近几次提示得分的历史,并引导模型提出得分更高的更好提示。

  • 然而,这种方法需要高质量的大型基准来评估不同提示的性能,而在现实世界的许多使用案例中往往不存在这样的基准。

  • 此外,在这种大型数据集上进行迭代的成本也很高。

LLMs have proven to be highly effective in generating high-quality and rich datasets that boost model performance on a diverse set of tasks [25, 36, 19, 33]. Recent works demonstrate the capabilities of LLMs to refine the prompt provided by the user, resolving the initial prompt ambiguity [10].

事实证明,LLMs 在生成高质量和丰富的数据集方面非常有效,可以提高模型在各种任务中的性能[25, 36, 19, 33]。最近的研究表明,LLMs 有能力完善用户提供的提示,解决初始提示的模糊性问题[10]。

However, without additional information, the model has to guess the true intention of the user, which in many cases can lead to inaccurate results.

但是,如果没有额外的信息,模型就必须猜测用户的真实意图,这在很多情况下会导致结果不准确。

In this work, we introduce Intent-based Prompt Calibration (IPC), a system which aims to calibrate the prompt according to the intention of the user, by using synthetic examples. The calibration process is performed by iteratively building a dataset of challenging boundary cases and optimising the prompt according to the generated benchmark. This novel aspect of our method, producing a small benchmark tailored to the boundary cases of the user’s task as part of the optimization process, is highly valuable for explainability, LLM distillation, and other use cases. In contrast to previous works, the system is optimized for real-world use cases such as moderation which usually suffers from imbalanced data distribution. We also extend the prompt optimization to a new family of generative tasks, by first fitting a ranking prompt and then performing the prompt optimization with the learned ranker. Learning a prompt ranker allows us to optimize generative tasks with minimal annotation effort. As demonstrated in our experimentation section, using such an approach without synthetic data of boundary cases, e.g., as done in previous methods, would not be efficient due to the natural imbalance of the ranking distribution.

在这项工作中,我们引入了基于意图的提示校准(IPC)系统,该系统旨在通过使用合成示例,根据用户的意图校准提示。

校准过程是通过迭代方式建立一个具有挑战性的边界案例数据集,并根据生成的基准优化提示。

作为优化过程的一部分,我们方法的这一新颖之处是根据用户任务的边界案例生成一个小基准,这对于可解释性、LLM 提炼和其他用例都非常有价值。

与以往的工作不同的是,该系统针对现实世界中的使用案例进行了优化,例如通常存在数据分布不平衡问题的节制。

我们还将提示优化扩展到新的生成任务系列,首先拟合一个排名提示,然后使用学习到的排名器进行提示优化。通过学习提示排序器,我们可以用最少的注释工作来优化生成任务。

正如我们在实验部分所展示的那样,在没有边界案例合成数据的情况下使用这种方法(如以前的方法),由于排名分布的自然不平衡,效率并不高。

Lastly, our system is built in a modular way, such that each part of the components can be used on its own in other tasks like synthetic data generation or prompt distillation between two LLMs. We describe the system components in detail and demonstrate the effectiveness of our proposed method with respect to strong proprietary models like GPT-3.5/4-Turbo. We show that our method outperforms previous methods using a very small amount of data and iteration steps . This significantly reduces the total optimization efforts and costs, and makes the system applicable to various production use cases.

最后,我们的系统是以模块化的方式构建的,因此组件的每个部分都可以单独用于其他任务,如合成数据生成或两个 LLM 之间的提示提炼。

我们详细描述了系统组件,并演示了我们提出的方法对 GPT-3.5/4-Turbo 等强专有模型的有效性。

结果表明,我们的方法在使用极少量数据和迭代步骤的情况下优于之前的方法。这大大减少了总的优化工作量和成本,使系统适用于各种生产用例。

2 Method

我们的系统如图 1 所示。
在这里插入图片描述
Figure 1: System diagram.

  1. An initial prompt is provided by the user
  2. Synthetic challenging cases are generated
  3. A user or an LLM annotates the examples
  4. After evaluating the prompt performances, an LLM suggest a new prompt given the last prompt’s results.
  5. This process is repeated iteratively until a certain stop criterion
  6. The system outputs a calibrated prompt.

图 1:系统图。

  1. 用户提供初始提示
  2. 生成具有挑战性的合成案例
  3. 用户或 LLM 对案例进行注释
  4. 在评估了提示的性能后,LLM 根据上一个提示的结果提出一个新的提示。
  5. 这一过程反复进行,直到达到一定的停止标准
  6. 系统输出经过校准的提示。

Our system is illustrated in Figure 1. We start with the initial prompt suggestion and a task description. The user can also provide a few examples in a few-shot setting. Then, during the calibration optimization process, the system iteratively: 1. Suggests a few samples of challenging and diverse boundary cases for the task and the current prompt. 2. Evaluates the current prompt on the generated dataset, and provides an analysis. 3. Given the history of the last few prompts, suggests a new prompt with a higher score. The optimization process is terminated when either there is no improvement in the last few steps, or when the maximum number of iterations has been reached.

  • 我们首先提供初始提示建议和任务描述。用户还可以在few-shot的设置中提供几个例子examples。
  • 然后,在校准优化过程中,系统会反复进行以下操作:
  1. 针对任务和当前提示,推荐一些具有挑战性和多样性的边界案例样本
  2. 在生成的数据集上评估当前提示,并提供分析。
  3. 根据最近几次提示的历史记录,提出得分更高的新提示。当最后几步没有改进或达到最大迭代次数时,优化过程就会终止。

The base configuration of our system is optimized for classification tasks, with accuracy set as the score function, and the error analysis determined by a confusion matrix and the prompt misclassifications. An example of the system flow can be seen in Figure 2. In each iteration, new challenging samples are generated (according to the current prompt), and the misclassifications are used to refine the prompt until it is calibrated to the user intent.

我们系统的基本配置针对分类任务进行了优化,将准确率设定为评分函数误差分析混淆矩阵和提示误分类决定。

系统流程示例见图 2。在每次迭代中,都会生成新的挑战样本(根据当前的提示),并利用误分类来完善提示,直至其符合用户意图。
在这里插入图片描述
Figure 2: Example of a real system flow.
The user provides only the task description and initial prompt.
The model iteratively generates challenging samples and refines the prompt according to the generated benchmark.

图 2:实际系统流程示例。
用户只提供任务描述初始提示
模型会反复生成具有挑战性的样本,并根据生成的基准改进提示

2.1 Generative tasks

To extend the prompt calibration process from classification tasks to generative tasks, we split the optimization process into two parts.

In the first part, we use an LLM to rephrase the initial prompt and task description in order to define a ranking task based on the modified initial prompt and task description. We then calibrate a prompt for the ranking task, treating it as a classification task using the classification pipeline. Naturally, the ranker distribution tends to be a normal distribution with its mean at the mean score. This distribution is imbalanced, especially in the interesting range of the top scores. Therefore, in the ranking case, the sample generator meta-prompt is instructed to generate challenging boundary samples from the top two scores.

In the second part, we leverage the same underlying process to optimize the original generative prompt. This step is done by iteratively applying steps 2 and 3, described in the system overview, using the calibrated ranking prompt as the score function. It’s important to note that human annotations are required only in the ranking calibration process. Furthermore, by treating the intent as a classification task, the prompt can be calibrated using a small amount of annotation effort.

为了将提示校准过程从分类任务扩展到生成任务,我们将优化过程分为两个部分。

在第一部分

  • 我们使用 LLM 对初始提示和任务描述进行重新措辞,以便根据修改后的初始提示和任务描述定义排序任务
  • 然后,我们为排名任务校准提示使用分类管道将其视为分类任务。自然,排名者的分布趋向于正态分布,其平均值为平均分数。这种分布是不平衡的,尤其是在最高分的有趣范围内。因此,在排名情况下,样本生成器元提示被指示从前两个分数中生成具有挑战性的边界样本。

在第二部分

  • 我们利用相同的底层流程来优化原始生成提示。这一步是通过迭代应用系统概述中描述的步骤 2 和 3 来完成的,使用校准后的排名提示作为分数函数。
  • 值得注意的是,只有在排序校准过程中才需要人工注释。此外,通过将意图视为分类任务,只需少量注释工作即可校准提示。

2.2 Meta-Prompts

The meta-prompts consist of three separate prompts, as can be seen in Appendix A.

如附录 A 所示,元提示由三个独立的提示组成。

Sample generator. The sample generation meta-prompt is determined according to the system state: In the first iteration, if the user doesn’t provide any samples (zero-shot setting), the meta-prompt instructs the model to generate diverse adversarial samples with even class distribution. In the next iterations, the prompt is extended with the following additional context: (1) A history with prompts and good adversarial samples that confused the prompts; and (2) A set of realistic samples from the dataset, where the model is instructed to preserve the dataset style. The context-realistic samples are chosen to be semantically close according to a given sentence embedding.

样本生成器
样本生成元提示是根据系统状态决定的: 在第一次迭代中,如果用户没有提供任何样本(零镜头设置),元提示会指示模型生成具有均匀类别分布的多样化对抗样本。
在接下来的迭代中,该提示会增加以下内容:

  1. 包含提示和混淆提示的优秀对抗样本的历史记录
  2. 数据集中的一组现实样本,模型会被指示保留数据集的风格。根据给定的句子嵌入,选择语义接近的上下文现实样本。

Analyzer. The analyzer meta-prompt receives the prompt score, a confusion matrix in the classification case, and a set of errors in all the classes. It is then instructed to produce an analysis summary of the prompt performances and the major failure cases.

分析器
元提示分析器接收提示得分、分类情况下的混淆矩阵和所有类别的错误集。然后,元提示器将根据指示对提示结果和主要失败案例进行分析总结。

Prompt generator. The input for the prompt generator meta-prompt is (1) A list of the last suggested prompts and their scores (2) The performance analysis of the last prompt that is produced by the Analyzer prompt. The model is instructed to produce a prompt with a higher score according to the history and the analysis.

提示生成器
提示生成器元提示的输入是:

  1. 上次建议的提示及其得分列表
  2. 分析器提示生成的上次提示的性能分析。根据历史记录和分析结果,指示模型生成得分更高的提示

2.3 System pipeline

An overview of the system architecture can be seen in Figure 8. The system consists of four primary components.

系统结构概览见图 8。系统由四个主要部分组成。

Dataset. This component manages the dataset and performs operations such as insertion, modification, deletion, and applying functions, on the dataset rows. The component also handles data cleaning by removing semantic duplications and performing semantic sampling. Since the system is optimized for small datasets, the current implementation is based on a local database using pandas.

数据集

  • 该组件负责管理数据集,并对数据集行执行插入、修改、删除和应用函数等操作。
  • 该组件还通过删除语义重复和执行语义抽样来处理数据清理。由于该系统针对小型数据集进行了优化,目前的实现基于使用 pandas 的本地数据库。

Estimator. The estimator is responsible for estimating a batch of samples. We implement this component twice, once for the predictions and once for the annotations. This generic implementation for both types of use cases, allows us to modify the system simply for diverse use cases such as prompt calibration, prompt distillation and prompt squashing. The currently supported types of estimators are: (1) Human annotation, using Argilla UI [32]. The system is connected to the Argilla server and is waiting until the annotation task is completed; (2) LLM estimator, which uses an LLM to estimate the sample given a prompt. We support various types of LLMs, using Langchain integration [5]. For efficiency, the system supports parallelism using both workers and async calls. The system also supports sending a few samples in one prompt (prompt batching), which can reduce the cost significantly; and (3) Batch estimator, the batch estimator runs multiple LLM estimators and integrates their outputs through an aggregation layer. It is mainly used for prompt-squashing, enabling users to optimize a single prompt that will perform as well as running few prompts multiple times. For example, when a user wants to apply several moderation rules simultaneously.

估算器
估算器负责估算一批样本。我们对该组件实施了两次,一次用于预测,一次用于注释。这种通用的实现方式适用于两种类型的使用情况,使我们能够针对不同的使用情况(如提示校准、提示提炼和提示压扁)对系统进行简单的修改。

目前支持的估算器类型有

  1. 使用 Argilla UI[32]进行人工标注。系统连接到 Argilla 服务器,等待注释任务完成;
  2. LLM 估算器,使用 LLM 估算给定提示的样本。我们使用 Langchain 集成 [5],支持各种类型的 LLM。为了提高效率,系统支持使用 Worker 和异步调用进行并行处理。系统还支持在一个提示中发送几个样本(提示批处理),这可以大大降低成本;
  3. 批处理估算器,批处理估计器运行多个 LLM 估计器,并通过聚合层整合它们的输出。它主要用于提示语缩减,使用户能够优化单个提示语,其性能不亚于多次运行少数提示语。例如,当用户希望同时应用多个节制规则时。

Evaluator. The evaluator is responsible for evaluating the records after the prediction and annotation stage. The evaluator accepts a function and applies it to each row. It’s important to note that the function is generic. For example, in the generation pipeline, the function is performed by invoking an LLM. The evaluator is also responsible for defining the errors and handling the error analysis using the Analyzer described in the meta-prompts section.

评估器
评估器负责在预测和注释阶段之后对记录进行评估。

评估器接受一个函数并将其应用于每一行。

需要注意的是,该函数是通用的。例如,在生成管道中,该函数通过调用 LLM 来执行。评估器还负责定义错误,并使用元术语部分所述的分析器处理错误分析。

Optimizer (Optimization Pipeline). The optimizer manager handles the whole optimization process flow, it performs the iteration steps described in the previous section and is responsible for stopping and returning the final calibrated prompt. The currently supported criteria are either convergence (determined by a patience hyper-parameter), or usage limit (determined by maximal cost if relevant, or by the number of generated tokens).

优化器(优化管道)。优化器管理器负责处理整个优化流程,执行上一节所述的迭代步骤,并负责停止和返回最终的校准提示。目前支持的标准是收敛性(由耐心超参数决定)或使用限制(由相关的最大成本或生成的代币数量决定)。

Experiments

We test our system on scenarios that reflect real-world moderation and generation use cases on strong proprietary models (GPT-3.5/4-Turbo). We used the IMDB review dataset [20] as the base data for all our experiments. We compare our proposed IPC method to two SOTA prompt optimization methods that are based on meta-prompts: 1. OPRO [35] and 2. The meta prompt provided by PE [37].

我们对系统进行了测试,测试场景反映了现实世界中在强大的专有模型(GPT-3.5/4-Turbo)上的节制和生成用例。

我们使用 IMDB 评论数据集 [20] 作为所有实验的基础数据。我们将我们提出的 IPC 方法与两种基于元提示的 SOTA 提示优化方法进行了比较: 1. OPRO [35] 和 2. PE [37] 提供的元提示。

3.1 Classification

在这里插入图片描述

We evaluate the prompt calibration process on three binary classification tasks: (1) Spoiler detection, (2) Sentiment analysis, and (3) Parental Guidance (PG) detection. In each experiment, we start with some initial samples and a prompt, in addition to a short task description. To generate the ground truth (GT) we composed a highly detailed prompt with specific preferences. For the GT generation, we used a strong model (GPT-4 Turbo), as this process of generating the GT simulates a user’s particular preference for the given task. The baseline methods were trained on samples from the IMDB dataset [20], whereas our proposed method was trained on the adversarial synthetic data which was provided with 10 initial samples from the original IMDB training data. All methods were trained for 50 iterations. The test dataset was taken from the IMDB reviews test split, with the generated annotations provided by the GT prompt. We then collected 250 samples for each class in each one of the tested scenarios, such that the final dataset has equal class distribution. It’s important to note that the IMDB dataset includes only highly polarizing reviews (no reviews with ratings in the range of 4-7). To evaluate the method’s performance on more challenging cases, we also generate a synthetic test dataset with 300 samples (using the initial prompt) for the sentiment classification task.

我们在以下三个二元分类任务中对提示校准过程进行了评估:

  1. 剧透检测;
  2. 情感分析;
  3. 家长指导 (PG) 检测。
  • 在每个实验中,除了简短的任务描述外,我们还从一些初始样本和提示开始。为了生成ground truth(GT),我们编写了一个具有特定偏好的详细提示。

  • 在生成ground truth时,我们使用了一个强大的模型(GPT-4 Turbo),因为ground truth的过程模拟了用户对给定任务的特定偏好。

  • baseline是在 IMDB 数据集[20]的样本上进行训练的,而我们提出的方法则是在对抗性合成数据上进行训练的,该数据是从原始 IMDB 训练数据中提供的 10 个初始样本。所有方法都进行了 50 次迭代训练。

  • 测试数据集取自 IMDB 评论测试分集,由 GT 提示提供生成的注释。然后,我们为每个测试场景中的每个类别收集了 250 个样本,这样最终数据集的类别分布就相等了。

  • 值得注意的是,IMDB 数据集只包含极度两极分化的评论(没有评分在 4-7 分之间的评论)。为了评估该方法在更具挑战性的情况下的性能,我们还为情感分类任务生成了一个包含 300 个样本(使用初始提示)的合成测试数据集。

We present our results in Figures 3,7. As seen in the figures, IPC outperforms all other tested methods. In particular, it’s important to note the high variance of the other methods, especially in the case of a small number of training samples. The gap in performance between the methods becomes even more evident in the synthetic data case, where there are more boundary cases, as can be seen in Figure 6. A qualitative comparison between the methods can be seen in Table 1. While OPRO [35] results in mainly rephrasing the initial prompt, and the PE [37] prompt only partly fits the GT prompt, the IPC prompt successfully captures the subtle details and nuances of the GT prompt. The significant differences in data distributions between the original data and the synthetic generated data can be seen in Figure 5. Both the spoiler and the PG classification tasks exhibit significant bias towards the ’No’ labels, where the Synthetic data is almost balanced.

我们在图 3 和图 7 中展示了测试结果。

  • 从图中可以看出,IPC 优于所有其他测试方法。

  • 需要特别指出的是,其他方法的方差较大,尤其是在训练样本数量较少的情况下。
    在这里插入图片描述

  • 从图 6 中可以看出,在合成数据的情况下,方法之间的性能差距更加明显,因为合成数据中存在更多的边界案例。两种方法的定性比较见表 1。

  • OPRO [35] 的结果主要是改写了初始提示语,PE [37] 提示语也只部分符合 GT 提示语,而 IPC 提示语则成功捕捉到了 GT 提示语的微妙细节和细微差别。

  • 从图 5 中可以看出,原始数据和合成生成的数据在数据分布上存在明显差异。剧透检测和 PG 分类任务都表现出明显的 "否 "标签偏向,而合成数据则几乎平衡。

3.2 Generation

The generation setting is composed of two parts: generating the ranker prompt and then using the ranker to optimize the generation task prompt. We tested our generation pipeline on challenging ambiguous tasks: (1) Generate a movie review that is enthusiastic, reliable and adheres to a given movie description. (2) Generate a movie review that is sarcastic but has a positive sentiment. As in the classification case, we chose highly detailed prompts to generate the ranker GT with a scale of 1 to 5, which simulates the human preferences for the given task. For each tested method we fit a ranker using 50 labeled samples, and then optimized the generative task prompt according to the learned ranking model. The reported evaluation score is calculated by running the learned generative prompt on a test set of size 50 and evaluating the result using the target ranking prompt GT. For the baseline methods, we took samples from the IMDB review dataset [20] and generated a movie description for each review. We then fed this data to the ranker optimization process. We ran both the ranker training and the generator training for 30 iteration steps.

生成设置由两部分组成:

  1. 生成排序器提示,
  2. 然后使用排序器优化生成任务提示。

我们在具有挑战性的模糊任务中测试了我们的生成管道:
(1) 生成一篇热情、可靠并符合给定电影描述的电影评论。
(2)生成一篇带有讽刺意味但情绪积极的影评。

与分类情况一样,我们选择了高度详细的提示来生成等级为 1 到 5 的 GT,这模拟了人类对给定任务的偏好。

对于每种测试方法,我们都使用 50 个标注样本拟合排名器,然后根据学习到的排名模型优化生成任务提示。

报告的评估分数是通过在 50 个测试集上运行学习到的生成提示,并使用目标排名提示 GT 评估结果计算得出的。

对于基线方法,我们从 IMDB 评论数据集[20]中提取样本,并为每篇评论生成电影描述。然后,我们将这些数据输入排名器优化流程。我们对排名器和生成器进行了 30 次迭代训练。

Results for GPT-4 Turbo LLM, including both ranker training and generation prompt training, are presented in Table 3. A qualitative comparison is provided in Table 4. We see that using IPC improves the average ranking score of the generated reviews compared to the other tested methods in all tested scenarios. It’s important to note that all the tested methods, except for IPC, performed worse than the initial prompt in some experiments. This can be explained by the distribution of the ranking scores in the real data, which is shown in Figure 4, where there are almost no samples with the top score. In contrast, the distribution of the generated synthetic samples is biased towards the top two scores.

表 3 列出了 GPT-4 Turbo LLM(包括排序器训练和生成提示训练)的结果。

表 4 提供了定性比较。

  • 我们可以看到,在所有测试场景中,与其他测试方法相比,使用 IPC 可以提高所生成评论的平均排名得分。
  • 值得注意的是,除 IPC 外,所有测试方法在某些实验中的表现都不如初始提示。

这可以用真实数据中排名得分的分布来解释,如图 4 所示,其中几乎没有得分最高的样本。相比之下,生成的合成样本的分布则偏向于得分最高的两个样本。

3.3 Ablation study

We examine the impact of each key component of the system on the spoiler classification task. Specifically, we look at the 50 training samples case. The effect of each one of the components can be seen in Table 2. Using synthetic data boosts model performance. It is also important to note that the analyzer component substantially improves the model’s performance. This stands in contradiction to [35]’s findings that adding errors to the meta-prompts doesn’t improve the model performance, and can also emphasise the gap between the standard general benchmarks and use cases such as moderation.

我们研究了系统中每个关键组件对破坏者分类任务的影响。

具体来说,我们考察了 50 个训练样本的情况。表 2 显示了每个组件的影响。使用合成数据提高了模型性能。同样重要的是,分析器组件大大提高了模型的性能。这与文献[35]的研究结果相矛盾,即在元数据中添加误差并不能提高模型性能,这也凸显了标准通用基准与调节等用例之间的差距。

4 Related Work

在这里插入图片描述
在这里插入图片描述

Prompt Optimization. Several methods have been suggested to address the challenge of automating the prompt engineering process. A commonly used approach is to optimize a task-specific embedding, in either a continuous [15, 16, 17] or discrete [34, 29] manner. This approach requires access to the LLM itself in order to perform the optimization. An alternative approach is to use reinforcement learning [9, 39, 1]. This approach either requires access to the generated tokens’ probability or requires a large dataset for training a model. Recent works used the LLMs themselves for prompt optimization [41, 24, 35, 37]. These methods can also be applied to proprietary LLMs, , where access is limited to the final generated sentences. However, these methods still require a good valid benchmark in order to evaluate and compare the different generated prompts, which is not always available in real-world cases.

Prompt优化。

为应对提示工程流程自动化的挑战,人们提出了几种方法。

  • 一种常用的方法是以连续 [15, 16, 17] 或离散 [34, 29] 的方式优化特定任务的嵌
    入。这种方法需要访问 LLM 本身才能进行优化。

  • 另一种方法是使用强化学习 [9, 39, 1]。这种方法要么需要获取生成标记的概率,要么需要大量数据集来训练模型。最近的研究则使用 LLM 本身进行Prompt优化 [41, 24, 35, 37]。

这些方法也可以应用于专有的 LLM,在这种情况下,对最终生成句子的访问是有限的。不过,这些方法仍然需要一个良好有效的基准,以评估和比较不同的生成提示,而在现实世界中,并不总是有这样的基准。

Synthetic data. The utilization of synthetic data produced by LLMs has demonstrated remarkable effectiveness across a wide range of tasks, including code generation [25, 36], mathematical reasoning [38, 19], text embedding [33] and text2image [3]. The advantage of using synthetic data is not only in cost savings; it can also be beneficial for low-resource tasks or imbalanced data distributions [21]. Following these works, our system generates high-quality evenly distributed synthetic boundary samples, that result in a more efficient optimization process and higher-quality results.

合成数据。

在代码生成 [25, 36]、数学推理 [38, 19]、文本嵌入 [33] 和 text2image [3] 等多种任务中,利用 LLM 生成的合成数据已显示出显著效果。

  1. 使用合成数据的优势不仅在于节约成本,
  2. 对于资源匮乏的任务或不平衡的数据分布也有好处[21]。

根据这些研究成果,我们的系统生成了高质量的均匀分布合成边界样本,从而提高了优化过程的效率和结果的质量。

Synthetic data was also proven to be an effective method to distil knowledge from black-box LLMs, by training on synthetic data that was generated by those models [31, 22, 12]. However, in these works the generated data was used to fully train the student model. In contrast, our work demonstrates the effectiveness of synthetic data to distil knowledge between two black-box models via automatic prompt engineering.

合成数据也被证明是一种从黑盒 LLM 中提炼知识的有效方法,方法是在这些模型生成的合成数据上进行训练[31, 22, 12]。

不过,在这些工作中,生成的数据被用来完全训练学生模型。相比之下,我们的工作通过自动提示工程展示了合成数据在两个黑盒模型之间提炼知识的有效性。

Curriculum Learning. Arranging the data samples for training machine learning models in a meaningful way, starting from easier samples and progressing to more challenging ones, can yield performance enhancements compared to the conventional method of training based on random data shuffling. This approach is known as curriculum learning [30, 2]. Curriculum Learning has been proven to be effective in various fields such as object localization [13, 28], object detection [7, 26] and NLP [14, 23]. Inspired by these ideas, in [8] they propose to fine-tune LLMs by iteratively generating synthetic data and refining the policy to distinguish between the synthetic data and the human-annotated data. In our work, we use a similar approach, where the system iteratively generates more challenging cases that resolve the previous prompt ambiguity in order to more efficiently tune to the user intent.

课程学习。

与基于随机数据洗牌的传统训练方法相比,以一种有意义的方式安排用于训练机器学习模型的数据样本,从较容易的样本开始,逐步训练更具挑战性的样本,可以提高性能。

这种方法被称为课程学习 [30, 2]。课程学习在物体定位 [13, 28]、物体检测 [7, 26] 和 NLP [14, 23] 等多个领域被证明是有效的。

受这些想法的启发,在 [8] 中,他们提出通过迭代生成合成数据和完善策略来微调 LLM,以区分合成数据和人类标注的数据。

在我们的工作中,我们采用了类似的方法,即系统迭代生成更具挑战性的案例,以解决之前提示的模糊性,从而更有效地根据用户意图进行调整。

5 Conclusions

In this work, we introduced IPC, a system for automatic prompt engineering. The system combines a synthetic data generation module that generates challenging and diverse samples, and a prompt optimization module that suggests new prompts. Both of them are implemented by prompting LLMs, and they iteratively refine each other until the prompt converges. We further propose a new method to extend the meta-prompt based prompt optimization process to generative tasks. We demonstrate the effectiveness of our system on real-world use cases such as moderation and generation with respect to strong proprietary models (GPT-3.5/4-Turbo).Our method significantly enhances the resulting performance of prompts in all tested scenarios.

在这项工作中,我们引入了自动提示工程系统 IPC。

该系统集成了一个合成数据生成模块和一个提示优化模块,前者用于生成具有挑战性和多样性的样本,后者用于提出新的提示。

这两个模块都是通过提示 LLM 实现的,它们相互迭代改进,直到提示收敛为止。

我们进一步提出了一种新方法,将基于元提示的提示优化过程扩展到生成任务。我们在现实世界的使用案例中演示了我们的系统的有效性,如针对强专有模型(GPT-3.5/4-Turbo)的节制和生成。

Our system is built in a modular and flexible way that allows for easy modification and addition of new components. In future work, we intend to extend our system to new use cases such as multi-modality and in-context learning. We also intend to explore further possibilities to optimize the meta-prompts themselves.

我们的系统以模块化和灵活的方式构建,便于修改和添加新的组件。

在未来的工作中,我们打算将我们的系统扩展到新的使用案例中,如多模式和上下文学习。

我们还打算进一步探索优化元句法本身的可能性。

Appendix

A Implementation details

In this section, we provide a additional information on the implementation details. An architecture overview of our system is provided in Figure 8. We also provide a list of the meta-prompts utilized within the pipeline of our system.

在本节中,我们将提供更多有关实施细节的信息。图 8 是我们系统的架构概览。我们还提供了系统流水线中使用的元部件列表。

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

B Experiments: Additional details

In this section, we provide additional material on the experiments provided in the paper.

在本节中,我们将就论文中提供的实验提供补充材料。
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

  • 15
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值