《Advanced RAG》-09-Prompt 压缩(二)

承接上文:《Advanced RAG》-09-Prompt 压缩(一)

LongLLMLingua

LLMLingua 的问题在于,它在压缩过程中不考虑用户的问题,可能会保留无关信息。

LongLLMLingua 将用户问题纳入压缩过程,旨在解决这一问题。

在这里插入图片描述

如图 9 所示,LongLLMLingua 提出了四个新组件,以增强对 LLM 中关键信息的感知:

  • 问题感知的粗粒度和细粒度压缩
  • 文件重新排序机制
  • 动态压缩比
  • 后续恢复算法

问题感知粗粒度压缩(Question-aware coarse-grained compression)

LongLLMLingua 建议使用问题 x^que 在不同上下文 x^doc_k 条件下的困惑度来表示它们之间的关联。在 x^que 后面可以加上一个限制性语句,即 x^restrict = "我们可以在给定的文档中得到这个问题的答案"。该语句加强了 x^quex^doc_k 之间的联系,并作为一个正则化项减少了幻觉效应。这可以表述为

在这里插入图片描述

为什么不计算问题 x^que 条件下的文档级困惑度呢?

这是因为文档中往往包含大量无关信息。即使以 x^que相关代码可在函数 get_distance_longllmlingua 中找到。 为条件,为整个文档计算的困惑度得分也可能不够明显,因此不足以作为文档级压缩的衡量标准。

相关代码可在函数 get_distance_longllmlingua 中找到。

问题感知细粒度压缩(Question-aware fine-grained compression)

LongLLMLingua 引入了对比困惑的概念。

在这里插入图片描述

首先,我们计算一个标记的困惑度,不考虑问题本身,表示为perplexity(x_i | x<i)。然后,我们再次测量困惑度,这次包括问题,表示为perplexity(x_i | x^que, x<i)。这衡量了在给定问题x^que的情况下,看到标记x_i之前所有标记的惊讶程度。

这样做的目的是确定每个标记的惊奇程度随问题变化的程度。如果一个词在包含问题时变得不那么令人惊讶,那么它可能与问题高度相关。

文件重新排序机制(Document reordering mechanism)

如图 10 所示,在推理过程中,LLM 往往会使用提示开头和结尾的内容,而忽略中间的内容。这个问题被称为 "迷失在中间 "问题。

在这里插入图片描述

图 10 还表明,当相关信息被放在开头时,LLM 的表现最佳。因此,LongLLMLingua 根据粗粒度压缩的结果来组织段落,按得分从高到低的顺序从前往后排列。

在这里插入图片描述

动态压缩比(Dynamic compression ratio)

由于不同文档的关键信息密度不同,我们应该为与问题更相关的文档分配更多的预算(即更低的压缩比)。

LongLLMLingua 使用粗粒度压缩的重要性分数来指导细粒度压缩的预算分配。

具体来说,首先使用 LLMLingua 的预算控制器为保留的文档设置初始预算。然后,在细粒度压缩阶段,为每个文档动态分配压缩预算。分配的依据是文档的重要性得分排名指数,该指数是在粗粒度压缩阶段确定的。

LongLLMLingua 采用线性调度器进行自适应分配,每个令牌 xi 的预算可表示为:

在这里插入图片描述

其中,Nd 表示文件数量,δτ 是一个超参数,用于控制动态分配的总体预算。

相应的代码可在函数 get_dynamic_compression_ratio 中找到。

后续恢复算法(Subsequence recovery algorithm)

如图 11 所示,在细粒度标记压缩过程中,可能会丢弃一些关键实体的标记。例如,原始提示中的 "2009 "可能被压缩为 “209”,"Wilhelm Conrad Rontgen "可能被压缩为 “Wilhelmgen”。

在这里插入图片描述

LongLLMLingua 提出了一种子序列恢复算法,可以从 LLM 的响应中恢复出原始内容,如图 12 所示。

在这里插入图片描述

主要流程包括以下步骤:

  • 遍历 LLM 响应中的标记 yl,并选择压缩提示 x˜ 中出现的最长子串 y˜key,l
  • 找出原始提示 x 中与 y˜key,l 相对应的最大公共最短子序列 xi,j
  • 用 xi,j 替换 LLMs 响应中的相应标记 y˜key,l。

相应的代码可在恢复函数中找到。

代码演示

设置环境的方法与 LLMLingua 相同。下面是测试代码:

from llmlingua import PromptCompressor

GSM8K_PROMPT = "Question: Angelo and Melanie want to plan how many hours over the next week they should study together for their test next week. They have 2 chapters of their textbook to study and 4 worksheets to memorize. They figure out that they should dedicate 3 hours to each chapter of their textbook and 1.5 hours for each worksheet. If they plan to study no more than 4 hours each day, how many days should they plan to study total over the next week if they take a 10-minute break every hour, include 3 10-minute snack breaks each day, and 30 minutes for lunch each day?\nLet's think step by step\nAngelo and Melanie think they should dedicate 3 hours to each of the 2 chapters, 3 hours x 2 chapters = 6 hours total.\nFor the worksheets they plan to dedicate 1.5 hours for each worksheet, 1.5 hours x 4 worksheets = 6 hours total.\nAngelo and Melanie need to start with planning 12 hours to study, at 4 hours a day, 12 / 4 = 3 days.\nHowever, they need to include time for breaks and lunch. Every hour they want to include a 10-minute break, so 12 total hours x 10 minutes = 120 extra minutes for breaks.\nThey also want to include 3 10-minute snack breaks, 3 x 10 minutes = 30 minutes.\nAnd they want to include 30 minutes for lunch each day, so 120 minutes for breaks + 30 minutes for snack breaks + 30 minutes for lunch = 180 minutes, or 180 / 60 minutes per hour = 3 extra hours.\nSo Angelo and Melanie want to plan 12 hours to study + 3 hours of breaks = 15 hours total.\nThey want to study no more than 4 hours each day, 15 hours / 4 hours each day = 3.75\nThey will need to plan to study 4 days to allow for all the time they need.\nThe answer is 4\n\nQuestion: You can buy 4 apples or 1 watermelon for the same price. You bought 36 fruits evenly split between oranges, apples and watermelons, and the price of 1 orange is $0.50. How much does 1 apple cost if your total bill was $66?\nLet's think step by step\nIf 36 fruits were evenly split between 3 types of fruits, then I bought 36/3 = 12 units of each fruit\nIf 1 orange costs $0.50 then 12 oranges will cost $0.50 * 12 = $6\nIf my total bill was $66 and I spent $6 on oranges then I spent $66 - $6 = $60 on the other 2 fruit types.\nAssuming the price of watermelon is W, and knowing that you can buy 4 apples for the same price and that the price of one apple is A, then 1W=4A\nIf we know we bought 12 watermelons and 12 apples for $60, then we know that $60 = 12W + 12A\nKnowing that 1W=4A, then we can convert the above to $60 = 12(4A) + 12A\n$60 = 48A + 12A\n$60 = 60A\nThen we know the price of one apple (A) is $60/60= $1\nThe answer is 1\n\nQuestion: Susy goes to a large school with 800 students, while Sarah goes to a smaller school with only 300 students.  At the start of the school year, Susy had 100 social media followers.  She gained 40 new followers in the first week of the school year, half that in the second week, and half of that in the third week.  Sarah only had 50 social media followers at the start of the year, but she gained 90 new followers the first week, a third of that in the second week, and a third of that in the third week.  After three weeks, how many social media followers did the girl with the most total followers have?\nLet's think step by step\nAfter one week, Susy has 100+40 = 140 followers.\nIn the second week, Susy gains 40/2 = 20 new followers.\nIn the third week, Susy gains 20/2 = 10 new followers.\nIn total, Susy finishes the three weeks with 140+20+10 = 170 total followers.\nAfter one week, Sarah has 50+90 = 140 followers.\nAfter the second week, Sarah gains 90/3 = 30 followers.\nAfter the third week, Sarah gains 30/3 = 10 followers.\nSo, Sarah finishes the three weeks with 140+30+10 = 180 total followers.\nThus, Sarah is the girl with the most total followers with a total of 180.\nThe answer is 180"
QUESTION = "Question: Josh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make?"



llm_lingua = PromptCompressor()

compressed_prompt = llm_lingua.compress_prompt(
    GSM8K_PROMPT.split("\n\n")[0],
    question = QUESTION,
    # ratio=0.55# Set the special parameter for LongLLMLingua
    condition_in_question = "after_condition",
    reorder_context = "sort",
    dynamic_context_compression_ratio = 0.3, # or 0.4
    condition_compare = True,
    context_budget = "+100",
    rank_method = "longllmlingua",
)

print('-' * 100)
print("original:")
print(GSM8K_PROMPT.split("\n\n")[0])


print('-' * 100)
print("compressed_prompt:")
print(compressed_prompt)

运行结果如图 13 所示:

在这里插入图片描述

自动压缩(AutoCompressor)

与前面提到的方法不同,AutoCompressor 是一种基于软提示的方法。

它通过扩大词汇量以及利用 "摘要标记 "和 "摘要向量 "来浓缩上下文信息,对现有模型进行了巧妙的微调。

在这里插入图片描述

图 14 显示了 AutoCompressor 的结构,其运行步骤如下:

  1. Expand Vocabulary: 这一步骤包括在模型现有词汇中添加 “摘要标记”。这些标记使模型能够将大量信息浓缩为一个较小的向量。
  2. Split Document: 将待处理的文档分割成小段,每段都附加摘要标记。这些标记还包含前几段的摘要信息,形成摘要累积。
  3. Fine-tuning Training: 采用无监督训练方法,利用 "下一个词预测 "任务对模型进行微调。这项任务的目的是根据当前标记前的标记和当前段落前的段落摘要向量预测下一个词。
  4. Backpropagation: AutoCompressor 对每个片段使用时间反向传播 (BPTT) 和梯度检查点技术,以尽量减小计算图的大小。反向传播针对整个文档执行,使模型能够学习整个上下文的关联。

AutoCompressor 提供了代码,感兴趣的读者可以试试。

import torch
from transformers import AutoTokenizer
from auto_compressor import LlamaAutoCompressorModel, AutoCompressorModel

# Load AutoCompressor trained by compressing 6k tokens in 4 compression steps
tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/AutoCompressor-Llama-2-7b-6k")
# Need bfloat16 + cuda to run Llama model with flash attention
model = LlamaAutoCompressorModel.from_pretrained("princeton-nlp/AutoCompressor-Llama-2-7b-6k", torch_dtype=torch.bfloat16).eval().cuda()

prompt = 'The first name of the current US president is "'
prompt_tokens = tokenizer(prompt, add_special_tokens=False, return_tensors="pt").input_ids.cuda()

context = """Joe Biden, born in Scranton, Pennsylvania, on November 20, 1942, had a modest upbringing in a middle-class family. He attended the University of Delaware, where he double-majored in history and political science, graduating in 1965. Afterward, he earned his law degree from Syracuse University College of Law in 1968.\nBiden's early political career began in 1970 when he was elected to the New Castle County Council in Delaware. In 1972, tragedy struck when his wife Neilia and 1-year-old daughter Naomi were killed in a car accident, and his two sons, Beau and Hunter, were injured. Despite this devastating loss, Biden chose to honor his commitment and was sworn in as a senator by his sons' hospital bedsides.\nHe went on to serve as the United States Senator from Delaware for six terms, from 1973 to 2009. During his time in the Senate, Biden was involved in various committees and was particularly known for his expertise in foreign affairs, serving as the chairman of the Senate Foreign Relations Committee on multiple occasions.\nIn 2008, Joe Biden was selected as the running mate for Barack Obama, who went on to win the presidential election. As Vice President, Biden played an integral role in the Obama administration, helping to shape policies and handling issues such as economic recovery, foreign relations, and the implementation of the Affordable Care Act (ACA), commonly known as Obamacare.\nAfter completing two terms as Vice President, Joe Biden decided to run for the presidency in 2020. He secured the Democratic nomination and faced the incumbent President Donald Trump in the general election. Biden campaigned on a platform of unity, promising to heal the divisions in the country and tackle pressing issues, including the COVID-19 pandemic, climate change, racial justice, and economic inequality.\nIn the November 2020 election, Biden emerged victorious, and on January 20, 2021, he was inaugurated as the 46th President of the United States. At the age of 78, Biden became the oldest person to assume the presidency in American history.\nAs President, Joe Biden has worked to implement his agenda, focusing on various initiatives, such as infrastructure investment, climate action, immigration reform, and expanding access to healthcare. He has emphasized the importance of diplomacy in international relations and has sought to rebuild alliances with global partners.\nThroughout his long career in public service, Joe Biden has been recognized for his commitment to bipartisanship, empathy, and his dedication to working-class issues. He continues to navigate the challenges facing the nation, striving to bring the country together and create positive change for all Americans."""
context_tokens = tokenizer(context, add_special_tokens=False, return_tensors="pt").input_ids.cuda()

summary_vectors = model(context_tokens, output_softprompt=True).softprompt
print(f"Compressing {context_tokens.size(1)} tokens to {summary_vectors.size(1)} summary vectors")
# >>> Compressing 660 tokens to 50 summary vectors

generation_with_summary_vecs = model.generate(prompt_tokens, do_sample=False, softprompt=summary_vectors, max_new_tokens=12)[0]
print("Generation w/ summary vectors:\n" + tokenizer.decode(generation_with_summary_vecs))
# >>> The first name of the current US president is "Joe" and the last name is "Biden".

next_tokens_without_context = model.generate(prompt_tokens, do_sample=False, max_new_tokens=11)[0]
print("Generation w/o context:\n" + tokenizer.decode(next_tokens_without_context))
# >>> The first name of the current US president is "Donald" and the last name is "Trump".

LLMLingua-2

LLMLingua-2 基于因果语言模型(如 LLaMa-7B)的信息熵,通过删除词块或词汇单位来压缩提示语,从而发现了两个问题:

(1) 用于确定信息熵的小语言模型与提示压缩目标不一致。

(2) 它只利用单向上下文,可能无法包含及时压缩所需的全部信息。

这些问题的核心在于,信息熵可能是一种次优的压缩措施。

LLMLingua-2 的整体架构如图 15 所示:

在这里插入图片描述

为解决问题 1,LLMLingua-2 引入了数据提炼过程。该过程从 LLM 中提取知识,在不丢失关键信息的情况下压缩提示。同时,它还构建了一个提取文本压缩数据集。在这个数据集上进行训练,有助于将小语言模型与提示压缩有效地结合起来。

为解决问题 2,LLMLingua-2 将提示语压缩视为标记分类问题。这种方法确保了压缩后的提示语与原始提示语的保真度。它使用变压器编码器作为底层架构,从完整的双向语境中捕捉提示压缩所需的所有信息。

如何构建有效的提示压缩数据集

数据蒸馏

数据提炼包括从 GPT-4 等大型语言模型中提取知识,以便在不丢失基本信息的情况下有效压缩提示信息。

在 LLMLingua-2 中,指令是经过精心设计的,如图 16 所示。这些指令要求 GPT-4 在生成过程中,通过省略原文中的非必要词而不添加任何新词来压缩文本。

同时,这些指令并没有规定压缩比限制。相反,GPT-4 会在保留最大信息量的前提下尽可能压缩原文。

在这里插入图片描述

如图 17 所示,GPT-4 在处理超长上下文时通常会采用较高的压缩比。这可能是因为它处理长上下文的能力有限。这种激进的压缩会导致大量信息丢失,严重影响后续任务的性能。

在这里插入图片描述

为了缓解这一问题,LLMLingua-2 采用了分块压缩的方法,将长文本分成不超过 512 个 token 的多个块,然后引导 GPT-4 对每个块分别进行压缩。

数据注释

目前,我们通过数据提炼获得了成对的原始文本及其压缩版本。数据注释的目的是为原文中的每个标记分配一个二进制标签。这决定了压缩后是否应保留该标记。

由于 GPT-4 可能无法准确遵循指令,LLMLingua-2 采用了滑动窗口技术来限制搜索范围。它还利用模糊匹配来处理 GPT-4 在压缩过程中可能对原词造成的改动。

质量控制

LLMLingua-2 使用两个质量控制指标来评估通过 GPT-4 蒸馏生成的压缩文本和自动注释标签的质量:变异率(VR)和对齐差距(AG)。

差异率衡量的是压缩文本中不同于原始文本的单词百分比。对齐差距评估自动注释标签的质量。

通过这些措施,LLMLingua-2 可以排除低质量样本,确保数据集的质量。

压缩器

视为二元分类问题

最初,提示压缩问题可以转化为二元分类问题。其基本概念是将每个词性单元视为一个独立的实体,并为其分配一个标签,即 "保留 "或 “丢弃”。这种方法既能保持压缩提示内容的完整性,又能简化模型的设计。

模型架构

使用基于变压器编码器的特征编码器,并在其上添加线性分类层。

这种结构可以捕捉每个词汇单元的双向上下文信息,为压缩任务提供重要信息。

压缩策略

原始提示 x 的压缩策略分为三个步骤。目标压缩率为 1/τ,其中 τ 定义为压缩后的提示语字数与原始提示语 x 的字数之上。

  • 首先,我们确定压缩提示 x˜ 中要保留的目标标记数:N˜ = τN。
  • 然后,我们使用标记分类模型来预测每个词 xi 被标记为 "保留 "的概率 pi。
  • 最后,我们保留原始提示 x 中 pi 值最高的前 N 个词,并保留其原始顺序,形成压缩提示 x˜。

代码

从上文可以看出,LLMLingua-2 的主要工作是构建压缩器。那么,一旦获得了压缩器,我们该如何使用它呢?

请参阅下面的代码(设置环境的方法与 LLMLingua 相同)。主要的内部流程可以在 compress_prompt_llmlingua2 函数中看到。

from llmlingua import PromptCompressor

PROMPT = "John: So, um, I've been thinking about the project, you know, and I believe we need to, uh, make some changes. I mean, we want the project to succeed, right? So, like, I think we should consider maybe revising the timeline.\n\nSarah: I totally agree, John. I mean, we have to be realistic, you know. The timeline is, like, too tight. You know what I mean? We should definitely extend it."

llm_lingua = PromptCompressor(
    model_name = "microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2 = True,
)
compressed_prompt = llm_lingua.compress_prompt(PROMPT, rate=0.33, force_tokens = ['\n', '?'])

## Or use LLMLingua-2-small model# llm_lingua = PromptCompressor(#     model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",#     use_llmlingua2=True,# )print('-' * 100)
print("original:")
print(PROMPT)

print('-' * 100)
print("compressed_prompt:")
print(compressed_prompt)

运行结果如图 18 所示:

在这里插入图片描述

RECOMP

RECOMP 引入了两种经过训练的压缩器:抽取式和抽象式。提取式压缩器从检索到的文档中选择有用的句子,而抽象式压缩器则将多个文档中的信息结合起来生成摘要。

图 19 显示了 RECOMP 中压缩机的位置。

在这里插入图片描述

提取式压缩器

给定输入文档集中的 n 个句子 [s1, s2, …, sn],我们训练一个双编码器模型。该模型将句子 si 和输入序列 x 嵌入到固定维度的嵌入中。这些嵌入的内积表示将 si 添加到输入 x 以生成目标输出序列对 LLM 的益处。

压缩器的最终摘要 s 由排名前 N 的句子组成,按其与输入句子的内积排序。

抽象压缩器

抽象压缩器是一种编码器-解码器模型。它将输入序列 x 和检索到的文档集连接起来,然后输出摘要 s。

这种方法包括使用 LLM(如 GPT-3)生成训练数据集,过滤这些数据,然后使用过滤后的数据集训练编码器-解码器模型。

代码

由于 RECOMP 的代码目前还处于早期阶段,这里就不做演示了。有兴趣的读者可以试一试。

结论

本文介绍了提示压缩的方法,包括方法分类、算法原理和代码解释。

在所讨论的方法中,LongLLMLingua 可能是一个更好的选择。我们已经在我们的研究项目中使用了它。如果我发现 LongLLMLingua 有任何缺陷或找到更好的方法,我将更新这篇文章。此外,LLMLingua-2 也可以一试,它在速度和内存使用方面都有优势。

本文为翻译,原文地址:https://ai.gopubby.com/advanced-rag-09-prompt-compression-95a589f7b554

  • 4
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

静愚 AGI

你的善意终将流回自己,加油!

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值