高级RAG技术研究之： Prompt 压缩

最新推荐文章于 2025-03-31 16:26:31 发布

程序猿李巡天

最新推荐文章于 2025-03-31 16:26:31 发布

阅读量1.7k

点赞数 4

文章标签： prompt 人工智能大数据 chatgpt oracle jvm

本文链接：https://blog.csdn.net/m0_59235945/article/details/140085415

版权

RAG过程可能会遇到两个问题：

大型语言模型 (LLM) 通常有上下文长度限制。因此，输入文本越长，该过程就越耗时且成本越高。
检索到的上下文可能并不总是有用。有可能较大块中只有一小部分与答案相关。在某些情况下，可能需要组合多个块来回答特定问题。即使重新排名，这个问题仍然存在。

LLM 的Prompt压缩是解决这些问题的一种方法。本质上，目标是保留提示中的关键信息，使输入令牌更有价值。这种方法提高了模型的性能并降低了成本。如图1右下角所示。

在这里插入图片描述

图 1：RAG 中的提示压缩（右下）。如紫色虚线所示，一些压缩器也可以直接应用于检索到的上下文。图片由作者提供。

值得注意的是，如图 1 中的紫色虚线所示，某些压缩器也可以直接应用于检索到的上下文。

总体而言，Prompt压缩方法可分为四大类：

基于信息熵的方法，如Selective Context、LLMLingua、LongLLMLingua。这些方法使用小型语言模型来计算原始提示中每个标记的自信息或困惑度。然后他们删除困惑度较低的标记。
基于软提示调优的方法，例如AutoCompressor和GIST。这些方法需要对LLM参数进行微调，使其适合特定领域，但不能直接应用于黑盒LLM。
首先，对LLM进行数据蒸馏，然后训练模型以生成更具可解释性的文本摘要。这些可以在不同的语言模型之间转移，并应用于不需要梯度更新的黑盒LLM。代表性方法是LLMLingua-2和RECOMP。
基于令牌合并或令牌修剪的方法，例如ToMe和AdapLeR。这些方法通常需要在推理过程中对模型进行微调或生成中间结果。

鉴于第四类方法最初是针对 ViT 或 BERT 等较小模型提出的，本文将介绍前三种方法类型中代表性算法的原理。

选择性背景

洞察力

图 2 表明LLM可以响应用户查询，而不需要完整的上下文或完整的对话历史记录。即使省略了相关信息，LLM仍然可以产生预期的反应。这可能归因于LLM能够从上下文线索和预训练期间获得的先验知识中推断出缺失的信息。

在这里插入图片描述

图 2. LLM能够在删除信息量较少的内容的情况下正确回答。资料来源：选择性上下文。

因此，可以通过过滤掉信息量较少的内容来优化上下文长度，而不会影响性能。这是选择性语境的关键见解。

选择性上下文采用小型语言模型（SLM）来确定给定上下文中词汇单元的自信息，例如句子、短语或标记。然后，它使用这些自我信息来评估其信息量。通过选择性地保留具有更高自我信息的内容，选择性上下文为 LLM 提供了更简洁、更高效的上下文表示。这是在不影响他们在不同任务中的表现的情况下实现的。

Self-Information

选择性上下文利用自信息来评估内容的质量。

自信息，也称为意外或信息内容，是信息论中的一个关键概念。它量化事件传达的信息量。它被定义为标记的负对数似然：

在这里插入图片描述

其中I(x)表示token的自信息x，P(x)表示其输出概率。

在信息论中，自我信息量化了与事件相关的意外或不确定性的水平。罕见的事件，传达的信息越多，自我信息就越高。相反，共同事件传达的信息较少，其自我信息也较低。

算法

为了更方便地解释原理，让我们深入源码。

首先，通过安装相应的python库并下载Spacy模型来设置环境。

(base) Florian:~ Florian$ conda create -n "selective_context" python=3.10 
(base) Florian:~ Florian$ conda activate selective_context
(selective_context) Florian:~ Florian$ pip install selective-context
(selective_context) Florian:~ Florian$ python -m spacy download en_core_web_sm

安装完成后版本如下：

(selective_context) Florian:~ Florian$ pip list | grep selective
selective-context   0.1.4

测试代码如下：

from selective_context import SelectiveContext

sc = SelectiveContext(model_type='gpt2', lang='en')
text = "INTRODUCTION Continual Learning ( CL ) , also known as Lifelong Learning , is a promising learning paradigm to design models that have to learn how to perform multiple tasks across different environments over their lifetime [To uniform the language and enhance the readability of the paper we adopt the unique term continual learning ( CL ) .]. Ideal CL models in the real world should be deal with domain shifts , researchers have recently started to sample tasks from two different datasets . For instance , proposed to train and evaluate a model on Imagenet first and then challenge its performance on the Places365 dataset . considers more scenarios , starting with Imagenet or Places365 , and then moving on to the VOC/CUB/Scenes datasets. Few works propose more advanced scenarios built on top of more than two datasets."
context, reduced_content = sc(text)

# We can also adjust the reduce ratio
# context_ratio, reduced_content_ratio = sc(text, reduce_ratio = 0.5)

初始运行将下载 GPT-2 模型，大小约为 500MB。测试代码结果如图3所示。

在这里插入图片描述

图 3：选择性上下文测试代码的结果。截图由作者提供。

接下来我们就来探索一下这个函数sc(text)。内部源码如下：

class SelectiveContext:
    ...
    ...
    def __call__(self, text: str, reduce_ratio: float = 0.35, reduce_level :str = 'phrase') -> List[str]:
        context = self.beautify_context(text)

        self.mask_ratio = reduce_ratio

        sents = [sent.strip() for sent in re.split(self.sent_tokenize_pattern, context) if sent.strip()]

        # You want the reduce happen at sentence level, phrase level, or token level?
        assert reduce_level in ['sent', 'phrase', 'token'], f"reduce_level should be one of ['sent', 'phrase', 'token'], got {reduce_level}"
        sent_lus, phrase_lus, token_lus = self._lexical_unit(sents)
        lexical_level = {
            'sent': sent_lus,
            'phrase': phrase_lus,
            'token': token_lus
        }

        # context is the reduced context, masked_sents denotes what context has been filtered out
        context, masked_sents = self.self_info_mask(lexical_level[reduce_level].text, lexical_level[reduce_level].self_info, reduce_level)
        return context, masked_sents

上面的代码主要涉及三个步骤：

计算上下文中每个标记的自信息。
根据词汇单元（例如短语或句子）合并标记及其自信息。
有选择地保留信息上下文。

第一步：计算自我信息

给定上下文C = x0, x1, …, xn，其中每个都xi代表一个令牌，我们使用因果语言模型（例如 GPT-2、OPT 和 LLaMA）来计算每个令牌的自信息xi：

在这里插入图片描述

如果您使用的是 GPT-2，以下是相应的代码：

class SelectiveContext:
    ...
    ...    
    def _get_self_info_via_gpt2(self, text: str) -> Tuple[List[str], List[float]]:
        if self.lang == 'en':
            text = f"<|endoftext|>{text}"
        elif self.lang == 'zh':
            text = f"[CLS]{text}"
        with torch.no_grad():
            encoding = self.tokenizer(text, add_special_tokens=False, return_tensors='pt')
            encoding = encoding.to(self.device)
            outputs = self.model(**encoding)
            logits = outputs.logits
            probs = torch.softmax(logits, dim=-1)
            self_info = -torch.log(probs)
        
        input_ids = encoding['input_ids']
        input_ids_expaned = input_ids[:, 1:].unsqueeze(-1)

第 2 步：合并成词汇单元

直接在令牌级别执行选择性上下文过滤可能会导致上下文不连贯。例如，原始提示中的“2009”可以被压缩为“209”。

因此，除了标记级别的过滤之外，在短语和句子级别实施过滤程序也至关重要。过滤中的基本单元，称为词汇单元，可以是标记、短语或句子。

如何计算每个词汇单元的自信息u = (xt, …, xt+α)？我们可以将组成u的每个 token 的自信息相加，遵循自信息可加性原理：

在这里插入图片描述

对应代码如下，添加了部分变量的调试信息：

class SelectiveContext:
    ...
    ...
    def _lexical_unit(self, sents):

        if self.sent_level_self_info:
            sent_self_info = []
            all_noun_phrases = []
            all_noun_phrases_info = []
            all_tokens = []
            all_token_self_info = []

            for sent in sents:
                # print(sent)
                tokens, self_info = self.get_self_information(sent)
                '''
                ipdb> sent
                'INTRODUCTION Continual Learning ( CL ) , also known as Lifelong Learning , is a promising learning paradigm to design models that have to learn how to perform multiple tasks across different environments over their lifetime [To uniform the language and enhance the readability of the paper we adopt the unique term continual learning ( CL ) .].'

                ipdb> tokens
                ['IN', 'TR', 'ODUCT', 'ION', ' Contin', 'ual', ' Learning', ' (', ' CL', ' )', ',', ' also', ' known', ' as', ' Lif', 'elong', ' Learning', ',', ' is', ' a', ' promising', ' learning', ' paradigm', ' to', ' design', ' models', ' that', ' have', ' to', ' learn', ' how', ' to', ' perform', ' multiple', ' tasks', ' across', ' different', ' environments', ' over', ' their', ' lifetime', ' [', 'To', ' uniform', ' the', ' language', ' and', ' enhance', ' the', ' read', 'ability', ' of', ' the', ' paper', ' we', ' adopt', ' the', ' unique', ' term', ' continual', ' learning', ' (', ' CL', ' )', '.', '].']

                ipdb> self_info
                [7.514791011810303, 1.632637619972229, 0.024813441559672356, 0.006853647995740175, 12.09920597076416, 2.1144468784332275, 9.457701683044434, 2.4503376483917236, 10.236454963684082, 0.8689146041870117, 5.269547939300537, 4.641763210296631, 0.22138957679271698, 0.010370315983891487, 10.071824073791504, 0.6905602216720581, 0.01698811538517475, 1.5882389545440674, 0.4495090842247009, 0.45371606945991516, 6.932497978210449, 6.087430477142334, 3.66465425491333, 3.3969509601593018, 7.337691307067871, 5.881226539611816, 1.7340556383132935, 4.599822521209717, 6.482723236083984, 4.045308589935303, 4.762691497802734, 0.21346867084503174, 3.7985599040985107, 4.6389899253845215, 0.33642446994781494, 4.918881416320801, 2.076707601547241, 3.3553669452667236, 5.5081071853637695, 5.625778675079346, 0.7966060638427734, 6.347291946411133, 12.772034645080566, 13.792041778564453, 4.11267614364624, 6.583715915679932, 3.3618998527526855, 8.434362411499023, 1.2423189878463745, 5.8330583572387695, 0.0013973338063806295, 0.3090735077857971, 1.1139129400253296, 4.160390853881836, 3.744772434234619, 7.2841596603393555, 1.4088190793991089, 7.86871337890625, 4.305004596710205, 9.69282341003418, 0.08665203303098679, 1.6127821207046509, 1.6296097040176392, 0.46206924319267273, 3.0398476123809814, 6.892032623291016]
                '''
                sent_self_info.append(np.mean(self_info))

                all_tokens.extend(tokens)
                all_token_self_info.extend(self_info)

                noun_phrases, noun_phrases_info = self._calculate_lexical_unit(tokens, self_info)
                '''
                ipdb> noun_phrases
                ['INTRODUCTION Continual Learning', ' (', ' CL', ' )', ',', ' also', ' known', ' as', ' Lifelong Learning', ',', ' is', ' a promising learning paradigm', ' to', ' design', ' models', ' that', ' have', ' to', ' learn', ' how', ' to', ' perform', ' multiple tasks', ' across', ' different environments', ' over', ' their lifetime', ' [', 'To', ' uniform', ' the language', ' and', ' enhance', ' the readability', ' of', ' the paper', ' we', ' adopt', ' the unique term continual learning', ' (', ' CL', ' )', '.', ']', '.']
                
                ipdb> noun_phrases_info
                [4.692921464797109, 2.4503376483917236, 10.236454963684082, 0.8689146041870117, 5.269547939300537, 4.641763210296631, 0.22138957679271698, 0.010370315983891487, 3.5931241369495788, 1.5882389545440674, 0.4495090842247009, 4.284574694931507, 3.3969509601593018, 7.337691307067871, 5.881226539611816, 1.7340556383132935, 4.599822521209717, 6.482723236083984, 4.045308589935303, 4.762691497802734, 0.21346867084503174, 3.7985599040985107, 2.487707197666168, 4.918881416320801, 2.7160372734069824, 5.5081071853637695, 3.2111923694610596, 6.347291946411133, 12.772034645080566, 13.792041778564453, 5.348196029663086, 3.3618998527526855, 8.434362411499023, 2.3589248929638416, 0.3090735077857971, 2.6371518969535828, 3.744772434234619, 7.2841596603393555, 4.672402499616146, 1.6127821207046509, 1.6296097040176392, 0.46206924319267273, 3.0398476123809814, 3.446016311645508, 3.446016311645508]
                '''

                # We need to add a space before the first noun phrase for every sentence except the first one
                if all_noun_phrases:
                    noun_phrases[0] = f" {noun_phrases[0]}"
                all_noun_phrases.extend(noun_phrases)
                all_noun_phrases_info.extend(noun_phrases_info)
            
            return [
                LexicalUnits('sent', text=sents, self_info=sent_self_info),
                LexicalUnits('phrase', text=all_noun_phrases, self_info=all_noun_phrases_info),
                LexicalUnits('token', text=all_tokens, self_info=all_token_self_info)
            ]

第三步：选择性保留信息上下文

一旦计算出每个词汇单元的自信息，问题就出现了：如何评估它们的信息量？本文提出了一种自适应方法，使用基于百分位的过滤方法来选择信息量最大的内容。这比使用固定阈值或保持固定数量的前 k 个词汇单元更好。

首先，我们根据词汇单元的自信息值按降序排列。然后，我们计算所有词汇单元的这些值的第 p 个百分位数。接下来，我们有选择地保留自信息值大于或等于第 p 个百分位数的词汇单元。

对应的代码如下。

class SelectiveContext:
    ...
    ...

    def self_info_mask(self, sents: List[str], self_info: List[float], mask_level):
        # mask_level: mask sentences, phrases, or tokens
        sents_after_mask = []
        masked_sents = []
                
        self.ppl_threshold = np.nanpercentile(self_info, self.mask_ratio * 100)

        # if title is not None:
        #     with open(os.path.join(self.path, title+'_prob_token.tsv'), 'w', encoding='utf-8') as f:
        #         for token, info in zip(tokens, self_info):
        #             f.write(f"{token}\t{info}\n")
        #     with open(os.path.join(self.path, title+'_prob_sent.tsv'), 'w', encoding='utf-8') as f:
        #         for sent, info in zip(sents, sent_self_info):
        #             f.write(f"{sent}\n{info}\n\n")

        for sent, info in zip(sents, self_info):
            if info < self.ppl_threshold:
                masked_sents.append(sent)
                sents_after_mask.append(self.mask_a_sent(sent, mask_level))
            else:
                sents_after_mask.append(sent)
        masked_context = " ".join(sents_after_mask) if mask_level == 'sent' else "".join(sents_after_mask)
        
        return masked_context, masked_sents

LLMLingua

概述

LLMLingua认为Selective Context经常忽视压缩内容之间的互连以及 LLM 和用于Prompt压缩的小语言模型之间的相关性。 LLMLingua 正是解决了这些问题。

具体来说，如图 4 所示，LLMLingua 采用预算控制器为原始提示的各个组成部分（例如指令、演示和问题）动态分配不同的压缩比。它还执行粗粒度、演示级压缩，即使在高压缩比下也能保持语义完整性。此外，LLMLingua 引入了用于细粒度提示压缩的令牌级迭代算法。

在这里插入图片描述

图 4：所提议方法 LLMLingua 的框架。资料来源：LLMLingua。

与 Selective Context 相比，LLMLingua 可以更有效地保留提示中的关键信息，同时考虑标记之间的条件依赖关系。它可以将提示压缩 20 倍。

预算控制器

预算控制器是 LLMLingua 的关键组件，用于动态为原始提示的各个部分分配不同的压缩比。

提示的不同部分对压缩具有不同的敏感度。例如，指示和问题更敏感，而演示则不太敏感。预算控制器的作用是为指令和问题分配较低的压缩比，从而保留重要信息。相反，可以为演示分配较高的压缩比，以消除冗余信息。

预算控制器的算法如图5所示：

在这里插入图片描述

图 5：预算控制器的算法。资料来源：LLMLingua。

主要变量是：

M：小型语言模型，例如 GPT-2 或 LLaMA。
x = (x^ins , x^dems , x^que)：原始提示，包括说明、演示和问题。
、_ins、_dems、分别表示、、、_que中的标记数xx^insxdemsx^que.
_dems：根据目标总体压缩率和指令和问题的预定义压缩率进行演示的压缩率，即_ins和_que.
D：该集将包含压缩演示。

主要流程如下：

1.计算演示的压缩率
2.使用小型语言模型（例如 GPT-2 或 LLaMA）计算原始演示集中每个演示的困惑度。
3.按困惑程度降序对所有演示进行排序。
4.迭代地选择 demo 并将其添加到集合D中。
5.压缩演示后，将剩余预算分配给说明和问题。
6.粗粒度压缩后输出集合D。

通过演示级流程，预算控制器可以在压缩过程中保留关键信息，有效减小原始提示的大小。此方法特别适用于包含多个演示的提示。

相关代码位于函数control_context_budget中。

迭代令牌级提示压缩 (ITPC)

使用困惑度进行即时压缩有一个固有的限制：独立性假设。此假设将提示中的每个标记视为独立的。换句话说，一个token出现的概率只取决于前一个token，与其他token无关。

这种假设的问题在于，它忽略了自然语言中标记之间经常存在的复杂依赖关系，这对于理解上下文和保持语义完整性至关重要。

这种疏忽可能会导致压缩过程中重要信息的丢失。例如，在高比率压缩中，如果一个令牌在上下文中提供了关键的推理步骤或逻辑连接，仅根据其困惑度来决定是否保留该令牌可能会导致推理过程不完整。

为了解决这个问题，LLMLingua 引入了迭代令牌级提示压缩 (ITPC) 算法。该方法不是仅仅依赖于其独立概率，而是在即时压缩期间更精确地评估每个标记的重要性。它通过迭代处理提示中的每个片段并考虑当前上下文中每个标记的条件概率来实现这一点。这种方法有助于更好地保留令牌之间的依赖关系。

图6展示了ITPC的详细步骤：
在这里插入图片描述

图6：ITPC算法的详细步骤。图片由作者提供。

通过这个过程，ITPC算法可以有效压缩提示长度，同时保持提示语义的完整性，从而降低LLM的推理成本。

相关代码位于函数iterative_compress_prompt中。

指令调优

图 4 说明指令调整也是 LLMLingua 中的关键步骤。其目的是最小化用于压缩提示的小语言模型和 LLM 之间的分布差异。

图 7 显示了指令调优的步骤：

在这里插入图片描述

图 7：指令调优的步骤。图片由作者提供。

代码演示

现在我们开始代码演示。一、搭建环境

（base）Florian：〜Florian $ conda create -n “llmlingua” python = 3.11

（base）Florian：〜Florian $ conda activate llmlingua 

（llmlingua）Florian：〜Florian $ pip install llmlingua

安装后的版本如下：

llmlingua          0.2 .1

测试代码如下：

from llmlingua import PromptCompressor

GSM8K_PROMPT = "Question: Angelo and Melanie want to plan how many hours over the next week they should study together for their test next week. They have 2 chapters of their textbook to study and 4 worksheets to memorize. They figure out that they should dedicate 3 hours to each chapter of their textbook and 1.5 hours for each worksheet. If they plan to study no more than 4 hours each day, how many days should they plan to study total over the next week if they take a 10-minute break every hour, include 3 10-minute snack breaks each day, and 30 minutes for lunch each day?\nLet's think step by step\nAngelo and Melanie think they should dedicate 3 hours to each of the 2 chapters, 3 hours x 2 chapters = 6 hours total.\nFor the worksheets they plan to dedicate 1.5 hours for each worksheet, 1.5 hours x 4 worksheets = 6 hours total.\nAngelo and Melanie need to start with planning 12 hours to study, at 4 hours a day, 12 / 4 = 3 days.\nHowever, they need to include time for breaks and lunch. Every hour they want to include a 10-minute break, so 12 total hours x 10 minutes = 120 extra minutes for breaks.\nThey also want to include 3 10-minute snack breaks, 3 x 10 minutes = 30 minutes.\nAnd they want to include 30 minutes for lunch each day, so 120 minutes for breaks + 30 minutes for snack breaks + 30 minutes for lunch = 180 minutes, or 180 / 60 minutes per hour = 3 extra hours.\nSo Angelo and Melanie want to plan 12 hours to study + 3 hours of breaks = 15 hours total.\nThey want to study no more than 4 hours each day, 15 hours / 4 hours each day = 3.75\nThey will need to plan to study 4 days to allow for all the time they need.\nThe answer is 4\n\nQuestion: You can buy 4 apples or 1 watermelon for the same price. You bought 36 fruits evenly split between oranges, apples and watermelons, and the price of 1 orange is $0.50. How much does 1 apple cost if your total bill was $66?\nLet's think step by step\nIf 36 fruits were evenly split between 3 types of fruits, then I bought 36/3 = 12 units of each fruit\nIf 1 orange costs $0.50 then 12 oranges will cost $0.50 * 12 = $6\nIf my total bill was $66 and I spent $6 on oranges then I spent $66 - $6 = $60 on the other 2 fruit types.\nAssuming the price of watermelon is W, and knowing that you can buy 4 apples for the same price and that the price of one apple is A, then 1W=4A\nIf we know we bought 12 watermelons and 12 apples for $60, then we know that $60 = 12W + 12A\nKnowing that 1W=4A, then we can convert the above to $60 = 12(4A) + 12A\n$60 = 48A + 12A\n$60 = 60A\nThen we know the price of one apple (A) is $60/60= $1\nThe answer is 1\n\nQuestion: Susy goes to a large school with 800 students, while Sarah goes to a smaller school with only 300 students.  At the start of the school year, Susy had 100 social media followers.  She gained 40 new followers in the first week of the school year, half that in the second week, and half of that in the third week.  Sarah only had 50 social media followers at the start of the year, but she gained 90 new followers the first week, a third of that in the second week, and a third of that in the third week.  After three weeks, how many social media followers did the girl with the most total followers have?\nLet's think step by step\nAfter one week, Susy has 100+40 = 140 followers.\nIn the second week, Susy gains 40/2 = 20 new followers.\nIn the third week, Susy gains 20/2 = 10 new followers.\nIn total, Susy finishes the three weeks with 140+20+10 = 170 total followers.\nAfter one week, Sarah has 50+90 = 140 followers.\nAfter the second week, Sarah gains 90/3 = 30 followers.\nAfter the third week, Sarah gains 30/3 = 10 followers.\nSo, Sarah finishes the three weeks with 140+30+10 = 180 total followers.\nThus, Sarah is the girl with the most total followers with a total of 180.\nThe answer is 180"

llm_lingua = PromptCompressor()

## Or use the phi-2 model,
# llm_lingua = PromptCompressor("microsoft/phi-2")

## Or use the quantation model, like TheBloke/Llama-2-7b-Chat-GPTQ, only need <8GB GPU memory.
## Before that, you need to pip install optimum auto-gptq
# llm_lingua = PromptCompressor("TheBloke/Llama-2-7b-Chat-GPTQ", model_config={"revision": "main"})

compressed_prompt = llm_lingua.compress_prompt(GSM8K_PROMPT.split("\n\n")[0], instruction="", question="", target_token=200)

print('-' * 100)
print("original:")
print(GSM8K_PROMPT.split("\n\n")[0])

print('-' * 100)
print("compressed_prompt:")
print(compressed_prompt)

我们第一次运行时将下载默认模型。或者，我们可以选择使用量化模型。运行结果如图8所示：

在这里插入图片描述

图 8：LLMLingua 测试代码的结果。截图由作者提供。

LongLLMLingua

LLMLingua 的问题在于，它在压缩过程中没有考虑用户的问题，可能会保留不相关的信息。

LongLLMLingua旨在通过将用户问题纳入压缩过程来解决这个问题。

在这里插入图片描述

图 9：LongLLMLingua 的框架。灰色斜体内容：与 LLMLingua 中相同。来源：LongLLMLingua。

如图 9 所示，LongLLMLingua 提出了四个新组件来增强 LLM 中关键信息的感知：

问题感知的粗粒度和细粒度压缩
文档重新排序机制
动态压缩比
子序列恢复算法

问题感知粗粒度压缩

LongLLMLingua 提出使用问题的复杂性x^{que（以不同的上下文为条件x}doc_k）来表示它们的关联。x^restrict = "We can get the answer to this question in the given documents"可以在x^que 后添加限制性语句。该声明加强了x^doc_k和xque之间的联系，并且充当了减少幻觉效应的正则化项目。这可以表示为：

在这里插入图片描述

为什么不计算question条件下的文档级困惑度x^que呢？这是因为文档中常常包含大量不相关的信息。即使以 x^que为条件，为整个文档计算的困惑度分数也可能不够明显，使其不足以衡量文档级压缩。

相关代码可以在函数
get_distance_longllmlingua中找到。

问题感知细粒度压缩

LongLLMLingua 引入了对比困惑度的概念。

在这里插入图片描述

首先，我们计算一个 token 的困惑度，不考虑问题，表示为perplexity(x_i | x<i)。然后，我们再次测量困惑度，这次包括问题，表示为perplexity(x_i | x^que, x<i)。这衡量了x_i在给定问题时看到 token 之前的所有 token 的惊讶程度x^que。

目标是确定每个标记的惊喜程度相对于问题的变化程度。如果某个词在问题被包含后变得不那么令人惊讶，那么它可能与问题高度相关。

文档重新排序机制

如图10所示，在推理过程中，LLM倾向于使用提示开头和结尾的内容，而忽略中间的内容。这个问题被称为“迷失在中间”问题。

在这里插入图片描述

图
10：LLM捕获相关信息的能力取决于他们在提示中的位置。为了减少中间的信息丢失，我们引入了文档重新排序机制。来源：LongLLMLingua。

图 10 还说明，当相关信息放在开头时，LLM 表现最佳。因此，LongLLMLingua 根据粗粒度压缩的结果来组织段落，按照分数从前到后的降序排列。

在这里插入图片描述

动态压缩比

由于不同文档的关键信息密度不同，我们应该为与问题更相关的文档分配更多的预算（即较低的压缩比）。

LongLLMLingua 使用粗粒度压缩的重要性分数来指导细粒度压缩期间的预算分配。

特别是，首先使用 LLMLingua 的预算控制器设置保留文档的初始预算。然后，在细粒度压缩阶段，动态地将压缩预算分配给每个文档。这种分配基于文档重要性得分的排名指数，该得分是在粗粒度压缩阶段确定的。

LongLLMLingua 采用线性调度器进行自适应分配，每个代币的预算xi可以表示为：

在这里插入图片描述

其中Nd表示文档数量，δτ是控制动态分配总体预算的超参数。

相应的代码可以在函数
get_dynamic_compression_ratio中找到。

子序列恢复算法

如图11所示，在细粒度的token-wise压缩过程中，关键实体的一些token可能会被丢弃。例如，原始提示中的“2009”可以被压缩为“209”，“Wilhelm Conrad Rontgen”可以被压缩为“Wilhelmgen”。

在这里插入图片描述

图11：子序列恢复示例，红色文本代表原始文本，蓝色文本为结果。来源：LongLLMLingua。

LongLLMLingua提出了一种子序列恢复算法，可以从LLM的响应中恢复原始内容，如图12所示。

在这里插入图片描述

图 12：子序列恢复算法。来源：LongLLMLingua。

主要过程包括以下步骤：

yl遍历LLM的响应中的token ，选择y˜key,l压缩提示中出现的最长的子串x˜
求原提示中xi,j对应的最大公共最短子序列y˜key,lx
y˜key,l将LLM 响应中的相应标记替换为xi,j。

相应的代码可以在函数recover中找到。

代码演示

环境搭建方法与LLMLingua中相同。这是测试代码：

from llmlingua import PromptCompressor

GSM8K_PROMPT = "Question: Angelo and Melanie want to plan how many hours over the next week they should study together for their test next week. They have 2 chapters of their textbook to study and 4 worksheets to memorize. They figure out that they should dedicate 3 hours to each chapter of their textbook and 1.5 hours for each worksheet. If they plan to study no more than 4 hours each day, how many days should they plan to study total over the next week if they take a 10-minute break every hour, include 3 10-minute snack breaks each day, and 30 minutes for lunch each day?\nLet's think step by step\nAngelo and Melanie think they should dedicate 3 hours to each of the 2 chapters, 3 hours x 2 chapters = 6 hours total.\nFor the worksheets they plan to dedicate 1.5 hours for each worksheet, 1.5 hours x 4 worksheets = 6 hours total.\nAngelo and Melanie need to start with planning 12 hours to study, at 4 hours a day, 12 / 4 = 3 days.\nHowever, they need to include time for breaks and lunch. Every hour they want to include a 10-minute break, so 12 total hours x 10 minutes = 120 extra minutes for breaks.\nThey also want to include 3 10-minute snack breaks, 3 x 10 minutes = 30 minutes.\nAnd they want to include 30 minutes for lunch each day, so 120 minutes for breaks + 30 minutes for snack breaks + 30 minutes for lunch = 180 minutes, or 180 / 60 minutes per hour = 3 extra hours.\nSo Angelo and Melanie want to plan 12 hours to study + 3 hours of breaks = 15 hours total.\nThey want to study no more than 4 hours each day, 15 hours / 4 hours each day = 3.75\nThey will need to plan to study 4 days to allow for all the time they need.\nThe answer is 4\n\nQuestion: You can buy 4 apples or 1 watermelon for the same price. You bought 36 fruits evenly split between oranges, apples and watermelons, and the price of 1 orange is $0.50. How much does 1 apple cost if your total bill was $66?\nLet's think step by step\nIf 36 fruits were evenly split between 3 types of fruits, then I bought 36/3 = 12 units of each fruit\nIf 1 orange costs $0.50 then 12 oranges will cost $0.50 * 12 = $6\nIf my total bill was $66 and I spent $6 on oranges then I spent $66 - $6 = $60 on the other 2 fruit types.\nAssuming the price of watermelon is W, and knowing that you can buy 4 apples for the same price and that the price of one apple is A, then 1W=4A\nIf we know we bought 12 watermelons and 12 apples for $60, then we know that $60 = 12W + 12A\nKnowing that 1W=4A, then we can convert the above to $60 = 12(4A) + 12A\n$60 = 48A + 12A\n$60 = 60A\nThen we know the price of one apple (A) is $60/60= $1\nThe answer is 1\n\nQuestion: Susy goes to a large school with 800 students, while Sarah goes to a smaller school with only 300 students.  At the start of the school year, Susy had 100 social media followers.  She gained 40 new followers in the first week of the school year, half that in the second week, and half of that in the third week.  Sarah only had 50 social media followers at the start of the year, but she gained 90 new followers the first week, a third of that in the second week, and a third of that in the third week.  After three weeks, how many social media followers did the girl with the most total followers have?\nLet's think step by step\nAfter one week, Susy has 100+40 = 140 followers.\nIn the second week, Susy gains 40/2 = 20 new followers.\nIn the third week, Susy gains 20/2 = 10 new followers.\nIn total, Susy finishes the three weeks with 140+20+10 = 170 total followers.\nAfter one week, Sarah has 50+90 = 140 followers.\nAfter the second week, Sarah gains 90/3 = 30 followers.\nAfter the third week, Sarah gains 30/3 = 10 followers.\nSo, Sarah finishes the three weeks with 140+30+10 = 180 total followers.\nThus, Sarah is the girl with the most total followers with a total of 180.\nThe answer is 180"
QUESTION = "Question: Josh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make?"



llm_lingua = PromptCompressor()

compressed_prompt = llm_lingua.compress_prompt(
    GSM8K_PROMPT.split("\n\n")[0],
    question = QUESTION,
    # ratio=0.55
    # Set the special parameter for LongLLMLingua
    condition_in_question = "after_condition",
    reorder_context = "sort",
    dynamic_context_compression_ratio = 0.3, # or 0.4
    condition_compare = True,
    context_budget = "+100",
    rank_method = "longllmlingua",
)

print('-' * 100)
print("original:")
print(GSM8K_PROMPT.split("\n\n")[0])


print('-' * 100)
print("compressed_prompt:")
print(compressed_prompt)

运行结果如图13所示：

在这里插入图片描述

图 13：LongLLMLingua 测试代码的结果。截图由作者提供。

自动压缩机

与前面提到的方法相比，AutoCompressor是一种基于软提示的方法。

它通过扩展词汇量并利用“摘要标记”和“摘要向量”来压缩上下文信息，巧妙地微调现有模型。

在这里插入图片描述

图 14：自动压缩器通过递归生成摘要向量来处理长文档，这些向量作为软提示传递到所有后续片段。来源：AutoCompressor。

图 14 展示了 AutoCompressor 的架构，其运行步骤如下：

1.扩展词汇表：此步骤涉及将“摘要标记”添加到模型的现有词汇表中。这些标记使模型能够将大量信息压缩为较小的向量。
2.分割文档：要处理的文档被分成小段，每个小段都附加摘要标记。这些令牌还携带前面片段的摘要信息，形成摘要累积。
3.微调训练：采用无监督训练方法，利用“下一个单词预测”任务来微调模型。此任务的目标是根据当前标记之前的标记以及当前片段之前的片段的摘要向量来预测下一个单词。
4.反向传播： AutoCompressor 使用随时间反向传播 (BPTT) 和每个分段的梯度检查点来最小化计算图的大小。对整个文档执行反向传播，使模型能

够学习完整上下文的关联。代码

AutoCompressor提供了代码，有兴趣的读者可以尝试一下。

import torch
from transformers import AutoTokenizer
from auto_compressor import LlamaAutoCompressorModel, AutoCompressorModel

# Load AutoCompressor trained by compressing 6k tokens in 4 compression steps
tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/AutoCompressor-Llama-2-7b-6k")
# Need bfloat16 + cuda to run Llama model with flash attention
model = LlamaAutoCompressorModel.from_pretrained("princeton-nlp/AutoCompressor-Llama-2-7b-6k", torch_dtype=torch.bfloat16).eval().cuda()

prompt = 'The first name of the current US president is "'
prompt_tokens = tokenizer(prompt, add_special_tokens=False, return_tensors="pt").input_ids.cuda()

context = """Joe Biden, born in Scranton, Pennsylvania, on November 20, 1942, had a modest upbringing in a middle-class family. He attended the University of Delaware, where he double-majored in history and political science, graduating in 1965. Afterward, he earned his law degree from Syracuse University College of Law in 1968.\nBiden's early political career began in 1970 when he was elected to the New Castle County Council in Delaware. In 1972, tragedy struck when his wife Neilia and 1-year-old daughter Naomi were killed in a car accident, and his two sons, Beau and Hunter, were injured. Despite this devastating loss, Biden chose to honor his commitment and was sworn in as a senator by his sons' hospital bedsides.\nHe went on to serve as the United States Senator from Delaware for six terms, from 1973 to 2009. During his time in the Senate, Biden was involved in various committees and was particularly known for his expertise in foreign affairs, serving as the chairman of the Senate Foreign Relations Committee on multiple occasions.\nIn 2008, Joe Biden was selected as the running mate for Barack Obama, who went on to win the presidential election. As Vice President, Biden played an integral role in the Obama administration, helping to shape policies and handling issues such as economic recovery, foreign relations, and the implementation of the Affordable Care Act (ACA), commonly known as Obamacare.\nAfter completing two terms as Vice President, Joe Biden decided to run for the presidency in 2020. He secured the Democratic nomination and faced the incumbent President Donald Trump in the general election. Biden campaigned on a platform of unity, promising to heal the divisions in the country and tackle pressing issues, including the COVID-19 pandemic, climate change, racial justice, and economic inequality.\nIn the November 2020 election, Biden emerged victorious, and on January 20, 2021, he was inaugurated as the 46th President of the United States. At the age of 78, Biden became the oldest person to assume the presidency in American history.\nAs President, Joe Biden has worked to implement his agenda, focusing on various initiatives, such as infrastructure investment, climate action, immigration reform, and expanding access to healthcare. He has emphasized the importance of diplomacy in international relations and has sought to rebuild alliances with global partners.\nThroughout his long career in public service, Joe Biden has been recognized for his commitment to bipartisanship, empathy, and his dedication to working-class issues. He continues to navigate the challenges facing the nation, striving to bring the country together and create positive change for all Americans."""
context_tokens = tokenizer(context, add_special_tokens=False, return_tensors="pt").input_ids.cuda()

summary_vectors = model(context_tokens, output_softprompt=True).softprompt
print(f"Compressing {context_tokens.size(1)} tokens to {summary_vectors.size(1)} summary vectors")
# >>> Compressing 660 tokens to 50 summary vectors

generation_with_summary_vecs = model.generate(prompt_tokens, do_sample=False, softprompt=summary_vectors, max_new_tokens=12)[0]
print("Generation w/ summary vectors:\n" + tokenizer.decode(generation_with_summary_vecs))
# >>> The first name of the current US president is "Joe" and the last name is "Biden".

next_tokens_without_context = model.generate(prompt_tokens, do_sample=False, max_new_tokens=11)[0]
print("Generation w/o context:\n" + tokenizer.decode(next_tokens_without_context))
# >>> The first name of the current US president is "Donald" and the last name is "Trump".

LLMLingua-2

LLMLingua-2通过根据 LLaMa-7B 等因果语言模型的信息熵删除标记或词汇单元来识别压缩提示的两个问题：

(1)用于确定信息熵的小语言模型与Prompt压缩目标不相符。

(2)它仅利用单向上下文，这可能不包含Prompt压缩所需的所有信息。

这些问题的核心是信息熵可能是压缩的次优度量。

LLMLingua-2的整体架构如图15所示：

在这里插入图片描述

图 15：LLMLingua-2 概述。资料来源：LLMLingua-2。

为了解决问题 1，LLMLingua-2 引入了数据蒸馏过程。此过程从LLM中提取知识来压缩提示，而不会丢失关键信息。同时，它构建了一个提取文本压缩数据集。对该数据集的训练有助于有效地调整小语言模型以促进压缩。

为了解决问题 2，LLMLingua-2 将即时压缩视为令牌分类问题。这种方法确保了压缩提示与原始提示的保真度。它使用变压器编码器作为底层架构来捕获所有必要的信息，以便从完整的双向上下文中进行快速压缩。

如何构建有效的Prompt压缩数据集

数据蒸馏

数据蒸馏涉及从大型语言模型（例如 GPT-4）中提取知识，以有效压缩提示而不丢失重要信息。

在LLMLingua-2中，指令经过精心设计，如图16所示。这些指令要求GPT-4通过从原始文本中省略非必要单词来压缩文本，并且在生成过程中不添加任何新单词。

同时，这些指令没有施加压缩比限制。相反，GPT-4 则提示尽可能压缩原始文本，同时保留最大信息。

在这里插入图片描述

图 16：用于数据蒸馏的指令。资料来源：LLMLingua-2。

如图 17 所示，GPT-4 在处理极长上下文时通常采用高压缩比。这可能是由于其处理长上下文的能力有限。这种激进的压缩会导致大量信息丢失，从而极大地影响后续任务的性能。

在这里插入图片描述

图 17：MeetingBank 上原始上下文长度的压缩比图示。我们使用 GPT-4–32k，输出令牌限制设置为
4096。来源：LLMLingua-2。

为了缓解这个问题，LLMLingua-2采用了块压缩方法，将长文本分成不超过512个token的多个块，然后指导GPT-4单独压缩每个块。

数据标注

目前，我们已经通过数据蒸馏获得了成对的原始文本及其压缩版本。数据注释的目标是为原始文本中的每个标记分配一个二进制标签。这决定了压缩后是否应保留令牌。

由于 GPT-4 可能无法准确遵循指令，因此 LLMLingua-2 采用滑动窗口技术来限制搜索范围。它还利用模糊匹配来处理 GPT-4 压缩过程中对原始单词的潜在更改。

质量控制

LLMLingua-2 使用两个质量控制指标来评估 GPT-4 蒸馏生成的压缩文本和自动注释标签的质量：变异率 (VR) 和对齐间隙 (AG)。

变异率衡量压缩文本中与原始文本不同的单词的百分比。对齐间隙评估自动注释标签的质量。

使用这些措施，LLMLingua-2 可以排除低质量样本，确保数据集的质量。

压缩机

被视为二元分类问题

最初，即时压缩问题可以转化为二元分类问题。基本概念是将每个词汇单元视为一个独立的实体，并为其分配一个标签，“保留”或“丢弃”。这种方法保留了压缩提示内容的完整性，同时简化了模型的设计。

模型架构

使用基于变压器编码器的特征编码器，并在顶部添加线性分类层。

该架构可以捕获每个词汇单元的双向上下文信息，为压缩任务提供必要的信息。

压缩策略

原始提示的压缩策略x分为三步。目标压缩比为1/τ，其中τ定义为压缩提示中的单词数与原始提示中的单词数的商x。

首先，我们确定压缩提示中保留的目标令牌数量x˜：N˜ = τN。
pi然后，我们使用令牌分类模型来预测每个单词xi被标记为的概率‘preserve’。
最后，我们保留原始提示中N˜具有最高值的顶部单词，保留它们的原始顺序以形成压缩提示。pixx˜

代码

从上面可以看出，LLMLingua-2的主要工作是构建压缩器。那么，获得了压缩器之后我们该如何使用呢？

请参考下面的代码（环境搭建方法与LLMLingua中相同）。主要内部流程可以在函数

compress_prompt_llmlingua2中看到。

from llmlingua import PromptCompressor

PROMPT = "John: So, um, I've been thinking about the project, you know, and I believe we need to, uh, make some changes. I mean, we want the project to succeed, right? So, like, I think we should consider maybe revising the timeline.\n\nSarah: I totally agree, John. I mean, we have to be realistic, you know. The timeline is, like, too tight. You know what I mean? We should definitely extend it."

llm_lingua = PromptCompressor(
    model_name = "microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2 = True,
)
compressed_prompt = llm_lingua.compress_prompt(PROMPT, rate=0.33, force_tokens = ['\n', '?'])

## Or use LLMLingua-2-small model
# llm_lingua = PromptCompressor(
#     model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
#     use_llmlingua2=True,
# )

print('-' * 100)
print("original:")
print(PROMPT)

print('-' * 100)
print("compressed_prompt:")
print(compressed_prompt)

运行结果如图18所示：

在这里插入图片描述

图 18：LLMLingua-2 测试代码的结果。截图由作者提供。

重组

RECOMP引入了两种类型的训练压缩器：提取式和抽象式。提取压缩器从检索到的文档中选择有用的句子，而抽象压缩器则组合来自多个文档的信息以生成摘要。

图 19 显示了 RECOMP 中压缩机的位置。

在这里插入图片描述

图 19：RECOMP 的架构。资料来源：RECOMP。

抽气压缩机

给定输入文档集中的n句子[s1, s2, …, sn]，我们训练双编码器模型。该模型将句子si和输入序列嵌入x到固定维度的嵌入中。这些嵌入的内积表明了 LLM 将 si 添加到输入 x 以生成目标输出序列的好处。

s压缩器的最终摘要由最上面的N句子组成，按其与输入的内积排名。

抽象压缩器

抽象压缩器是一种编码器-解码器模型。它将输入序列x和检索到的文档集连接起来并输出摘要s。

该方法涉及使用 LLM（如 GPT-3）生成训练数据集，过滤该数据，然后使用过滤后的数据集训练编码器-解码器模型。

结论

本文介绍了提示压缩的方法，包括方法分类、算法原理、代码解释。

在所讨论的方法中，LongLLMLingua 可能是更好的选择。我们已经在我们的研究项目中实施了它。

如何学习大模型 AI ？

由于新岗位的生产效率，要优于被取代岗位的生产效率，所以实际上整个社会的生产效率是提升的。

但是具体到个人，只能说是：

“最先掌握AI的人，将会比较晚掌握AI的人有竞争优势”。

这句话，放在计算机、互联网、移动互联网的开局时期，都是一样的道理。

我在一线互联网企业工作十余年里，指导过不少同行后辈。帮助很多人得到了学习和成长。

我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

在这里插入图片描述