《Advanced RAG》-09-Prompt 压缩（一）

最新推荐文章于 2024-09-15 11:28:58 发布

静愚 AGI

最新推荐文章于 2024-09-15 11:28:58 发布

阅读量1k

点赞数 43

分类专栏：深度 RAG Medium精选文章标签： prompt AIGC 人工智能语言模型

本文链接：https://blog.csdn.net/JingYu_365/article/details/141116779

版权

深度 RAG 同时被 2 个专栏收录

15 篇文章 0 订阅

订阅专栏

Medium精选

14 篇文章 0 订阅

订阅专栏

总结

文章主要介绍了提升大型语言模型效率的提示压缩技术，包括方法分类、算法原理和代码实现，并详细介绍了四种主要的提示压缩方法：基于信息熵的方法、基于软提示调整的方法、基于数据提炼的方法和基于标记合并或剪枝的方法。

摘要

文章详细讨论了高级 RAG 技术中的提示压缩问题，指出大型语言模型在处理长文本时会遇到性能和成本问题。首先阐述了为什么需要提示压缩，即为了在不牺牲性能的前提下减少输入令牌的数量和处理时间。

接着，介绍了四种提示压缩方法：

基于信息熵的方法（如 Selective Context）使用小型语言模型计算原始提示中每个标记的自信息或困惑度，并删除困惑度较低的标记；
基于软提示调整的方法（如 AutoCompressor）通过微调 LLM 参数来适应特定领域，但不适用于黑盒 LLM；
基于数据提炼的方法（如 LLMLingua-2）通过从 LLM 中提炼数据来训练模型，以生成更易于解释的文本摘要，并且可以应用于不需要梯度更新的黑盒 LLM；
基于标记合并或剪枝的方法（如 ToMe）通常需要在推理过程中对模型进行微调或生成中间结果。

文章还详细介绍了 LLMLingua 的原理和代码实现，以及 LongLLMLingua、AutoCompressor 和 LLMLingua-2 的特点和使用方法。

最后，提到了 RECOMP 方法，它引入了两种经过训练的压缩器：提取式和抽象式压缩器。

RAG 流程可能会遇到两个问题：

大语言模型（LLM）通常有上下文长度限制。因此，输入文本越长，处理过程就越费时费力。
检索到的上下文不一定总是有用的。在一个较大的语块中，可能只有一小部分与答案相关。在某些情况下，要回答一个特定的问题，可能需要将多个信息块结合起来。即使重新排序，这个问题依然存在。

LLM 的提示压缩是解决这些问题的一种方法。从根本上说，其目的是保留提示中的关键信息，使输入令牌更有价值。这种方法既能提高模型的性能，又能降低成本。如图 1 右下方所示。

在这里插入图片描述

值得注意的是，如图 1 中紫色虚线所示，某些压缩器也可以直接应用于检索到的上下文。

总的来说，提示压缩方法可分为四大类：

基于信息熵的方法，如Selective Context, LLMLingua, LongLLMLingua。这些方法使用一个小型语言模型来计算原始提示语中每个标记的自信息或易错性。然后，它们会删除易错性较低的标记。
基于软提示调整的方法，如 AutoCompressorr 和 GIST。这些方法需要对 LLM 参数进行微调，使其适用于特定领域，但不能直接应用于黑盒 LLM。
首先，从 LLM 中进行数据提炼，然后训练模型，生成更多可解释的文本摘要。这些模型可以在不同的语言模型之间转移，并应用于不需要梯度更新的黑盒 LLM。具有代表性的方法是 LLMLingua-2 和 RECOMP。
基于标记合并或标记剪枝的方法，如 ToMe 和 AdapLeR。这些方法通常需要在推理过程中对模型进行微调或生成中间结果。

鉴于第四类方法最初是针对 ViT 或 BERT 等较小模型提出的，本文将介绍前三类方法中代表性算法的原理。

Selective Context

洞察力（Insight）

图 2 显示，LLM 不需要完整的上下文或完整的对话历史记录，就能对用户的询问做出回应。即使在相关信息被遗漏的情况下，LLMs 仍能做出预期的回应。这可能要归功于 LLMs 从上下文线索和预训练中获得的先验知识中推断出缺失信息的能力。

在这里插入图片描述

因此，我们可以在不影响性能的情况下，通过过滤掉信息量较少的内容来优化上下文长度。这就是选择性上下文的关键所在。

选择性上下文采用小语言模型（SLM）来确定给定上下文中词性单位（如句子、短语或标记）的自信息。然后，它利用这些自信息来评估它们的信息量。通过有选择性地保留自信息较高的内容，选择性上下文为 LLM 提供了一种更简洁、更高效的上下文表示法。实现这一点不会影响它们在不同任务中的性能。

自信息（Self-Information）

选择性上下文利用自信息来评估内容质量。

自信息，又称惊喜或信息含量，是信息论中的一个重要概念。它量化了一个事件所传达的信息量。它被定义为标记的负对数可能性：

在这里插入图片描述

其中，I(x) 表示标记，x 的自信息，P(x) 表示其输出概率。

在信息论中，自信息量化了与事件相关的不确定程度。罕见事件传递的信息越多，自信息量就越大。相反，常见事件传递的信息较少，其自信息量较低。

算法（Algorithm）

为了更方便地解释其中的原理，让我们深入了解一下源代码。

首先，通过安装相应的 python 库和下载 Spacy 模型来设置环境。

(base) Florian:~ Florian$ conda create -n "selective_context" python=3.10 
(base) Florian:~ Florian$ conda activate selective_context
(selective_context) Florian:~ Florian$ pip install selective-context
(selective_context) Florian:~ Florian$ python -m spacy download en_core_web_sm

安装完成后，版本如下：

(selective_context) Florian:~ Florian$ pip list | grep selective
selective-context   0.1.4

测试代码如下：

from selective_context import SelectiveContext

sc = SelectiveContext(model_type='gpt2', lang='en')
text = "INTRODUCTION Continual Learning ( CL ) , also known as Lifelong Learning , is a promising learning paradigm to design models that have to learn how to perform multiple tasks across different environments over their lifetime [To uniform the language and enhance the readability of the paper we adopt the unique term continual learning ( CL ) .]. Ideal CL models in the real world should be deal with domain shifts , researchers have recently started to sample tasks from two different datasets . For instance , proposed to train and evaluate a model on Imagenet first and then challenge its performance on the Places365 dataset . considers more scenarios , starting with Imagenet or Places365 , and then moving on to the VOC/CUB/Scenes datasets. Few works propose more advanced scenarios built on top of more than two datasets."
context, reduced_content = sc(text)

# We can also adjust the reduce ratio
# context_ratio, reduced_content_ratio = sc(text, reduce_ratio = 0.5)

初始运行将下载大小约为 500MB 的 GPT-2 模型。测试代码的结果如图 3 所示。

在这里插入图片描述

接下来，让我们探究一下 sc(text) 函数。内部源代码如下：

class SelectiveContext:
    ...
    ...
    def __call__(self, text: str, reduce_ratio: float = 0.35, reduce_level :str = 'phrase') -> List[str]:
        context = self.beautify_context(text)

        self.mask_ratio = reduce_ratio

        sents = [sent.strip() for sent in re.split(self.sent_tokenize_pattern, context) if sent.strip()]

        # You want the reduce happen at sentence level, phrase level, or token level?
        assert reduce_level in ['sent', 'phrase', 'token'], f"reduce_level should be one of ['sent', 'phrase', 'token'], got {reduce_level}"
        sent_lus, phrase_lus, token_lus = self._lexical_unit(sents)
        lexical_level = {
            'sent': sent_lus,
            'phrase': phrase_lus,
            'token': token_lus
        }

        # context is the reduced context, masked_sents denotes what context has been filtered out
        context, masked_sents = self.self_info_mask(lexical_level[reduce_level].text, lexical_level[reduce_level].self_info, reduce_level)
        return context, masked_sents

上述代码主要包括三个步骤：

计算上下文中每个标记的自信息。
根据词组或句子等词汇单位合并标记及其自信息。
有选择地保留信息背景。

步骤 1：计算自信息

给定上下文 C = x0、x1、…、xn，其中每个 xi 代表一个标记，我们使用因果语言模型（如 GPT-2、OPT 和 LLaMA）来计算每个标记 xi 的自我信息：

在这里插入图片描述

如果您使用的是 GPT-2，下面是相应的代码：

class SelectiveContext:
    ...
    ...    
    def _get_self_info_via_gpt2(self, text: str) -> Tuple[List[str], List[float]]:
        if self.lang == 'en':
            text = f"<|endoftext|>{text}"
        elif self.lang == 'zh':
            text = f"[CLS]{text}"
        with torch.no_grad():
            encoding = self.tokenizer(text, add_special_tokens=False, return_tensors='pt')
            encoding = encoding.to(self.device)
            outputs = self.model(**encoding)
            logits = outputs.logits
            probs = torch.softmax(logits, dim=-1)
            self_info = -torch.log(probs)
        
        input_ids = encoding['input_ids']
        input_ids_expaned = input_ids[:, 1:].unsqueeze(-1)

步骤 2：合并为词汇单元

直接在标记级别执行选择性上下文过滤可能会导致上下文不连贯。例如，原始提示中的 "2009 "可能被压缩为 “209”。

因此，除了标记级过滤外，在短语和句子级实施过滤程序也至关重要。过滤的基本单位称为词性单位，可以是一个标记、一个短语或一个句子。

如何计算每个词性单位 u = (xt, …, xt+α)的自信息？我们可以根据自信息的可加性原则，将组成 u 的每个标记的自信息相加：

在这里插入图片描述

相应的代码如下，并添加了某些变量的调试信息：

class SelectiveContext:
    ...
    ...
    def _lexical_unit(self, sents):

        if self.sent_level_self_info:
            sent_self_info = []
            all_noun_phrases = []
            all_noun_phrases_info = []
            all_tokens = []
            all_token_self_info = []

            for sent in sents:
                # print(sent)
                tokens, self_info = self.get_self_information(sent)
                '''
                ipdb> sent
                'INTRODUCTION Continual Learning ( CL ) , also known as Lifelong Learning , is a promising learning paradigm to design models that have to learn how to perform multiple tasks across different environments over their lifetime [To uniform the language and enhance the readability of the paper we adopt the unique term continual learning ( CL ) .].'

                ipdb> tokens
                ['IN', 'TR', 'ODUCT', 'ION', ' Contin', 'ual', ' Learning', ' (', ' CL', ' )', ',', ' also', ' known', ' as', ' Lif', 'elong', ' Learning', ',', ' is', ' a', ' promising', ' learning', ' paradigm', ' to', ' design', ' models', ' that', ' have', ' to', ' learn', ' how', ' to', ' perform', ' multiple', ' tasks', ' across', ' different', ' environments', ' over', ' their', ' lifetime', ' [', 'To', ' uniform', ' the', ' language', ' and', ' enhance', ' the', ' read', 'ability', ' of', ' the', ' paper', ' we', ' adopt', ' the', ' unique', ' term', ' continual', ' learning', ' (', ' CL', ' )', '.', '].']

                ipdb> self_info
                [7.514791011810303, 1.632637619972229, 0.024813441559672356, 0.006853647995740175, 12.09920597076416, 2.1144468784332275, 9.457701683044434, 2.4503376483917236, 10.236454963684082, 0.8689146041870117, 5.269547939300537, 4.641763210296631, 0.22138957679271698, 0.010370315983891487, 10.071824073791504, 0.6905602216720581, 0.01698811538517475, 1.5882389545440674, 0.4495090842247009, 0.45371606945991516, 6.932497978210449, 6.087430477142334, 3.66465425491333, 3.3969509601593018, 7.337691307067871, 5.881226539611816, 1.7340556383132935, 4.599822521209717, 6.482723236083984, 4.045308589935303, 4.762691497802734, 0.21346867084503174, 3.7985599040985107, 4.6389899253845215, 0.33642446994781494, 4.918881416320801, 2.076707601547241, 3.3553669452667236, 5.5081071853637695, 5.625778675079346, 0.7966060638427734, 6.347291946411133, 12.772034645080566, 13.792041778564453, 4.11267614364624, 6.583715915679932, 3.3618998527526855, 8.434362411499023, 1.2423189878463745, 5.8330583572387695, 0.0013973338063806295, 0.3090735077857971, 1.1139129400253296, 4.160390853881836, 3.744772434234619, 7.2841596603393555, 1.4088190793991089, 7.86871337890625, 4.305004596710205, 9.69282341003418, 0.08665203303098679, 1.6127821207046509, 1.6296097040176392, 0.46206924319267273, 3.0398476123809814, 6.892032623291016]
                '''
                sent_self_info.append(np.mean(self_info))

                all_tokens.extend(tokens)
                all_token_self_info.extend(self_info)

                noun_phrases, noun_phrases_info = self._calculate_lexical_unit(tokens, self_info)
                '''
                ipdb> noun_phrases
                ['INTRODUCTION Continual Learning', ' (', ' CL', ' )', ',', ' also', ' known', ' as', ' Lifelong Learning', ',', ' is', ' a promising learning paradigm', ' to', ' design', ' models', ' that', ' have', ' to', ' learn', ' how', ' to', ' perform', ' multiple tasks', ' across', ' different environments', ' over', ' their lifetime', ' [', 'To', ' uniform', ' the language', ' and', ' enhance', ' the readability', ' of', ' the paper', ' we', ' adopt', ' the unique term continual learning', ' (', ' CL', ' )', '.', ']', '.']
                
                ipdb> noun_phrases_info
                [4.692921464797109, 2.4503376483917236, 10.236454963684082, 0.8689146041870117, 5.269547939300537, 4.641763210296631, 0.22138957679271698, 0.010370315983891487, 3.5931241369495788, 1.5882389545440674, 0.4495090842247009, 4.284574694931507, 3.3969509601593018, 7.337691307067871, 5.881226539611816, 1.7340556383132935, 4.599822521209717, 6.482723236083984, 4.045308589935303, 4.762691497802734, 0.21346867084503174, 3.7985599040985107, 2.487707197666168, 4.918881416320801, 2.7160372734069824, 5.5081071853637695, 3.2111923694610596, 6.347291946411133, 12.772034645080566, 13.792041778564453, 5.348196029663086, 3.3618998527526855, 8.434362411499023, 2.3589248929638416, 0.3090735077857971, 2.6371518969535828, 3.744772434234619, 7.2841596603393555, 4.672402499616146, 1.6127821207046509, 1.6296097040176392, 0.46206924319267273, 3.0398476123809814, 3.446016311645508, 3.446016311645508]
                '''

                # We need to add a space before the first noun phrase for every sentence except the first one
                if all_noun_phrases:
                    noun_phrases[0] = f" {noun_phrases[0]}"
                all_noun_phrases.extend(noun_phrases)
                all_noun_phrases_info.extend(noun_phrases_info)
            
            return [
                LexicalUnits('sent', text=sents, self_info=sent_self_info),
                LexicalUnits('phrase', text=all_noun_phrases, self_info=all_noun_phrases_info),
                LexicalUnits('token', text=all_tokens, self_info=all_token_self_info)
            ]

步骤 3：有选择地保留信息背景

一旦计算出每个词汇单元的自信息，问题就来了：如何评估它们的信息量？本文提出了一种自适应方法，使用基于百分位数的过滤方法来选择信息量最大的内容。这比使用固定阈值或保留固定数量的前 k 个词汇单元更可取。

首先，我们根据词性单位的自信息值从高到低排列词性单位。然后，我们计算所有词性单元这些值的 pth 百分位数。接下来，我们选择性地保留自信息值大于或等于 pth 百分位数的词汇单位。

相应的代码如下

class SelectiveContext:
    ...
    ...

    def self_info_mask(self, sents: List[str], self_info: List[float], mask_level):
        # mask_level: mask sentences, phrases, or tokens
        sents_after_mask = []
        masked_sents = []
                
        self.ppl_threshold = np.nanpercentile(self_info, self.mask_ratio * 100)

        # if title is not None:#     with open(os.path.join(self.path, title+'_prob_token.tsv'), 'w', encoding='utf-8') as f:#         for token, info in zip(tokens, self_info):#             f.write(f"{token}\t{info}\n")#     with open(os.path.join(self.path, title+'_prob_sent.tsv'), 'w', encoding='utf-8') as f:#         for sent, info in zip(sents, sent_self_info):#             f.write(f"{sent}\n{info}\n\n")for sent, info in zip(sents, self_info):
            if info < self.ppl_threshold:
                masked_sents.append(sent)
                sents_after_mask.append(self.mask_a_sent(sent, mask_level))
            else:
                sents_after_mask.append(sent)
        masked_context = " ".join(sents_after_mask) if mask_level == 'sent' else "".join(sents_after_mask)
        
        return masked_context, masked_sents

LLMLingua

概述

LLMLingua 认为，选择性上下文往往忽略了压缩内容之间的相互联系，以及 LLM 与用于及时压缩的小语言模型之间的相关性。LLMLingua 正好解决了这些问题。

具体来说，如图 4 所示，**LLMLingua 采用了预算控制器，为原始提示的各个组成部分（如指令、案例和问题）动态分配不同的压缩率。**它还能进行粗粒度的演示级压缩，即使在高压缩比的情况下也能保持语义的完整性。此外，LLMLingua 还引入了一种标记级迭代算法，用于对提示语进行细粒度压缩。

在这里插入图片描述

与 "选择性上下文 "相比，LLMLingua 能更有效地保留提示中的关键信息，同时考虑到标记之间的条件依赖关系。它能将提示语压缩 20 倍。

预算控制器（Budget controller）

预算控制器是 LLMLingua 的一个关键组件，用于为原始提示音的不同部分动态分配不同的压缩比。

提示语的不同部分对压缩的敏感度不同。例如，说明和问题的敏感度较高，而演示的敏感度较低。预算控制器的作用是为指令和问题分配较低的压缩率，从而保留基本信息。相反，可以为演示分配较高的压缩率，以消除冗余信息。

预算控制器的算法如图 5 所示：

在这里插入图片描述

主要变量是

M𝑠: 小型语言模型，如 GPT-2 或 LLaMA。
x = (x^ins , x^dems , x^que): 原始提示，包括说明、演示和问题。
𝐿, 𝐿_ins, 𝐿_dems, 和 𝐿_que 表示 x, x^ins , x^dems, 和**x^que**的token数。
𝜏_dems: 根据目标总体压缩率𝜏 以及预设的指令和问题压缩率（即𝜏_ins 和 𝜏_que）计算的演示压缩率。
D: 这个集合将包含压缩后的示例。

主要流程如下：

计算示例的压缩率。
使用小型语言模型（如 GPT-2 或 LLaMA）计算原始示例集中每个示例的困惑度（perplexity）。
按照困惑度从高到低对所有示例进行排序。
迭代选择示例并将其添加到集合 D 中。
在压缩示例后，将剩余的预算分配给指令和问题。
输出经过粗粒度压缩后的集合 D。

通过演示级流程，预算控制器可以在压缩过程中保留关键信息，有效减少原始提示的大小。这种方法尤其适用于包含多个案例的提示。

相关代码位于 control_context_budget 函数中。

迭代令牌级提示压缩（ITPC）

在自然语言处理中，困惑度（perplexity）是衡量语言模型对未知数据预测能力的指标，数值越低表示模型对未知数据的预测越准确。当我们将困惑度应用于提示压缩（prompt compression）时，独立性假设简化了模型处理序列的方式，但这也可能导致模型忽略了词符间更复杂的相互依赖关系，从而影响模型的预测精度。

在实际应用中，词符间的相互作用和依赖往往对语言理解和生成至关重要。

这种疏忽可能会导致在压缩过程中丢失关键信息。

例如，在高比率压缩中，如果一个标记提供了上下文中的关键推理步骤或逻辑联系，那么仅根据其困惑度来决定是否保留该token可能会导致推理过程不完整。

为了解决这个问题，LLMLingua 引入了迭代标记级提示压缩 (ITPC) 算法。

这种方法并不完全依赖其独立概率，而是在提示压缩过程中更精确地评估每个token的重要性。它通过迭代处理提示中的每个片段，并在当前上下文中考虑每个token的条件概率来实现这一目的。这种方法有助于更好地保留标记之间的依赖关系。

图 6 显示了 ITPC 的详细步骤：

在这里插入图片描述

通过这一过程，ITPC 算法可以有效压缩提示语的长度，同时保持提示语义的完整性，从而降低 LLM 的推理成本。

相关代码在函数 iterative_compress_prompt 中。

指令微调（Instruction Tuning）

图 4 说明，指令调整也是 LLMLingua 的一个关键步骤。其目的是最小化用于压缩提示的小语言模型与 LLM 之间的分布差异。

图 7 显示了指令调整的步骤：

在这里插入图片描述

代码演示

首先，设置环境：

(base) Florian:~ Florian$ conda create -n "llmlingua" python=3.11

(base) Florian:~ Florian$ conda activate llmlingua

(llmlingua) Florian:~ Florian$ pip install llmlingua

安装的版本如下：

llmlingua          0.2.1

测试代码如下

from llmlingua import PromptCompressor

GSM8K_PROMPT = "Question: Angelo and Melanie want to plan how many hours over the next week they should study together for their test next week. They have 2 chapters of their textbook to study and 4 worksheets to memorize. They figure out that they should dedicate 3 hours to each chapter of their textbook and 1.5 hours for each worksheet. If they plan to study no more than 4 hours each day, how many days should they plan to study total over the next week if they take a 10-minute break every hour, include 3 10-minute snack breaks each day, and 30 minutes for lunch each day?\nLet's think step by step\nAngelo and Melanie think they should dedicate 3 hours to each of the 2 chapters, 3 hours x 2 chapters = 6 hours total.\nFor the worksheets they plan to dedicate 1.5 hours for each worksheet, 1.5 hours x 4 worksheets = 6 hours total.\nAngelo and Melanie need to start with planning 12 hours to study, at 4 hours a day, 12 / 4 = 3 days.\nHowever, they need to include time for breaks and lunch. Every hour they want to include a 10-minute break, so 12 total hours x 10 minutes = 120 extra minutes for breaks.\nThey also want to include 3 10-minute snack breaks, 3 x 10 minutes = 30 minutes.\nAnd they want to include 30 minutes for lunch each day, so 120 minutes for breaks + 30 minutes for snack breaks + 30 minutes for lunch = 180 minutes, or 180 / 60 minutes per hour = 3 extra hours.\nSo Angelo and Melanie want to plan 12 hours to study + 3 hours of breaks = 15 hours total.\nThey want to study no more than 4 hours each day, 15 hours / 4 hours each day = 3.75\nThey will need to plan to study 4 days to allow for all the time they need.\nThe answer is 4\n\nQuestion: You can buy 4 apples or 1 watermelon for the same price. You bought 36 fruits evenly split between oranges, apples and watermelons, and the price of 1 orange is $0.50. How much does 1 apple cost if your total bill was $66?\nLet's think step by step\nIf 36 fruits were evenly split between 3 types of fruits, then I bought 36/3 = 12 units of each fruit\nIf 1 orange costs $0.50 then 12 oranges will cost $0.50 * 12 = $6\nIf my total bill was $66 and I spent $6 on oranges then I spent $66 - $6 = $60 on the other 2 fruit types.\nAssuming the price of watermelon is W, and knowing that you can buy 4 apples for the same price and that the price of one apple is A, then 1W=4A\nIf we know we bought 12 watermelons and 12 apples for $60, then we know that $60 = 12W + 12A\nKnowing that 1W=4A, then we can convert the above to $60 = 12(4A) + 12A\n$60 = 48A + 12A\n$60 = 60A\nThen we know the price of one apple (A) is $60/60= $1\nThe answer is 1\n\nQuestion: Susy goes to a large school with 800 students, while Sarah goes to a smaller school with only 300 students.  At the start of the school year, Susy had 100 social media followers.  She gained 40 new followers in the first week of the school year, half that in the second week, and half of that in the third week.  Sarah only had 50 social media followers at the start of the year, but she gained 90 new followers the first week, a third of that in the second week, and a third of that in the third week.  After three weeks, how many social media followers did the girl with the most total followers have?\nLet's think step by step\nAfter one week, Susy has 100+40 = 140 followers.\nIn the second week, Susy gains 40/2 = 20 new followers.\nIn the third week, Susy gains 20/2 = 10 new followers.\nIn total, Susy finishes the three weeks with 140+20+10 = 170 total followers.\nAfter one week, Sarah has 50+90 = 140 followers.\nAfter the second week, Sarah gains 90/3 = 30 followers.\nAfter the third week, Sarah gains 30/3 = 10 followers.\nSo, Sarah finishes the three weeks with 140+30+10 = 180 total followers.\nThus, Sarah is the girl with the most total followers with a total of 180.\nThe answer is 180"

llm_lingua = PromptCompressor()

## Or use the phi-2 model,
# llm_lingua = PromptCompressor("microsoft/phi-2")

## Or use the quantation model, like TheBloke/Llama-2-7b-Chat-GPTQ, only need <8GB GPU memory.
## Before that, you need to pip install optimum auto-gptq
# llm_lingua = PromptCompressor("TheBloke/Llama-2-7b-Chat-GPTQ", model_config={"revision": "main"})

compressed_prompt = llm_lingua.compress_prompt(GSM8K_PROMPT.split("\n\n")[0], instruction="", question="", target_token=200)

print('-' * 100)
print("original:")
print(GSM8K_PROMPT.split("\n\n")[0])

print('-' * 100)
print("compressed_prompt:")
print(compressed_prompt)