LLMs：《Optimizing your LLM in production在生产环境中优化您的LLM》翻译与解读—LLM在实际应用中面临的两大挑战(内存需求+对更长上下文输入需求)+提升LLM部署

一个处女座的程序猿

已于 2023-10-15 23:34:42 修改

阅读量1k

点赞数 2

分类专栏： AI/AGI NLP/LLMs 文章标签：自然语言处理大语言模型生产环境

于 2023-09-19 01:15:27 首次发布

本文链接：https://blog.csdn.net/qq_41185868/article/details/133004502

版权

NLP/LLMs 同时被 2 个专栏收录

483 篇文章 376 订阅

订阅专栏

AI/AGI

317 篇文章 215 订阅

订阅专栏

LLMs：《Optimizing your LLM in production在生产环境中优化您的LLM》翻译与解读—LLM在实际应用中面临的两大挑战(内存需求+对更长上下文输入需求)+提升LLM部署效率的三大技术(低精度量化+更高效的自注意力算法Flash Attention+优化模型结构【位置嵌入/键-值缓存】)

导读：总结了LLM在实际应用中面临的两大挑战，以及提升LLM部署效率的三大技术方法。这篇文章从理论和实践两个层面全面介绍了采用低精度量化、更高效的自注意力算法、优化后的模型结构可以很好地解决LLM在生产环境推理过程中的计算和内存瓶颈问题。最后，文中还提到了推测解码作为前景方向之一。

1、LLMs的两大挑战：LLM在产业应用中需要面对越来越大的内存需求和对长上下文输入的支持需求。

2、三种有效技术来高效部署LLMs

>> 低精度技术—使用更低精度来表示权重：利用低精度来降低内存消耗，将权重量化到较低的精度来减小模型大小。如使用半精度(float16)替代单精度(float32)可以减半内存需求，进一步使用8位或4位量化也可以进一步降低内存需求。

>> 使用更高效的自注意力算法—Flash Attention：lash Attention技术实现了线性增长的注意力计算，采用Flash Attention算法替代传统自注意力层，使注意力计算从平方级下降到线性级，大幅提高长序列处理能力。

>> 优化模型结构—架构创新：优化LLM的架构设计，包括改进位置嵌入和键-值缓存等方法以提高长文本输入和多轮对话任务的效率。

>>>>使用相对位置嵌入替代绝对位置嵌入更适用于长输入，如RoPE和ALiBi，可以让模型在更长序列上也表现出色。

>>>>使用KV缓存(键-值缓存)，缓存以前time step的关注值向量，可以有效减少重复计算和内存开销，提高 decoding 效率。MQA和GQA算法也可以降低内存消耗。

通过这三大技术，可以有效提升LLM在实际应用场景中的部署效率，实现在削减精度和内存的同时保持或提升模型效果，适应日益增长的计算需求。

LLMs：预训练大模型实现全流程详解之模型训练各种技巧原理细讲—3.2、模型预训练及优化：参数优化(前置参数/超参)+结构优化(优化器/激活函数/位置嵌入/注意力机制/归一化的方法和位置)+降内存优化(分布式5大并行策略)+提速优化(词表裁剪/梯度累积GA/梯度检查点GC/AMP训练/4-bit量化/ZeRO)之详细攻略

《Optimizing your LLM in production》翻译与解读

LLMs已成为现代知识型产业中不可或缺的工具：面临两大挑战—内存需求+更长的上下文需求

解决高效部署LLM的三大有效技术：低精度、Flash Attention、架构创新

1. Harnessing the Power of Lower Precision发挥低精度的威力

单精度→半精度

单精度(float32)下所需的VRAM(GB为单位)是参数(B为单位)的4倍

半精度(float16或bfloat16)下所需的VRAM(GB为单位)是参数(B为单位)的2倍

大多数超过80GB+又因硬件最大为A00-80G=故需使用并行技术(TP和PP)：Transformers不支持开箱即用的张量并行但支持简单的流水线并行

代码实现：利用8 x 80GB的A100节点加载BLOOM

代码实现：利用1 x A100-80GB加载bigcode/octocoder→推理的内存需求约为31 GB

目前大多数模型支持bfloat16(因为大多数GPU支持)

模型量化技术—进一步权重量化8位或4位(几乎不会显著降低性能)

若硬件GPU不够32GB可以采用量化技术

量化技术的宗旨：降低权重的精度同时尽量保持模型推理准确性

量化技术适合文本生成场景的原因：只是选择最可能的下一个标记的集合，而不是真正关心下一个token logit分布的确切值(因为是下一个token logit分布保持大致相同)

量化技术的原理—保证计算高精度(BF16)：将模型权重量化到低精度(目标精度)，在运算时动态还原精度，保证计算精度，计算后再将权重量化，从而降低模型大小但不影响效果

代码实现：模型8位量化，15B模型需同等level的15.2GB，验证了8位量化可在消费级GPU(如4090)上运行

代码实现：模型4位量化，15B模型仅需9.5GB，可在消费级GPU(如3090)上运行

对比——4位量化、8位量化、bfloat16推理相比：模型准确性几乎未降，但4位量化的推理速度略慢

模型4位量化：可在消费级GPU(如RTX3090/V100/T4)上运行

降显存更多的内容可查看AutoGPTQ或Transformers量化文档

模型量化的本质(权衡内存效率和准确性)：如果有大把GPU资源可以不用考虑量化技术

2. Flash Attention: A Leap Forward—闪光注意力：一个重大飞跃

LLMs之FlashAttention-2：《FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning更快的注意力与更好的并行性和工作分区》翻译与解读

目前优秀的主流大模型核心架构基本一致：前馈层、激活层、层归一化层、自注意力层

传统自注意力层【平方级增长】：传统自注意层的计算和内存复杂度随输入序列长度呈【平方级增长】，给长序列模型的应用带来严重挑战

Tri Dao提出Flash Attention【线性增长】

Flash Attention的原理：跟踪softmax归一化统计信息和优化计算，实现与默认注意力层数值相同但内存开销【线性增长】的效果，更快速且更节省内存

观察是否使用Flash Attention下模型性能的变化：通过重复系统提示10次，使输入长度足够长来

3. 架构背后的科学：为长文本输入和聊天选择策略

提高计算效率和内存效率的方式：低精度转换、更高效的自注意力算法

改变LLM的架构使其对需要长文本输入的任务最有效和高效

模型架构的两个重要组成部分：位置嵌入、键-值缓存

3.1 Improving positional embeddings of LLMs改进LLM的位置嵌入

自注意力的意义：建立关联

位置嵌入让LLMs理解句子顺序：因由QK^T计算的概率分数+O(1)次计算中完成→没有位置嵌入的LLM导致token间的距离均相同—即无法区分Hello I love you和You love I hello

绝对位置嵌入(按固定长度训练+给出每个词位置的唯一向量表示【为每个位置的id编码了一个唯一的嵌入】+与词本身语义无关+对长文本效果不佳→会导致LLM性能差)：【Transformer的原始论文提出】基于正弦函数的正弦位置嵌入→基于可学习的位置嵌入(非固定嵌入+在训练期间学习而得+必须在固定的输入长度N上训练+但是固定输入长度导致难以外推)→提出如果模型学习输入标记之间的相对位置距离(而非绝对位置)会更有优势

相对位置嵌入(关注词间相对距离【而非绝对位置】+符合语言本质→弥补了绝对位置编码在长文本模型和外推上的不足)—在自注意力算法中直接提示LLM句子顺序最佳(在注意力计算中加入相对位置依赖项)：RoPE、ALiBi

RoPE(通过旋转矩阵表达相对距离关系)：如PaLM、Llama、Falcon

ALiBi(通过添加位置偏差值调整注意力分布)：如MPT、BLOOM

RoPE和ALiBi(具有更好的扩展性)相对位置编码可以对训练时未见过的输入长度进行推理：两种均是基于相对位置的启发,可以降低远距离token间的关联性

3.2 The key-value cache键-值缓存/KV缓存

LLMs：预训练大模型实现全流程详解之模型训练各种技巧原理细讲—3.2、模型预训练及优化：参数优化+结构优化+降内存优化(……/KV缓存技术)之详细攻略

自回归生成的工作原理：因果语言建模本质会掩盖注意力得分的上三角矩阵

token从不依赖于先前的token→为减少不必要的计算使用键-值缓存

键-值缓存的两个优点：计算高效率+内存仅线性增长

键-值缓存在Chat场景案例(需要多次自回归解码的应用)表现更优：利用存储的上下文信息加快后续解码

显著降低存储键-值缓存的内存成本的两大技巧：都是在保留自回归解码需要的key-value缓存基础的同时+都通过减少key-value投影权重数来提升效率

Multi-Query-Attention (MQA)多查询注意力：使用单个key-value投影权重替代原始多头attentions结构,两个优点(节省内存+提高计算效率),比如PaLM、MPT、BLOOM、Falcon

Grouped-Query-Attention (GQA)分组查询注意力：使用多个(如2-8个)而不是单个key-value投影权重+保留更多模型容量的同时还可以获得MQA的内存和计算效率提升，如LLaMA2

对比：MQA多查询注意力、GQA分组查询注意力

Conclusion结论

有前途的方向之一推测解码Speculative Decoding

LLMs：预训练大模型实现全流程详解之模型训练各种技巧原理细讲—3.2、模型预训练及优化：参数优化(前置参数/超参)+结构优化(优化器/激活函数/位置嵌入/注意力机制/归一化的方法和位置)+降内存优化(分布式5大并行策略)+提速优化(词表裁剪/梯度累积GA/梯度检查点GC/AMP训练/4-bit量化/ZeRO)之详细攻略

https://blog.csdn.net/qq_41185868/article/details/131606747

《Optimizing your LLM in production》翻译与解读

地址

博客地址：https://huggingface.co/blog/optimize-llm

GitHub地址：https://github.com/huggingface/blog/blob/main/optimize-llm.md

时间

2023年9月15日

作者

HuggingFace

LLMs已成为现代知识型产业中不可或缺的工具：面临两大挑战—内存需求+更长的上下文需求

Large Language Models (LLMs) such as GPT3/4, Falcon, and LLama are rapidly advancing in their ability to tackle human-centric tasks, establishing themselves as essential tools in modern knowledge-based industries. Deploying these models in real-world tasks remains challenging, however:

>> To exhibit near-human text understanding and generation capabilities, LLMs currently require to be composed of billions of parameters (see Kaplan et al, Wei et. al). This consequently amplifies the memory demands for inference.

>> In many real-world tasks, LLMs need to be given extensive contextual information. This necessitates the model's capability to manage very long input sequences during inference.

The crux of these challenges lies in augmenting the computational and memory capabilities of LLMs, especially when handling expansive input sequences.

大型语言模型（LLM）如GPT3/4、Falcon和LLama正在迅速提高其处理人类中心任务的能力，它们已经成为现代知识型产业中不可或缺的工具。然而，在实际任务中部署这些模型仍然具有挑战性：

>>为了展示接近人类文本理解和生成能力，LLM目前需要由数十亿参数组成（请参见Kaplan等人，Wei等人的研究）。这导致了推理时的内存需求增加。

>>在许多实际任务中，LLM需要提供大量的上下文信息。这要求模型在推理过程中能够处理非常长的输入序列。

这些挑战的核心在于增强LLM的计算和内存能力，特别是在处理大规模输入序列时。

解决高效部署LLM的三大有效技术：低精度、Flash Attention、架构创新

In this blog post, we will go over the most effective techniques at the time of writing this blog post to tackle these challenges for efficient LLM deployment:

>> Lower Precision: Research has shown that operating at reduced numerical precision, namely 8-bit and 4-bit, can achieve computational advantages without a considerable decline in model performance.

>> Flash Attention: Flash Attention is a variation of the attention algorithm that not only provides a more memory-efficient approach but also realizes increased efficiency due to optimized GPU memory utilization.

>> Architectural Innovations: Considering that LLMs are always deployed in the same way during inference, namely autoregressive text generation with a long input context, specialized model architectures have been proposed that allow for more efficient inference. The most important advancement in model architectures hereby are Alibi, Rotary embeddings, Multi-Query Attention (MQA) and Grouped-Query-Attention (GQA).

在本博文中，我们将介绍编写本博文时最有效的技术，以解决高效部署LLM所面临的这些挑战：

>> 低精度：研究表明，使用较低的数值精度，即8位和4位，可以在不显著降低模型性能的情况下实现计算优势。

>> 快闪注意力（Flash Attention）：快闪注意力是一种注意力算法的变体，它不仅提供了更节省内存的方法，还由于优化了GPU内存利用效率而实现了更高的效率。

>> 架构创新：考虑到LLM在推理过程中始终以相同的方式部署，即自回归文本生成与长输入上下文，已经提出了专用的模型架构，以实现更高效的推理。这方面的最重要的进展是Alibi、Rotary embeddings、多查询注意力（MQA）和分组查询注意力（GQA）。

Throughout this notebook, we will offer an analysis of auto-regressive generation from a tensor's perspective. We delve into the pros and cons of adopting lower precision, provide a comprehensive exploration of the latest attention algorithms, and discuss improved LLM architectures. While doing so, we run practical examples showcasing each of the feature improvements.

在本文中，我们将从张量的角度分析自回归生成。我们将深入探讨采用低精度的利弊，全面探讨最新的注意力算法，并讨论改进的LLM架构。在这个过程中，我们将运行展示每个功能改进的实际示例。

1. Harnessing the Power of Lower Precision发挥低精度的威力

单精度→半精度

单精度(float32)下所需的VRAM(GB为单位)是参数(B为单位)的4倍

Memory requirements of LLMs can be best understood by seeing the LLM as a set of weight matrices and vectors and the text inputs as a sequence of vectors. In the following, the definition weights will be used to signify all model weight matrices and vectors.

通过将LLM视为一组权重矩阵和向量，并将文本输入视为一系列向量，可以最好地理解LLM的内存需求。在下文中，定义权重将用于表示所有模型权重矩阵和向量。

At the time of writing this post, LLMs consist of at least a couple billion parameters. Each parameter thereby is made of a decimal number, e.g. 4.5689 which is usually stored in either float32, bfloat16, or float16 format. This allows us to easily compute the memory requirement to load the LLM into memory:

Loading the weights of a model having X billion parameters requires roughly 4 * X GB of VRAM in float32 precision

在撰写本文时，LLM至少包含数十亿个参数。每个参数都由一个十进制数表示，例如4.5689，通常以float32、bfloat16或float16格式存储。这使我们能够轻松地计算将LLM加载到内存所需的内存：

加载具有n十亿(B)参数的模型的权重通常需要大约4*n GB的float32精度的VRAM

半精度(float16或bfloat16)下所需的VRAM(GB为单位)是参数(B为单位)的2倍

Nowadays, models are however rarely trained in full float32 precision, but usually in bfloat16 precision or less frequently in float16 precision. Therefore the role of thumb becomes:

Loading the weights of a model having X billion parameters requires roughly 2 * X GB of VRAM in bfloat16/float16 precision

现在，模型很少以完整的float32精度进行训练，而通常以bfloat16精度或较少的float16精度进行训练。因此，一个经验法则是：

加载具有X十亿参数的模型的权重通常需要大约2 * X GB的bfloat16/float16精度的VRAM

For shorter text inputs (less than 1024 tokens), the memory requirement for inference is very much dominated by the memory requirement to load the weights. Therefore, for now, let's assume that the memory requirement for inference is equal to the memory requirement to load the model into the GPU VRAM.

To give some examples of how much VRAM it roughly takes to load a model in bfloat16:

GPT3 requires 2 * 175 GB = 350 GB VRAM

Bloom requires 2 * 176 GB = 352 GB VRAM

Llama-2-70b requires 2 * 70 GB = 140 GB VRAM

Falcon-40b requires 2 * 40 GB = 80 GB VRAM

MPT-30b requires 2 * 30 GB = 60 GB VRAM

bigcode/starcoder requires 2 * 15.5 = 31 GB VRAM

对于较短的文本输入（少于1024个标记），推理的内存需求很大程度上受到加载权重的内存需求。因此，目前，让我们假设推理的内存需求等于将模型加载到GPU VRAM中所需的内存。

在bfloat16中加载一个模型大概需要多少VRAM的一些例子：

GPT3需要2*175 GB = 350GB

Bloom需要2*176 GB = 352GB

Llama-2-70b需要2*70 GB = 140GB

Falcon-40b需要2*40 GB = 80GB

MPT-30b需要2*30 GB = 60GB

bigcode/starcoder需要2*15.5 = 31GB

大多数超过80GB+又因硬件最大为A00-80G=故需使用并行技术(TP和PP)：Transformers不支持开箱即用的张量并行但支持简单的流水线并行

As of writing this document, the largest GPU chip on the market is the A100 offering 80GB of VRAM. Most of the models listed before require more than 80GB just to be loaded and therefore necessarily require tensor parallelism and/or pipeline parallelism.

撰写本文时，市场上最大的GPU芯片是A100，提供80GB的VRAM。在此之前列出的大多数模型都需要超过80GB的VRAM才能加载，因此必须使用张量并行性和/或流水线并行性。

�� Transformers does not support tensor parallelism out of the box as it requires the model architecture to be written in a specific way. If you're interested in writing models in a tensor-parallelism-friendly way, feel free to have a look at the text-generation-inference library.

Naive pipeline parallelism is supported out of the box. For this, simply load the model with device="auto" which will automatically place the different layers on the available GPUs as explained here. Note, however that while very effective, this naive pipeline parallelism does not tackle the issues of GPU idling. For this more advanced pipeline parallelism is required as explained here.

Transformers不支持开箱即用的张量并行，因为它需要以特定的方式编写模型架构。如果您对以张量并行友好的方式编写模型感兴趣，请随时查看text-generation-inference库。

简单的流水线并行性得到了支持。为此，只需使用device="auto"加载模型，它将自动将不同的层放置在可用的GPU上，如此处所解释的。请注意，尽管非常有效，但这种简单的流水线并行性无法解决GPU空闲问题。要实现更高级的流水线并行性，请参考此处的解释。

代码实现：利用8 x 80GB的A100节点加载BLOOM

If you have access to an 8 x 80GB A100 node, you could load BLOOM as follows

!pip install transformers accelerate bitsandbytes optimum

# from transformers import AutoModelForCausalLM

# model = AutoModelForCausalLM.from_pretrained("bigscience/bloom", device_map="auto", pad_token_id=0)

By using device_map="auto" the attention layers would be equally distributed over all available GPUs.

如果您可以访问一个8 x 80GB的A100节点，您可以按以下方式加载BLOOM：

通过使用device_map="auto"，注意力层将平均分布在所有可用的GPU上。

代码实现：利用1 x A100-80GB加载bigcode/octocoder→推理的内存需求约为31 GB

In this notebook, we will use bigcode/octocoder as it can be run on a single 40 GB A100 GPU device chip. Note that all memory and speed optimizations that we will apply going forward, are equally applicable to models that require model or tensor parallelism.

Since the model is loaded in bfloat16 precision, using our rule of thumb above, we would expect the memory requirement to run inference with bigcode/octocoder to be around 31 GB VRAM. Let's give it a try.

在这个笔记本中，我们将使用bigcode/octocoder，因为它可以在单个40 GB的A100 GPU设备芯片上运行。请注意，我们将在之后应用的所有内存和速度优化对需要模型并行或张量并行的模型同样适用。

由于模型以bfloat16精度加载，根据我们上面的经验法则，我们预计使用bigcode/octocoder进行推理的内存需求将约为31 GB VRAM。让我们试一下。

We first load the model and tokenizer and then pass both to Transformers' pipeline object.

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

import torch

model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", torch_dtype=torch.bfloat16, device_map="auto", pad_token_id=0)

tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

prompt = "Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer:"

result = pipe(prompt, max_new_tokens=60)[0]["generated_text"][len(prompt):]

Result

Output:

Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\nreturn bytes / 1024 / 1024 / 1024\n```\n\n

This function takes a single

Nice, we can now directly use the result to convert bytes into Gigabytes.

def bytes_to_giga_bytes(bytes):

return bytes / 1024 / 1024 / 1024

Let's call torch.cuda.max_memory_allocated to measure the peak GPU memory allocation.

bytes_to_giga_bytes(torch.cuda.max_memory_allocated())

Output:

29.0260648727417

我们首先加载模型和分词器，然后将它们传递给Transformers的pipeline对象。

这是一个将字节转换为千兆字节的Python函数

这个函数接受一个参数

好的，现在我们可以直接使用结果将字节转换为千兆字节。

让我们调用torch.cuda.max_memory_allocated来测量GPU内存的峰值分配。

Close enough to our back-of-the-envelope computation! We can see the number is not exactly correct as going from bytes to kilobytes requires a multiplication of 1024 instead of 1000. Therefore the back-of-the-envelope formula can also be understood as an "at most X GB" computation. Note that if we had tried to run the model in full float32 precision, a whopping 64 GB of VRAM would have been required.

非常接近我们估算的结果！我们可以看到这个数字并不完全准确，因为从字节到千字节的转换需要乘以1024而不是1000。因此，这个估算可以理解为"最多X GB"的估算。请注意，如果我们尝试以完全的float32精度运行模型，将需要64GB的VRAM。

目前大多数模型支持bfloat16(因为大多数GPU支持)

Almost all models are trained in bfloat16 nowadays, there is no reason to run the model in full float32 precision if your GPU supports bfloat16. Float32 won't give better inference results than the precision that was used to train the model.

If you are unsure in which format the model weights are stored on the Hub, you can always look into the checkpoint's config under "torch_dtype", e.g. here. It is recommended to set set the model to the same precision type as written in the config when loading with from_pretrained(..., torch_dtype=...) except for the original type is float32 in which case one can use both float16 or bfloat16 for inference.

如今，大多数模型都以bfloat16进行训练，如果您的GPU支持bfloat16，就没有理由以完全的float32精度运行模型。float32不会比用于训练模型的精度提供更好的推理结果。

如果你不确定模型权重以哪种格式存储在Hub上，你可以在“torch_dtype”下查看检查点的配置，例如这里。建议在使用from_pretrained(..., torch_dtype=...)时将模型设置为与配置中写入的相同的精度类型。除非原始类型为float32，在这种情况下可以在推理中使用float16或bfloat16。

Let's define a flush(...) function to free all allocated memory so that we can accurately measure the peak allocated GPU memory.

del pipe

del model

import gc

import torch

def flush():

gc.collect()

torch.cuda.empty_cache()

torch.cuda.reset_peak_memory_stats()

Let's call it now for the next experiment.

flush()

In the recent version of the accelerate library, you can also use an utility method called release_memory()

from accelerate.utils import release_memory

# ...

release_memory(model)

让我们定义一个flush(...)函数，以释放所有分配的内存，这样我们就可以准确地测量GPU内存分配的峰值。

现在，让我们为下一次实验调用它。

在最近的加速库版本中，您还可以使用一个叫做release_memory()的实用工具方法来释放内存。

模型量化技术—进一步权重量化8位或4位(几乎不会显著降低性能)

若硬件GPU不够32GB可以采用量化技术

Now what if your GPU does not have 32 GB of VRAM? It has been found that model weights can be quantized to 8-bit or 4-bits without a significant loss in performance (see Dettmers et al.). Model can be quantized to even 3 or 2 bits with an acceptable loss in performance as shown in the recent GPTQ paper ��.

那么，如果您的GPU没有32GB的VRAM怎么办？已经发现，可以将模型权重量化为8位或4位，而不会显著降低性能（请参见Dettmers等人的研究）。模型可以量化为3位或2位，性能损失可接受，如最近的GPTQ论文所示。

量化技术的宗旨：降低权重的精度同时尽量保持模型推理准确性

量化技术适合文本生成场景的原因：只是选择最可能的下一个标记的集合，而不是真正关心下一个token logit分布的确切值(因为是下一个token logit分布保持大致相同)

Without going into too many details, quantization schemes aim at reducing the precision of weights while trying to keep the model's inference results as accurate as possible (a.k.a as close as possible to bfloat16). Note that quantization works especially well for text generation since all we care about is choosing the set of most likely next tokens and don't really care about the exact values of the next token logit distribution. All that matters is that the next token logit distribution stays roughly the same so that an argmax or topk operation gives the same results.

在不涉及太多细节的情况下，量化方案旨在降低权重的精度，同时尽量保持模型的推理结果尽可能准确(也就是尽可能接近bfloat16)。请注意，量化在文本生成方面特别有效，因为我们关心的只是选择最可能的下一个标记的集合，而不是真正关心下一个token logit分布的确切值。重要的是下一个token logit分布保持大致相同，以便argmax或topk操作给出相同的结果。

量化技术的原理—保证计算高精度(BF16)：将模型权重量化到低精度(目标精度)，在运算时动态还原精度，保证计算精度，计算后再将权重量化，从而降低模型大小但不影响效果

There are various quantization techniques, which we won't discuss in detail here, but in general, all quantization techniques work as follows:

>> Quantize all weights to the target precision

>> Load the quantized weights, and pass the input sequence of vectors in bfloat16 precision

>> Dynamically dequantize weights to bfloat16 to perform the computation with their input vectors in bfloat16 precision

>> Quantize the weights again to the target precision after computation with their inputs.

有各种各样的量化技术，我们不会在这里详细讨论，但总的来说，所有量化技术的工作原理如下：

>> 量化所有权重到目标精度。

>> 加载量化的权重，并将向量输入序列以bfloat16精度传递。

>> 在计算中使用bfloat16的输入向量时，动态将权重去量化为bfloat16。

>> 在与输入计算后，再次将权重量化为目标精度。

In a nutshell, this means that inputs-weight matrix multiplications, with X being the inputs, W being a weight matrix and Y being the output:

for every matrix multiplication. Dequantization and re-quantization is performed sequentially for all weight matrices as the inputs run through the network graph.

简而言之，这意味着输入-权重矩阵乘法，其中X为输入，W为权重矩阵，Y为输出：

对于每个矩阵乘法。当输入通过网络图时，对所有权重矩阵依次执行去量化和重新量化。

Therefore, inference time is often not reduced when using quantized weights, but rather increases. Enough theory, let's give it a try! To quantize the weights with Transformers, you need to make sure that the bitsandbytes library is installed.

因此，当使用量化权重时，推理时间通常不会减少，而是增加。足够的理论，让我们试试吧！要使用Transformers进行权重量化，您需要确保已安装bitsandbytes库。

代码实现：模型8位量化，15B模型需同等level的15.2GB，验证了8位量化可在消费级GPU(如4090)上运行

# !pip install bitsandbytes

We can then load models in 8-bit quantization by simply adding a load_in_8bit=True flag to from_pretrained.

model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", load_in_8bit=True, pad_token_id=0)

然后，我们可以通过简单地在from_pretrained中添加load_in_8bit=True标志来加载8位量化的模型。

Now, let's run our example again and measure the memory usage.

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

result = pipe(prompt, max_new_tokens=60)[0]["generated_text"][len(prompt):]

result

Output:

Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\nreturn bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single

Nice, we're getting the same result as before, so no loss in accuracy! Let's look at how much memory was used this time.

bytes_to_giga_bytes(torch.cuda.max_memory_allocated())

Output:

15.219234466552734

Significantly less! We're down to just a bit over 15 GBs and could therefore run this model on consumer GPUs like the 4090. We're seeing a very nice gain in memory efficiency and more or less no degradation to the model's output. However, we can also notice a slight slow-down during inference.

We delete the models and flush the memory again.

del model

del pipe

flush()

现在，让我们再次运行我们的示例，并测量内存使用情况。

这是一个将字节转换为千兆字节的Python函数：

很好，我们得到了与之前相同的结果，因此没有准确性损失！让我们看看这次使用了多少内存。

显著减少！我们的内存使用量仅略高于15GB，因此可以在像4090这样的消费级GPU上运行这个模型。我们在内存效率方面取得了非常好的进展，而模型输出几乎没有降级。然而，在推理过程中，我们也可以注意到略微的减速。

我们删除模型并再次清空内存。

代码实现：模型4位量化，15B模型仅需9.5GB，可在消费级GPU(如3090)上运行

Let's see what peak GPU memory consumption 4-bit quantization gives. Quantizing the model to 4-bit can be done with the same API as before - this time by passing load_in_4bit=True instead of load_in_8bit=True.

model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", load_in_4bit=True, low_cpu_mem_usage=True, pad_token_id=0)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

result = pipe(prompt, max_new_tokens=60)[0]["generated_text"][len(prompt):]

result

Output:

Here is a Python function that transforms bytes to Giga bytes:\n\n```\ndef bytes_to_gigabytes(bytes):\nreturn bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single argument

让我们看看4位量化会产生什么GPU内存消耗峰值。将模型量化为4位可以使用与以前相同的API来完成-这次通过传递load_in_4bit=True而不是load_in_8bit=True。

We're almost seeing the same output text as before - just the python is missing just before the code snippet. Let's see how much memory was required.

bytes_to_giga_bytes(torch.cuda.max_memory_allocated())

Output:

9.543574333190918

Just 9.5GB! That's really not a lot for a >15 billion parameter model.

我们几乎看到了与之前相同的输出文本 - 只是在代码片段之前缺少了“Python”。让我们看看需要多少内存。

仅9.5GB！这对于一个拥有超过15B参数的模型来说真的不多。

对比——4位量化、8位量化、bfloat16推理相比：模型准确性几乎未降，但4位量化的推理速度略慢

While we see very little degradation in accuracy for our model here, 4-bit quantization can in practice often lead to different results compared to 8-bit quantization or full bfloat16 inference. It is up to the user to try it out.

Also note that inference here was again a bit slower compared to 8-bit quantization which is due to the more aggressive quantization method used for 4-bit quantization leading to

quantize

quantize and

dequantize

dequantize taking longer during inference.

del model

del pipe

flush()

Overall, we saw that running OctoCoder in 8-bit precision reduced the required GPU VRAM from 32G GPU VRAM to only 15GB and running the model in 4-bit precision further reduces the required GPU VRAM to just a bit over 9GB.

尽管我们在这里的模型准确性几乎没有降低，但实际上，4位量化与8位量化或完整的bfloat16推理相比，往往会导致不同的结果。用户可以自行尝试。

还要注意，与8位量化相比，4位量化的推理速度略慢，这是因为4位量化采用了更激进的量化方法，导致在推理过程中quantize（量化）和dequantize（反量化）需要更长的时间。

总的来说，我们看到将OctoCoder在8位精度下运行可以将所需的GPU VRAM从32GB减少到仅15GB，并且在4位精度下运行模型可以将所需的GPU VRAM进一步减少到略高于9GB。

模型4位量化：可在消费级GPU(如RTX3090/V100/T4)上运行

4-bit quantization allows the model to be run on GPUs such as RTX3090, V100, and T4 which are quite accessible for most people.

4位量化使模型可以在RTX3090、V100和T4等GPU上运行，这对大多数人来说相当容易获得。

降显存更多的内容可查看AutoGPTQ或Transformers量化文档

For more information on quantization and to see how one can quantize models to require even less GPU VRAM memory than 4-bit, we recommend looking into the AutoGPTQ implementation.

有关更多关于量化以及如何使模型需要更少的GPU VRAM内存的信息，我们建议查看AutoGPTQ的实现。

AutoGPTQ：https://huggingface.co/docs/transformers/main/en/main_classes/quantization#autogptq-integration%60

模型量化的本质(权衡内存效率和准确性)：如果有大把GPU资源可以不用考虑量化技术

As a conclusion, it is important to remember that model quantization trades improved memory efficiency against accuracy and in some cases inference time.

If GPU memory is not a constraint for your use case, there is often no need to look into quantization. However many GPUs simply can't run LLMs without quantization methods and in this case, 4-bit and 8-bit quantization schemes are extremely useful tools.

For more in-detail usage information, we strongly recommend taking a look at the Transformers Quantization Docs. Next, let's look into how we can improve computational and memory efficiency by using better algorithms and an improved model architecture.

总之，重要的是要记住，模型量化是在内存效率和准确性之间进行权衡，在某些情况下也会影响推理时间。

如果GPU内存不是您的用例的限制因素，通常不需要考虑量化。然而，许多GPU无法运行没有量化方法的LLMs，在这种情况下，4位和8位量化方案是极其有用的工具。

有关更详细的使用信息，强烈建议查看Transformers量化文档。接下来，让我们看看如何通过使用更好的算法和改进的模型架构来提高计算和内存效率。

2. Flash Attention: A Leap Forward—闪光注意力：一个重大飞跃

LLMs之FlashAttention-2：《FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning更快的注意力与更好的并行性和工作分区》翻译与解读

https://yunyaniu.blog.csdn.net/article/details/133108384

目前优秀的主流大模型核心架构基本一致：前馈层、激活层、层归一化层、自注意力层

Today's top-performing LLMs share more or less the same fundamental architecture that consists of feed-forward layers, activation layers, layer normalization layers, and most crucially, self-attention layers.

今天表现最佳的大型语言模型（LLMs）在根本架构上几乎相同，包括前馈层、激活层、层归一化层，最关键的是自注意力层。

传统自注意力层【平方级增长】：传统自注意层的计算和内存复杂度随输入序列长度呈【平方级增长】，给长序列模型的应用带来严重挑战

Self-attention layers are central to Large Language Models (LLMs) in that they enable the model to understand the contextual relationships between input tokens. However, the peak GPU memory consumption for self-attention layers grows quadratically both in compute and memory complexity with number of input tokens (also called sequence length) that we denote in the following by

N . While this is not really noticeable for shorter input sequences (of up to 1000 input tokens), it becomes a serious problem for longer input sequences (at around 16000 input tokens).

自注意力层对于大型语言模型（LLMs）至关重要，因为它们使模型能够理解输入标记之间的上下文关系。然而，自注意力层的峰值GPU内存消耗与输入标记数量（也称为序列长度）呈二次增长，我们在下文中用 N 表示。尽管对于较短的输入序列（最多1000个输入标记），这并不明显，但对于较长的输入序列（大约16000个输入标记），这成为一个严重的问题。

Let's take a closer look. The formula to compute the output O of a self-attention layer for an input X of length N is:

mathbf X=(x1 ,...xN) is thereby the input sequence to the attention layer. The projections Q and K will each consist of N vectors resulting in the QK^T being of size N^2.

让我们仔细看一下。计算长度为 N 的输入 X 的自注意力层的输出 O 的公式如下：

其中 mathbf X=(x1 ,...xN) 是自注意力层的输入序列。投影 Q 和 K 将各自包含 N 个向量，导致 QK^T 的大小为 N^2。

LLMs usually have multiple attention heads, thus doing multiple self-attention computations in parallel. Assuming, the LLM has 40 attention heads and runs in bfloat16 precision, we can calculate the memory requirement to store the QK^T matrices to be 40∗2∗N^2 bytes. For N=1000 only around 50 MB of VRAM are needed, however, for N=16000 we would need 19 GB of VRAM, and for N=100,000 we would need almost 1TB just to store the QK^T matrices.

Long story short, the default self-attention algorithm quickly becomes prohibitively memory-expensive for large input contexts.

As LLMs improve in text comprehension and generation, they are applied to increasingly complex tasks. While models once handled the translation or summarization of a few sentences, they now manage entire pages, demanding the capability to process extensive input lengths.

LLMs通常具有多个注意力头，因此可以并行进行多次自注意力计算。假设LLM具有40个注意力头并以bfloat16精度运行，我们可以计算存储QK^T矩阵所需的内存量为40∗2∗N^2字节。对于 N=1000，仅需要约50 MB的VRAM，但对于 N=16000，我们需要19 GB的VRAM，而对于 N=100,000，我们需要将近1TB的VRAM来存储QK^T矩阵。

简而言之，对于大型输入上下文来说，默认的自注意力算法迅速变得非常昂贵，占用大量内存。

随着LLMs在文本理解和生成方面的改进，它们被应用于越来越复杂的任务。虽然模型曾经处理翻译或摘要的几句话，但现在它们翻译整页的内容，需要处理广泛的输入长度。

Tri Dao提出Flash Attention【线性增长】

How can we get rid of the exorbitant memory requirements for large input lengths? We need a new way to compute the self-attention mechanism that gets rid of the QK^T matrix. Tri Dao et al. developed exactly such a new algorithm and called it Flash Attention.

那么，如何摆脱大型输入长度所需的巨大内存要求呢？我们需要一种新的计算自注意力机制的方法，可以摆脱QK^T矩阵。Tri Dao等人开发了一种全新的算法，称之为Flash Attention。

In a nutshell, Flash Attention breaks the V✖Softmax(QK^Tij ) computation apart and instead computes smaller chunks of the output by iterating oven multiple softmax computation steps:

with saij and sbij being some softmax normalization statistics that need to be recomputed for every i and j .

Please note that the whole Flash Attention is a bit more complex and is greatly simplified here as going in too much depth is out of scope for this notebook. The reader is invited to take a look at the well-written Flash Attention paper for more details.

简而言之，Flash Attention将 V✖Softmax(QK^Tij ) 的计算分解开，而是通过迭代多个softmax计算步骤来计算输出的较小块：

其中 saij 和 sbij 是需要为每个 i 和 j 重新计算的一些softmax归一化统计数据。

请注意，整个Flash Attention稍微复杂一些，这里进行了大幅简化，因为深入讨论超出了本文档的范围。读者可以查看详细的Flash Attention论文以获取更多详细信息。

Flash Attention的原理：跟踪softmax归一化统计信息和优化计算，实现与默认注意力层数值相同但内存开销【线性增长】的效果，更快速且更节省内存

Flash Attention通过跟踪softmax归一化统计信息和优化计算，实现与默认注意力层数值相同但内存开销线性增长的效果。它之所以在推理速度上优于默认注意力，是由于能大大减轻对GPU慢速VRAM的依赖，将中间运算优先在快速SRAM内部完成，从而提升效率。

The main takeaway here is:

By keeping track of softmax normalization statistics and by using some smart mathematics, Flash Attention gives numerical identical outputs compared to the default self-attention layer at a memory cost that only increases linearly with N .

Looking at the formula, one would intuitively say that Flash Attention must be much slower compared to the default self-attention formula as more computation needs to be done. Indeed Flash Attention requires more FLOPs compared to normal attenion as the softmax normalization statistics have to constantly be recomputed (see paper for more details if interested)

However Flash Attenion is much faster in inference compared to default attention which comes from its ability to significantly reduce the demands on the slower, high-bandwidth memory of the GPU (VRAM), focusing instead on the faster on-chip memory (SRAM).

这里的主要观点是：

通过跟踪softmax归一化统计信息，并使用一些巧妙的数学方法，Flash Attention以仅与 N 呈线性增长的内存成本提供数值上相同的输出。

从公式上看，人们会直觉地认为Flash Attention必须比默认的自注意力公式慢得多，因为需要进行更多的计算。实际上，Flash Attention与普通注意力相比需要更多的FLOP，因为softmax归一化统计数据必须不断重新计算（如果感兴趣，请参阅论文以获取更多详细信息）。

然而，Flash Attention在推理方面比默认注意力要快得多，这是因为它能够显着减少GPU（VRAM）较慢、高带宽内存的需求，而是专注于更快的片上内存（SRAM）。

Essentially, Flash Attention makes sure that all intermediate write and read operations can be done using the fast on-chip SRAM memory instead of having to access the slower VRAM memory to compute the output vector O .

In practice, there is currently absolutely no reason to not use Flash Attention if available. The algorithm gives mathematically the same outputs, and is both faster and more memory-efficient.

简而言之，Flash Attention确保所有中间写入和读取操作都可以使用快速的片上SRAM内存完成，而不必访问较慢的VRAM内存来计算输出向量 O 。

实际上，如果可用，目前没有不使用Flash Attention的理由。该算法在数学上提供相同的输出，而且更快速且更节省内存。

观察是否使用Flash Attention下模型性能的变化：通过重复系统提示10次，使输入长度足够长来

Let's look at a practical example.

Our OctoCoder model now gets a significantly longer input prompt which includes a so-called system prompt. System prompts are used to steer the LLM into a better assistant that is tailored to the users' task. In the following, we use a system prompt that will make OctoCoder a better coding assistant.

让我们看一个实际的例子。

现在，我们的OctoCoder模型获得了一个包括所谓的系统提示的显著更长的输入提示。系统提示用于引导LLM成为更适合用户任务的助手。在下文中，我们使用一个系统提示，将使OctoCoder成为更好的编程助手。

For demonstration purposes, we duplicate the system by ten so that the input length is long enough to observe Flash Attention's memory savings. We append the original text prompt "Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer: Here"

long_prompt = 10 * system_prompt + prompt

We instantiate our model again in bfloat16 precision.

model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", torch_dtype=torch.bfloat16, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

出于演示目的，我们将系统提示复制了十次，以便输入长度足够长，以观察Flash Attention的内存节省情况。我们将原始文本提示 "问题：请编写一个将字节转换为千兆字节的Python函数。\n\n答案：Here" 追加到长提示中。

我们再次使用bfloat16精度实例化我们的模型。

Let's now run the model just like before without Flash Attention and measure the peak GPU memory requirement and inference time.

import time

start_time = time.time()

result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):]

print(f"Generated in {time.time() - start_time} seconds.")

result

Output:

Generated in 10.96854019165039 seconds.

Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n return bytes / 1024 / 1024 / 1024\n\nAnswer

现在，让我们像之前一样运行模型，不使用Flash Attention，并测量峰值GPU内存需求和推理时间。

We're getting the same output as before, however this time, the model repeats the answer multiple times until it's 60 tokens cut-off. This is not surprising as we've repeated the system prompt ten times for demonstration purposes and thus cued the model to repeat itself.

Note that the system prompt should not be repeated ten times in real-world applications - one time is enough!

Let's measure the peak GPU memory requirement.

bytes_to_giga_bytes(torch.cuda.max_memory_allocated())

Output:

37.668193340301514

As we can see the peak GPU memory requirement is now significantly higher than in the beginning, which is largely due to the longer input sequence. Also the generation takes a little over a minute now.

We call flush() to free GPU memory for our next experiment.

flush()

我们得到了与之前相同的输出，但是这一次模型重复答案多次，直到达到60个标记的截止点。这并不令人意外，因为出于演示目的，我们重复了系统提示十次，从而提示模型重复自己。

请注意，在实际应用程序中，系统提示不应重复十次 - 一次足够！

让我们测量峰值GPU内存需求。

如我们所见，峰值GPU内存需求现在显著高于开始时，这主要是由于更长的输入序列导致的。此外，生成现在需要稍多于一分钟。

我们调用flush()以释放GPU内存，以进行下一个实验。

For comparison, let's run the same function, but enable Flash Attention instead. To do so, we convert the model to BetterTransformers and by doing so enabling PyTorch's SDPA self-attention which in turn is based on Flash Attention.

model.to_bettertransformer()

Now we run the exact same code snippet as before and under the hood Transformers will make use of Flash Attention.

start_time = time.time()

with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):

result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):]

print(f"Generated in {time.time() - start_time} seconds.")

result

Output:

Generated in 3.0211617946624756 seconds.

Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef

We're getting the exact same result as before, but can observe a very significant speed-up thanks to Flash Attention.

Let's measure the memory consumption one last time.

bytes_to_giga_bytes(torch.cuda.max_memory_allocated())

Output:

32.617331981658936

And we're almost back to our original 29GB peak GPU memory from the beginning.

We can observe that we only use roughly 100MB more GPU memory when passing a very long input sequence with Flash Attention compared to passing a short input sequences as done in the beginning.

flush()

为了进行比较，让我们运行相同的函数，但启用Flash Attention。为此，我们将模型转换为BetterTransformers，从而启用PyTorch的SDPA自注意力机制，后者又基于Flash Attention。

现在，我们运行与之前完全相同的代码片段，Transformers底层将使用Flash Attention。

我们得到了与之前完全相同的结果，但由于Flash Attention，观察到了非常显著的加速。

让我们最后一次测量内存消耗。

我们几乎回到了开始时的原始29GB峰值GPU内存。

我们可以观察到，与在开始时传递短输入序列相比，使用Flash Attention传递非常长的输入序列时，我们仅使用了大约100MB更多的GPU内存。

最后，我们调用flush()以释放GPU内存。

3. 架构背后的科学：为长文本输入和聊天选择策略

The Science Behind LLM Architectures: Strategic Selection for Long Text Inputs and ChatLLM

提高计算效率和内存效率的方式：低精度转换、更高效的自注意力算法

So far we have looked into improving computational and memory efficiency by:

>> Casting the weights to a lower precision format

>> Replacing the self-attention algorithm with a more memory- and compute efficient version

到目前为止，我们已经研究了通过以下方式提高计算效率和内存效率：

>> 将权重转换为较低精度的格式

>> 用更节省内存和计算资源的版本替换自注意力算法

改变LLM的架构使其对需要长文本输入的任务最有效和高效

Let's now look into how we can change the architecture of an LLM so that it is most effective and efficient for task that require long text inputs, e.g.:

>> Retrieval augmented Questions Answering,

>> Summarization,

>> Chat

Note that chat not only requires the LLM to handle long text inputs, but it also necessitates that the LLM is able to efficiently handle the back-and-forth dialogue between user and assistant (such as ChatGPT).

现在让我们看看如何改变LLM的架构，使其对需要长文本输入的任务最有效和高效，例如：

>> 带检索的问题回答（Retrieval augmented Questions Answering）

>> 摘要生成（Summarization）

>> 聊天（Chat）

请注意，聊天不仅要求LLM处理长文本输入，还要求LLM能够有效地处理用户和助手之间的来回对话（例如ChatGPT）。

模型架构的两个重要组成部分：位置嵌入、键-值缓存

Once trained, the fundamental LLM architecture is difficult to change, so it is important to make considerations about the LLM's tasks beforehand and accordingly optimize the model's architecture. There are two important components of the model architecture that quickly become memory and/or performance bottlenecks for large input sequences.

>> The positional embeddings

>> The key-value cache

Let's go over each component in more detail

一旦训练完成，基本的LLM架构很难更改，因此在事先考虑LLM的任务并相应优化模型架构非常重要。

模型架构的两个重要组成部分很快就会成为大型输入序列的内存和/或性能瓶颈。

>> 位置嵌入（positional embeddings）

>> 键-值缓存（key-value cache）

让我们更详细地讨论每个组件。

3.1 Improving positional embeddings of LLMs改进LLM的位置嵌入

自注意力的意义：建立关联

Self-attention puts each token in relation to each other's tokens. As an example, the Softmax(QK^T) matrix of the text input sequence "Hello", "I", "love", "you" could look as follows:

Each word token is given a probability mass at which it attends all other word tokens and, therefore is put into relation with all other word tokens. E.g. the word "love" attends to the word "Hello" with 0.05%, to "I" with 0.3%, and to itself with 0.65%.

自注意力使每个标记与其他标记相关联。例如，文本输入序列 "Hello", "I", "love", "you" 的 Softmax(QK^T) 矩阵可能如下所示：

每个词标记被分配一个概率质量，用于关注所有其他词标记，因此与所有其他词标记建立了关系。例如，单词 "love" 关注单词 "Hello" 的概率为0.05%，关注单词 "I" 的概率为0.3%，关注自己的概率为0.65%。

位置嵌入让LLMs理解句子顺序：因由QK^T计算的概率分数+O(1)次计算中完成→没有位置嵌入的LLM导致token间的距离均相同—即无法区分Hello I love you和You love I hello

A LLM based on self-attention, but without position embeddings would have great difficulties in understanding the positions of the text inputs to each other. This is because the probability score computed by QK^T relates each word token to each other word token in O(1) computations regardless of their relative positional distance to each other. Therefore, for the LLM without position embeddings each token appears to be have the same distance to all other tokens, e.g. differentiating between "Hello I love you" and "You love I hello" would be very challenging.

For the LLM to understand sentence order, an additional cue is needed and is usually applied in the form of positional encodings (or also called positional embeddings). Positional encodings, encode the position of each token into a numerical presentation that the LLM can leverage to better understand sentence order.

基于自注意力的LLM，如果没有位置嵌入，将难以理解文本输入相对于彼此的位置。这是因为由QK^T计算的概率分数将每个词标记与每个其他词标记相关联，无论它们相对位置的距离如何，都可以在O(1)次计算中完成。因此，对于没有位置嵌入的LLM，每个token 似乎与所有其他token 具有相同的距离，例如，区分 "Hello I love you" 和 "You love I hello" 将非常具有挑战性。

为了让LLM理解句子顺序，需要额外的提示，通常以位置编码（也称为位置嵌入）的形式应用。位置编码将每个标记的位置编码为数字表示，LLM可以利用它更好地理解句子顺序。

绝对位置嵌入(按固定长度训练+给出每个词位置的唯一向量表示【为每个位置的id编码了一个唯一的嵌入】+与词本身语义无关+对长文本效果不佳→会导致LLM性能差)：【Transformer的原始论文提出】基于正弦函数的正弦位置嵌入→基于可学习的位置嵌入(非固定嵌入+在训练期间学习而得+必须在固定的输入长度N上训练+但是固定输入长度导致难以外推)→提出如果模型学习输入标记之间的相对位置距离(而非绝对位置)会更有优势

The authors of the Attention Is All You Need paper introduced sinusoidal positional embeddings P=p1,…,pN. where each vector pi is computed as a sinusoidal function of its position i . The positional encodings are then simply added to the input sequence vectors thereby cueing the model to better learn sentence order.

《Attention Is All You Need》论文的作者引入了正弦位置嵌入 P=p1,…,pN，其中每个向量 pi 都计算为其位置 i 的正弦函数。然后，将位置编码简单地添加到输入序列向量中，从而提示模型更好地学习句子顺序。

Instead of using fixed position embeddings, others (such as Devlin et al.) used learned positional encodings for which the positional embeddings P are learned during training.

Sinusoidal and learned position embeddings used to be the predominant methods to encode sentence order into LLMs, but a couple of problems related to these positional encodings were found:

1.) Sinusoidal and learned position embeddings are both absolute positional embeddings, i.e. encoding a unique embedding for each position id: 0,…,N . As shown by Huang et al. and Su et al.], absolute positional embeddings lead to poor LLM performance for long text inputs. For long text inputs, it is advantageous if the model learns the relative positional distance input input tokens have to each other instead of their absolute position.

2.) When using learned position embeddings, the LLM has to be trained on a fixed input length N . which makes it difficult to extrapolate to an input length longer than what it was trained on.

其他人(如Devlin等人)使用学习的位置编码，在训练过程中学习位置嵌入P，而不是使用固定位置嵌入。

与固定位置嵌入不同，其他一些研究（如Devlin等人）使用了在训练期间学习的位置嵌入。正弦和学习位置嵌入曾经是将句子顺序编码到LLM中的主要方法，但与这些位置编码相关的一些问题被发现：

1.) 正弦和学习位置嵌入都是绝对位置嵌入，即为每个位置ID（0,…,N）编码一个唯一的嵌入。如Huang等人和Su等人所示，对于长文本输入，绝对位置嵌入会导致LLM性能差。对于长文本输入，如果模型能够学习输入标记彼此之间的相对位置距离而不是它们的绝对位置，则会更有优势。

2.) 使用学习位置嵌入时，LLM必须在固定的输入长度N上进行训练。这使得难以对比训练长度之外更长的输入长度进行外推。

相对位置嵌入(关注词间相对距离【而非绝对位置】+符合语言本质→弥补了绝对位置编码在长文本模型和外推上的不足)—在自注意力算法中直接提示LLM句子顺序最佳(在注意力计算中加入相对位置依赖项)：RoPE、ALiBi

相对位置编码可以随距离增大而逐步衰减，符合语言本质。相对位置编码弥补了绝对位置编码在长文本模型和外推上的不足，影响了后续研究的发展方向。

这两种方法都有助于提高语言模型的性能和处理长文本输入的能力。

Recently, relative positional embeddings that can tackle the above mentioned problems have become more popular, most notably:

>>Rotary Position Embedding (RoPE)

>>ALiBi

Both RoPE and ALiBi argue that it's best to cue the LLM about sentence order directly in the self-attention algorithm as it's there that word tokens are put into relation with each other. More specifically, sentence order should be cued by modifying the QK^T computation.

最近，相对位置嵌入变得更加流行，尤其是可以解决上述问题的相对位置嵌入，最著名的有：

>>旋转位置嵌入（Rotary Position Embedding，RoPE）

>>ALiBi

RoPE和ALiBi都认为，在自注意力算法中直接提示LLM句子顺序最佳，因为在这里单词标记与其他单词标记建立关系。更具体地说，应通过修改QK^T计算来提示句子顺序。

RoPE(通过旋转矩阵表达相对距离关系)：如PaLM、Llama、Falcon

Without going into too many details, RoPE notes that positional information can be encoded into query-key pairs, e.g. qi and xj by rotating each vector by an angle θ∗i and θ∗j respectively with i,j describing each vectors sentence position: Rθ,i−j thereby represents a rotational matrix. θ is not learned during training, but instead set to a pre-defined value that depends on the maximum input sequence length during training.

By doing so, the propability score between qi and qj is only affected if i≠j and solely depends on the relative distance i−j regardless of each vector's specific positions i and j .

RoPE is used in multiple of today's most important LLMs, such as:

>>Falcon

>>Llama

>>PaLM

不详细讨论，RoPE指出可以将位置信息编码到查询-键对(query-key pairs)中，例如，qi和xi通过将每个向量分别旋转角度θ∗i和θ∗j，由此i和j描述每个向量的句子位置：

Rθ,i−j表示一个旋转矩阵。θ在训练期间不会学习，而是设置为依赖于训练期间的最大输入序列长度的预定义值。

通过这样做，qi和qj之间的概率分数Score仅在i≠j时受到影响，并且仅取决于它们的相对距离i−j，而不考虑每个向量的具体位置i和j。

RoPE用于当今最重要的LLM之一，例如：

>>Falcon

>>Llama

>>PaLM

ALiBi(通过添加位置偏差值调整注意力分布)：如MPT、BLOOM

As an alternative, ALiBi proposes a much simpler relative position encoding scheme. The relative distance that input tokens have to each other is added as a negative integer scaled by a pre-defined value m to each query-key entry of the QK^T matrix right before the softmax computation.

As shown in the ALiBi paper, this simple relative positional encoding allows the model to retain a high performance even at very long text input sequences.

ALiBi is used in multiple of today's most important LLMs, such as:

>>MPT

>>BLOOM

作为替代方案，ALiBi提出了一个简化的相对位置编码方案。将输入标记之间的相对距离作为负整数，乘以预定义值m，添加到softmax计算之前的QK^T矩阵的每个查询-键条目中。

如ALiBi论文所示，这种简单的相对位置编码允许模型在非常长的文本输入序列上保持高性能。

ALiBi用于当今最重要的LLM之一，例如：

>>MPT

>>BLOOM

RoPE和ALiBi(具有更好的扩展性)相对位置编码可以对训练时未见过的输入长度进行推理：两种均是基于相对位置的启发,可以降低远距离token间的关联性

RoPE需要额外技巧才能在长输入下效果好，而ALiBi直接扩展下三角矩阵即可。总的来说,相对位置编码更适用于需处理长输入的下游任务。

Both RoPE and ALiBi position encodings can extrapolate to input lengths not seen during training whereas it has been shown that extrapolation work much better out-of-the-box for ALiBi as compared to RoPE. For ALiBi, one simply increases the values of the lower triangular position matrix to match the length of the input sequence. For RoPE, keeping the same θ that was used during training leads to poor results when passing text inputs much longer than those seen during training, c.f Press et al.. However, the community has found a couple of effective tricks that adapt θ . thereby allowing RoPE position embeddings to work well for extrapolated text input sequences (see here).

RoPE和ALiBi的位置编码都可以外推到训练期间没有看到的输入长度，而已经证明，与RoPE相比，在ALiBi中外推的开箱即用效果要好得多。

对于ALiBi，只需增加下三角形位置矩阵的值以匹配输入序列的长度。

对于RoPE，当传递的文本输入比训练期间看到的长得多时，保持训练期间使用的相同θ会导致较差的结果，c.f. Press等。然而，社区已经发现了一些有效的技巧来适应θ。从而使RoPE位置嵌入能够很好地用于外推的文本输入序列(见这里)。

Both RoPE and ALiBi are relative positional embeddings that are not learned during training, but instead are based on the following intuitions:

>>Positional cues about the text inputs should be given directly to the QK^T matrix of the self-attention layer

>>The LLM should be incentivized to learn a constant relative distance positional encodings have to each other

>>The further text input tokens are from each other, the lower the probability of their query-value probability. Both RoPE and ALiBi lower the query-key probability of tokens far away from each other. RoPE by decreasing their vector product by increasing the angle between the query-key vectors. ALiBi by adding large negative numbers to the vector product

In conclusion, LLMs that are intended to be deployed in tasks that require handling large text inputs are better trained with relative positional embeddings, such as RoPE and ALiBi. Also note that even if an LLM with RoPE and ALiBi has been trained only on a fixed length of say N1=2048 it can still be used in practice with text inputs much larger than N1. like N2=8192 > N1 by extrapolating the positional embeddings.

RoPE和ALiBi都是相对位置嵌入，不是在训练过程中学习到的，而是基于以下直觉:

>>关于文本输入的位置提示应该直接给出自注意层的QK^T矩阵

>>应该激励LLM学习位置编码之间的恒定相对距离

>>文本输入标记彼此之间的距离越远，它们的查询值概率就越低。RoPE和ALiBi都降低了彼此相距较远的令牌的查询键概率。通过增加查询键向量之间的夹角来减小它们的向量积。在向量积上加上大的负数

综上所述，旨在部署在需要处理大量文本输入的任务中的llm可以更好地使用相对位置嵌入进行训练，例如RoPE和ALiBi。还要注意，即使一个带有RoPE和ALiBi的LLM只在固定长度N1=2048上进行了训练，它仍然可以在实践中使用比N1大得多的文本输入。如N2=8192 > N1通过外推位置嵌入。

3.2 The key-value cache键-值缓存/KV缓存

LLMs：预训练大模型实现全流程详解之模型训练各种技巧原理细讲—3.2、模型预训练及优化：参数优化+结构优化+降内存优化(……/KV缓存技术)之详细攻略

https://yunyaniu.blog.csdn.net/article/details/131606747

自回归生成的工作原理：因果语言建模本质会掩盖注意力得分的上三角矩阵

Auto-regressive text generation with LLMs works by iteratively putting in an input sequence, sampling the next token, appending the next token to the input sequence, and continuing to do so until the LLM produces a token that signifies that the generation has finished.

Please have a look at Transformer's Generate Text Tutorial to get a more visual explanation of how auto-regressive generation works.

LLM的自动回归文本生成的工作方式是，迭代地放入一个输入序列，采样下一个标记，将下一个标记附加到输入序列，并继续这样做，直到LLM产生一个标记，表示生成已经完成。

请查看Transformer的“生成文本教程”，以更直观地了解自回归生成的工作原理。

Let's run a quick code snippet to show how auto-regressive works in practice. We will simply take the most likely next token via torch.argmax.

As we can see every time we increase the text input tokens by the just sampled token.

With very few exceptions, LLMs are trained using the causal language modeling objective and therefore mask the upper triangle matrix of the attention score - this is why in the two diagrams above the attention scores are left blank (a.k.a have 0 probability). For a quick recap on causal language modeling you can refer to the Illustrated Self Attention blog.

让我们运行一个快速的代码片段，以展示自回归在实践中是如何工作的。我们将简单地通过torch.argmax获取最有可能的下一个token。

正如我们所看到的，每次我们将文本输入token增加到刚刚采样的token。

除了极少数例外，LLMs通常使用因果语言建模目标进行训练，因此掩盖了注意力得分的上方三角形矩阵——这就是为什么在上面的两个图中，注意力得分是空白的(也就是说，概率为0)。要快速回顾因果语言建模，您可以参考Illustrated Self Attention博客。

token从不依赖于先前的token→为减少不必要的计算使用键-值缓存

As a consequence, tokens never depend on previous tokens, more specifically the q_i vector is never put in relation with any key, values vectors, k_j,v_j if j>i . Instead q_i only attends to previous key-value vectors k_m<i ,v_m<i , for m∈{0,…i−1}. In order to reduce unnecessary computation, one can therefore cache each layer's key-value vectors for all previous timesteps.

因此，token从不依赖于先前的token，更具体地说，只有当j>i时，向量q_i从不与任何键、值向量k_j，v_j相互关联。相反，q_i只关注先前的键-值向量k_m<i，v_m<i进行关联，其中m∈{0,…i−1}。为了减少不必要的计算，可以为每个层的键-值向量缓存每个先前的时间步骤。

In the following, we will tell the LLM to make user of the key-value cache by retrieving and forwarding it for each forward pass. In Transformers, we can retrieve the key-value cache by passing the use_cache flag to the forward call and can then pass it with the current token.

在下面的代码中，我们将告诉LLM通过在每次转发传递时检索并转发键-值缓存来使用键-值缓存。在Transformers中，我们可以通过将use_cache标志传递给forward调用来检索键-值缓存，然后将其与当前token一起传递。

As one can see, when using the key-value cache the text input tokens are not increased in length, but remain a single input vector. The length of the key-value cache on the other hand is increased by one at every decoding step.

Making use of the key-value cache means that the QK^T is essentially reduced to q_c K^Twith q_c being the query projection of the currently passed input token which is always just a single vector.

正如大家所看到的，使用键-值缓存时，文本输入token的长度不会增加，而保持为单个输入向量。另一方面，键-值缓存的长度在每个解码步骤都会增加1。

利用键-值缓存意味着QK^T实质上被减少为q_c K^T，其中q_c是当前传递的输入token的查询投影，它总是一个单独的向量。

键-值缓存的两个优点：计算高效率+内存仅线性增长

Using the key-value cache has two advantages:

>>Significant increase in computational efficiency as less computations are performed compared to computing the full QK^T matrix. This leads to an increase in inference speed

>>The maximum required memory is not increased quadratically with the number of generated tokens, but only increases linearly.

One should always make use of the key-value cache as it leads to identical results and a significant speed-up for longer input sequences. Transformers has the key-value cache enabled by default when making use of the text pipeline or the generate method.

使用键-值缓存有两个优点：

>>与计算完整的QK^T矩阵相比，计算效率显著提高，从而提高了推理速度。

>>所需的最大内存仅线性增长，而不是随生成标记数量的增长而二次增长。

应该始终使用键-值缓存，因为它会产生相同的结果，并在处理较长的输入序列时显著加快速度。在使用文本流水线或生成方法时，Transformers默认启用键-值缓存。

键-值缓存在Chat场景案例(需要多次自回归解码的应用)表现更优：利用存储的上下文信息加快后续解码

>>关键点存储对于需要多次自回归解码的应用如聊天非常有用，例如回答用户询问时需要理解上下文。

>>例子描述了关键点存储在两次自回归解码中的应用。第一次解码需要展开上下文，第二次由于存储了第一轮结果，只需解码后续问题。

>>保留所有上下文对语言模型在聊天中的理解非常重要，例如第二轮问题需要基于第一轮解码结果理解用户意图。

>>关键点存储可以让聊天历史在解码时持续累积，而不是每次从头重新编码，这对聊天应用尤其有效，能够利用先前计算结果加快后续处理。

总之，该内容系统性介绍了键-值缓存在聊天场景下的应用价值：利用存储的上下文信息加快后续解码，保持对整个聊天(多轮对话)过程的理解，这对实现流畅的人机对话很重要。

>>键-值缓存是聊天应用程序中的有价值工具，它允许LLM保持上下文，提高响应的连贯性，并通过重复使用先前计算的关键-值向量来高效处理多轮对话，从而实现聊天历史的持续增长，而无需重新编码整个历史。

Note that the key-value cache is especially useful for applications such as chat where multiple passes of auto-regressive decoding are required. Let's look at an example.

请注意，键-值缓存在诸如聊天之类需要多次自回归解码的应用程序中特别有用。让我们看一个例子。

In this chat, the LLM runs auto-regressive decoding twice:

>>The first time, the key-value cache is empty and the input prompt is "User: How many people live in France?" and the model auto-regressively generates the text "Roughly 75 million people live in France" while increasing the key-value cache at every decoding step.

>>The second time the input prompt is "User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many in Germany?". Thanks to the cache, all key-value vectors for the first two sentences are already computed. Therefore the input prompt only consists of "User: And how many in Germany?". While processing the shortened input prompt, it's computed key-value vectors are concatenated to the key-value cache of the first decoding. The second Assistant's answer "Germany has ca. 81 million inhabitants" is then auto-regressively generated with the key-value cache consisting of encoded key-value vectors of "User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many are in Germany?".

在这个聊天中，LLM运行了两次自回归解码：

>>第一次，键-值缓存为空，输入提示为“User: 法国有多少人？”并且模型自回归生成文本“大约有7500万人住在法国”，同时在每个解码步骤中增加键-值缓存。

>>第二次，输入提示为“User: 法国有多少人？\n Assistant: 法国有大约7500万人\n User: 德国有多少人？”由于缓存，前两个句子的所有键-值向量已经计算出来。因此，输入提示仅包含“User: 德国有多少人？”在处理缩短的输入提示时，其计算的键值向量被连接到第一个解码的键-值缓存中。然后，第二个助手的回答“德国大约有8100万居民”是由编码的键值向量组成的键-值缓存自动回归生成的，其中包含“User:有多少人住在法国?”\n Assistant:大约有7500万人住在法国\n User:德国有多少人呢?

Two things should be noted here:

>>Keeping all the context is crucial for LLMs deployed in chat so that the LLM understands all the previous context of the conversation. E.g. for the example above the LLM needs to understand that the user refers to the population when asking "And how many are in Germany".

>>The key-value cache is extremely useful for chat as it allows us to continuously grow the encoded chat history instead of having to re-encode the chat history again from scratch (as e.g. would be the case when using an encoder-decoder architecture).

这里需要注意两点：

>>保持所有上下文对于部署在聊天中的LLM至关重要，以便LLM理解对话的所有先前上下文。例如，对于上面的例子，LLM需要理解用户在问“德国有多少人”时指的是人口。

>>键-值缓存对聊天非常有用，因为它允许我们不断增长编码的聊天历史，而不是从头开始重新编码聊天历史(例如，当使用编码器-解码器架构时)。

显著降低存储键-值缓存的内存成本的两大技巧：都是在保留自回归解码需要的key-value缓存基础的同时+都通过减少key-value投影权重数来提升效率

Let's compute the number of float values that need to be stored in the key-value cache for the LLM bigcode/octocoder that we used before. The number of float values amounts to two times the sequence length times the number of attention heads times the attention head dimension and times the number of layers. Computing this for our LLM at a hypothetical input sequence length of 16000 gives:

让我们计算LLM big code/octocoder 需要存储在键-值缓存中的float的数量。float值的数量等于序列长度乘以注意头的数量乘以注意头的维度和层数的两倍。假设输入序列长度为16000，计算我们的LLM得到

There is however one catch. While the required peak memory for the QK^T matrix is significantly reduced, holding the key-value cache in memory can become very memory expensive for long input sequence or multi-turn chat. Remember that the key-value cache needs to store the key-value vectors for all previous input vectors x_i , for i∈{1,…,c−1} for all self-attention layers and for all attention heads.

Roughly 8 billion float values! Storing 8 billion float values in float16 precision requires around 15 GB of RAM which is circa half as much as the model weights themselves! Researchers have proposed two methods that allow to significantly reduce the memory cost of storing the key-value cache:

然而，有一个注意事项。虽然QK^T矩阵的所需峰值内存显著减少，但将键-值缓存保存在内存中对于长输入序列或多轮聊天可能会变得非常昂贵。请记住，键-值缓存需要存储所有先前输入向量x_i的键值向量，对于i∈{1，…，c−1}，对于所有自关注层和所有关注头。

大约80亿个浮点值!以float16精度存储80亿个浮点值需要大约15GB的RAM，这大约是模型本身权重的一半!研究人员提出了两种方法，可以显著降低存储键-值缓存的内存成本:

Multi-Query-Attention (MQA)多查询注意力：使用单个key-value投影权重替代原始多头attentions结构,两个优点(节省内存+提高计算效率),比如PaLM、MPT、BLOOM、Falcon

Multi-Query-Attention was proposed in Noam Shazeer's Fast Transformer Decoding: One Write-Head is All You Need paper. As the title says, Noam found out that instead of using n_head key-value projections weights, one can use a single head-value projection weight pair that is shared across all attention heads without that the model's performance significantly degrades.

By using a single head-value projection weight pair, the key value vectors ki,vi have to be identical across all attention heads which in turn means that we only need to store 1 key-value projection pair in the cache instead of n_head ones.

As most LLMs use between 20 and 100 attention heads, MQA significantly reduces the memory consumption of the key-value cache. For the LLM used in this notebook we could therefore reduce the required memory consumption from 15 GB to less than 400 MB at an input sequence length of 16000.

多查询注意力是由Noam Shazeer在《快速变压器解码:一个Write-Head就够了》论文中提出的。正如标题所示，Noam发现，与其使用n_head键值投影权重，不如使用单个头-值投影权重对所有注意头共享，而不会显着降低模型的性能。

通过使用单个头-值投影权重对，键值向量ki,vi必须在所有注意头之间相同，这意味着我们只需要在缓存中存储1个键值投影对，而不是n_head个。

由于大多数LLMs使用20到100个注意头，因此MQA显著减少了键-值缓存的内存消耗。对于本笔记本中使用的LLM，在输入序列长度为16000的情况下，我们可以将所需的内存消耗从15 GB减少到不到400 MB。

In addition to memory savings, MQA also leads to improved computational efficiency as explained in the following. In auto-regressive decoding, large key-value vectors need to be reloaded, concatenated with the current key-value vector pair to be then fed into the q_c*K^T computation at every step. For auto-regressive decoding, the required memory bandwidth for the constant reloading can become a serious time bottleneck. By reducing the size of the key-value vectors less memory needs to be accessed, thus reducing the memory bandwidth bottleneck. For more detail, please have a look into Noam's paper.

除了节省内存外，MQA还提高了计算效率，如下所述。在自回归解码中，需要重新加载大的键值向量，并将其与当前的键值向量对连接起来，然后在每一步将其输入q_c*K^T计算。

对于自回归解码，不断重新加载所需的内存带宽可能成为严重的时间瓶颈。通过减小键值向量的大小，可以减少访问的内存带宽，从而减少内存带宽瓶颈。有关更多详细信息，请参阅Noam的论文。

The important part to understand here is that reducing the number of key-value attention heads to 1 only makes sense if a key-value cache is used. The peak memory consumption of the model for a single forward pass without key-value cache stays unchanged as every attention head still has a unique query vector so that each attention head still has a different QK^T matrix.

MQA has seen wide adoption by the community and is now used by many of the most popular LLMs:

>>Falcon

>>PaLM

>>MPT

>>BLOOM

Also, the checkpoint used in this notebook - bigcode/octocoder - makes use of MQA.

这里需要理解的重要部分是，仅当使用键-值缓存时，将键值关注头的数量减少到1才有意义。对于没有键值缓存的单个转发传递，模型的峰值内存消耗保持不变，因为每个注意头仍然具有唯一的查询向量，因此每个注意头仍然具有不同的QK^T矩阵。

MQA已经在社区中广泛采用，现在许多最受欢迎的LLMs都在使用：

>>Falcon

>>PaLM

>>MPT

>>BLOOM

此笔记本中使用的检查点 - bigcode/octocoder - 也使用了MQA。

Grouped-Query-Attention (GQA)分组查询注意力：使用多个(如2-8个)而不是单个key-value投影权重+保留更多模型容量的同时还可以获得MQA的内存和计算效率提升，如LLaMA2

Grouped-Query-Attention, as proposed by Ainslie et al. from Google, found that using MQA can often lead to quality degradation compared to using vanilla multi-key-value head projections. The paper argues that more model performance can be kept by less drastically reducing the number of query head projection weights. Instead of using just a single key-value projection weight, n < n_head key-value projection weights should be used. By choosing n to a significantly smaller value than n_head, such as 2,4 or 8 almost all of the memory and speed gains from MQA can be kept while sacrificing less model capacity and thus arguably less performance.

谷歌的Ainslie等人提出的GQA，与使用纯粹的多键-值头投影相比，使用MQA通常会导致质量下降。该论文认为，通过减少查询头投影权重的数量，可以保留更多的模型性能。与其只使用1个键-值投影权重，应该使用n<n_head键-值投影权重。通过选择n远小于n_head的值，例如2、4或8，几乎可以保留来自MQA的所有内存和速度收益，同时牺牲较少的模型容量，因此可能不太会降低性能。

Moreover, the authors of GQA found out that existing model checkpoints can be uptrained to have a GQA architecture with as little as 5% of the original pre-training compute. While 5% of the original pre-training compute can still be a massive amount, GQA uptraining allows existing checkpoints to be useful for longer input sequences.

GQA was only recently proposed which is why there is less adoption at the time of writing this notebook. The most notable application of GQA is Llama-v2.

此外，GQA的作者发现，现有的模型检查点可以被训练成一个GQA架构，只需要原始预训练计算的5%。虽然5%的原始预训练计算仍然是大量的，但GQA上训练允许现有的检查点对更长的输入序列有用。

由于GQA只是最近提出的，因此在撰写本笔记时，采用情况较少。GQA最值得关注的应用是Llama-v2。

As a conclusion, it is strongly recommended to make use of either GQA or MQA if the LLM is deployed with auto-regressive decoding and is required to handle large input sequences as is the case for example for chat.

总之，如果LLM使用自回归解码并需要处理大型输入序列（例如聊天），强烈建议使用GQA或MQA。

对比：MQA多查询注意力、GQA分组查询注意力

LLMs：预训练大模型实现全流程详解之模型训练各种技巧原理细讲—3.2、模型预训练及优化：参数优化(前置参数/超参)+结构优化(优化器/激活函数/位置嵌入/注意力机制/归一化的方法和位置)+降内存优化(分布式5大并行策略)+提速优化(词表裁剪/梯度累积GA/梯度检查点GC/AMP训练/4-bit量化/ZeRO)之详细攻略

https://blog.csdn.net/qq_41185868/article/details/131606747

Conclusion结论

有前途的方向之一推测解码Speculative Decoding

研究社区不断提出新颖高效的方式来加速庞大LLM的推理时间，例如推测解码假设可以由较小模型快速生成“易预测词”，仅由LLM自身生成“难预测词”，从而提高效率。

The research community is constantly coming up with new, nifty ways to speed up inference time for ever-larger LLMs. As an example, one such promising research direction is speculative decoding where "easy tokens" are generated by smaller, faster language models and only "hard tokens" are generated by the LLM itself. Going into more detail is out of the scope of this notebook, but can be read upon in this nice blog post.

The reason massive LLMs such as GPT3/4, Llama-2-70b, Claude, PaLM can run so quickly in chat-interfaces such as Hugging Face Chat or ChatGPT is to a big part thanks to the above-mentioned improvements in precision, algorithms, and architecture. Going forward, accelerators such as GPUs, TPUs, etc... will only get faster and allow for more memory, but one should nevertheless always make sure to use the best available algorithms and architectures to get the most bang for your buck.

研究界不断提出新的方法，以加快越来越大的LLM的推理时间。例如，一个有前途的研究方向是推测解码，其中“简单标记”由较小、更快的语言模型生成，，只有“硬令牌”由LLM本身生成。更多的细节超出了本笔记本的范围，但可以在这篇不错的博客文章中阅读。

GPT3/4、Llama-2-70b、Claude、PaLM等大型LLM在Hugging Face Chat或ChatGPT等聊天界面中运行如此之快，部分归功于上述精度、算法和架构方面的改进。展望未来，加速器如GPU、TPU等将变得更快，并允许更多内存，但仍然应始终确保使用最佳可用的算法和架构，以获得最大的性价比。

一个处女座的程序猿

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
2
评论
LLMs：《Optimizing your LLM in production在生产环境中优化您的LLM》翻译与解读—LLM在实际应用中面临的两大挑战(内存需求+对更长上下文输入需求)+提升LLM部署

LLMs：《Optimizing your LLM in production在生产环境中优化您的LLM》翻译与解读—LLM在实际应用中面临的两大挑战(内存需求+对更长上下文输入需求)+提升LLM部署效率的三大技术(低精度量化+更高效的自注意力算法Flash Attention+优化模型结构【位置嵌入/键-值缓存】)目录《Optimizing your LLM in production》翻译与解读LLMs已成为现代知识型产业中不可或缺的工具：面临两大挑战—内
复制链接

扫一扫