SnapKV: LLM Knows What You are Looking for Before Generation（实现超长上下文的压缩方法无需训练）-CSDN博客

本文链接：https://blog.csdn.net/weixin_32759777/article/details/138216043

地址

https://arxiv.org/pdf/2404.14469

核心

这篇论文介绍了一种名为SnapKV的创新方法，旨在提高大型语言模型处理长上下文时的效率和内存利用率。主要贡献包括： 1. 设计实验探索在输出生成过程中注意力特征的模式，发现注意力分配具有一致性，可以提取重要信息。 2. 提出了SnapKV算法，利用观察窗口和投票机制选择每个注意力头的重要键值对，并使用池化进行细粒度聚类。 3. 在多个模型和数据集上评估SnapKV，结果显示其可以大幅压缩键值对缓存，提高解码速度，同时保持模型性能。总之，SnapKV为长序列输入提供了一种高效压缩键值对缓存的方法，有助于降低内存和计算成本，同时保持了生成质量。

import torch
import time
import torch.nn.functional as F
import torch.nn as nn
import math

# perform qk calculation and get indices
# this version will not update in inference mode

# Copied from transformers.models.llama.modeling_llama.repeat_kv
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
    """
    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
    """
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
    if n_rep == 1:
        return hidden_states
    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)

class KVCluster():
    def __init__(self, window_size = 64, max_capacity_prompt = 256 + 64, kernel_size = 5, pooling = 'avgpool'):
        self.window_size