Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

type

status

date

slug

summary

1 简介

kv cache 的内存消耗随着模型大小和生成长度的增加迅速增长，大大增加了设备内存的压力。

当内存使用量超过GPU容量时，通常会使用卸载技术。虽然缓解了内存压力，但有限的pcie带宽会带来性能损失。因此需要避免成本高昂的重训练或者微调同时减少 kv cache 的内存占用。

从上图中可观察到并非所有注意力模块都需要关注到所有 token。所以如果能识别此类结构并压缩缓存向量可以大大减少内存消耗并加速文本生成。

采用压缩前诊断的方法，先用分析算法识别结构模式，然后自适应构建各个模块的 kv cache。

研究中识别了四种基本注意力结构：

主要关注本地上下文。为此构建一个驱逐远程上下文的 kv cache

主要关注特定的标记/标点符号。构建一个仅保留特殊标记或标点符号的 kv cache

有些具有按列稀疏的注意力图。为此丢弃最不常出现的标记

有些广泛关注所有 token。采用标准 kv cache

2 相关工作

2.1 token dropping and kv cache compression

对于循环神经网络，在给定时间步跳过多个 token

根据注意力分数消除 BERT 中的冗余词

给编码器模块添加池化层来压缩输入序列

在 BERT 中添加 token 选择任务，学习选择对性能至关重要的 token

设计了一个可学习的阈值来检测要修剪的不重要 token

压缩 prompts 成几个特殊 token 来减少内存压力

利用累积注意力分数作为识别重要 token 的标准

2.2 underlying structure of attention

使用 LRF 分析了 BERT 的自注意力头，将其表征为可解释的角色，其中之一是始终关注相邻 token

同一层中的头可能会对性能产生不同影响，每个头的重要性随着任务的不同而变化

确定了一些“头重点关注”的模式

3 自适应 kv cache 压缩

FastGen：一种用于构建自适应 kv cache 的双阶段算法。

在提示编码阶段进行模型分析以辨别各种注意力头的行为。

在 token 生成阶段根据每个 token 选择的压缩策略管理 kv cache，而不是为每个新的 token 附加新的 kv 向量。

3.1 模型分析

模型分析基于提示编码结果进行。

具体来收，对于压缩策略，对应的 KV cache 压缩记为。对于注意力图，我们选择可以用最小内存成本以恢复率为恢复的最佳策略：

Equation 1

其中，是所有可行压缩策略的集合，是压缩策略的目标 KV cache 预算，是预定义的超参用来表示希望策略恢复到的程度。

模型分析算法：

Intrinsically, our method assumes that the structure of the attention map is stable across different attention heads at different positions. So, it is sufficient to use only the encoded prompt to select a proper compression policy. It is worth noting that existing literature has provided the theoretical justification for using solely encoded prompts to capture attention structures for the full contexts. In our study, we also empirically verified this, as to be elaborated in Section 4.