【LLM高效训练】GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

最新推荐文章于 2024-10-16 23:29:31 发布

Arachis_X

最新推荐文章于 2024-10-16 23:29:31 发布

阅读量1.1k

点赞数 15

分类专栏： nlp 文章标签： nlp 人工智能

本文链接：https://blog.csdn.net/Arachis_X/article/details/136590993

版权

nlp 专栏收录该内容

24 篇文章 0 订阅

订阅专栏

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection 通过梯度低秩投影实现记忆高效的 LLM 训练

论文地址
 代码地址
 田渊栋等人新作：突破内存瓶颈，让一块4090预训练7B大模型
请添加图片描述

请添加图片描述

Abstract

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.

大型语言模型（LLM）的训练面临着巨大的内存挑战，这主要是由于权重和优化器状态的大小不断增加。

常见的内存缩减方法，如低秩适应（Low-rank adaptation，LoRA），是在每一层冻结的预训练权重中添加一个可训练的低秩矩阵，从而减少可训练参数和优化器状态。

然而，这些方法在预训练和微调阶段的表现通常不如使用全阶权重进行的训练，因为它们将参数搜索限制在低阶子空间，改变了训练动态，而且可能需要全阶暖启动。

在这项工作中，我们提出了梯度低阶投影（GaLore），这是一种允许全参数学习的训练策略，但比 LoRA 等常见的低阶适应方法更节省内存。

我们的方法在 LLaMA 1B 和 7B 架构上使用多达 19.7B 标记的 C4 数据集进行预训练，以及在 GLUE 任务上对 RoBERTa 进行微调时，可将优化器状态下的内存使用率降低多达 65.5%，同时保持效率和性能。

与 BF16 基线相比，我们的 8 位 GaLore 进一步减少了高达 82.5% 的优化器内存和 **63.3%**的总训练内存。值得注意的是，我们首次证明了在不采用模型并行、检查点或卸载策略的情况下，在拥有 24GB 内存的消费级 GPU（如英伟达 RTX 4090）上预训练 7B 模型的可行性。