Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes 逐步蒸馏!以较少的训练数据和较小的模型规模超越较大的语言模型
2023. 7 ACL2023 Findings
论文地址
代码地址
论文笔记(LLM+蒸馏):Distilling step-by-step+代码分析
Abstract
Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for training small models within a multi-task framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to few-shot prompted LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our finetuned 770M T5 model outperforms the few-shot prompted 540B PaLM model using only 80% of available data on a benchmark, whereas standard finetuning the same T5 model struggles to match even by using 100% of the dataset. We release the code at: https://github.com/google-research/distilling-step-by-step
部署大型语言模型(LLM)具有挑战性,因为它们在实际应用中内存效率低、计算密集。
为此,研究人员通过使用人类标签进行微调或使用 LLM 生成的标签进行蒸馏,来训练针对特定任务的较小模型。
然而,微调和蒸馏需要大量的训练数据,才能达到与 LLM 相当的性能。
我们逐步引入蒸馏法,这是一种新的机制,
- (a)训练的模型更小,性能优于 LLM,
- (b)通过利用微调或蒸馏法所需的更少的训练数据来实现这一目标。
我们的方法在多任务框架内提取 LLM 理由作为训练小型模型的额外监督。
我们通过 4 个 NLP 基准得出了三项发现:
-
首先,与微调法和蒸馏法相比,我们的机制能以更少的标注/未标注训练示例获得更好的性能。
-
其次,与少量提示的 LLM 相比,我们使用更小的模型规模就能获得更好的性能。
-
第三,我们同时缩小了模型规模和所需数据量,从而超越了 LLMs;
我们的经过微调的 770M T5 模型在基准测试中只使用了 80% 的可用数据,就超越了少次提示的 540B PaLM 模型,而标准微调的 T5 模型即使使用 100% 的数据集也难以与之匹敌。
Results