Computational Bottlenecks of Training Small-scale Large Language Models 翻译_苹果的研究者围绕参数规模小于 20 亿的小语言模型(slms)探索训练计算瓶颈-CSDN博客

本文链接：https://blog.csdn.net/Doc2X/article/details/143948160

选择 Doc2X，让 PDF 转换更轻松
支持 PDF 转 Word、Latex、Markdown，多栏与公式精准解析，还提供深度翻译功能，适合科研及日常办公！
Choose Doc2X, Simplify PDF Conversion
Supports PDF to Word, LaTeX, and Markdown, with precise multi-column and formula parsing, plus advanced translation for research and daily work!
👉 立即试用 Doc2X | Try Doc2X Now

原文地址：https://arxiv.org/pdf/2410.19456

Computational Bottlenecks of Training Small-scale Large Language Models

小规模大型语言模型训练的计算瓶颈

Saleh Ashkboos* Iman Mirzadeh Keivan Alizadeh

Moin Nabi

Apple

saleh.ashkboos@inf.eth.ch,fartash@apple.com

Abstract

摘要

While large language models (LLMs) dominate the AI landscape, Small-scale large Language Models (SLMs) are gaining attention due to cost and efficiency demands from consumers. However, there is limited research on the training behavior and computational requirements of SLMs. In this study, we explore the computational bottlenecks of training SLMs (up to $2\mathrm{\;B}$ parameters) by examining the effects of various hyperparameters and configurations, including GPU type, batch size, model size, communication protocol, attention type, and the number of GPUs. We assess these factors on popular cloud services using metrics such as loss per dollar and tokens per second ${}^{2}$ . Our findings aim to support the broader adoption and optimization of language model training for low-resource AI research institutes.

尽管大型语言模型（LLMs）主导了人工智能领域，但由于消费者对成本和效率的需求，小规模大型语言模型（SLMs）正逐渐受到关注。然而，关于SLMs的训练行为和计算需求的现有研究有限。在本研究中，我们通过考察各种超参数和配置（包括GPU类型、批量大小、模型大小、通信协议、注意力类型和GPU数量）的影响，探讨了训练SLMs（最多 $2\mathrm{\;B}$ 参数）的计算瓶颈。我们使用每美元损失和每秒令牌数 ${}^{2}$ 等指标，在流行的云服务上评估这些因素。我们的研究旨在支持低资源人工智能研究机构更广泛地采用和优化语言模型训练。

1 Introduction

1 引言

Large Language Models (LLMs) are becoming increasingly popular in various fields due to their performance on a variety of tasks $\left\lbrack {6,{18},8,{20},5}\right\rbrack$ . However,deploying large models widely such as on mobile hardware and edge devices is challenging due to the large memory and compute requirements $\left\lbrack { {11},{12},{10}}\right\rbrack$ . These constraints have driven a growing interest in smaller language models (such as $\leq {2B}$ parameters) as a viable alternative [24,16,23]. Recent work refer to these models as Small-scale large Language Models (SLMs) which can work well in environments where cost-efficiency and resource limitations are of significant concern, as well as on servers where the reduced cost of inference will be a dominant factor to attract and retain customers.

大型语言模型（LLMs）由于其在各种任务上的表现 $\left\lbrack {6,{18},8,{20},5}\right\rbrack$ ，在各个领域中变得越来越受欢迎。然而，由于内存和计算需求巨大，在移动硬件和边缘设备等广泛部署大型模型具有挑战性 $\left\lbrack { {11},{12},{10}}\right\rbrack$ 。这些限制促使人们对更小的语言模型（如 $\leq {2B}$ 参数）产生了越来越大的兴趣，这些模型作为一种可行的替代方案[24,16,23]。最近的研究将这些模型称为小规模大型语言模型（SLMs），这些模型在成本效率和资源限制显著的环境中表现良好，并且在推理成本降低将成为吸引和留住客户的主要因素的服务器上也能很好地工作。

SLMs have demonstrated substantial potential in achieving competitive results despite their smaller size. Techniques such as pruning, distillation, and quantization have been employed to enhance their performance