LLMs之LLaMA:《LLaMA: Open and Efficient Foundation Language Models》翻译与解读

316 篇文章 208 订阅

LLMs之LLaMA:《LLaMA: Open and Efficient Foundation Language Models》翻译与解读

导读:该论文提出了一个开源的大规模语言模型LLaMA,2048个A100-80G训练21天。该模型有以下几个核心技术点:
>> 模型架构=Transformer+集合多个算法的优秀技术(RMSNorm+SwiGLU+RoPE+AdamW+xformers库+渐进式学习率):LLaMA模型采用类似GPT的Transformer架构,但是使用了多项技术优化,特别是采用解决层归一化方法的16层模型。这相比于其他模型有更深的深度,能够学习更复杂的语言表示。

(1)、集合多个算法的优秀技术:预归一化函数RMSNorm、激活函数SwiGLU、旋转位置嵌入RoPE、AdamW优化器、高效的因果多头注意力xformers库加速。

(2)、渐进式学习率调度:LLaMA使用渐进式学习率调度方法,即训练开始时使用更大的学习率,然后随着训练的进行逐渐减小学习率。这可以帮助模型更快收敛到最优解。

>> 训练数据4TB+BPE分词(1.4万亿个tokens)—更多tokens+较小模型=可较好性能:LLaMA训练的数据集包含4TB的句子,只使用公开的数据集,英语CommonCrawl、C4、GitHub 、Wikipedia、Gutenberg+Books3+ArXiv、Stack Exchange。使用SentencePiece字节对编码(BPE)算法对数据进行分词(1.4万亿个tokens)。Chinchilla 论文中推荐在 200B(0.2T) 的 tokens 上训练 10B 规模的模型,而 LLaMA 使用了 1.4T tokens(1.4万亿个tokens) 训练 7B的模型,增大 tokens 规模,模型的性能仍在持续上升

>> LLaMA包含从7B/13B/30B/65B参数的基础语言模型集合—LLaMA-13B 仅以 1/10参数性能优于 GPT-3(175B):这是一个包含从7B~65B参数的基础语言模型集合。我们使用数万亿个标记对这些模型进行训练,并展示了可以仅使用公开可用的数据集进行训练,而无需使用专有和不可访问的数据集来训练最先进的模型。在多项语言模型和下游任务上的 benchmark上,LLaMA模型与同规模GPT(175B)模型相当或略优。但训练和推理速度明显更快,而LLaMA-65B与最好的模型Chinchilla-70B和PaLM-540B竞争力相当。

目录

相关论文

LLMs之LLaMA:《LLaMA: Open and Efficient Foundation Language Models》翻译与解读

LLMs之Alpaca:《Alpaca: A Strong, Replicable Instruction-Following Model》翻译与解读

LLMs:《Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca-4月17日版》翻译与解读

LLMs:《Efficient And Effective Text Encoding For Chinese Llama And Alpaca—6月15日版本》翻译与解读

《LLaMA: Open and Efficient Foundation Language Models》翻译与解读

介绍LLaMA:一款基础性的、拥有650亿参数的大型语言模型

Abstract

1、Introduction

LLMs的能力和发展趋势、模型规模与性能的关系

缩放法则与推理预算的忽视

LLaMA模型的提出

内容概述:模型改进、性能比较、模型的偏见和毒性问题

2、Approach

训练方法概述:

2.1、Pre-training Data

Table 1: Pre-training data. Data mixtures used for pre-training, for each subset we list the sampling propor-tion, number of epochs performed on the subset when training on 1.4T tokens, and disk size. The pre-training runs on 1T tokens have the same sampling proportion.预训练数据。用于预训练的数据混合,对于每个子集,我们列出了采样比例、在1.4T标记上训练时在子集上执行的时代数量以及磁盘大小。在1T标记上进行的预训练运行具有相同的采样比例。

预训练数据:

分词器:采用字节对编码(BPE)算法并借助SentencePiece工具库实现

整体训练数据集及其处理方式:分词后包含大约1.4万亿个标记(1.4T),仅对维基百科和图书领域的数据进行了大约2个epoch的迭代训练

2.2、Architecture

Table 2: Model sizes, architectures, and optimization hyper-parameters.模型大小、架构和优化超参数。

模型架构:

2.3    Optimizer

Figure 1: Training loss over train tokens for the 7B, 13B, 33B, and 65 models. LLaMA-33B and LLaMA- 65B were trained on 1.4T tokens. The smaller models were trained on 1.0T tokens. All models are trained with a batch size of 4M tokens.——7B、13B、33B和65B模型在训练标记上的训练损失。LLaMA-33B和LLaMA-65B在1.4T标记上进行训练。较小的模型在1.0T标记上进行训练。所有模型的批次大小均为4M标记。

优化器:

2.4   Efficient implementation高效实现

Table 3: Zero-shot performance on Common Sense Reasoning tasks.零样本性能,针对常识推理任务。

高效实现:

性能和训练速度:采用了2048个A100-80GB GPU,训练1.4T标记的数据集,耗时21天

3    Main results主要结果

任务类型和基准测试:

与其他模型的比较:

总结:常识推理、封闭书籍问答、阅读理解、数学推理、代码生成等任务的性能评估:

Table 4: NaturalQuestions. Exact match performance.

3.1 Common Sense Reasoning常识推理

3.2 Closed-book Question Answering闭书式问答

Table 5: TriviaQA. Zero-shot and few-shot exact match performance on the filtered dev set.

3.3 Reading Comprehension阅读理解

Table 6: Reading Comprehension. Zero-shot accuracy.

3.4   Mathematical reasoning数学推理

 Table 7: Model performance on quantitative reason-ing datasets. For majority voting, we use the same setup as Minerva, with k = 256 samples for MATH and k = 100 for GSM8k (Minerva 540B uses k = 64 for MATH and and k = 40 for GSM8k). LLaMA-65B outperforms Minerva 62B on GSM8k, although it has not been fine-tuned on mathematical data.

3.5 Code generation代码生成

Table 8: Model performance for code generation. We report the pass@ score on HumanEval and MBPP. HumanEval generations are done in zero-shot and MBBP with 3-shot prompts similar to Austin et al.(2021). The values marked with ∗ are read from figures in Chowdhery et al. (2022).代码生成的模型性能。我们报告了在HumanEval和MBPP上的pass@分数。HumanEval生成是在零猜和MBBP中进行的,使用与Austin et al. (2021)相似的3-shot提示。标有∗的值是从Chowdhery et al. (2022)的图表中读取的。

3.6 Massive Multitask Language Understanding大规模多任务语言理解

3.7 Evolution of performance during training训练期间性能的演变

表现的追踪情况

Table 9: Massive Multitask Language Understanding (MMLU). Five-shot accuracy.大规模多任务语言理解(MMLU)。五次提示的准确性。

4 Instruction Finetuning指令微调

指导数据微调:

微调实验和结果:

Table 10: Instruction finetuning – MMLU (5-shot). Comparison of models of moderate size with and with-out instruction finetuning on MMLU.

Figure 2: Evolution of performance on question answering and common sense reasoning during training.在训练过程中的问答和常识推理性能演变。

5 Bias, Toxicity and Misinformation偏见、有害内容和虚假信息

大型语言模型可能面临的偏见、有毒性和虚假信息生成的问题,通过多个基准测试展示了LLaMA-65B在这些方面的表现。

5.1 Real Toxicity Prompts

Table 11: RealToxicityPrompts. We run a greedy de-coder on the 100k prompts from this benchmark. The “respectful” versions are prompts starting with “Com-plete the following sentence in a polite, respectful, and unbiased manner:”, and “Basic” is without it. Scores were obtained using the PerplexityAPI, with higher score indicating more toxic generations.RealToxicityPrompts。我们在这个基准测试的100,000个提示上运行贪婪解码器。"尊重"版本是以"以礼貌、尊重和公正的方式完成以下句子:"开头的提示,而"基本"则没有。得分是使用PerplexityAPI获得的,得分越高表示生成的内容越有毒。

5.2 CrowS-Pairs

Table 12: CrowS-Pairs. We compare the level of bi-ases contained in LLaMA-65B with OPT-175B and GPT3-175B. Higher score indicates higher bias.

5.3 WinoGender

5.4 TruthfulQA

6 Carbon footprint碳足迹

模型训练对环境的能源和碳足迹的影响

7 Related work相关工作

Table 15: Carbon footprint of training different models in the same data center. We follow Wu et al. (2022) to compute carbon emission of training OPT, BLOOM and our models in the same data center. For the power consumption of a A100-80GB, we take the thermal design power for NVLink systems, that is 400W. We take a PUE of 1.1 and a carbon intensity factor set at the national US average of 0.385 kg CO2e per KWh.在同一数据中心训练不同模型的碳足迹。我们遵循Wu等人(2022)的方法,在同一数据中心计算OPT、BLOOM和我们模型的碳排放量。对于A100-80GB的功耗,我们采用NVLink系统的热设计功耗,即400W。我们采用PUE值为1.1,碳强度因子设定为美国国家平均值,即每千瓦时0.385千克CO2e。

语言模型定义、语言模型历史、规模扩展、规模对性能的影响

8 Conclusion结论

概括了论文的主要贡献和观察结果

Acknowledgements致谢


相关论文

LLMs之LLaMA:《LLaMA: Open and Efficient Foundation Language Models》翻译与解读

AIGC之LLaMA:《LLaMA: Open and Efficient Foundation Language Models》翻译与解读_ai自然语言处理_一个处女座的程序猿的博客-CSDN博客

LLMs之Alpaca:《Alpaca: A Strong, Replicable Instruction-Following Model》翻译与解读

https://yunyaniu.blog.csdn.net/article/details/129775107

LLMs:《Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca-4月17日版》翻译与解读

https://yunyaniu.blog.csdn.net/article/details/130998087

LLMs:《Efficient And Effective Text Encoding For Chinese Llama And Alpaca—6月15日版本》翻译与解读

https://yunyaniu.blog.csdn.net/article/details/131318974

《LLaMA: Open and Efficient Foundation Language Models》翻译与解读

地址

论文https://arxiv.org/abs/2302.13971

作者

Hugo Touvron, Thibaut Lavril∗, Gautier Izacard∗, Xavier Martinet Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozière, Naman Goyal

Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin Edouard Grave∗, Guillaume Lample∗

Meta AI

时间

2023年2月25日

介绍LLaMA:一款基础性的、拥有650亿参数的大型语言模型

时间:2023年2月24日
原文地址https://ai.meta.com/blog/large-language-model-llama-meta-ai/

作为Meta对开放科学的承诺的一部分,今天我们公开发布LLaMA(Large Language Model Meta AI),这是一款先进的基础性大型语言模型,旨在帮助研究人员推动人工智能的这一子领域的工作。像LLaMA这样的更小、更高性能的模型使研究社区中那些无法访问大量基础设施的人能够研究这些模型,进一步使这一重要而快速变化的领域的访问更加民主化

在大型语言模型领域,训练像LLaMA这样的较小基础模型是可取的,因为它需要远少于计算能力和资源来测试新方法、验证他人的工作以及探索新的用例。基础模型在大量未标记数据上进行训练,这使它们成为微调各种任务的理想选择。我们提供了LLaMA的多个规模(7B、13B、33B和65B参数),还分享了一个LLaMA模型卡片,详细说明了我们构建模型的方法,符合我们对负责任人工智能实践的理念。

在过去的一年中,拥有数十亿参数的大型语言模型,即自然语言处理(NLP)系统,展示了生成创造性文本、解决数学定理、预测蛋白质结构、回答阅读理解问题等方面的新能力。它们是AI能够在全球数十亿人中规模提供实质性潜在利益的最清晰案例之一。

尽管大型语言模型近期取得了许多进展,但由于训练和运行这样大型模型所需的资源,对它们的全面研究访问仍然有限。受限的访问限制了研究人员理解这些大型语言模型是如何工作的,阻碍了改善其稳健性并缓解已知问题(如偏见、毒性和生成误导信息的潜在风险)的努力。

在更多的标记上训练的较小模型更容易进行重新训练和微调,以适应特定的潜在产品用例。我们在1.4万亿标记上训练了LLaMA 65B和LLaMA 33B。我们最小的模型,LLaMA 7B,训练在1万亿标记上。

像其他大型语言模型一样,LLaMA通过将一系列单词作为输入并递归生成文本来工作。为了训练我们的模型,我们选择了使用最多人口的20种语言的文本,重点关注那些使用拉丁字母和西里尔字母的语言。

在解决大型语言模型中的偏见、有毒评论和幻觉风险方面,仍需要进行更多的研究。像其他模型一样,LLaMA也面临这些挑战。作为基础模型,LLaMA设计为灵活多用途,可以应用于许多不同的用例,而不是为特定任务设计的微调模型。通过分享LLaMA的代码,其他研究人员可以更容易地测试在大型语言模型中限制或消除这些问题的新方法。我们还在论文中提供了一系列评估,评估模型的偏见和毒性,以展示模型的局限性,并支持在这一关键领域进行进一步研究。

为了保持完整性并防止滥用,我们将以专注于研究用途的非商业许可发布我们的模型。对该模型的访问将根据个案授予学术研究人员、与政府、公民社会和学术界组织有关的人员以及全球工业研究实验室。有兴趣申请访问的人可以在我们的研究论文中找到申请链接。

我们认为整个人工智能社区 — 学术研究人员、公民社会、决策者和行业 — 必须共同努力制定关于负责任AI的明确指南,特别是关于负责任的大型语言模型。我们期待看到社区能够通过使用LLaMA学到什么 — 最终建立什么。

Abstract

We introduce LLaMA, a collection of founda- tion language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly avail- able datasets  exclusively,  without  resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA- 65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community1.

我们介绍了LLaMA,这是一组参数范围从7B到65B的基础语言模型。我们使用数万亿个标记来训练我们的模型,并展示了可以仅使用公开可用的数据集进行训练,而不需要专有和不可访问的数据集来训练最先进的模型。特别是,LLaMA-13B在大多数基准测试中表现优于GPT-3(175B),LLaMA-65B与最好的模型Chinchilla-70B和PaLM-540B具有竞争力。我们将所有模型发布给研究社区。

1、Introduction

LLMs的能力和发展趋势、模型规模与性能的关系

  • LLMs,如GPT-3,通过在庞大的文本语料库上进行训练,展示了在从文本指令或Few-shot示例中执行新任务的能力
  • Few-shot特性是通过将模型扩大到足够大的规模后首次出现的,这导致了进一步扩大模型规模的研究方向
  • 之前的研究假设更多的参数将导致更好的性能,因此致力于进一步扩大模型规模。
  • 最佳性能并非由最大的模型实现,而是由在更多数据上训练的较小模型实现

Large Languages Models (LLMs) trained on mas- sive corpora of texts have shown their ability to per- form new tasks from textual instructions or from a few examples (Brown et al., 2020). These few-shot properties first appeared when scaling models to a sufficient size (Kaplan et al., 2020), resulting in a line of work that focuses on further scaling these models (Chowdhery et al., 2022; Rae et al., 2021). These efforts are based on the assumption that more parameters will lead to better performance. However, recent work from Hoffmann et al. (2022) shows that, for a given compute budget, the best performances are not achieved by the largest mod- els, but by smaller models trained on more data.

大型语言模型(LLMs)在大规模文本语料库上训练后展现了它们根据文本指令或少量示例执行新任务的能力(Brown等,2020年)。这种少样本特性首次出现在将模型扩展到足够大的规模时(Kaplan等,2020年),随后有了一系列进一步扩展这些模型的工作(Chowdhery等,2022年;Rae等,2021年)。这些努力是基于一个假设,即更多的参数将导致更好的性能。然而,Hoffmann等人(2022年)的最新研究表明,在给定的计算预算下,最佳性能不是由最大的模型实现的,而是由更小的模型在更多数据上进行训练的模型实现的。

缩放法则与推理预算的忽视

  • Hoffmann等人的缩放法则旨在确定如何在特定训练计算预算下最佳缩放数据集和模型大小。
  • 但这一目标忽略了推理预算,而在大规模提供语言模型时,推理预算变得至关重要

The objective of the scaling laws from Hoff- mann et al. (2022) is to determine how to best scale the dataset and model sizes for a particular training compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a  smaller one trained longer will ultimately be cheaper at inference. For instance, although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.

Hoffmann等人(2022年)的扩展定律的目标是确定如何在特定的训练计算预算下最佳地扩展数据集和模型大小。然而,这个目标忽视了推理预算,在大规模使用语言模型时变得至关重要。在这种情况下,给定目标性能水平,首选的模型不是训练最快的模型,而是推理最快的模型,尽管训练一个大型模型以达到一定的性能水平可能更便宜,但训练时间更长的较小模型在推理阶段最终更经济。例如,尽管Hoffmann等人(2022年)建议在200B个标记上训练一个10B模型,但我们发现7B模型的性能在训练1T个标记后仍在改善。

LLaMA模型的提出

  • 为了在各种推理预算下实现最佳性能,作者提出了一系列名为LLaMA的语言模型,其参数范围从7B到65B。
  • LLaMA模型在性能上与最佳的LLMs相媲美,例如LLaMA-13B在大多数基准测试上优于GPT-3,尽管规模小了10倍
  • 数据来源公开性:与Chinchilla、PaLM或GPT-3不同,LLaMA仅使用公开可用的数据,使其与开源兼容。
  • 模型开源性:大多数现有模型依赖于不公开或未记录的数据,而LLaMA的公开数据使用更具开源性。

The focus of this work is to train a series of language models that achieve the best possible per- formance at various inference budgets, by training on more tokens than what is typically used. The resulting models, called LLaMA, ranges from 7B to 65B parameters with competitive performance compared to the best existing LLMs. For instance, LLaMA-13B outperforms GPT-3 on most bench- marks, despite being 10× smaller. We believe that this model will help democratize the access and study of LLMs, since it can be run on a single GPU. At the higher-end of the scale, our 65B-parameter model is also competitive with the best large lan- guage models such as Chinchilla or PaLM-540B.

Unlike Chinchilla, PaLM, or GPT-3, we only use publicly available data, making our work com- patible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented (e.g. “Books – 2TB” or “Social media conversations”). There exist some exceptions, notably OPT (Zhang et al., 2022), GPT-NeoX (Black et al., 2022), BLOOM (Scao et al., 2022) and GLM (Zeng et al., 2022), but none that are competitive with PaLM-62B or Chinchilla.

本文的重点是训练一系列在各种推理预算下实现最佳性能的语言模型,通过使用比通常使用的更多标记进行训练。结果得到的模型称为LLaMA,参数范围从7B到65B,与现有最好的LLM相比具有竞争力的性能。例如,LLaMA-13B在大多数基准测试中优于GPT-3,尽管体积只有其1/10。我们相信,这个模型将有助于使LLMs的访问和研究民主化,因为它可以在单个GPU上运行。在规模较大的端,我们的65B参数模型与最好的大型语言模型(如Chinchilla或PaLM-540B)也具有竞争力。

与Chinchilla、PaLM或GPT-3不同,我们只使用公开可用的数据,使我们的工作与开源兼容,而大多数现有模型依赖于不公开可用或未经记录的数据(例如,“Books – 2TB”或“Social media conversations”)。也存在一些例外,例如OPT(Zhang等,2022年),GPT-NeoX(Black等,2022年),BLOOM(Scao等,2022年)和GLM(Zeng等,2022年),但没有一个能与PaLM-62B或Chinchilla相竞争

内容概述:模型改进、性能比较、模型的偏见和毒性问题

  • 文章介绍了对Transformer架构(Vaswani等人,2017)的修改以及训练方法。

In the rest of this paper, we present an overview of the modifications we made to the transformer architecture (Vaswani et al., 2017), as well as our training method. We then report the performance of our models and compare with others LLMs on a set of standard benchmarks. Finally, we expose some of the biases and toxicity encoded in our models, using some of the most recent benchmarks from the responsible AI community.

在本文的其余部分,我们将概述我们对Transformer架构(Vaswani等,2017年)所做的修改以及我们的训练方法。然后,我们将报告我们模型的性能,并与其他LLM在一系列标准基准测试中进行比较。最后,我们使用最近的一些负责任的AI社区的基准测试揭示了我们的模型中编码的一些偏见和有害信息。

2、Approach

训练方法概述:

  • 作者的训练方法与之前的工作相似,受到Chinchilla缩放法则的启发。
  • 使用大型transformers在大量文本数据上进行训练,采用标准优化器。

Our training approach is similar to the methods described in previous work (Brown et al., 2020; Chowdhery et al., 2022), and is inspired by the Chinchilla scaling laws (Hoffmann et al., 2022). We train large transformers on a large quantity of textual data using a standard optimizer.

我们的训练方法类似于先前的工作(Brown等,2020年;Chowdhery等,2022年),并受到了Chinchilla扩展定律的启发(Hoffmann等,2022年)。我们使用标准优化器在大量文本数据上训练大型Transformer模型

2.1、Pre-training Data

Table 1: Pre-training data. Data mixtures used for pre-training, for each subset we list the sampling propor-tion, number of epochs performed on the subset when training on 1.4T tokens, and disk size. The pre-training runs on 1T tokens have the same sampling proportion.预训练数据。用于预训练的数据混合,对于每个子集,我们列出了采样比例、在1.4T标记上训练时在子集上执行的时代数量以及磁盘大小。在1T标记上进行的预训练运行具有相同的采样比例。

预训练数据:

  • 数据集由多个来源组成,包括CommonCrawl、C4、GitHub、Wikipedia、Gutenberg、Books3、ArXiv、Stack Exchange等,涵盖多个领域。
  • 数据经过预处理,包括去重、语言识别、质量过滤等步骤,确保只使用公开可用、与开源兼容的数据。
  • 整个训练数据集包含大约1.4T个标记。

Our training dataset is a mixture of several sources, reported in Table 1, that cover a diverse set of do- mains. For the most part, we reuse data sources that have been leveraged to train other LLMs, with the restriction of only using data that is publicly available, and compatible with open sourcing. This leads to the following mixture of data and the per- centage they represent in the training set:

我们的训练数据集是多个来源的混合物,详见表格1,涵盖了各种领域。在很大程度上,我们重新使用了用于训练其他LLM的数据源,但限制是只使用公开可用的数据,并且与开源兼容。这导致了以下混合数据及其在训练集中所代表的百分比:

English CommonCrawl [67%]. We preprocess five CommonCrawl  dumps,  ranging  from  2017 to 2020, with the CCNet pipeline (Wenzek et al., 2020). This process deduplicates the data at the line level, performs language identification with a fastText linear classifier to remove non-English pages and filters low quality content with an n- gram language model. In addition, we trained a linear model to classify pages used as references in Wikipedia v.s. randomly sampled pages, and discarded pages not classified as references.

C4 [15%]. During exploratory experiments, we observed that using diverse pre-processed Com- monCrawl datasets improves performance. We thus included the publicly available C4 dataset (Raffel et al., 2020) in our data. The preprocessing of C4 also contains deduplication and language identifi- cation steps: the main difference with CCNet is the quality filtering, which mostly relies on heuris- tics such as presence of punctuation marks or the number of words and sentences in a webpage.

英语CommonCrawl [67%]。我们对五个CommonCrawl数据转储进行预处理,时间跨度从2017年到2020年,使用CCNet流程(Wenzek等,2020年)。该过程在行级别进行数据去重,使用fastText线性分类器进行语言识别以去除非英语页面,并使用n-gram语言模型过滤低质量内容。此外,我们训练了一个线性模型,用于对维基百科中用作参考的页面与随机抽样页面进行分类,并丢弃未被分类为参考文献的页面

C4 [15%]。在探索性实验中,我们观察到使用多样的预处理CommonCrawl数据集可以提高性能。因此,我们在我们的数据中包括了公开可用的C4数据集(Raffel等,2020年)。C4的预处理也包括去重和语言识别步骤:与CCNet的主要区别在于质量过滤,主要依靠标点符号的存在或网页中的单词和句子数量。

Github [4.5%]. We use the public GitHub dataset available on Google BigQuery. We only kept projects that are distributed under the Apache, BSD and MIT licenses. Additionally, we filtered low quality files with heuristics based on the line length or proportion of alphanumeric characters, and removed boilerplate, such as headers, with reg- ular expressions. Finally, we deduplicate the result- ing dataset at the file level, with exact matches.

Wikipedia [4.5%]. We add Wikipedia  dumps from the June-August 2022 period, covering 20 languages, which use either the Latin or Cyrillic scripts: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. We process the data to remove hyperlinks, comments and other formatting boilerplate.

Gutenberg and Books3  [4.5%].  We  include two book corpora in our training dataset: the Guten- berg Project, which contains books that are in the public domain, and the Books3 section of TheP- ile (Gao et al., 2020), a publicly available dataset for training large language models. We perform deduplication at the book level, removing books with more than 90% content overlap.

GitHub [4.5%]。我们使用Google BigQuery上公开可用的GitHub数据集。我们只保留按Apache、BSD和MIT许可证分发的项目。此外,我们使用基于行长度或包含字母数字字符比例的启发式方法过滤低质量文件,并使用正则表达式删除诸如标题之类的样板文件。最后,我们使用完全匹配在文件级别进行数据去重。

Wikipedia [4.5%]。我们添加了2022年6月至8月期间的维基百科转储,涵盖20种使用拉丁字母或西里尔字母的语言:bg、ca、cs、da、de、en、es、fr、hr、hu、it、nl、pl、pt、ro、ru、sl、sr、sv、uk。我们对数据进行处理,删除超链接、注释和其他格式样板

Gutenberg和Books3 [4.5%]。我们的训练数据集中包括两个图书语料库:Guten- berg计划中包含的公共领域图书,以及ThePile(Gao等,2020年)的Books3部分,这是一个用于训练大型语言模型的公开可用数据集。我们对书籍进行了去重处理,删除了内容重叠超过90%的书籍

ArXiv  [2.5%].   We  process  arXiv  Latex  files to add scientific data to our dataset. Following Lewkowycz et al. (2022), we removed everything before the first section, as well as the bibliography. We also removed the comments from the .tex files, and inline-expanded definitions and macros written by users to increase consistency across papers.

Stack Exchange [2%]. We include a dump of Stack Exchange, a website of high quality ques- tions and answers that covers a diverse set of do- mains, ranging from computer science to chemistry. We kept the data from the 28 largest websites, re- moved the HTML tags from text and sorted the answers by score (from highest to lowest).

ArXiv [2.5%]。我们处理arXiv的LaTeX文件,以向我们的数据集添加科学数据。根据Lewkowycz等(2022年)的方法,我们删除了第一节之前的所有内容以及参考文献部分。我们还从.tex文件中删除了注释,并对用户编写的内联扩展定义和宏进行了展开,以增加论文之间的一致性。

Stack Exchange [2%]。我们包括Stack Exchange的转储,这是一个高质量问题和回答的网站,涵盖了从计算机科学到化学的各种领域。我们保留了最大的28个网站的数据,从文本中删除了HTML标记,并按得分(从高到低)对答案进行排序。

分词器:采用字节对编码(BPE)算法并借助SentencePiece工具库实现

  • 分词算法: 使用了字节对编码(BPE)算法,该算法由Sennrich等人于2015年提出,并借助了SentencePiece工具库的实现(Kudo和Richardson,2018年)。

  • 数字处理: 在分词过程中,将所有数字拆分为独立的数字,并对未知的UTF-8字符进行字节级的分解。

  • 技术细节: 采用BPE算法有助于处理多样性的语言数据,而将数字拆分和字节级分解则有助于更好地捕捉语言中的细微差异和结构。

Tokenizer. We tokenize the data with the byte- pair encoding (BPE) algorithm (Sennrich et al., 2015), using the implementation from Sentence- Piece (Kudo and Richardson, 2018). Notably, we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters.

分词器。我们使用字节对编码(BPE)算法(Sennrich等,2015年)对数据进行分词,使用SentencePiece(Kudo和Richardson,2018年)的实现。值得注意的是,我们将所有数字拆分为单独的数字,并使用字节对将未知的UTF-8字符进行分解。

整体训练数据集及其处理方式:分词后包含大约1.4万亿个标记(1.4T),仅对维基百科和图书领域的数据进行了大约2个epoch的迭代训练

  • 训练数据规模: 整个训练数据集在分词后包含大约1.4万亿个标记(tokens)。

  • 标记使用频率: 在大多数训练数据中,每个标记在训练期间仅使用一次。这意味着每个文本标记都被模型考虑了一次。

  • 特殊情况处理: 例外情况是对维基百科和图书领域的数据,针对这两个领域,进行了大约2个epoch的迭代训练。这意味着模型对于这两个领域的数据进行了更深入的学习,以更好地理解其中的语义和结构。

Overall, our entire training dataset contains roughly 1.4T tokens after tokenization. For most of our training data, each token is used only once dur- ing training, with the exception of the Wikipedia and Books domains, over which we perform ap- proximately two epochs.

总体而言,我们整个训练数据集在分词后包含大约1.4万亿个标记。对于我们的大多数训练数据,每个标记在训练过程中仅使用一次,除了维基百科和图书领域,我们对其进行了大约两个时期的训练

2.2、Architecture

Table 2: Model sizes, architectures, and optimization hyper-parameters.模型大小、架构和优化超参数。

模型架构:

  • 基于transformer架构,但引入了一些改进:
    • 采用预正则化(Pre-normalization)来提高训练稳定性。
    • 使用SwiGLU激活函数替代ReLU,以提高性能。
    • 移除绝对位置嵌入,使用Rotary Positional Embeddings。

Following recent work on large language models, our network is based on the transformer architec- ture (Vaswani et al., 2017). We leverage various improvements that were subsequently proposed,and used in different models such as PaLM. Here are the main difference with the original architec- ture, and where we were found the inspiration for this change (in bracket):

在最近关于大型语言模型的研究中,我们的网络基于Transformer架构(Vaswani等,2017年)。我们利用了后来提出的各种改进方法,这些方法在不同模型中被使用,如PaLM。以下是与原始架构的主要不同之处以及我们对此变化的启示(方括号内):

Pre-normalization [GPT3]. To improve the training stability, we normalize the input of each transformer sub-layer, instead of normalizing the output. We use the RMSNorm normalizing func- tion, introduced by Zhang and Sennrich (2019).

SwiGLU activation function [PaLM]. We re- place the ReLU non-linearity by the SwiGLU ac- tivation function, introduced by Shazeer (2020) to improve the performance. We use a dimension of 4d instead of 4d as in PaLM.

Rotary Embeddings [GPTNeo]. We remove the absolute positional embeddings, and instead, add rotary positional embeddings (RoPE), introduced by Su et al. (2021), at each layer of the network.

[GPT3]预归一化 的RMSNorm归一化函数。为了改善训练稳定性,我们对每个Transformer子层的输入进行归一化,而不是对输出进行归一化。我们使用RMSNorm归一化函数,由Zhang和Sennrich(2019年)引入。

[PaLM]激活函数 SwiGLU。我们将ReLU非线性激活函数替换为SwiGLU激活函数,该函数由Shazeer(2020年)引入以提高性能。与PaLM不同的是,我们使用了4d的维度而不是PaLM中的4d。

[GPTNeo]旋转位置嵌入 RoPE。我们移除了绝对位置嵌入,并在网络的每个层添加了旋转位置嵌入(RoPE),这是由Su等(2021年)引入的。

The details of the hyper-parameters for our dif-ferent models are given in Table 2.

有关我们不同模型的超参数详细信息,请参见表2。

2.3    Optimizer

Figure 1: Training loss over train tokens for the 7B, 13B, 33B, and 65 models. LLaMA-33B and LLaMA- 65B were trained on 1.4T tokens. The smaller models were trained on 1.0T tokens. All models are trained with a batch size of 4M tokens.——7B、13B、33B和65B模型在训练标记上的训练损失。LLaMA-33B和LLaMA-65B在1.4T标记上进行训练。较小的模型在1.0T标记上进行训练。所有模型的批次大小均为4M标记。

优化器:

  • 使用AdamW优化器,设定一系列超参数,包括学习率、权重衰减、梯度剪裁等。
  • 采用余弦学习率调度,具有渐变的学习率和批次大小随模型大小变化而变化的特性。

Our models are trained using the AdamW opti- mizer (Loshchilov and Hutter, 2017), with the fol- lowing hyper-parameters: β1 = 0.9, β2  =  0.95. We use a cosine learning rate schedule, such that the final learning rate is equal to 10% of the maximal learning rate. We use a weight decay of 0.1 and gradient clipping of 1.0. We use 2, 000 warmup steps, and vary the learning rate and batch size with the size of the model (see Table 2 for details).

我们使用AdamW优化器(Loshchilov和Hutter,2017年)进行模型训练,使用以下超参数:β1 = 0.9,β2 = 0.95。我们使用余弦学习率调度,使最终学习率等于最大学习率的10%。我们使用权重衰减0.1和梯度裁剪1.0。我们使用2,000个预热步骤,并根据模型的大小调整学习率和批次大小(详见表2)。

2.4   Efficient implementation高效实现

Table 3: Zero-shot performance on Common Sense Reasoning tasks.零样本性能,针对常识推理任务。

高效实现:

  • 通过实现高效的因果多头注意力机制来降低内存使用和运行时。
  • 使用checkpoint检查点技术减少在反向传播期间需要重新计算的激活数量,以提高训练效率。
  • 实现模型和序列的并行计算,通过减少模型内存使用,利用多GPU并行计算,最大程度上减少GPU间的通信。

性能和训练速度:采用了2048个A100-80GB GPU,训练1.4T标记的数据集,耗时21天

  • 在训练65B参数模型时,每秒每GPU处理约380个标记,使用2048个A100 GPU,每个GPU有80GB RAM。
  • 完成包含1.4T标记的数据集的训练大约需要21天。

We make several optimizations to improve the train-ing speed of our models. First, we use an efficient implementation of the causal multi-head attention to reduce memory usage and runtime. This imple-mentation, available in the xformers library,2 is inspired by Rabe and Staats (2021) and uses the backward from Dao et al. (2022). This is achieved by not storing the attention weights and not com-puting the key/query scores that are masked due to the causal nature of the language modeling task.

我们进行了几项优化来提高模型的训练速度。首先,我们使用了一种高效的因果多头注意力实现,以减少内存使用和运行时间。这个实现在xformers库中可用,受到了Rabe和Staats(2021年)的启发,并使用了Dao等人(2022年)的反向传播。通过不存储注意力权重和不计算由于语言建模任务的因果性质而被屏蔽的键/查询得分,实现了这一点。

To further improve training efficiency, we re-duced the amount of activations that are recom-puted during the backward pass with checkpoint-ing. More precisely, we save the activations that are expensive to compute, such as the outputs of linear layers. This is achieved by manually imple-menting the backward function for the transformer layers, instead of relying on the PyTorch autograd. To fully benefit from this optimization, we need to reduce the memory usage of the model by using model and sequence parallelism, as described by Korthikanti et al. (2022). Moreover, we also over-lap the computation of activations and the commu-nication between GPUs over the network (due to all_reduce operations) as much as possible.

When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days.

为了进一步提高训练效率,我们通过检查点技术减少了在反向传播过程中重新计算的激活数量。具体而言,我们保存了昂贵计算的激活,如线性层的输出。这是通过手动实现Transformer层的反向传播函数来实现的,而不是依赖于PyTorch的autograd。为了充分利用这种优化,我们需要使用模型和序列并行来减少模型的内存使用,正如Korthikanti等人(2022年)所描述的。此外,我们还尽可能地重叠计算激活和在网络上进行的GPU之间的通信(由于all_reduce操作)。

当训练一个拥有650亿参数的模型时,我们的代码在拥有80GB内存的2048个A100 GPU上每秒处理约380个标记。这意味着在包含1.4万亿标记的数据集上进行训练大约需要21天。

3    Main results主要结果

任务类型和基准测试:

  • 作者考虑了零样本和少样本任务,并在总共20个基准测试上报告了结果。
  • 零样本任务中,模型通过提供开放性生成的答案或对提议答案进行排名来回答任务。
  • 少样本任务中,模型通过提供任务的少量示例(1到64个)和一个测试示例来生成答案或对不同选项进行排名。

Following previous work (Brown et al., 2020), we consider zero-shot and few-shot tasks, and report results on a total of 20 benchmarks:

  1. Zero-shot. We provide a textual description of the task and a test example. The model either provides an answer using open-ended generation, or ranks the proposed answers.
  2. Few-shot. We provide a few examples of the task (between 1 and 64) and a test example. The model takes this text as input and gener-ates the answer or ranks different options.

我们遵循以前的工作(Brown等,2020年),考虑了零样本和少样本任务,并在总共20个基准测试中报告结果:

  1. 零样本。我们提供任务的文本描述和一个测试示例。模型通过开放式生成回答或对提议的答案进行排序来回答。
  2. 少样本。我们提供任务的几个示例(1到64个)和一个测试示例。模型以这些文本作为输入,生成答案或对不同选项进行排序。

与其他模型的比较:

  • 与其他基准模型进行比较,包括GPT-3、Gopher、Chinchilla、PaLM、OPT、GPT-J、GPT-Neo等。
  • 在不同任务和基准测试中对LLaMA进行了性能评估。

We compare LLaMA with other foundation mod-els, namely the non-publicly available language models GPT-3 (Brown et al., 2020), Gopher (Rae et al., 2021), Chinchilla (Hoffmann et al., 2022) and PaLM (Chowdhery et al., 2022), as well as the open-sourced OPT models (Zhang et al., 2022), GPT-J (Wang and Komatsuzaki, 2021), and GPT-Neo (Black et al., 2022). In Section 4, we also briefly compare LLaMA with instruction-tuned models such as OPT-IML (Iyer et al., 2022) and Flan-PaLM (Chung et al., 2022).

We evaluate LLaMA on free-form generation tasks and multiple choice tasks. In the multiple choice tasks, the objective is to select the most appropriate completion among a set of given op-tions, based on a provided context. We select the completion with the highest likelihood given the provided context. We follow Gao et al. (2021) and use the likelihood normalized by the number of characters in the completion, except for certain datasets (OpenBookQA, BoolQ), for which we fol-low Brown et al. (2020), and select a completion based on the likelihood normalized by the likeli-hood of the completion given “Answer:” as context: P (completion|context)/P (completion|“Answer:”).

我们将LLaMA与其他基础模型进行比较,包括非公开可用的语言模型GPT-3、Gopher、Chinchilla和PaLM,以及开源的OPT模型、GPT-J和GPT-Neo。在第4节中,我们还简要比较了LLaMA与OPT-IML和Flan-PaLM等针对指令进行调整的模型。

我们在自由形式生成任务和多项选择任务上评估LLaMA。在多项选择任务中,目标是根据提供的上下文从给定选项中选择最合适的完成。我们选择在给定上下文中具有最高可能性的完成。我们遵循Gao等人(2021年)的方法,使用由完成的字符数进行归一化的可能性作为评估指标,但对于某些数据集(OpenBookQA、BoolQ),我们按照Brown等人(2020年)的方法,根据在给定上下文"Answer:"的情况下完成的可能性与完成的可能性的比值进行选择:P(完成|上下文)/P(完成|"Answer:")。

总结:常识推理、封闭书籍问答、阅读理解、数学推理、代码生成等任务的性能评估:

  • 在常识推理任务中,LLaMA-65B在多个基准测试上超越Chinchilla-70B和PaLM-540B,LLaMA-13B也在大多数基准测试中胜过GPT-3。
  • 在封闭书籍问答任务中,LLaMA在Natural Questions和TriviaQA基准测试中在零样本和少样本设置下均取得了领先的性能
  • 在阅读理解任务中,LLaMA-65B与PaLM-540B竞争激烈,LLaMA-13B在某些基准测试中超过GPT-3。
  • 在数学推理任务中,LLaMA在MATH和GSM8k基准测试中与PaLM和Minerva相比表现出色
  • 在代码生成任务中,LLaMA在HumanEval和MBPP基准测试中的性能超过了LaMDA和PaLM。
  • 在大规模多任务语言理解基准测试中,LLaMA-65B在一些领域上略逊于Chinchilla-70B和PaLM-540B。

Table 4: NaturalQuestions. Exact match performance.

3.1 Common Sense Reasoning常识推理

We consider eight standard common sense rea-soning benchmarks: BoolQ (Clark et al., 2019), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019),HellaSwag (Zellers et al., 2019), WinoGrande (Sak-aguchi et al., 2021), ARC easy and challenge (Clark et al., 2018) and OpenBookQA (Mihaylov et al., 2018). These datasets include Cloze and Winograd style tasks, as well as multiple choice question an-swering. We evaluate in the zero-shot setting as done in the language modeling community.

In Table 3, we compare with existing models of various sizes and report numbers from the cor-responding papers. First, LLaMA-65B outper-forms Chinchilla-70B on all reported benchmarks but BoolQ. Similarly, this model surpasses PaLM- 540B everywhere but on BoolQ and WinoGrande. LLaMA-13B model also outperforms GPT-3 on most benchmarks despite being 10× smaller.

我们考虑了八个常识推理基准测试:BoolQ、PIQA、SIQA、HellaSwag、WinoGrande、ARC easy和challenge以及OpenBookQA。这些数据集包括Cloze和Winograd风格的任务,以及多项选择题。我们按照语言建模社区的做法,在零样本设置下进行评估。

在表3中,我们与各种规模的现有模型进行比较,并报告了相应论文中的数据。首先,LLaMA-65B在除了BoolQ以外的所有基准测试中都优于Chinchilla-70B。同样,除了BoolQ和WinoGrande以外,该模型也超过了PaLM-540B。LLaMA-13B模型尽管大小只有GPT-3的10倍小,但在大多数基准测试中表现优于GPT-3。

3.2 Closed-book Question Answering闭书式问答

We compare LLaMA to existing large language models on two closed-book question answering benchmarks: Natural Questions (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017). For both benchmarks, we report exact match perfor-mance in a closed book setting, i.e., where the mod-els do not have access to documents that contain evidence to answer the question. In Table 4, we report performance on NaturalQuestions, and in Ta-ble 5, we report on TriviaQA. On both benchmarks, LLaMA-65B achieve state-of-the-arts performance in the zero-shot and few-shot settings. More im-portantly, the LLaMA-13B is also competitive on these benchmarks with GPT-3 and Chinchilla, de-spite being 5-10× smaller. This model runs on a single V100 GPU during inference.

我们将LLaMA与现有的大型语言模型在两个闭书式问答基准测试上进行比较:自然问题和TriviaQA。对于这两个基准测试,我们报告了在闭书设置下的准确匹配性能,即模型无法访问包含回答问题所需证据的文档。在表4中,我们报告了在自然问题上的性能,而在表5中,我们报告了在TriviaQA上的性能。在这两个基准测试中,LLaMA-65B在零样本和少样本设置下达到了最先进的性能。更重要的是,LLaMA-13B在这些基准测试上与GPT-3和Chinchilla相比也具有竞争力,尽管模型只在单个V100 GPU上进行推断。

Table 5: TriviaQA. Zero-shot and few-shot exact match performance on the filtered dev set.

3.3 Reading Comprehension阅读理解

We evaluate our models on the RACE reading com-prehension benchmark (Lai et al., 2017). This dataset was collected from English reading com-prehension exams designed for middle and high school Chinese students. We follow the evaluation setup from Brown et al. (2020) and report results in Table 6. On these benchmarks, LLaMA-65B is competitive with PaLM-540B, and, LLaMA-13B outperforms GPT-3 by a few percents.

 我们在RACE阅读理解基准测试上评估了我们的模型。这个数据集是从为中国中学和高中学生设计的英语阅读理解考试中收集的。我们按照Brown等人(2020年)的评估设置进行评估,并在表6中报告结果。在这些基准测试中,LLaMA-65B与PaLM-540B具有竞争力,而LLaMA-13B在几个百分点上优于GPT-3。

Table 6: Reading Comprehension. Zero-shot accuracy.

3.4   Mathematical reasoning数学推理

We evaluate our models on two mathematical rea-soning benchmarks: MATH (Hendrycks et al., 2021) and GSM8k (Cobbe et al., 2021). MATH is a dataset of 12K middle school and high school mathematics problems written in LaTeX. GSM8k is a set of middle school mathematical problems. In Table 7, we compare with PaLM and Min-erva (Lewkowycz et al., 2022). Minerva is a series of PaLM models finetuned on 38.5B tokens ex-tracted from ArXiv and Math Web Pages, while neither PaLM or LLaMA are finetuned on mathe-matical data. The numbers for PaLM and Minerva are taken from Lewkowycz et al. (2022), and we compare with and without maj1@k. maj1@k de-notes evaluations where we generate k samples for each problem and perform a majority voting (Wang et al., 2022). On GSM8k, we observe that LLaMA- 65B outperforms Minerva-62B, although it has not been fine-tuned on mathematical data.

我们在两个数学推理基准测试上评估了我们的模型:MATH(Hendrycks等人,2021年)和GSM8k(Cobbe等人,2021年)。MATH是一个包含12,000个中学和高中数学问题的数据集,使用LaTeX编写。GSM8k是一组中学数学问题。在表7中,我们与PaLM和Minerva(Lewkowycz等人,2022年)进行了比较。Minerva是一系列在ArXiv和数学网页中提取的3850亿标记上微调的PaLM模型,而PaLM和LLaMA都没有在数学数据上进行微调。PaLM和Minerva的数据来自Lewkowycz等人(2022年),我们比较了有无maj1@k的结果。maj1@k表示我们为每个问题生成k个样本,并进行多数投票(Wang等人,2022年)的评估。在GSM8k上,我们观察到LLaMA-65B优于Minerva-62B,尽管它没有在数学数据上进行微调。

 Table 7: Model performance on quantitative reason-ing datasets. For majority voting, we use the same setup as Minerva, with k = 256 samples for MATH and k = 100 for GSM8k (Minerva 540B uses k = 64 for MATH and and k = 40 for GSM8k). LLaMA-65B outperforms Minerva 62B on GSM8k, although it has not been fine-tuned on mathematical data.

3.5 Code generation代码生成

We evaluate the ability of our models to write code from a natural language description on two benchmarks: HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021). For both tasks, the model receives a description of the program in a few sentences, as well as a few input-output ex-amples. In HumanEval, it also receives a function signature, and the prompt is formatted as natural code with the textual description and tests in a docstring. The model needs to generate a Python program that fits the description and satisfies the test cases. In Table 8, we compare the pass@1 scores of our models with existing language mod-els that have not been finetuned on code, namely PaLM and LaMDA (Thoppilan et al., 2022). PaLM and LLaMA were trained on datasets that contain a similar number of code tokens.

我们在两个代码生成基准测试上评估我们的模型对于从自然语言描述中生成代码的能力:HumanEval(Chen等人,2021年)和MBPP(Austin等人,2021年)。对于这两个任务,模型接收到一段程序的描述,包括几个输入-输出示例。在HumanEval中,它还会接收到一个函数签名,而提示文本的格式是自然代码,其中包含了文本描述和测试用例。模型需要生成一个符合描述并满足测试用例的Python程序。在表8中,我们将我们的模型的pass@1得分与未在代码上进行微调的现有语言模型进行了比较,包括PaLM和LaMDA(Thoppilan等人,2022年)。PaLM和LLaMA都是在包含相似数量的代码标记的数据集上进行训练的。

As show in Table 8, for a similar number of parameters, LLaMA outperforms other gen-eral models such as LaMDA and PaLM, which are not trained or finetuned specifically for code. LLaMA with 13B parameters and more outper-forms LaMDA 137B on both HumanEval and MBPP. LLaMA 65B also outperforms PaLM 62B, even when it is trained longer. The pass@1 results reported in this table were obtained by sampling with temperature 0.1. The pass@100 and pass@80 metrics were obtained with temperature 0.8. We use the same method as Chen et al. (2021) to obtain unbiased estimates of the pass@k.

如表8所示,对于相似数量的参数,LLaMA优于其他通用模型,如LaMDA和PaLM,这些模型没有专门针对代码进行训练或微调。LLaMA拥有13B参数及以上,在HumanEval和MBPP上的表现优于LaMDA 137B。即使在训练时间更长的情况下,LLaMA 65B也优于PaLM 62B。表中报告的pass@1结果是在温度为0.1的情况下采样得到的。pass@100和pass@80指标是在温度为0.8的情况下获得的。我们使用与Chen等人(2021年)相同的方法来获得pass@k的无偏估计。

It is possible to improve the performance on code by finetuning on code-specific tokens. For instance, PaLM-Coder (Chowdhery et al., 2022) increases the pass@1 score of PaLM on HumanEval from 26.2% for PaLM to 36%. Other models trained specifically for code also perform better than gen-eral models on these tasks (Chen et al., 2021; Ni-jkamp et al., 2022; Fried et al., 2022). Finetuning on code tokens is beyond the scope of this paper.

通过在代码上进行微调可以提高代码的性能。例如,PaLM-Coder(Chowdhery等人,2022年)将PaLM在HumanEval上的pass@1得分从26.2%提高到36%。专门针对代码进行训练的其他模型在这些任务上也表现优于通用模型(Chen等人,2021年;Nijkamp等人,2022年;Fried等人,2022年)。在本文范围之外,对代码标记进行微调。

Table 8: Model performance for code generation. We report the pass@ score on HumanEval and MBPP. HumanEval generations are done in zero-shot and MBBP with 3-shot prompts similar to Austin et al.(2021). The values marked with ∗ are read from figures in Chowdhery et al. (2022).代码生成的模型性能。我们报告了在HumanEval和MBPP上的pass@分数。HumanEval生成是在零猜和MBBP中进行的,使用与Austin et al. (2021)相似的3-shot提示。标有∗的值是从Chowdhery et al. (2022)的图表中读取的。

3.6 Massive Multitask Language Understanding大规模多任务语言理解

The massive multitask language understanding benchmark, or MMLU, introduced by Hendrycks et al. (2020) consists of multiple choice questions covering various domains of knowledge, includ-ing humanities, STEM and social sciences. We evaluate our models in the 5-shot setting, using the examples provided by the benchmark, and report results in Table 9. On this benchmark, we observe that the LLaMA-65B is behind both Chinchilla- 70B and PaLM-540B by a few percent in average, and across most domains. A potential explanation is that we have used a limited amount of books and academic papers in our pre-training data, i.e., ArXiv, Gutenberg and Books3, that sums up to only 177GB, while these models were trained on up to 2TB of books. This large quantity of books used by Gopher, Chinchilla and PaLM may also explain why Gopher outperforms GPT-3 on this benchmark, while it is comparable on other benchmarks.

大规模多任务语言理解基准测试(MMLU),由Hendrycks等人(2020年)引入,包含涵盖人文、STEM和社会科学等各个领域的多项选择题。我们在5-shot设置下使用该基准测试提供的示例评估我们的模型,并在表9中报告结果。在这个基准测试中,LLaMA-65B在平均值上略逊于Chinchilla-70B和PaLM-540B,并且在大多数领域中也是如此。一个可能的解释是,我们在预训练数据中使用了有限数量的图书和学术论文,即ArXiv、Gutenberg和Books3,总共只有177GB,而这些模型是在高达2TB的图书上进行训练的。这些由Gopher、Chinchilla和PaLM使用的大量图书可能也解释了为什么Gopher在这个基准测试中优于GPT-3,而在其他基准测试中相当。

3.7 Evolution of performance during training训练期间性能的演变

表现的追踪情况

  • 性能追踪: 在训练期间,对模型在一些问题回答和常识推理基准测试中的表现进行了追踪。

  • 图表展示: 结果以图表形式呈现在图2中,展示了模型在不同基准测试中的性能。

  • 性能趋势: 大多数基准测试中,性能呈现稳步提升,并与模型的训练困惑度(见图1)呈正相关。这表明随着训练的进行,模型对任务的理解和表现逐渐改善

  • 特殊情况: 有两个基准测试(SIQA和WinoGrande)出现了特殊情况。在SIQA中,性能出现较大的变化,可能表明该基准测试不太可靠。而在WinoGrande中,性能与训练困惑度的相关性不如其他测试明显,LLaMA-33B和LLaMA-65B在训练过程中表现相似。

During training, we tracked the performance of our models on a few question answering and common sense benchmarks, and report them in Figure 2. On most benchmarks, the performance improves steadily, and correlates with the training perplexity of the model (see Figure 1). The exceptions are SIQA and WinoGrande. Most notably, on SIQA, we observe a lot of variance in performance,that may indicate that this benchmark is not reliable. On WinoGrande, the performance does not correlate as well with training perplexity: the LLaMA-33B and LLaMA-65B have similar performance during the training.

在训练过程中,我们跟踪了我们的模型在一些问答和常识基准测试上的性能,并在图2中进行了报告。在大多数基准测试中,性能稳步提升,并与模型的训练困惑度相关(参见图1)。SIQA和WinoGrande是例外。特别是在SIQA上,我们观察到性能有很大的变化,这可能表明该基准测试不太可靠。在WinoGrande上,性能与训练困惑度的相关性不太明显:LLaMA-33B和LLaMA-65B在训练期间的性能相似。

Table 9: Massive Multitask Language Understanding (MMLU). Five-shot accuracy.大规模多任务语言理解(MMLU)。五次提示的准确性。

4 Instruction Finetuning指令微调

指导数据微调:

  • 通过在指导数据上进行微调,LLaMA-65B在MMLU任务上迅速取得了性能提升
  • 即使LLaMA-65B的非微调版本已经能够遵循基本指导,微调仍然能够显著提高MMLU的性能,并增强模型遵循指导的能力

微调实验和结果:

  • 在本文中,作者进行了一项微调实验,命名为LLaMA-I,遵循了与Chung等人(2022年)相同的协议。
  • 在表格10中,报告了LLaMA-I在MMLU上的结果,并与已有的中等规模的指导微调模型(OPT-IML和Flan-PaLM系列)进行了比较。
  • 尽管这里使用的指导微调方法相对简单,但在MMLU上达到了68.9%的性能。LLaMA-I(65B)在MMLU上的性能超过了已有的中等规模的指导微调模型,但仍远未达到最先进水平。GPT code-davinci-002在MMLU上的最先进水平为77.4%(数据来自Iyer等人(2022年))。

性能详细信息:

  • 作者在附录的表格16中提供了LLaMA-I在57个任务上的详细性能信息。

In this section, we show that briefly finetuning on instructions data rapidly leads to improvements on MMLU. Although the non-finetuned version of LLaMA-65B is already able to follow basic in-structions, we observe that a very small amount of finetuning improves the performance on MMLU, and further improves the ability of the model to follow instructions. Since this is not the focus of this paper, we only conducted a single experiment following the same protocol as Chung et al. (2022) to train an instruct model, LLaMA-I.

在本节中,我们展示了在指令数据上进行简短微调会迅速改善在MMLU上的性能。尽管LLaMA-65B的非微调版本已经能够遵循基本的指令,但我们观察到微调很少的量可以提高在MMLU上的性能,并进一步提高模型遵循指令的能力。由于这不是本文的重点,我们只进行了一个实验,按照Chung等人(2022年)的协议训练了一个指令模型LLaMA-I。

In Table 10, we report the results of our instruct model LLaMA-I on MMLU and compare with ex-isting instruction finetuned models of moderate sizes, namely, OPT-IML (Iyer et al., 2022) and the Flan-PaLM series (Chung et al., 2022). All the re-ported numbers are from the corresponding papers. Despite the simplicity of the instruction finetuning approach used here, we reach 68.9% on MMLU. LLaMA-I (65B) outperforms on MMLU existing instruction finetuned models of moderate sizes, but are still far from the state-of-the-art, that is 77.4 for GPT code-davinci-002 on MMLU (numbers taken from Iyer et al. (2022)). The details of the performance on MMLU on the 57 tasks can be found in Table 16 of the appendix.

在表10中,我们报告了我们的指令模型LLaMA-I在MMLU上的结果,并与具有中等规模的现有指令微调模型OPT-IML(Iyer等人,2022年)和Flan-PaLM系列(Chung等人,2022年)进行了比较。所有报告的数据都来自相应的论文。尽管这里使用的指令微调方法相对简单,我们在MMLU上达到了68.9%的准确率。LLaMA-I(65B)在MMLU上的表现优于具有中等规模的现有指令微调模型,但仍远远落后于当前的最先进水平,即GPT code-davinci-002在MMLU上的准确率为77.4%(数据取自Iyer等人(2022年))。有关在57个任务上的MMLU性能细节,请参见附录的表16。

Table 10: Instruction finetuning – MMLU (5-shot). Comparison of models of moderate size with and with-out instruction finetuning on MMLU.

Figure 2: Evolution of performance on question answering and common sense reasoning during training.在训练过程中的问答和常识推理性能演变。

5 Bias, Toxicity and Misinformation偏见、有害内容和虚假信息

大型语言模型可能面临的偏见、有毒性和虚假信息生成的问题,通过多个基准测试展示了LLaMA-65B在这些方面的表现。

  • 偏见、有毒性和虚假信息: 大型语言模型存在重现和放大训练数据中的偏见,生成有毒或冒犯性内容的问题。

  • 评估模型有毒性: 通过在不同基准测试上评估LLaMA-65B模型的毒性内容生成和刻板印象检测能力。使用RealToxicityPrompts基准测试,评估了模型在基于真实有毒提示的情况下的表现,发现毒性随模型规模增加而增加

  • CrowS-Pairs评估模型偏见: 使用CrowS-Pairs基准测试,测量了模型在9个类别中的偏见,包括性别、宗教、种族/肤色、性取向、年龄、国籍、残疾、外貌和社会经济地位。LLaMA相对于GPT-3和OPT-175B在平均偏见上稍微有优势

  • WinoGender基准测试: 通过WinoGender基准测试进一步研究模型在性别方面的偏见。发现模型在“their/them/someone”代词上的共识分辨性能明显优于“her/her/she”和“his/him/he”代词,这可能表明性别偏见存在

  • TruthfulQA测量真实性: 使用TruthfulQA基准测试,评估模型辨别声明真实性的能力。结果显示LLaMA-65B在真实和信息量充足两个类别上得分较高,但正确回答率仍然较低,表明该模型可能产生不准确的答案

Large language models have been showed to re-produce and amplify biases that are existing in the training data (Sheng et al., 2019; Kurita et al., 2019), and to generate toxic or offensive con-tent (Gehman et al., 2020). As our training dataset contains a large proportion of data from the Web, we believe that it is crucial to determine the po-tential for our models to generate such content. To understand the potential harm of LLaMA-65B, we evaluate on different benchmarks that measure toxic content production and stereotypes detection. While we have selected some of the standard bench-marks that are used by the language model com-munity to indicate some of the issues with these models, these evaluations are not sufficient to fully understand the risks associated with these models.

已经有研究表明,大型语言模型能够复制和放大训练数据中存在的偏见(Sheng等人,2019年;Kurita等人,2019年),并生成有害或冒犯性的内容(Gehman等人,2020年)。由于我们的训练数据集包含大量来自互联网的数据,我们认为确定我们的模型生成此类内容的潜力是至关重要的。为了了解LLaMA-65B的潜在危害,我们在衡量有害内容生成和刻板印象检测的不同基准测试上进行评估。虽然我们选择了一些标准的基准测试,这些测试被语言模型社区用来指示这些模型存在的一些问题,但这些评估并不足以完全了解与这些模型相关的风险。

5.1 Real Toxicity Prompts

Language models can generate toxic language, e.g., insults, hate speech or threats. There is a very large range of toxic content that a model can generate, making a thorough evaluation challenging. Several recent work (Zhang et al., 2022; Hoffmann et al., 2022) have considered the RealToxicityPrompts benchmark (Gehman et al., 2020) as an indicator of how toxic is their model. RealToxicityPrompts consists of about 100k prompts that the model must complete; then a toxicity score is automatically evaluated by making a request to PerspectiveAPI 3. We do not have control over the pipeline used by the third-party PerspectiveAPI, making comparison with previous models difficult.

语言模型可以生成有害语言,例如侮辱、仇恨言论或威胁。模型可以生成的有害内容范围非常广泛,这使得全面评估变得具有挑战性。最近的一些研究(Zhang等人,2022年;Hoffmann等人,2022年)已将RealToxicityPrompts基准测试(Gehman等人,2020年)视为评估其模型有害性的指标。RealToxicityPrompts包含约10万个模型必须完成的提示,然后通过向PerspectiveAPI发出请求自动评估其有害性分数。我们无法控制第三方PerspectiveAPI使用的流程,这使得与先前模型的比较变得困难。

For each of the 100k prompts, we greedily gen-erate with our models, and measure their toxic-ity score. The score per prompt ranges from 0 (non-toxic) to 1 (toxic). In Table 11, we report our averaged score on basic and respectful prompt cat-egories of RealToxicityPrompts. These scores are “comparable” with what we observe in the litera-ture (e.g., 0.087 for Chinchilla) but the method-ologies differ between these work and ours (in terms of sampling strategy, number of prompts and time of API). We observe that toxicity increases with the size of the model, especially for Respect-ful prompts. This was also observed in previous work (Zhang et al., 2022), with the notable excep-tion of Hoffmann et al. (2022) where they do not see a difference between Chinchilla and Gopher, despite different sizes. This could be explained by the fact that the larger model, Gopher, has worse performance than Chinchilla, suggesting that the relation between toxicity and model size may only apply within a model family.

对于这10万个提示中的每个提示,我们使用我们的模型进行贪婪生成,并测量其有害性分数。每个提示的分数范围从0(非有害)到1(有害)。在表11中,我们报告了我们在RealToxicityPrompts的基本和尊重提示类别上的平均分数。这些分数与我们在文献中观察到的结果“可比”(例如,Chinchilla的分数为0.087),但这些工作与我们的工作方法不同(在采样策略、提示数量和API时间方面)。我们观察到有害性随着模型的大小增加而增加,特别是对于尊重提示。这也是以前的研究观察到的现象(Zhang等人,2022年),但Hoffmann等人(2022年)的研究是一个值得注意的例外,他们没有观察到Chinchilla和Gopher之间的差异,尽管它们的大小不同。这可能可以解释为较大的模型Gopher的性能比Chinchilla差,表明有害性和模型大小之间的关系可能仅适用于模型系列内部。

Table 11: RealToxicityPrompts. We run a greedy de-coder on the 100k prompts from this benchmark. The “respectful” versions are prompts starting with “Com-plete the following sentence in a polite, respectful, and unbiased manner:”, and “Basic” is without it. Scores were obtained using the PerplexityAPI, with higher score indicating more toxic generations.RealToxicityPrompts。我们在这个基准测试的100,000个提示上运行贪婪解码器。"尊重"版本是以"以礼貌、尊重和公正的方式完成以下句子:"开头的提示,而"基本"则没有。得分是使用PerplexityAPI获得的,得分越高表示生成的内容越有毒。

5.2 CrowS-Pairs

We evaluate the biases in our model on the CrowS-Pairs (Nangia et al., 2020). This dataset allows to measure biases in 9 categories: gender, religion, race/color, sexual orientation, age, nationality, dis-ability, physical appearance and socioeconomic sta-tus. Each example is composed of a stereotype and an anti-stereotype, we measure the model prefer-ence for the stereotypical sentence using the per-plexity of both sentences in a zero-shot setting. Higher scores thus indicate higher bias. We com-pare with GPT-3 and OPT-175B in Table 12.

LLaMA compares slightly favorably to both models on average. Our model is particularly bi-ased in the religion category (+10% compared to OPT-175B), followed by age and gender. We ex-pect these biases to come from CommonCrawl de-spite multiple filtering steps.

我们在CrowS-Pairs(Nangia等人,2020年)上评估了我们模型的偏见。该数据集可用于衡量9个类别的偏见:性别、宗教、种族/肤色、性取向、年龄、国籍、残疾、外貌和社会经济地位。每个示例由一个刻板印象和一个反刻板印象组成,我们通过在零样本设置中比较两个句子的困惑度来衡量模型对刻板印象句子的偏好。较高的分数表示较高的偏见。我们在表12中与GPT-3和OPT-175B进行了比较。

总体而言,LLaMA在平均水平上略微优于这两个模型。我们的模型在宗教类别上表现出较大的偏见(相比于OPT-175B,增加了10%),其次是年龄和性别。我们认为这些偏见可能来自于CommonCrawl,尽管经过了多次过滤步骤。

Table 12: CrowS-Pairs. We compare the level of bi-ases contained in LLaMA-65B with OPT-175B and GPT3-175B. Higher score indicates higher bias.

5.3 WinoGender

To further investigate the biases of our model on the gender category, we look at the WinoGender benchmark (Rudinger et al., 2018), a co-reference resolution dataset. WinoGender is made of Wino-grad schema, and biases are evaluated by determin-ing if a model co-reference resolution performance is impacted by the gender of the pronoun.

More precisely, each sentence has three men-tions: an “occupation”, a “participant”, and a “pronoun” where the pronoun is co-referencing either the occupation or participant. We prompt the model to determine the co-reference relation and measure if it does so correctly according to the context of the sentence. The goal is to reveal if societal biases associated with occupations have been captured by the model. For example, a sentence in the WinoGender dataset is “The nurse notified the patient that his shift would be ending in an hour.”, which is followed by ‘His’ refers to. We then compare the perplexity of the continuations the nurse and the patient to per-form co-reference resolution with the model. We evaluate the performance when using 3 pronouns: “her/her/she”, “his/him/he” and “their/them/some-one” (the different choices corresponding to the grammatical function of the pronoun.

为了进一步研究我们的模型在性别类别上的偏见,我们查看了WinoGender基准测试(Rudinger等人,2018年),这是一个共参考消解数据集。WinoGender由Wino-grad schema构成,通过确定模型对代词的性别是否会影响共参考消解的性能来评估偏见。

更具体地说,每个句子有三个提及:一个“职业”、一个“参与者”和一个“代词”,其中代词与职业或参与者共指。我们提示模型确定共指关系,并根据句子的上下文正确地执行共参考消解。目标是揭示模型是否捕捉到了与职业相关的社会偏见。例如,WinoGender数据集中的一个句子是“The nurse notified the patient that his shift would be ending in an hour.”,接着是‘His’ refers to。然后,我们比较继续表示护士和病人的困惑度,以执行与模型的共参考消解。我们评估使用三个代词时的性能:“her/her/she”、“his/him/he”和“their/them/some-one”(不同选择对应于代词的语法功能)。

In Table 13, we report the co-reference scores for the three different pronouns contained in the dataset. We observe that our model is significantly better at performing co-reference resolution for the “their/them/someone” pronouns than for the “her/her/she” and “his/him/he” pronouns. A simi-lar observation was made in previous work (Rae et al., 2021; Hoffmann et al., 2022), and is likely indicative of gender bias. Indeed, in the case of the “her/her/she” and “his/him/he” pronouns, the model is probably using the majority gender of the occu-pation to perform co-reference resolution, instead of using the evidence of the sentence.

To further investigate this hypothesis, we look at the set of “gotcha” cases for the “her/her/she” and “his/him/he” pronouns in the WinoGender dataset. Theses cases correspond to sentences in which the pronoun does not match the majority gender of the occupation, and the occupation is the correct answer. In Table 13, we observe that our model, LLaMA-65B, makes more errors on the gotcha examples, clearly showing that it capture societal biases related to gender and occupation. The drop of performance exists for “her/her/she” and “his/him/he” pronouns, which is indicative of biases regardless of gender.

在表13中,我们报告了数据集中三个不同代词的共参考分数。我们观察到,对于“their/them/someone”代词,我们的模型在执行共参考消解时明显更好。以前的研究(Rae等人,2021年;Hoffmann等人,2022年)也得出了类似的观察结果,这很可能表明存在性别偏见。实际上,在“her/her/she”和“his/him/he”代词的情况下,模型可能使用职业的多数性别来执行共参考消解,而不是使用句子的证据。

为了进一步研究这个假设,我们查看了WinoGender数据集中“her/her/she”和“his/him/he”代词的“gotcha”案例。这些案例对应于代词与职业的多数性别不匹配,而职业是正确答案的句子。在表13中,我们观察到我们的模型LLaMA-65B在“gotcha”案例上产生了更多错误,清楚地显示了它捕捉到了与性别和职业相关的社会偏见。无论性别如何,对于“her/her/she”和“his/him/he”代词,性能下降都存在偏见的迹象。

5.4 TruthfulQA

TruthfulQA (Lin et al., 2021) aims to measure the truthfulness of a model, i.e., its ability to identify when a claim is true. Lin et al. (2021) consider the definition of “true” in the sense of “literal truth about the real world”, and not claims that are only true in the context of a belief system or tradition. This benchmark can evaluate the risks of a model to generate misinformation or false claims. The questions are written in diverse style, cover 38 cat-egories and are designed to be adversarial.

In Table 14, we report the performance of our models on both questions to measure truthful mod-els and the intersection of truthful and informative. Compared to GPT-3, our model scores higher in both categories, but the rate of correct answers is still low, showing that our model is likely to hallu-cinate incorrect answers.

TruthfulQA(Lin等人,2021年)旨在衡量模型的真实性,即其识别陈述是否真实的能力。Lin等人(2021年)将“真实”定义为“关于现实世界的字面真实”,而不仅仅是在信仰体系或传统背景下成立的陈述。该基准测试可以评估模型生成错误信息或虚假陈述的风险。问题以多样的风格编写,涵盖了38个类别,并被设计为对抗性的。

在表14中,我们报告了我们的模型在衡量真实模型和真实且有信息的问题时的性能。与GPT-3相比,我们的模型在这两个类别中得分较高,但正确答案的比率仍然很低,显示出我们的模型很可能会产生不正确的答案。

6 Carbon footprint碳足迹

模型训练对环境的能源和碳足迹的影响

  • 能源消耗和碳足迹: 作者强调模型训练耗费了大量能源,导致二氧化碳排放。在表格15中详细列出了总能源消耗和碳足迹。作者使用Wu等人(2022)的公式估算训练一个模型所需的瓦时(Wh)以及产生的二氧化碳排放量(tCO2eq)。

  • 能源消耗公式: 作者使用公式Wh = GPU-h×(GPU功耗)×PUE来估算所需的瓦时,其中PUE(Power Usage Effectiveness)被设置为1.1。这个公式考虑了GPU的数量、功耗以及PUE。

  • 碳排放计算: 二氧化碳排放量取决于训练网络的数据中心的位置。作者以美国全国平均碳强度因子0.385 kg CO2eq/KWh为基准,用于估算碳排放。不考虑数据中心的实际位置,以确保在相同数据中心条件下的模型训练成本比较。

  • 比较不同数据中心的模型训练成本: 通过将OPT和BLOOM在相同的数据中心条件下进行公平比较,作者估算出开发这些模型的成本约为2,638 MWh,总排放量为1,015 tCO2eq。作者希望通过释放这些模型,能够帮助减少未来的碳排放,因为训练已经完成,而且其中一些模型相对较小,可以在单个GPU上运行

The training of our models have consumed a mas-sive quantity of energy, responsible for the emis-sion of carbon dioxide. We follow the recent liter-ature on the subject and breakdown both the total energy consumption and the resulting carbon foot-print in Table 15. We follow a formula for Wu et al.(2022) to estimate the Watt-hour, Wh, needed to train a model, as well as the tons of carbon emis-sions, tCO2eq. For the Wh, we use the formula:

Wh = GPU-h×(GPU power consumption)×PUE,where we set the Power Usage Effectiveness (PUE) at 1.1. The resulting carbon emission depends on the location of the data center used to train the net-work. For instance, BLOOM uses a grid that emits 0.057 kg CO2eq/KWh leading to 27 tCO2eq and OPT a grid that emits 0.231 kg CO2eq/KWh, lead-ing to 82 tCO2eq. In this study, we are interested in comparing the cost in carbon emission of training of these models if they were trained in the same data center. Hence, we do not take the location of data center in consideration, and use, instead, the US national average carbon intensity factor of 0.385 kg CO2eq/KWh. This leads to the following formula for the tons of carbon emissions:

tCO2eq = MWh × 0.385.

我们模型的训练消耗了大量能源,导致二氧化碳的排放。我们参考最近的文献,将总能耗和相应的碳足迹分解如表15所示。我们遵循Wu等人(2022年)的公式来估计训练模型所需的瓦时(Wh)和碳排放量(tCO2eq)。对于瓦时,我们使用以下公式:

Wh = GPU-h×(GPU功耗)×PUE,其中我们将功耗使用效率(PUE)设置为1.1。产生的碳排放量取决于用于训练网络的数据中心的位置。例如,BLOOM使用的电网排放0.057千克CO2eq/KWh,导致27 tCO2eq;OPT使用的电网排放0.231千克CO2eq/KWh,导致82 tCO2eq。在本研究中,我们感兴趣的是比较在相同的数据中心中训练这些模型的碳排放成本。因此,我们不考虑数据中心的位置,而是使用美国国家平均碳强度因子为0.385千克CO2eq/KWh。这导致以下碳排放量的公式:

tCO2eq = MWh × 0.385.

We apply the same formula to OPT and BLOOM for fair comparison. For OPT, we assume training required 34 days on 992 A100-80B (see their logs4). Finally, we estimate that we used 2048 A100-80GB for a period of approximately 5 months to develop our models. This means that developing these mod-els would have cost around 2,638 MWh under our assumptions, and a total emission of 1,015 tCO2eq. We hope that releasing these models will help to reduce future carbon emission since the training is already done, and some of the models are relatively small and can be run on a single GPU.

我们对OPT和BLOOM应用相同的公式进行公平比较。对于OPT,我们假设训练需要在992个A100-80B上进行34天(参见其日志4)。最后,根据我们的假设,我们估计使用了2048个A100-80GB进行了约5个月的模型开发。这意味着在我们的假设下,开发这些模型的成本约为2638 MWh,并且总排放量为1015 tCO2eq。我们希望发布这些模型能够帮助减少未来的碳排放,因为训练已经完成,其中一些模型相对较小,可以在单个GPU上运行。

7 Related work相关工作

Table 15: Carbon footprint of training different models in the same data center. We follow Wu et al. (2022) to compute carbon emission of training OPT, BLOOM and our models in the same data center. For the power consumption of a A100-80GB, we take the thermal design power for NVLink systems, that is 400W. We take a PUE of 1.1 and a carbon intensity factor set at the national US average of 0.385 kg CO2e per KWh.在同一数据中心训练不同模型的碳足迹。我们遵循Wu等人(2022)的方法,在同一数据中心计算OPT、BLOOM和我们模型的碳排放量。对于A100-80GB的功耗,我们采用NVLink系统的热设计功耗,即400W。我们采用PUE值为1.1,碳强度因子设定为美国国家平均值,即每千瓦时0.385千克CO2e。

语言模型定义、语言模型历史、规模扩展、规模对性能的影响

  • 语言模型定义: 语言模型被定义为对单词、标记或字符序列的概率分布。这个任务通常被描述为下一个标记的预测,长期以来一直被认为是自然语言处理中的核心问题

  • 语言模型历史: 传统上,语言模型基于n-gram计数统计,使用各种平滑技术改进对稀有事件的估计。在过去的20年中,神经网络成功应用于语言建模任务,包括前馈模型、循环神经网络(RNNs)和长短时记忆网络(LSTMs)。近年来,基于自注意力机制的Transformer网络取得了重要的进展,特别是对于捕捉长距离依赖性。

  • 规模扩展: 在语言模型的发展历史中,对模型和数据集规模进行扩展有着悠久的历史。研究表明,使用包括BERT、GPT-2、Megatron-LM、T5等在内的大型语言模型取得了重要的成果。GPT-3更是达到了1750亿参数的规模,带来了一系列的大型语言模型,如Jurassic-1、Megatron-Turing NLG、Gopher、Chinchilla、PaLM、OPT和GLM等。

  • 规模对性能的影响: 其他研究关注了规模对深度学习模型性能的影响,发现模型和数据集规模与系统性能之间存在幂律关系。研究还涉及到适应学习速率计划以适应数据集规模扩展的细化方法,以及研究了大型语言模型的能力受规模扩展的影响。

Language models are probability distributions over sequences of words, tokens or charac-ters (Shannon, 1948, 1951). This task, often framed as next token prediction, has long been considered a core problem in natural language processing (Bahl et al., 1983; Brown et al., 1990). Because Turing (1950) proposed to measure machine intelligence by using language through the “imitation game”, language modeling has been proposed as a bench-mark to measure progress toward artificial intelli-gence (Mahoney, 1999).

Architecture. Traditionally, language models were based on n-gram count statistics (Bahl et al., 1983), and various smoothing techniques were proposed to improve the estimation of rare events (Katz, 1987; Kneser and Ney, 1995). In the past two decades, neural networks have been suc-cessfully applied to the language modelling task,starting from feed forward models (Bengio et al., 2000), recurrent neural networks (Elman, 1990; Mikolov et al., 2010) and LSTMs (Hochreiter and Schmidhuber, 1997; Graves, 2013). More recently, transformer networks, based on self-attention, have led to important improvements, especially for cap-turing long range dependencies (Vaswani et al., 2017; Radford et al., 2018; Dai et al., 2019).

语言模型是对单词、标记或字符序列的概率分布(Shannon,1948年,1951年)。这个任务通常被定义为下一个标记预测,并且长期以来一直被视为自然语言处理中的核心问题(Bahl等人,1983年;Brown等人,1990年)。自从Turing(1950年)提出通过使用语言来衡量机器智能以来,语言建模已被提出作为衡量人工智能进展的基准(Mahoney,1999年)。

架构。传统上,语言模型基于n-gram计数统计(Bahl等人,1983年),并提出了各种平滑技术来改善对罕见事件的估计(Katz,1987年;Kneser和Ney,1995年)。在过去的二十年中,神经网络已成功应用于语言建模任务,从前馈模型(Bengio等人,2000年),循环神经网络(Elman,1990年;Mikolov等人,2010年)和LSTM(Hochreiter和Schmidhuber,1997年;Graves,2013年)开始。最近,基于自注意力的Transformer网络在捕捉长距离依赖性方面取得了重要进展(Vaswani等人,2017年;Radford等人,2018年;Dai等人,2019年)。

Scaling. There is a long history of scaling for language models, for both the model and dataset sizes. Brants et al. (2007) showed the benefits of using language models trained on 2 trillion tokens, resulting in 300 billion n-grams, on the quality of machine translation. While this work relied on a simple smoothing technique, called Stupid Backoff, Heafield et al. (2013) later showed how to scale Kneser-Ney smoothing to Web-scale data. This allowed to train a 5-gram model on 975 billions to-kens from CommonCrawl, resulting in a model with 500 billions n-grams (Buck et al., 2014). Chelba et al. (2013) introduced the One Billion Word benchmark, a large scale training dataset to measure the progress of language models.

规模化。语言模型的规模化有着悠久的历史,包括模型和数据集的规模。Brants等人(2007年)展示了使用训练在2万亿标记上的语言模型的好处,从而产生3000亿个n-gram,提高了机器翻译的质量。虽然这项工作依赖于称为Stupid Backoff的简单平滑技术,但Heafield等人(2013年)随后展示了如何将Kneser-Ney平滑技术扩展到Web规模的数据上。这使得可以在CommonCrawl的9750亿个标记上训练一个5-gram模型,得到了5000亿个n-gram的模型(Buck等人,2014年)。Chelba等人(2013年)引入了十亿词基准测试数据集,用于衡量语言模型的进展。

In the context of neural language models, Joze-fowicz et al. (2016) obtained state-of-the-art re-sults on the Billion Word benchmark by scaling LSTMs to 1 billion parameters. Later, scaling transformers lead to improvement on many NLP tasks. Notable models include BERT (Devlin et al., 2018), GPT-2 (Radford et al., 2019), Megatron-LM (Shoeybi et al., 2019), and T5 (Raffel et al., 2020). A significant breakthrough was obtained with GPT-3 (Brown et al., 2020), a model with 175 billion parameters. This lead to a series of Large Language Models, such as Jurassic-1 (Lieber et al., 2021), Megatron-Turing NLG (Smith et al.,2022), Gopher (Rae et al., 2021), Chinchilla (Hoff-mann et al., 2022), PaLM (Chowdhery et al., 2022), OPT (Zhang et al., 2022), and GLM (Zeng et al., 2022). Hestness et al. (2017) and Rosenfeld et al.(2019) studied the impact of scaling on the perfor-mance of deep learning models, showing the exis-tence of power laws between the model and dataset sizes and the performance of the system. Kaplan et al. (2020) derived power laws specifically for transformer based language models, which were later refined by Hoffmann et al. (2022), by adapting the learning rate schedule when scaling datasets. Finally, Wei et al. (2022) studied the effect of scal-ing on the abilities of large language models.

在神经语言模型的背景下,Joze-fowicz等人(2016年)通过将LSTM扩展到10亿个参数,在十亿词基准测试上取得了最先进的结果。后来,通过扩展Transformer模型,在许多自然语言处理任务上取得了改进。值得注意的模型包括BERT(Devlin等人,2018年),GPT-2(Radford等人,2019年),Megatron-LM(Shoeybi等人,2019年)和T5(Raffel等人,2020年)。GPT-3(Brown等人,2020年)是一个具有1750亿个参数的模型,取得了重大突破。这导致了一系列的大语言模型,如Jurassic-1(Lieber等人,2021年),Megatron-Turing NLG(Smith等人,2022年),Gopher(Rae等人,2021年),Chinchilla(Hoff-mann等人,2022年),PaLM(Chowdhery等人,2022年),OPT(Zhang等人,2022年)和GLM(Zeng等人,2022年)。Hestness等人(2017年)和Rosenfeld等人(2019年)研究了规模化对深度学习模型性能的影响,展示了模型和数据集规模与系统性能之间存在的幂律关系。Kaplan等人(2020年)专门针对基于Transformer的语言模型推导出了幂律关系,后来由Hoffmann等人(2022年)通过在扩展数据集时调整学习率计划进行了改进。最后,Wei等人(2022年)研究了规模化对大型语言模型能力的影响。

8 Conclusion结论

概括了论文的主要贡献和观察结果

  • 发布的语言模型: 论文介绍了一系列开源发布的语言模型,这些模型在性能上与最先进的基础模型相竞争。尤其值得注意的是,LLaMA-13B在比GPT-3小10倍以上的情况下表现更好,而LLaMA-65B与Chinchilla-70B和PaLM-540B相媲美。

  • 基于公共数据集的训练: 与以往研究不同的是,论文展示了在仅使用公开可用的数据集的情况下,就能够达到最先进的性能水平,而无需使用专有数据集。

  • 对领域问题的关注: 作者希望通过向研究社区发布这些模型,加速大型语言模型的发展,并帮助改善它们的稳健性以及缓解已知问题,如毒性和偏见。

  • 关于微调的观察: 与Chung等人(2022)一样,作者观察到对这些模型进行指导性微调会产生有希望的结果,并计划在未来的工作中进一步研究这一点。

  • 未来计划: 最后,作者计划在未来发布在更大的预训练语料库上训练的更大型模型,因为他们在不断扩展规模时看到性能的持续改善。

In this paper, we presented a series of language models that are released openly, and competitive with state-of-the-art foundation models. Most notably, LLaMA-13B outperforms GPT-3 while being more than 10× smaller, and LLaMA-65B is competitive with Chinchilla-70B and PaLM-540B. Unlike previous studies, we show that it is possible to achieve state-of-the-art performance by training exclusively on publicly available data, without resorting to proprietary datasets. We hope that releasing these models to the research community will accelerate the development of large language models, and help efforts to improve their robust-ness and mitigate known issues such as toxicity and bias. Additionally, we observed like Chung et al.(2022) that finetuning these models on instructions lead to promising results, and we plan to further investigate this in future work. Finally, we plan to release larger models trained on larger pretraining corpora in the future, since we have seen a constant improvement in performance as we were scaling.

 在本文中,我们介绍了一系列与最先进的基础模型竞争,并且以开放方式发布的语言模型。特别值得注意的是,LLaMA-13B在比GPT-3小10倍以上的情况下表现更好,而LLaMA-65B与Chinchilla-70B和PaLM-540B具有竞争力。与先前的研究不同,我们展示了通过仅在公开可用的数据上进行训练,而无需使用专有数据集就可以实现最先进性能的可能性。我们希望将这些模型发布给研究社区将加速大型语言模型的发展,并帮助改善它们的稳健性以及减少已知问题,如有害和偏见。此外,我们观察到,像Chung等人(2022年)一样,在指令上微调这些模型会带来有希望的结果,并计划在未来的工作中进一步研究这一点。最后,我们计划在未来发布在更大的预训练语料库上训练的更大模型,因为我们发现随着规模的扩大,性能不断提高。

Acknowledgements致谢

We thank Daniel Haziza, Francisco Massa, Jeremy Reizenstein, Artem Korenev, and Patrick Labatut from the xformers team. We thank Susan Zhang and Stephen Roller for their support on data deduplication. We thank Luca Wehrstedt, Vegard Mella, and Pierre-Emmanuel Mazaré for their support on training stability. We thank Shubho Sengupta, Kalyan Saladi, and all the AI infra team for their support. We thank Jane Yu for her input on evaluation. We thank Yongyi Hu for his help on data collection.

我们感谢xformers团队的Daniel Haziza、Francisco Massa、Jeremy Reizenstein、Artem Korenev和Patrick Labatut。感谢Susan Zhang和Stephen Roller在数据去重方面的支持。感谢Luca Wehrstedt、Vegard Mella和Pierre-Emmanuel Mazaré在训练稳定性方面的支持。感谢Shubho Sengupta、Kalyan Saladi和所有AI基础设施团队的支持。感谢Jane Yu在评估方面的贡献。感谢Yongyi Hu在数据收集方面的帮助。

  • 4
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

一个处女座的程序猿

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值