LLMs:《BLOOM: A 176B-Parameter Open-Access Multilingual Language Model》翻译与解读

317 篇文章 213 订阅

LLMs:《BLOOM: A 176B-Parameter Open-Access Multilingual Language Model》翻译与解读

导读:BLOOM(BigScience Large Open-science Open-access Multilingual Language Model),一个由数百名研究人员合作设计和构建的176B参数的开放获取语言模型。BLOOM是一个仅解码器的Transformer语言模型,通过在ROOTS语料库上进行训练,在多种基准测试中表现出竞争性的性能。作者详细记录了BLOOM的开发过程,包括训练数据集ROOTS的创建、架构和分词器的设计,并探讨了BLOOM与其他大型语言模型的评估结果。

>>BLOOM-7B/13B/176B 开源以促LLM的研究和应用:作者希望该模型的发布能够推动大型语言模型的应用和研究,并记录他们的经验以促进类似于BigScience的大规模协作项目,使更多具有不同背景的人能够参与并推动该领域的发展。其中,Bloomz系列模型是基于 xP3 数据集微调。 推荐用于英语的提示(prompting);Bloomz-mt系列模型是基于 xP3mt 数据集微调。推荐用于非英语的提示(prompting)。

>> 基于ROOTS数据集(46种自然语言+13种编程语言):在ROOTS语料库上进行训练的,该语料库包含46种自然语言和13种编程语言的数百个来源(总共59种语言)。

>> 模型架构=仅解码器的Transformer+ALiBi位置嵌入+嵌入层归一化+多语言的tokenizer(字节级BPE)

>> 分布式学习工程策略=Megatron-DeepSpeed20+3D平行【数据+张量+管道】融合分布式深度学习+混合精度【bfloat16】

>> 法超级计算机上训练了3.5个月并详细记录了BLOOM开发过程:BLOOM,一个176B参数的开放获取多语言语言模型。BLOOM是由数百名研究人员组成的BigScience团队创建的,并在法国政府资助的Jean Zay超级计算机上进行了为期3.5个月的训练。在本文中,我们详细记录了BLOOM的开发过程,包括其训练数据集ROOTS的创建,架构和分词器的设计。

目录

《BLOOM: A 176B-Parameter Open-Access Multilingual Language Model》翻译与解读

Abstract

1、Introduction介绍

2、Background背景

2.1、Language Modeling语言建模

早期语言模型Early Language Models

神经语言模型Neural Language Models

Transfer Learning 迁移学习

少样本和零样本学习Few- and Zero-Shot:2020提出了prompts概念

LLM开发的社会限制Social Limitations of LLM Development

2.2、BigScience

参与者Participants

组织Organization

伦理考虑Ethical Considerations within BigScience

3、BLOOM

3.1、Training Dataset训练数据集

动机Motivation

3.1.1、Data Governance数据治理

3.1.2、Data Sources数据来源

Language Choices 语言选择

源选择Source Selection

GitHub Code

OSCAR

3.1.3、Data Preprocessing数据预处理

获取源数据Obtaining the Source Data

"质量"过滤:人类撰写的自然语言文本“Quality” filtering: Text Produced by Humans for Humans

Deduplication and Privacy Redaction去重和隐私删除

3.1.4、Prompted Datasets提示数据集

3.2、Model Architecture模型架构

3.2.1、Design Methodology设计方法

消融实验的实验设计Experimental Design for Ablations

不在范围内的架构Out-of-scope Architectures

3.2.2、Architecture and Pretraining Objective架构和预训练目标

3.2.3、Modeling Details建模细节

ALiBi位置嵌入ALiBi Positional Embeddings

嵌入层归一化Embedding LayerNorm

3.3、Tokenization标记化

验证Validation

标记器训练数据Tokenizer Training Data

词汇表大小Vocabulary Size

字节级BPE Byte-level BPE

归一化Normalization

预标记器Pre-tokenizer

3.4、Engineering工程

3.4.1、Hardware硬件

3.4.2、Framework框架

3.4.3、Floating Point Format浮点数格式

3.4.4、Fused CUDA Kernels融合的CUDA内核

3.4.5、Additional Challenges额外挑战

3.5、Training训练

预训练模型Pretrained Models

多任务微调Multitask Finetuning

对比微调Contrastive Finetuning

3.5.1、Carbon Footprint碳足迹

3.6、Release发布

模型卡Model Card

许可证Licensing

4、Evaluation评估

4.1、Experimental Design实验设计

4.1.1、Prompts提示

4.1.2、Infrastructure基础设施

4.1.3、Datasets数据集

SuperGLUE

Machine Translation (MT) 机器翻译(MT)

Summarization 摘要

4.1.4、Baseline Models基准模型

4.2、SuperGLUE

4.3、Machine Translation机器翻译

4.3.1、WMT

4.3.2、DiaBLA

4.3.3、Flores

4.4、Summarization摘要

4.5、Code Generation代码生成

4.6、HELM benchmark基准

4.7、Multitask Finetuning多任务微调

4.8、Embeddings嵌入

4.9、Multilingual Probing多语言探测

4.9.1、Method方法

Baselines 基准

Correlation相关性

4.9.2、Results结果

Probing探测

Correlation相关性

Discussion讨论

4.10、Bias偏见

Limitations 限制

5、Conclusion结论

6、Contributions贡献

Acknowledgments致谢


《BLOOM: A 176B-Parameter Open-Access Multilingual Language Model》翻译与解读

时间

2022年11月9日

地址

https://arxiv.org/abs/2211.05100

作者

BigScience Workshop∗

BigScience Workshop是一个由数百名研究人员组成的团队,旨在推动大规模协作研究项目。该团队致力于开发和推广大型语言模型(LLMs),以及通过协作研究实现这些模型的开放获取。他们的目标是将这一强大技术民主化,使更多的人能够使用和研究LLMs。

Abstract

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are fre-quently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer lan-guage model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and appli-cations using LLMs, we publicly release our models and code under the Responsible AI License.1

Keywords: Language models, collaborative research

大型语言模型(LLM)已被证明能够在几个演示或自然语言指令的基础上执行新任务。尽管这些能力已被广泛采用,但大多数LLM都是由资源丰富的组织开发的,并且经常对公众隐藏。为了推动这一强大技术的民主化,我们展示了BLOOM,这是一个176B参数的开放获取语言模型,它是由数百名研究人员合作设计和构建的。BLOOM是一个仅解码器的Transformer语言模型,它是在ROOTS语料库上进行训练的,该语料库包含46种自然语言和13种编程语言的数百个来源(总共59种语言)。我们发现BLOOM在各种基准测试中取得了竞争性的性能,在进行多任务提示微调后表现更强。为了促进未来使用LLM进行研究和应用,我们在负责任AI许可下公开发布我们的模型和代码。

关键词:语言模型,协作研究

1、Introduction介绍

Pretrained language models have become a cornerstone of modern natural language pro- cessing (NLP) pipelines because they often produce better performance from smaller quan- tities of labeled data. The development of ELMo (Peters et al., 2018), ULMFiT (Howard and Ruder, 2018), GPT (Radford et al., 2018), and BERT (Devlin et al., 2019) led to the widespread use of pretrained models as an initialization for finetuning on downstream tasks. The subsequent finding that pretrained language models can perform useful tasks without any additional training (Radford et al., 2019; Brown et al., 2020) further demonstrated their utility. In addition, the empirical observation that a language model’s performance tends to increase as the model is made larger—sometimes predictably (Hestness et al., 2017; Kaplan et al., 2020; Hoffmann et al., 2022) and sometimes suddenly (Wei et al., 2022)—has led to a trend of increasing scale (Zeng et al., 2021; Rae et al., 2021; Smith et al., 2022; Chowdhery et al., 2022). Apart from environmental concerns (Strubell et al., 2019; Lacoste et al., 2019; Schwartz et al., 2020), the costs of training large language models (LLMs) are only afford- able for well-resourced organizations. Furthermore, until recently, most LLMs were  not publicly released. As a result, the majority of the research community has been excluded from the development of LLMs. This exclusion has had concrete consequences; for exam- ple, most LLMs are primarily trained on English-language text (with notable exceptions in Chinese and Korean, e.g. Wang et al., 2021; Zeng et al., 2021; Kim et al., 2021).

预训练语言模型已成为现代自然语言处理(NLP)流程的重要组成部分,因为它们通常能在较少标注数据的情况下获得更好的性能ELMo(Peters等,2018)、ULMFiT(Howard和Ruder,2018)、GPT(Radford等,2018)和BERT(Devlin等,2019)的发展,导致了预训练模型被广泛用作初始化,用于下游任务的微调。随后的发现表明,预训练语言模型可以在没有任何额外训练的情况下执行有用的任务(Radford等,2019;Brown等,2020),进一步证明了它们的实用性。此外,语言模型的性能往往随着模型变得更大而增加——有时可以预测(Hestness等,2017;Kaplan等,2020;Hoffmann等,2022),有时突然增加(Wei等,2022),这导致了增大规模的趋势(Zeng等,2021;Rae等,2021;Smith等,2022;Chowdhery等,2022)。除了环境问题(Strubell等,2019;Lacoste等,2019;Schwartz等,2020)之外,训练大型语言模型(LLM)的成本只有那些资源充足的组织才能承担得起。此外,直到最近,大多数LLM并没有公开发布。因此,大部分研究社区在LLM的发展中被排除在外。这种排除产生了实际后果;例如,大多数LLM主要是在英语文本上进行训练的(中国和韩国有一些例外,例如Wang等,2021;Zeng等,2021;Kim等,2021)。

To address these issues, we present the BigScience Large Open-science Open-access Mul- tilingual Language Model (BLOOM, BigScience Workshop, 2022). BLOOM is a 176 billion parameter language model trained on 46 natural languages and 13 programming languages that was developed and released by a collaboration of hundreds of researchers. The com- pute for training BLOOM was provided through a French public grant from GENCI and IDRIS, leveraging IDRIS’ Jean Zay supercomputer. To build BLOOM, we undertook a thorough design process for each of its components, including the training dataset (Sec- tion 3.1), model architecture and training objective (Section 3.2), and engineering strategy for distributed learning (Section 3.4). We also performed an analysis of the model’s capa- bilities (Section 4).  Our overall aim is not only to publicly release a large-scale multilingual language model with performance comparable to recently developed systems, but also to document the coordinated process that went into its development (Section 2.2). The pur- pose of this paper is to provide a high-level overview of these design steps while referencing the individual reports we produced over the course of developing BLOOM.

为了解决这些问题,我们推出了BigScience Large Open-science Open-access Multilingual Language Model(BLOOM,BigScience Workshop,2022)。BLOOM是由数百名研究人员合作开发和发布的一个训练有1760亿个参数的语言模型,涵盖46种自然语言和13种编程语言。BLOOM的训练计算资源是通过法国GENCI和IDRIS的Jean Zay超级计算机提供的公共资助。为了构建BLOOM,我们对其各个组成部分进行了全面的设计,包括训练数据集(第3.1节)、模型架构训练目标(第3.2节)以及分布式学习的工程策略(第3.4节)。我们还对模型的能力进行了分析(第4节)。我们的总体目标不仅是公开发布一个具有与最近开发的系统相媲美性能的大规模多语言语言模型,还要记录其开发过程中的协调过程(第2.2节)。本文的目的是在引用我们在开发BLOOM过程中产生的个别报告的同时,提供对这些设计步骤的高级概述

2、Background背景

Before describing the BLOOM model itself, in this section we provide necessary background on LLMs as well as an organizational overview of the BigScience effort.

在描述BLOOM模型本身之前,本节将提供关于LLM以及BigScience项目的组织概述的必要背景信息。

2.1、Language Modeling语言建模

Language modeling refers to the task of modeling the probability of a sequence of tokens in a text (Shannon, 1948), where a token is a unit of text (e.g. word, subword, character or byte, etc., as discussed by Mielke et al., 2021). In this work (and in most current applications of language modeling) we model the joint probability of tokens in a text as:

where x is a sequence of tokens, xt is the tth token, and x<t is the sequence of tokens preceding xt. This approach is referred to as autoregressive language modeling and can be seen as iteratively predicting the probability of the next token.

语言建模是指对文本中一系列标记的概率进行建模的任务(Shannon,1948),其中标记是文本的一个单位(例如单词、子词、字符或字节等,如Mielke等人,2021所讨论的)。在这项工作中(以及大多数当前的语言建模应用中),我们将文本中标记的联合概率建模为:

其中,x是一系列标记,xt是第t个标记,x<t是xt之前的标记序列。这种方法被称为自回归语言建模,可以看作是迭代地预测下一个标记的概率

早期语言模型Early Language Models

Early Language Models Language models have a long history of application in NLP. Early language models (such as those developed by Shannon, 1948) were primarily n-gram models that estimate the probability of a length-n sequence of tokens in accordance with the number of times it appears in a training corpus. In practice, n-gram models face two major issues: first, they grow exponentially in size as n is increased; and second, they have no direct way of producing a probability for a sequence of tokens that does not appear in their training data. Advances on these problems enabled n-gram models to see widespread use across most areas of NLP (Goodman, 2001).

语言模型在自然语言处理中有着悠久的应用历史。早期的语言模型(例如Shannon,1948提出的模型)主要是n-gram模型,根据在训练语料库中出现的次数估计长度为n的标记序列的概率。在实践中,n-gram模型面临两个主要问题:首先,随着n的增加,它们的大小呈指数级增长;其次,它们没有直接的方法来为在训练数据中未出现的标记序列生成概率。解决这些问题的进展使得n-gram模型在NLP的大多数领域得到广泛应用(Goodman,2001)。

神经语言模型Neural Language Models

An alternative to n-gram models, first proposed by Miikku- lainen and Dyer (1991) and Schmidhuber and Heil (1996) and later popularized by Bengio et al. (2000), is to use a neural network to estimate the probability of the next token given prior tokens. While early work used feed-forward networks with a fixed-length history win- dow, Mikolov et al. (2010); Sutskever et al. (2011); Graves (2013) proposed to use recurrent neural networks instead and found that this significantly improved performance. More re- cently, language models based on the Transformer architecture (Vaswani et al., 2017) were shown to be more effective than recurrent neural networks (Radford et al., 2018; Al-Rfou et al., 2019; Kaplan et al., 2020). Consequently, the Transformer has become the de facto choice for language models.

Miikkulainen和Dyer(1991)以及Schmidhuber和Heil(1996)首次提出了一种替代n-gram模型的方法,并在之后被Bengio等人(2000)广泛应用,即使用神经网络来估计给定先前标记的下一个标记的概率。早期的工作使用带有固定历史窗口的前馈网络,但Mikolov等人(2010);Sutskever等人(2011);Graves(2013)提出改用循环神经网络RNN,并发现这显著提高了性能。更近期,基于Transformer架构(Vaswani等人,2017)的语言模型被证明比循环神经网络更有效(Radford等人,2018;Al-Rfou等人,2019;Kaplan等人,2020)。因此,Transformer已成为语言模型的事实标准选择

Transfer Learning 迁移学习

In tandem with advances in language  modeling  using  neural  net- works, NLP pipelines have increasingly adopted the framework of transfer learning. In transfer learning, the parameters of a model are first pretrained on a data-rich task be- fore being finetuned on a downstream task. A historically common approach to obtaining pretrained parameters  were  word  vectors  (Mikolov et  al., 2013)  trained so that the  dot product of co-occurring word vectors is large. However, subsequent work by Peters et al. (2018); Howard and Ruder (2018); Radford et al. (2018); Devlin et al. (2019) showed that the framework of Collobert et al. (2011), where the entire model is pretrained before being finetuned, can attain stronger performance.  In particular, Radford et al. (2018); Devlin et al. (2019) demonstrated strong results using pretrained Transformer language models, prompting work on progressively better models (Liu et al., 2019; Yang et al., 2019; Lewis et al., 2020; Raffel et al., 2020; Zhang et al., 2019, etc.).

随着使用神经网络进行语言建模的进展,NLP流水线越来越多地采用了迁移学习的框架。在迁移学习中,模型的参数首先在一个数据丰富的任务上进行预训练,然后在下游任务上进行微调以前常用的获得预训练参数的方法是使用词向量(Mikolov等人,2013),使得共现单词向量的点积较大。然而,Peters等人(2018);Howard和Ruder(2018);Radford等人(2018);Devlin等人(2019)的后续工作表明,在整个模型进行预训练后再进行微调的Collobert等人(2011)的框架可以获得更强的性能。特别是Radford等人(2018);Devlin等人(2019)展示了使用预训练的Transformer语言模型取得了强大的结果,引发了进一步改进模型的工作(Liu等人,2019;Yang等人,2019;Lewis等人,2020;Raffel等人,2020;Zhang等人,2019等)。

少样本和零样本学习Few- and Zero-Shot2020提出了prompts概念

Learning While finetuning a pretrained model remains an effective way of attaining high performance with limited labeled data, a parallel line of work has demonstrated that pretrained language models can be induced to perform tasks without any subsequent training. After Vinyals and Le (2015) observed limited task-performing behavior in a neural dialog model, Radford et al. (2019) later demonstrated that Transformer-based language models trained on text scraped from the web could perform various tasks to varying degrees. Notably, Radford et al. (2019) found that performance improved with model scale, inspiring work to characterize (Kaplan et al., 2020; Hoffmann et al., 2022) and exploit (Shoeybi et al., 2019; Brown et al., 2020; Smith et al., 2022; Chowdhery et al., 2022; Rae et al., 2021; Wang et al., 2021; Zeng et al., 2021; Zhang et al., 2022) the benefits of scale. A major factor in the success of this approach is the way that task-specific examples are formatted when fed into the model. Brown et al. (2020) popularized the idea of designing “prompts”  that provide natural-language descriptions of the task and also allow inputting a few demonstrations of input-output behavior.

虽然微调预训练模型仍然是使用有限标注数据实现高性能的有效方法,但并行的研究已经表明,预训练语言模型可以在没有后续训练的情况下执行任务。Vinyals和Le(2015)观察到神经对话模型中的有限任务执行行为后,Radford等人(2019)随后证明,基于Transformer的语言模型在从网络上抓取的文本上进行训练后,可以以不同程度执行各种任务。值得注意的是,Radford等人(2019)发现,模型规模的增大会改善性能,激发了对规模优势的表征(Kaplan等人,2020;Hoffmann等人,2022)和利用(Shoeybi等人,2019;Brown等人,2020;Smith等人,2022;Chowdhery等人,2022;Rae等人,2021;Wang等人,2021;Zeng等人,2021;Zhang等人,2022)的研究。这种方法成功的一个重要因素是将特定任务的示例在输入模型时的格式化方式。Brown等人(2020)提出了设计“提示”(prompts)的概念,提供任务的自然语言描述,并允许输入一些输入-输出行为的演示。

LLM开发的社会限制Social Limitations of LLM Development

While the continued increase in the size of large language models has resulted in improvements across a wide range of tasks, it has also exacerbated issues with their development and use (Bender et al., 2021). The computational expense of large models also prohibits the majority of the research community from partici- pating in their development, evaluation and routine use. Moreover, the computational costs have also lead to concerns about the carbon footprint stemming from the training and use of large language models (Strubell et al., 2019; Lacoste et al., 2019; Schwartz et al., 2020; Bannour et al., 2021), and existing carbon footprint studies have likely under-estimated emissions (Bannour et al., 2021). Contributing to an increase in the global carbon footprint exacerbates climate change which most severely affects already-marginalized communities (Westra and Lawson, 2001). Furthermore, the concentration of resources within a handful of (typically industrial) institutions  with  primarily  technical  expertise  hinders  prospects for an inclusive, collaborative, and reliable  governance of the technology. First, public narratives about the technology that are driven by industry actors can lead to inflated expectations about its suitability for use (Brennen, 2018; Brennen et al., 2022), leading to misaligned research and policy priorities (Raji et al., 2022) and potentially dire conse- quences in e.g. medical applications (Wong et al., 2021). Second, in a world mediated by technology, choices at all stages of its development end up shaping people’s lives in a way that can be most closely compared to regulations (Winner, 1977, 2017), albeit without the same explicit consultation of stakeholders in the process. When the development efforts are guided by prioritizing internal definitions of performance over their impact on society, the values of the developers come to be emphasized over those of the direct and indirect users (Birhane et al., 2022). Despite the substantial social dangers in allowing this technology to be developed unilaterally by corporations, EleutherAI (Phang et al., 2022) was the only non-corporate entity outside of China that was developing large language models before the BigScience Workshop was convened.

虽然大型语言模型规模的持续增长在各种任务上取得了改进,但也加剧了其开发和使用过程中的问题(Bender等人,2021)。大型模型的计算开销也限制了大部分研究社区参与其开发、评估和常规使用。此外,计算成本还引发了关于训练和使用大型语言模型的碳足迹的担忧(Strubell等人,2019;Lacoste等人,2019;Schwartz等人,2020;Bannour等人,2021),现有的碳足迹研究可能低估了排放量(Bannour等人,2021)。增加全球碳足迹会加剧气候变化,最严重影响已经处于边缘化的社区(Westra和Lawson,2001)。此外,资源集中在少数(通常是工业)机构中,这些机构主要具有技术专长,阻碍了对该技术进行包容、合作和可靠治理的前景。首先,由行业参与者推动的有关该技术的公众叙事可能会导致对其适用性的夸大期望(Brennen,2018;Brennen等人,2022),从而导致研究和政策优先事项不一致(Raji等人,2022),在医疗应用等方面可能产生严重后果(Wong等人,2021)。其次,在一个由技术介导的世界中,开发过程中的选择最终塑造了人们的生活方式,这可以与规章制度(Winner,1977,2017)进行最为密切的比较,尽管在过程中没有同样明确地咨询利益相关者。当开发工作的指导原则是将内部对性能的定义置于对社会影响的重要性之上时,开发人员的价值观就会凸显出来,而用户的价值观则被弱化(Birhane等人,2022)。尽管允许企业单方面开发这项技术存在重大的社会危险,但在召开BigScience研讨会之前,EleutherAI(Phang等人,2022)是除中国外唯一一个开发大型语言模型的非企业实体。

2.2、BigScience

参与者Participants

BLOOM’s development was coordinated by BigScience, an open research collaboration whose goal was the public release of an LLM. The project started after being awarded by GENCI a compute grant on its Jean Zay supercomputer at IDRIS/CNRS. It was initially built around a concerted effort from Hugging Face and the French NLP community (the “founding members”), and quickly opened up to grow into a broader international collaboration to support its aims of linguistic, geographical, and scientific diversity. In the end, over 1200 people registered as participants in BigScience and were given access to its communication channels. They had background not only in machine learning and computer science, but also linguistics, statistics, socio-cultural anthropology, philosophy, law, and other fields. Of those, hundreds of individuals have directly contributed to one of the project’s released artifacts. While the largest number of participants ultimately originated from the US, 38 countries were represented.

BLOOM的开发由BigScience协调进行,BigScience是一个开放的研究合作项目,旨在公开发布一个大型语言模型(LLM)。该项目在GENCI的Jean Zay超级计算机上获得了计算资助后开始进行,在IDRIS/CNRS进行。最初,该项目是由Hugging Face和法国自然语言处理社区(“创始成员”)共同努力构建的,并迅速扩大成为一个更广泛的国际合作项目,以支持其在语言、地理和科学多样性方面的目标。最终,有超过1200人注册成为BigScience的参与者,并获得了其沟通渠道的访问权限。他们不仅在机器学习和计算机科学方面具有背景,还包括语言学、统计学、社会文化人类学、哲学、法律和其他领域的知识。其中,数百人直接为项目的发布物之一做出了贡献。虽然参与者的最大数量最终来自美国,但代表了38个国家。

组织Organization

The set of related research questions tackled by the BigScience effort was reflected in the project’s organization into working groups. Each working group comprised several participants with various levels of involvement, including chairs whose role was to self-organize around a specific aspect of the overall project. Importantly, participants were encouraged to join more than one working group in order to share experiences and information, which resulted in the set of 30 working groups presented in Figure 1. Most of  the working groups focused on tasks directly linked to the development of BLOOM. In addition, a few groups focused on the evaluation of LLMs and dataset development in specific domains, such as biomedical texts (Fries et al., 2022b) and historical texts (De Toni et al., 2022). A larger overview of the motivations behind this initiative, its history and some of the lessons learned can be found in Akiki et al. (2022).

BigScience努力解决的一系列相关研究问题反映在项目的组织结构中。每个工作组由多个参与者组成,参与程度不同,包括负责自我组织在整体项目的特定方面周围的主席。重要的是,鼓励参与者加入多个工作组,以分享经验和信息,这导致了图1中呈现的30个工作组的设立。大多数工作组专注于与BLOOM开发直接相关的任务。此外,一些小组专注于特定领域中LLM的评估和数据集开发,例如生物医学文本(Fries等,2022b)和历史文本(De Toni等,2022)。关于此倡议背后的动机、其历史以及一些经验教训的更详细概述可参见Akiki等(2022)。

伦理考虑Ethical Considerations within BigScience

In order to acknowledge and start ad- dressing social limitations of LLM development within BigScience, the workshop relied on a collaboratively designed Ethical Charter2 and original research on applicable regulations in jurisdictions outside of the US3 to guide its choices throughout the project. In particular, the charter emphasizes values of inclusivity and diversity, openness and reproducibil- ity, and responsibility in various aspects of the organization (Akiki et al., 2022). Each of these values are showcased in different ways in the dataset curation (Section 3.1), model- ing (Section 3.2), engineering (Section 3.4), evaluation (Section 4), and other social impact (throughout) aspects of the project.

为了承认并开始解决BigScience内部LLM开发的社会限制,该研讨会依赖于合作设计的伦理宪章2和关于美国以外司法管辖区适用法规的原始研究3,以指导项目的选择。特别是,该宪章强调包容性和多样性、开放性和可重复性以及组织的各个方面的责任等价值观。这些价值观在项目的数据集策划(第3.1节)、建模(第3.2节)、工程(第3.4节)、评估(第4节)以及其他社会影响方面都以不同的方式展示出来。

3、BLOOM

In this section, we document the design of BLOOM, including its training dataset (Sec- tion 3.1), architecture (Section 3.2), tokenizer (Section 3.3), computing infrastructure (Sec- tion 3.4), and training hyperparameters (Section 3.5).

在本节中,我们将介绍BLOOM的设计,包括其训练数据集(第3.1节)、架构(第3.2节)、分词器(第3.3节)、计算基础设施(第3.4节)和训练超参数(第3.5节)。

3.1、Training Dataset训练数据集

BLOOM was trained on the ROOTS corpus (Lauren?con et al., 2022), a composite collection of 498 Hugging Face datasets (Lhoest et al., 2021) amounting to 1.61 terabytes of text that span 46 natural languages and 13 programming languages. A high-level overview of this dataset can be seen in Figure 3, while a detailed itemized list of every language along with its linguistic genus, family and macroarea is presented in Table 1. Beyond the corpus itself, the process resulted in the development and release of a number of organizational and technical tools, including those illustrated in Figure 2.  The rest of this section will contextualize these efforts by providing a brief summary of the steps taken to compile the corpus. For more detailed documentation of the overall dataset curation process and its outcomes, we refer the reader to Lauren?con et al. (2022).

BLOOM是在ROOTS语料库(Laurençon等,2022)上进行训练的,该语料库是由498个Hugging Face数据集(Lhoest等,2021)组成的复合集合,总共有1.61 TB的文本,涵盖了46种自然语言和13种编程语言。图3展示了该数据集的高级概述,表1列出了每种语言及其语言属、家族和宏地区的详细列表。除了语料库本身之外,该过程还导致了一些组织和技术工具的开发和发布,包括图2中所示的工具。本节的其余部分将通过提供对编制语料库所采取的步骤的简要概述来为这些努力提供背景。关于整体数据集策划过程及其结果的更详细文档,请参阅Laurençon等(2022)。

动机Motivation

The disconnect between developers and (in)voluntary users of the technology mentioned in Section 2 is particularly apparent in the curation of the datasets that have supported recent large-scale machine learning projects, where intentional “Data work” is generally under-valued (Sambasivan et al., 2021). In the context of LLMs, this tendency is exemplified by a range of heuristics-based filtering approaches that prioritize getting as much “high-quality” data for as little cost as possible over engaging with the needs—and rights—of data subjects, where quality is commonly defined as maximizing performance on downstream tasks while occasionally removing content deemed offensive by the developers.

第2节提到的开发者与(非)自愿使用技术的用户之间的脱节在支持最近大规模机器学习项目的数据集策划中尤为明显,在这种情况下,故意的“数据工作”通常被低估(Sambasivan等,2021)。在LLM的背景下,这种倾向通过一系列基于启发式过滤方法来体现,这些方法优先考虑以最小成本获取尽可能多的“高质量”数据,而不是与数据主体的需求和权利进行交流,其中高质量通常被定义为在下游任务中最大化性能,同时偶尔删除开发者认为冒犯性的内容。

While these approaches do yield terabytes of data with comparatively little human effort, compounding biases of the source material (such as CommonCrawl dumps) with those of the filtering method often leads to negative outcomes for marginalized populations. In one case, the use of a block list to remove “pornographic” text was shown to also suppress LGBTQ+ and African American English (AAE) text from a corpus (Dodge et al., 2021). In another, using Reddit outgoing links as an indicator of quality for a seed corpus (Radford et al., 2019) leads to trained models that implicitly prioritize US-centric  views  in  their outputs (Johnson et al., 2022). In yet another project, a filtering approach that relied on a machine learning image-text alignment model was shown to exacerbate its biases in the created multimodal dataset (Birhane et al., 2021). In addition, this abstractive approach to data curation leads to corpora that are difficult to meaningfully document and govern after the fact, as the provenance and authorship of individual items is usually lost in the process (although works such as Gao et al. (2020) that prioritize compilations of previously documented individual sources over crawled data provide a step towards addressing these issues (Biderman et al., 2022)).

尽管这些方法确实可以用较少的人力成本获得数TB的数据,但将源材料的偏见(如CommonCrawl转储)与过滤方法的偏见相结合,往往会对边缘化人群产生负面影响。例如,有一种使用屏蔽列表来删除“色情”文本的方法被证明也会从语料库中排除LGBTQ+和非洲裔美国英语(AAE)文本(Dodge等,2021)。另一个例子是使用Reddit的外部链接作为种子语料库质量的指标(Radford等,2019),这导致训练出的模型在输出中隐含地优先考虑美国中心的观点(Johnson等,2022)。在另一个项目中,一种依赖于机器学习图像-文本对齐模型的过滤方法被证明会加剧其在创建的多模态数据集中的偏见(Birhane等,2021)。此外,这种抽象的数据策划方法导致难以在事后进行有意义的文档化和治理,因为个别项的来源和作者通常在过程中丢失(尽管像Gao等(2020)这样优先考虑已有文档化个别来源的作品是解决这些问题的一步(Biderman等,2022))。

In the context of the BigScience workshop, and in accordance with its Ethical Charter,4

we aimed to prioritize human involvement, local expertise, and language expertise in our data curation and documentation process, as outlined in the following sections.

在BigScience研讨会的背景下,并根据其伦理宪章,我们致力于在数据策划和文档化过程中优先考虑人的参与、本地专业知识和语言专业知识,具体内容将在以下章节中概述。

3.1.1、Data Governance数据治理

Large text corpora comprise text about and created by people: the data subjects. Different people and institutions might legally “own” that data, making them data rights-holders. As machine learning developers gather and collate that data into ever-larger datasets to support training larger models, it becomes increasingly important to develop new ways of accounting for the interests of all parties involved – developers, data subjects, and rights-holders alike.

大型文本语料库包括关于人的文本和人们创建的文本:数据主体。不同的人和机构可能在法律上“拥有”这些数据,使他们成为数据权利持有人。随着机器学习开发者将这些数据收集和整合到规模越来越大的数据集中,以支持训练更大的模型,开发新的方式来考虑所有相关方的利益(包括开发者、数据主体和权利持有人)变得越来越重要。

The BigScience effort aimed to address these needs through a multidisciplinary lens combining technical, legal, and sociological expertise. The group focused on two main interrelated goals at two different time scales: the design of a structure for long-term inter- national data governance that prioritizes the agency of the data rights-holders, and concrete recommendations for handling the data used directly in the BigScience project. Progress on the first goal is presented in the work of Jernite et al. (2022), which further motivates the needs and requirements of data governance, and outlines the structure needed for a network of data custodians, rights-holders, and other parties to appropriately govern shared data. The interactions between these actors are designed to account for the privacy, intellectual property, and user rights of the data and algorithm subjects in a way that aims to prioritize local knowledge and expression of guiding values. In particular, this approach relies on structured agreements between data providers and data hosts5 that specify what the data may be used for.

BigScience努力通过结合技术、法律和社会学专业知识的多学科视角来解决这些需求。该团队专注于两个主要相互关联的目标,分别是设计一个长期的国际数据治理结构,以优先考虑数据权利持有人的机构能力,并提出处理BigScience项目中直接使用的数据的具体建议。第一个目标的进展在Jernite等人的工作中得到了展示(2022),这进一步推动了数据治理的需求和要求,并概述了一个网络数据保管人、权利持有人和其他相关方所需的结构,以适当地管理共享数据。这些参与者之间的互动被设计为考虑数据和算法主体的隐私、知识产权和用户权利,以优先考虑本地知识和价值导向的表达。具体而言,这种方法依赖于数据提供者和数据托管方之间的结构化协议,明确规定数据的使用方式。

While we were not able to fully establish an international organization in the compar- atively short time between the project start and model training, we worked on integrating lessons from this effort (and conversely adapting it to the practical concerns we were ex- periencing) in the following main ways: (i) we sought explicit permission to use the data from specific providers within the context of BigScience whenever possible (such as for the AI26-managed S2ORC corpus of Lo et al. (2020) or articles from the French newspaper Le Monde7); (ii) we kept individual sources separate until the final stages of preprocessing to maintain traceability and handle each according to the needs of its specific context; and (iii) we adopted a composite release approach for the various data sources that make up the overall corpus to foster reproducibility and follow-up research while respecting these source- dependent needs. Resources to visualize and access the ROOTS corpus can be found on the Hugging Face Hub organization “BigScience Data”.8 The organization hosts several demos (or “Spaces”) that can be used to gain insights into the full corpus, as well as direct access to the 223 (out of 498) components that we are able to distribute taking into account their licensing status, privacy risks, and agreements with their original custodians. Finally, since we understand that future investigation into the BLOOM models may require full access to the entire corpus, we are also inviting researchers with a relevant research project in mind to join ongoing efforts to analyze the data through a sign-up form.9

虽然在项目启动和模型训练之间的相对短暂时间内,我们无法完全建立一个国际组织,但我们致力于将这一努力的经验教训纳入以下主要方式中(并相应地调整以适应我们所面临的实际问题):(i)我们尽可能在BigScience的背景下,寻求明确的使用数据许可,从特定提供者那里获得数据的许可(例如,对于Lo等人(2020)的AI26管理的S2ORC语料库或法国报纸Le Monde的文章);(ii)我们在预处理的最后阶段之前将各个来源保持分离,以保持可追溯性,并根据其具体上下文的需求进行处理;(iii)我们采用了综合发布方法,用于构建整体语料库的各种数据源,以促进可重复性和后续研究,并尊重这些数据源的特定需求。可在Hugging Face Hub组织“BigScience Data”上找到可视化和访问ROOTS语料库的资源。该组织托管了几个演示(或“空间”),可用于深入了解完整语料库,以及直接访问我们能够根据其许可状态、隐私风险和与原始保管人的协议进行分发的223个(共498个)组成部分。最后,由于我们了解到未来对BLOOM模型的研究可能需要对整个语料库进行完整访问,我们还邀请有相关研究项目的研究人员通过注册表格参与到分析数据的持续努力中。

3.1.2、Data Sources数据来源

Given a strategy for data governance, the next step was to determine the composition of the training corpus. This stage was driven by several goals, which sometimes had inherent tensions. Some of those tensions included building a language model that was accessible to as many people as possible around the world while only including languages for which we had enough expertise to curate a dataset of comparable scale (and to a lesser extent composition) to previous efforts while improving the standards of documentation and respect for data and algorithm subject rights.

在确定数据治理策略之后,下一步是确定训练语料库的组成。这个阶段受到了几个目标的推动,有时存在固有的紧张关系。其中一些紧张关系包括构建一个尽可能多人能够访问的语言模型,同时只包括我们有足够专业知识来策划具有可比规模(在较小程度上是组成)的数据集的语言,以改善文档标准和对数据和算法主体权利的尊重。

Language Choices 语言选择

These considerations led us to an incremental process for choosing which languages were to be included in the corpus. We started with a list of eight of the world’s largest languages by number of speakers for which we did active outreach in the early stages of the project to invite fluent speakers to join the data efforts. Then, on the recommendation of language communities (Nekoto et al., 2020) we expanded Swahili in the original selection to the category of Niger-Congo languages, and Hindi and Urdu to Indic languages (Kunchukuttan et al., 2020). Finally, we proposed that any group of 3 or more participants fluent in an additional language could add it to the supported list if they would commit to selecting sources and guiding processing choices in the language in order to avoid common issues with corpora selected through automatic language identification without specific language expertise (Caswell et al., 2022).

这些考虑导致我们采取渐进的过程来选择语料库中应包含的语言。我们从世界上使用人数最多的八种语言列表开始,在项目的早期阶段积极与流利的使用者联系,邀请他们加入数据工作。然后,根据语言社区的建议(Nekoto等人,2020),我们将斯瓦希里语从最初的选择扩展到尼日尔-刚果语类别,将印地语和乌尔都语扩展到印度语类别(Kunchukuttan等人,2020)。最后,我们提议,任何由3个或更多精通某种附加语言的参与者组成的小组,如果他们承诺在语言中选择源并指导处理选择,以避免通过自动语言识别选择语料库时常见的问题(Caswell等人,2022),则可以将该语言添加到支持的列表中。

源选择Source Selection

The biggest part of the corpus was curated by workshop participants and research collectives who collectively compiled the “BigScience Catalogue”:  a large list of processed and non-processed sources covering a wide range of languages. This took the form of hackathons that were co-organized by communities such as Machine Learning Tokyo, Masakhane, and LatinX in AI (McMillan-Major  et  al.,  2022).  Complementary  to those efforts, other working group participants compiled language-specific resources such as the Arabic-focused Masader repository (Alyafeai et al., 2021; Altaher et al., 2022). A total of 252 sources were identified through this bottom-up approach, with at least 21 sources per language category. Additionally, in order to increase the geographic coverage of some of our Spanish, Chinese, French, and English sources, participants identified locally relevant websites in their language to be added to the corpus via pseudocrawl, a method to obtain those websites from a Common Crawl snapshot.

语料库的最大部分由研讨会参与者和研究集体共同编制的“BigScience Catalogue”策划。这是通过由Machine Learning Tokyo、Masakhane和LatinX in AI等社区共同组织的黑客马拉松活动完成的(McMillan-Major等人,2022)。除了这些努力外,其他工作组参与者还编制了特定语言的资源,例如以阿拉伯语为重点的Masader存储库(Alyafeai等人,2021;Altaher等人,2022)。通过这种自下而上的方法,共鉴别出了252个数据源,每个语言类别至少有21个源。此外,为了增加我们西班牙语、中文、法语和英语源的地理覆盖范围,参与者们还确定了在他们的语言中具有地方相关性的网站,通过伪爬行的方式将其添加到语料库中,这种方法是从Common Crawl的快照中获取这些网站的一种方法。

GitHub Code

The catalogue was further complemented with a dataset of programming languages collected from the GitHub data collection on Google’s BigQuery,10 which was then deduplicated of exact matches. The choice of languages to include mirrored the design choices introduced by Li et al. (2022) to train the AlphaCode model.

目录还补充了从Google的BigQuery的GitHub数据收集中收集的编程语言数据集,然后对完全匹配的内容进行了去重。包含的语言选择反映了Li等人(2022)在训练AlphaCode模型中引入的设计选择。

OSCAR

Both in an effort not to diverge from the standard research practice of using the Web as a source of pretraining data (Radford et al., 2018; Raffel et al., 2020), and also to satisfy the data volume needs of our compute budget given the size of BLOOM, we further sourced data from OSCAR version 21.09, corresponding to the February 2021 snapshot of the Common Crawl (Ortiz Suaez et al., 2019; Abadji et al., 2021), which ended up constituting 38% of the corpus.

为了不偏离使用Web作为预训练数据源的标准研究实践(Radford等人,2018;Raffel等人,2020),同时满足我们的计算预算对BLOOM尺寸的数据量需求,我们还从OSCAR 21.09版本中获取了数据,该版本对应于Common Crawl的2021年2月快照(Ortiz Suaez等人,2019;Abadji等人,2021),该数据构成了语料库的38%。

3.1.3、Data Preprocessing数据预处理

After the sources had been identified, data processing involved several steps to handle mul- tiple aspects of data curation. An overarching view of and processing pipeline to build ROOTS can be seen in Figure 2. All tools developed in the process are available on GitHub.11

在确定了数据来源之后,数据处理涉及多个步骤来处理数据策划的多个方面。ROOTS构建的整体视图和处理流程如图2所示。整个过程中开发的所有工具都可在GitHub上获得。

获取源数据Obtaining the Source Data

The first step involved obtaining the data for all of the text data sources identified in Section 3.1.2, which consisted of a combination of downloading and extracting the text field from a variety of NLP datasets in various formats (including e.g. question answering, summarization, or dialogue datasets), scraping and processing large amounts of PDF files from archives (e.g. the French repository of scientific articles12), and extracting and preprocessing text from 192 website entries from the catalogue and another geographically diverse set of 456 websites selected by data working group members. The latter required the development of new tools to extract text from the HTML in the Common Crawl WARC files, which we made available on the main data preparation repository.13 We were able to find and extract usable text data from all URLs present in 539 of the websites.

第一步是获取在3.1.2节中确定的所有文本数据来源的数据,其中包括从各种格式的自然语言处理数据集中下载和提取文本字段(包括问答、摘要或对话数据集),从档案中爬取和处理大量PDF文件(例如法国科学文章存储库),以及从目录中的192个网站条目和数据工作组成员选择的另一个地理分布广泛的456个网站中提取和预处理文本。后者需要开发新的工具,以从Common Crawl的WARC文件中提取HTML中的文本,并在主要的数据准备存储库中提供这些工具。我们能够从539个网站中的所有URL中找到并提取可用的文本数据。

Figure 2: Creation Pipeline of the ROOTS Corpus. The purple-colored sourcing stage of the pipeline and the yellow-colored processing stage are described respectively in Section 3.1.2 and Section 3.1.3.

图2:ROOTS语料库的创建流程。管道的紫色采集阶段和黄色处理阶段分别在第3.1.2节和第3.1.3节中进行了描述。

"质量"过滤:人类撰写的自然语言文本“Quality” filtering: Text Produced by Humans for Humans

After obtaining the text, we found that most of the sources contained some amount of text that was not natural language, for example preprocessing errors, SEO pages, or spam (including pornographic spam). In order to filter non-natural language, we defined a set of quality indicators, where high-quality text is defined as “written by humans for humans”, without distinction of content (as we wanted content selection to exclusively be the domain of the more accountable human source selection) or a priori judgments of grammaticality. The full list of indicators are described in (Lauren?con et al., 2022). Importantly, the indicators were adapted to the needs of each of the sources in two main ways. First, their parameters such as the thresholds and supporting term lists were selected individually for each language by fluent speakers. Second, we manually went through each individual source to identify which indicators were most likely to identify non-natural language. Both processes were supported by tools to visualize their impact.14,15

在获取了文本之后,我们发现大多数来源中都包含一些非自然语言的文本,例如预处理错误、SEO页面或垃圾邮件(包括色情垃圾邮件)。为了过滤掉非自然语言,我们定义了一组质量指标,其中高质量文本被定义为“由人类为人类撰写”,不区分内容(因为我们希望内容选择完全由更可靠的人类源选择进行)或先验语法判断。完整的指标列表在(Laurençon等人,2022)中有描述。重要的是,这些指标根据每个来源的需求以两种主要方式进行了适应。首先,它们的参数(如阈值和支持术语列表)由流利的使用者为每种语言分别选择。其次,我们手动审查了每个单独的来源,以确定最有可能识别非自然语言的指标。这两个过程都使用工具来可视化其影响。

Figure 3: Graphical overview of the ROOTS corpus. Left: A treemap plot of the language families of all 46 natural languages where surface is proportional to the number of bytes. Indo-European  and  Sino-Tibetan  families  overwhelm  the  plot  with  a  combined  total  of

1321.89 GB. The thin orange surface represents 18GB of Indonesian data and the green rectangle 0.4GB constituting the Niger-Congo language family subset. Right: A waffle plot of the distribution of the 13 programming languages by size, where one square represents approximately 200MB.

图3:ROOTS语料库的图形概览。左侧:以树状图绘制的所有46种自然语言的语系,表面积与字节数成比例。印欧语系和汉藏语系占据了主导地位,总共1321.89 GB。橙色表面表示18GB的印尼数据,绿色矩形表示尼日尔-刚果语系子集的0.4GB。右侧:以格子图绘制的13种编程语言的大小分布,其中一个方格表示约200MB。

Deduplication and Privacy Redaction去重和隐私删除

Finally, we removed near-duplicate documents with two deduplication steps and redacted Personal Identifiable Information (such as social security numbers) that we could identify from the OSCAR version of the corpus—as it was deemed to be the source that presented the highest privacy risks, prompting us to apply regex-based redaction even in cases where the expressions had some false positives.

最后,我们通过两个去重步骤删除了近似重复的文档,并从OSCAR版本的语料库中删除了个人可识别信息(例如社会保障号码),因为它被认为是存在最高隐私风险的来源,所以我们甚至在表达式存在一些误报的情况下,也应用基于正则表达式的删除。

3.1.4Prompted Datasets提示数据集

Multitask prompted finetuning (also referred to as instruction tuning) involves finetun- ing a pretrained language model on a training mixture composed of a large set of different tasks specified through natural language prompts. T0 (Sanh et al., 2022) (developed as part of BigScience) demonstrated that language models finetuned on a multitask mixture of prompted datasets have strong zero-shot task generalization abilities. Moreover, T0 was shown to outperform language models that are an order of magnitude larger but did not undergo such finetuning. Motivated by these results, we explored using existing natural language datasets to carry out multitask prompted finetuning.

T0 was trained on a subset of the Public Pool of Prompts (P3), a collection of prompts for various existing and open-source English natural language datasets. This collection of prompts was created through a series of hackathons involving BigScience collaborators and where hackathon participants wrote a total of of 2000+ prompts for 170+ datasets. Datasets in P3 cover a variety of natural language task including sentiment analysis, ques- tion answering, and natural language inference and exclude harmful content or non-natural language such as programming languages. PromptSource (Bach et al., 2022),16 an open- source toolkit (also developed as part of BigScience) facilitated creating, sharing and using natural language prompts. Full details of the collection process are given in (Sanh et al., 2022; Bach et al., 2022).

多任务提示微调(也称为指令微调)涉及在通过自然语言提示指定的大量不同任务的训练混合物上微调预训练语言模型。T0(Sanh等人,2022)(作为BigScience的一部分开发)证明了在多任务提示数据集的混合训练上微调的语言模型具有强大的零样本任务泛化能力。此外,T0还表现出优于规模大一个数量级但没有经过这种微调的语言模型的能力。受到这些结果的启发,我们探索了使用现有的自然语言数据集进行多任务提示微调。

T0是在公共提示池(P3)的子集上训练的,P3是一个包含各种现有和开源英语自然语言数据集的提示集合。通过一系列涉及BigScience合作者的黑客马拉松活动创建了这些提示集合,参与的黑客马拉松参与者为170多个数据集撰写了2000多个提示。P3中的数据集涵盖了各种自然语言任务,包括情感分析、问答和自然语言推理,并排除了有害内容或非自然语言(如编程语言)。PromptSource(Bach等人,2022)是一个开源工具包(也是作为BigScience的一部分开发的),用于创建、共享和使用自然语言提示。有关收集过程的详细信息可参见(Sanh等人,2022;Bach等人,2022)。

After pretraining BLOOM, we applied the same massively multitask finetuning recipe to equip BLOOM with multilingual zero-shot task generalization abilities. We refer to the resulting models as BLOOMZ. To train BLOOMZ, we extended P3 to include new datasets in languages other than English and new tasks, such as translation. This resulted in xP3, a collection of prompts for 83 datasets covering 46 languages and 16 tasks. As highlighted in Figure 4, xP3 mirrors the language distribution of ROOTS. Tasks in xP3 are both cross- lingual (e.g. translation) and monolingual (e.g. summarization, question answering). We used PromptSource to collect these prompts, adding additional metadata to the prompts, such as input and target  languages.  To  study  the  importance  of  multilingual prompts, we also machine-translated English prompts in xP3 to the respective dataset languages to produce a collection called xP3mt. Further details on the prompt collection for xP3 and xP3mt are given in Muennighoff et al. (2022b).

在对BLOOM进行预训练之后,我们使用相同的大规模多任务微调方法,使BLOOM具备多语言零样本任务泛化能力。我们将结果模型称为BLOOMZ。为了训练BLOOMZ,我们扩展了P3,包括了英语以外的新语言和新任务(如翻译)。这导致了xP3,一个包含83个数据集的提示集合,涵盖了46种语言和16个任务。如图4所示,xP3反映了ROOTS的语言分布。xP3中的任务既包括跨语言任务(如翻译),也包括单语言任务(如摘要、问答)。我们使用PromptSource收集这些提示,并为提示添加了额外的元数据,如输入和目标语言。为了研究多语言提示的重要性,我们还将xP3中的英语提示机器翻译成各自的数据集语言,生成了一个名为xP3mt的集合。有关xP3和xP3mt的提示收集的进一步详细信息,请参见Muennighoff等人(2022b)。

3.2、Model Architecture模型架构

This section discusses our design methodology and the architecture of the BLOOM model. In-depth studies and experiments can be found in Le Scao et al. (2022) and Wang et al. (2022a). We first review our design methodology, then motivate our choice of training a causal decoder-only model. Finally, we justify the ways that our model architecture deviates from standard practice.

本节讨论了我们的设计方法和BLOOM模型的架构。关于深入的研究和实验可以参考Le Scao等人(2022)和Wang等人(2022a)。我们首先回顾了我们的设计方法,然后解释了选择训练因果解码器模型的动机。最后,我们对我们的模型架构与标准实践的偏离进行了合理的解释。

3.2.1、Design Methodology设计方法

The design space of possible architectures is immense, making exhaustive exploration impos- sible. One option would be to exactly replicate the architecture of an existing large language model. On the other hand, a great deal of work on improving existing architectures has seen relatively little adoption (Narang et al., 2021); adopting some of these recommended practices could yield a significantly better model. We take a middle ground and focus on model families that have been shown to scale well, and that have reasonable support in publicly available tools and codebases. We ablate components and hyperparameters of the models, seeking to make the best use of our final compute budget.

可能的架构设计空间非常庞大,使得详尽地进行探索变得不可能。一种选择是完全复制现有大型语言模型的架构。另一方面,对改进现有架构的工作虽然做了大量研究,但在实际应用中采用的相对较少(Narang等人,2021)。采用其中一些推荐的实践方法可能会产生更好的模型。我们采取了中间的立场,专注于已经证明在规模上可扩展且在公开可用的工具和代码库中得到合理支持的模型系列。我们对模型的组成部分和超参数进行了消融实验,力求充分利用最终的计算预算。

消融实验的实验设计Experimental Design for Ablations

One of the main draws of LLMs has been their ability to perform tasks in a “zero/few-shot” way: large enough models can perform novel tasks simply from in-context instructions and examples (Radford et al., 2019), without dedicated training on supervised samples. Accordingly, and because finetuning a 100B+ model is unwieldy, we focused our evaluation of architectural decisions on zero-shot generalization, and do not consider transfer learning. Specifically, we measured zero-shot performance on diverse aggregates of tasks: 29 tasks from the EleutherAI Language Model Evaluation Har- ness (EAI-Eval, Gao et al. (2021)), and 9 tasks from the evaluation set of T0 (T0-Eval, Sanh et al. (2022)). There is significant overlap between the two: only one task from T0-Eval (StoryCloze) is not in EAI-Eval, although all prompts between the two are different. See Le Scao et al. (2022) for a detailed list of tasks and baselines. We also note that our tasks aggregates share 17 of the 31 tasks of the evaluation of GPT-3 (Brown et al., 2020).

We conducted our ablation experiments using smaller models. We used the 6.7B pa- rameter scale for the pretraining objective ablations (Wang et al., 2022a) and the 1.3B scale for the rest including position embeddings, activations, and layer normalization (Le Scao et al., 2022). Recently, Dettmers et al. (2022) identified a phase transition for models larger than 6.7B, in which the emergence of “outliers features” is observed. This questions whether results obtained at the 1.3B scale should be assumed to extrapolate to our final model size.

LLM的主要吸引力之一是其以“零/少样本”方式执行任务的能力:足够大的模型可以仅通过上下文指令和示例执行新任务(Radford等人,2019),而无需在监督样本上进行专门训练。因此,由于微调一个超过100B的模型难以操作,我们将架构决策的评估重点放在了零样本泛化上,并不考虑迁移学习。具体来说,我们对各种任务的零样本性能进行了测量:来自EleutherAI Language Model Evaluation Har-ness(EAI-Eval,Gao等人,2021)的29个任务以及来自T0评估集(T0-Eval,Sanh等人,2022)的9个任务。这两个集合之间存在重叠:只有T0-Eval中的一个任务(StoryCloze)不在EAI-Eval中,尽管两者之间的提示都不同。有关任务和基线的详细列表,请参阅Le Scao等人(2022)。我们还注意到,我们的任务集合与GPT-3(Brown等人,2020)的评估中的31个任务中的17个任务重叠。

我们使用较小的模型进行消融实验。我们在预训练目标消融实验中使用了6.7B参数规模(Wang等人,2022a),并在其余部分,包括位置嵌入、激活函数和层归一化方面使用了1.3B规模(Le Scao等人,2022)。最近,Dettmers等人(2022)发现对于大于6.7B的模型,存在一个“异常特征”的出现的相变点。这就引发了一个问题,即是否应该认为在1.3B规模下得到的结果可以推广到我们最终的模型规模。

不在范围内的架构Out-of-scope Architectures

We did not consider mixture-of-experts (MoE) (Shazeer et al., 2017), due to a lack of widely used GPU-based codebases suitable for training them at scale. Similarly, we also did not consider state-space models (Gu et al., 2020). At the time of the design of BLOOM, they consistently underperformed in natural language tasks (Gu et al., 2021). Both of these approaches are promising, and have now demonstrated competitive results–at large scales for MoE (Fedus et al., 2022; Srivastava et al., 2022), and at smaller scale for state-space models with H3 (Fu et al., 2023).

由于缺乏适用于大规模训练的广泛使用的基于GPU的代码库,我们没有考虑专家混合模型(MoE)(Shazeer等人,2017)。同样,我们也没有考虑状态空间模型(Gu等人,2020)。在设计BLOOM时,它们在自然语言任务中一直表现不佳(Gu等人,2021)。这两种方法都很有前景,目前已经展示了具有竞争力的结果:MoE在大规模上(Fedus等人,2022;Srivastava等人,2022),状态空间模型在较小规模上,如H3(Fu等人,2023)。

3.2.2、Architecture and Pretraining Objective架构和预训练目标

Although most modern language models are based on the Transformer architecture, there are significant deviations between architectural implementations. Notably, while the original Transformer is based on an encoder-decoder architecture, many popular models have opted for encoder-only (e.g. BERT, (Devlin et al., 2019)) or decoder-only (e.g. GPT, (Radford et al., 2018)) approaches. Currently, all state-of-the-art language models over 100 billion parameters are causal decoder-only models (Brown et al., 2020; Rae et al., 2021; Chowdhery et al., 2022). This is in opposition to the findings of Raffel et al. (2020), in which encoder- decoder models significantly outperform decoder-only models for transfer learning.

Prior to our work, the literature was lacking a systematic evaluation of the zero-shot generalization capabilities of different architectures and pretraining objectives. We explored this question in Wang et al. (2022a) where we evaluated encoder-decoder and decoder-only architectures and their interactions with causal, prefix, and masked language modeling pretraining objectives. Our results show that immediately after pretraining, causal decoder- only models performed best – validating the choice of state-of-the-art LLMs. Furthermore, they can be more efficiently adapted after pretraining to a non-causal architecture and objective–an approach which has been further explored and confirmed by Tay et al. (2022).

尽管大多数现代语言模型都基于Transformer架构,但不同的实现之间存在显著差异。值得注意的是,虽然最初的Transformer基于编码器-解码器架构,但许多流行模型选择了仅编码器(如BERT,Devlin等人,2019)或仅解码器(如GPT,Radford等人,2018)的方法。目前,超过1000亿参数的最先进语言模型都是因果解码器模型(Brown等人,2020;Rae等人,2021;Chowdhery等人,2022)。这与Raffel等人(2020)的发现相反,在该发现中,编码器-解码器模型在迁移学习中明显优于仅解码器模型。

在我们的工作之前,文献缺乏对不同架构和预训练目标的零样本泛化能力进行系统评估的研究。我们在Wang等人(2022a)中探索了这个问题,评估了编码器-解码器和仅解码器架构以及它们与因果、前缀和掩码语言建模预训练目标之间的相互作用。我们的结果显示,在预训练后的立即阶段,因果解码器模型表现最佳,验证了选择最先进的LLM的正确性。此外,它们在预训练后更容易适应非因果架构和目标,这一方法已经得到了Tay等人(2022)进一步探索和确认。

3.2.3、Modeling Details建模细节

Beyond choosing an architecture and pretraining objective, a number of changes to the original Transformer architecture have been proposed. For example, alternative positional embedding schemes (Su et al., 2021; Press et al., 2021) or novel activation functions (Shazeer, 2020). We thus performed a series of experiments to evaluate the benefit of each of these modifications for a causal decoder-only model in Le Scao et al. (2022). We adopted two architectural  deviations  in  BLOOM:

除了选择架构和预训练目标之外,还提出了对原始Transformer架构的一些改进。例如,替代的位置嵌入方案(Su等人,2021;Press等人,2021)或新颖的激活函数(Shazeer,2020)。因此,我们在Le Scao等人(2022)中进行了一系列实验,评估这些修改对因果解码器模型的益处。在BLOOM中,我们采用了两个架构偏差。

ALiBi位置嵌入ALiBi Positional Embeddings

Instead of adding positional information to the embed- ding layer, ALiBi directly attenuates the attention scores based on how far away the keys and queries are (Press et al., 2021). Although ALiBi was initially motivated by its abil- ity to extrapolate to longer sequences, we found it also led to smoother training and better downstream performance even at the original sequence length – outperforming both learned (Vaswani et al., 2017) and rotary (Su et al., 2021) embeddings.

ALiBi直接通过衰减注意力分数来考虑键和查询之间的距离(Press等人,2021),而不是将位置信息添加到嵌入层。尽管ALiBi最初是基于其在更长序列上的推广能力而提出的,但我们发现它在原始序列长度下也导致了更平滑的训练和更好的下游性能,优于学习的(Vaswani等人,2017)和旋转的(Su等人,2021)嵌入。

嵌入层归一化Embedding LayerNorm

In preliminary experiments training a 104B parameters model, we experimented with an additional layer normalization immediately after the embedding layer – as recommended by the bitsandbytes17 library (Dettmers et al., 2022) with its StableEmbedding layer. We found this significantly improved training stability. Even though we also found it penalizes zero-shot generalization in Le Scao et al. (2022), we train BLOOM with an additional layer normalization after the first embedding layer to avoid training instabilities. Note the preliminary 104B experiments were conducted in float16, while the final training was in bfloat16. Since then, float16 has been attributed as being responsible for many of the observed instabilities in training LLMs (Zhang et al., 2022; Zeng et al., 2022). It is possible that bfloat16 alleviates the need for the embedding LayerNorm.

在进行104B参数模型的初步实验时,我们尝试在嵌入层之后立即添加额外的层归一化,如bitsandbytes17库(Dettmers等人,2022)的推荐。我们发现这显著改善了训练的稳定性。尽管我们在Le Scao等人(2022)中发现这会对零样本泛化产生负面影响,但为了避免训练不稳定性,我们在第一个嵌入层之后使用了额外的层归一化进行BLOOM的训练。请注意,初步的104B实验是在float16精度下进行的,而最终训练是在bfloat16精度下进行的。此后,研究发现float16精度导致LLM训练中观察到的许多不稳定性(Zhang等人,2022;Zeng等人,2022)。因此,bfloat16可能减轻了对嵌入层归一化的需求。

We represent the full architecture of BLOOM in figure 5 for reference.

我们在图5中展示了BLOOM的完整架构以供参考。

3.3、Tokenization标记化

The design decisions when training a tokenizer are often neglected in favour of “default” settings (Mielke et al., 2021). For instance, OPT (Zhang et al., 2022) and GPT-3 (Brown et al., 2020) both use GPT-2’s tokenizer, trained for English. This can be justified by the fact that evaluating the impact of a particular choice on the downstream performance of the model is constrained by the large computational costs of training. However, the diverse nature of BLOOM’s training data requires careful design choices to ensure that the tokenizer encodes sentences in a lossless manner.

在训练标记器时,通常会忽视设计决策,而更倾向于使用“默认”设置(Mielke等人,2021)。例如,OPT(Zhang等人,2022)和GPT-3(Brown等人,2020)都使用了GPT-2的标记器,该标记器针对英语进行了训练。这可以通过训练的巨大计算成本所限来证明,评估特定选择对模型下游性能的影响。然而,BLOOM训练数据的多样性要求仔细选择设计,以确保标记器以无损方式对句子进行编码。

验证Validation

We use the fertility (羉s, 2019) of our tokenizer compared to existing monolin- gual tokenizers as a metric for sanity checks. Fertility is defined as the number of subwords created per word or per dataset by the tokenizer, which we measured using subsets of Universal Dependencies 2.9 (Nivre et al., 2017) and OSCAR (Ortiz Su醨ez et al., 2019) in the languages of interest. A very high fertility on a language compared to a monolingual tokenizer may indicate a degradation on the downstream multilingual performance of the model (Rust et al., 2021). Our goal was to not degrade the fertility on each language by more than 10 percentage points when comparing our multilingual tokenizer with monolingual to- kenizers in corresponding languages. For all experiments, the Hugging Face Tokenizers library (Moi et al., 2019) was used to design and train the tested tokenizers.

我们使用与现有单语标记器相比的标记器的丰度(fertility)(Laresen,2019)作为检查指标。丰度定义为标记器每个单词或每个数据集创建的子词的数量,我们使用Universal Dependencies 2.9(Nivre等人,2017)和OSCAR(Ortiz Suárez等人,2019)的子集来测量感兴趣的语言。与单语标记器相比,一个语言上非常高的丰度可能表明模型在下游多语言性能上的下降(Rust等人,2021)。我们的目标是在比较我们的多语言标记器与相应语言的单语标记器时,不要使每种语言的丰度下降超过10个百分点。对于所有实验,我们使用了Hugging Face Tokenizers库(Moi等人,2019)来设计和训练测试的标记器。

Table 2: Fertilities obtained on Universal Dependencies treebanks on languages with ex- isting monolingual tokenizers. The monolingual tokenizers we used were the ones from CamemBERT (Martin et al., 2020), GPT-2 (Radford et al., 2019), DeepESP/gpt2-spanish, bert-base-chinese, monsoon-nlp/hindi-bert and Arabic BERT (Safaya et al., 2020), all available on the HuggingFace Hub.

表2:在具有现有单语标记器的Universal Dependencies树库上获得的丰度。我们使用的单语标记器来自CamemBERT(Martin等人,2020)、GPT-2(Radford等人,2019)、DeepESP/gpt2-spanish、bert-base-chinese、monsoon-nlp/hindi-bert和Arabic BERT(Safaya等人,2020),所有这些都可以在HuggingFace Hub上获得。

标记器训练数据Tokenizer Training Data

We initially used a non-deduplicated subset of ROOTS. How- ever, a qualitative study on the vocabulary of the tokenizer revealed issues in its training data. For instance, in earlier versions of the tokenizer, we found entire URLs stored as tokens caused by several documents containing a high number of duplicates. These issues motivated us to remove duplicated lines in the tokenizer training training data. We then applied the same sampling ratios per language as for the training data.

我们最初使用了ROOTS的非重复子集。然而,对标记器词汇表的定性研究揭示了训练数据中的问题。例如,在标记器的早期版本中,我们发现整个URL被存储为标记,这是由于多个文档包含大量重复内容导致的。这些问题促使我们删除了标记器训练数据中的重复行。然后,我们对每种语言应用了与训练数据相同的采样比例。

词汇表大小Vocabulary Size

A large vocabulary size reduces the risk of over-segmenting some sen- tences, especially for low-resource languages. We conducted validation experiments using 150k and 250k vocabulary sizes to make comparisons with existing multilingual modeling literature easier (Conneau et al., 2020; Xue et al., 2021). We ultimately settled for a vo- cabulary of 250k tokens to reach our initial fertility objective compared to monolingual tokenizers. Since the vocabulary size determines the embedding matrix size, it also had to be divisible by 128 for GPU efficiency reasons and by 4 to be able to use Tensor Parallelism. We used a final size of 250,680 vocabulary items with 200 tokens reserved for possible future applications such as removing private information using placeholder tokens.

较大的词汇表大小降低了对某些句子过度分割的风险,特别是对于资源稀缺的语言。我们进行了验证实验,使用了150k和250k的词汇表大小,以便与现有的多语言建模文献进行比较(Conneau等人,2020;Xue等人,2021)。我们最终选择了一个包含250k个标记的词汇表,以达到与单语标记器相比的初始丰度目标。由于词汇表大小确定了嵌入矩阵的大小,为了提高GPU的效率,它还必须能被128整除,并且必须能被4整除以使用张量并行性。我们使用了最终的词汇表大小为250,680个词汇项,其中有200个标记保留用于可能的未来应用,比如使用占位符标记删除私人信息。

字节级BPE Byte-level BPE

The tokenizer is a learned subword tokenizer trained using the Byte Pair Encoding (BPE) algorithm introduced by Gage (1994). In order not to lose information during tokenization, the tokenizer creates merges starting from bytes as the smallest units instead of characters (Radford et al., 2019). This way, tokenization never results in unknown tokens because all 256 bytes can be contained in the vocabulary of the tokenizer. In addition, Byte-level BPE maximizes vocabulary sharing between languages (Wang et al., 2020).

该标记器是使用由Gage(1994)引入的字节对编码(BPE)算法训练的学习子词标记器。为了在标记化过程中不丢失信息,标记器从字节开始创建合并作为最小单位,而不是字符(Radford等人,2019)。这样,标记化永远不会导致未知标记,因为标记器的词汇表中可以包含所有256个字节。此外,字节级BPE最大化了语言之间的词汇共享(Wang等人,2020)。

归一化Normalization

Upstream of the BPE tokenization algorithm, no normalization of the text was performed in order to have the most general model possible. In all cases, we observed that adding unicode normalization such as NFKC did not reduce the fertility by more than 0.8% on all the languages considered but came at the cost of making the model less general; for example, causing 22 and 22 to be encoded in the same way.

在BPE标记化算法之前,没有对文本进行任何归一化处理,以便拥有最通用的模型。在所有情况下,我们观察到添加unicode归一化(如NFKC)不会使所有考虑的语言的丰度降低超过0.8%,但会以降低模型的通用性为代价;例如,导致22和22以相同的方式进行编码。

预标记器Pre-tokenizer

Our pre-tokenization has two goals: producing a first division of the text (usually using whitespaces and punctuation) and restricting the maximum length of se- quences of tokens produced by the BPE algorithm. The pre-tokenization rule used was the following regex:                                        “” 18 which splits words apart while preserving all the characters and in particular the sequences of spaces and line breaks that are crucial for programming languages. We do not use English-centric splits common in other tokenizers (e.g. splitting around ’nt or ’ll). We also didn’t use splits on numbers and digits, which caused issues in Arabic and code.

我们的预标记化具有两个目标:生成文本的第一个分割(通常使用空格和标点符号),并限制BPE算法生成的标记序列的最大长度。我们使用的预标记化规则是以下正则表达式:“”18,它将单词分开,同时保留所有字符,特别是对于编程语言至关重要的空格和换行符序列。我们不使用其他标记器中常见的以英语为中心的分割方式(例如,围绕“nt”或“ll”进行分割)。我们也没有对数字和数字进行分割,这在阿拉伯语和代码中会引起问题。

3.4、Engineering工程

3.4.1、Hardware硬件

The model was trained on Jean Zay,19 a French government-funded supercomputer owned by GENCI and operated at IDRIS, the national computing center for the French National Center for Scientific Research (CNRS). Training BLOOM took about 3.5 months to com- plete and consumed 1,082,990 compute hours. Training was conducted on 48 nodes, each having 8 NVIDIA  A100  80GB  GPUs  (a  total  of  384 GPUs); due to possible hardware failures during training, we also maintained a reserve of 4 spare nodes. The nodes were equipped with 2x AMD EPYC 7543 32-Core CPUs and 512 GB of RAM, while the storage was handled by mix of full flash and hard disk drives using a SpectrumScale (GPFS) parallel file system shared between all nodes and users of the supercomputer. 4 NVLink GPU-to- GPU interconnects per node enabled intra-node communications while 4 Omni-Path 100 Gbps links per node, arranged in an enhanced hypercube 8D global topology, were used for inter-node  communications.

该模型是在Jean Zay上训练的,Jean Zay是由法国GENCI资助的超级计算机,由法国国家科学研究中心(CNRS)的国家计算中心IDRIS运营。训练BLOOM花费了约3.5个月的时间,消耗了1,082,990个计算小时。训练是在48个节点上进行的,每个节点配备有8个NVIDIA A100 80GB GPU(总共384个GPU);由于训练过程中可能发生硬件故障,我们还保留了4个备用节点。节点配备有2个AMD EPYC 7543 32核CPU和512 GB的RAM,存储由使用SpectrumScale(GPFS)并行文件系统的全闪存和硬盘驱动器混合处理,该文件系统在所有节点和超级计算机的用户之间共享。每个节点配备4个NVLink GPU到GPU的互连,用于节点内通信,每个节点配备4个Omni-Path 100 Gbps链路,按照增强的超立方体8D全局拓扑结构排列,用于节点间通信。

3.4.2、Framework框架

BLOOM was trained using Megatron-DeepSpeed20 (Smith et al., 2022), a framework for large-scale distributed training. It consists of two parts: Megatron-LM21 (Shoeybi et al., 2019) provides the Transformer implementation, tensor parallelism, and data loading prim- itives, whereas DeepSpeed22 (Rasley et al., 2020) provides the ZeRO optimizer,  model pipelining, and general distributed training components. This framework allows us to train efficiently with 3D parallelism (Narayanan et al., 2021, shown in Figure 6), a fusion of three complementary approaches to distributed training. These approaches are described below:

Data parallelism (DP) replicates the model multiple times, with each replica placed on a different device and fed a slice of the data. The processing is done in parallel and all model replicas are synchronized at the end of each training step.

Tensor parallelism (TP) partitions individual layers of the model across multiple de- vices. This way, instead of having the whole activation or gradient tensor reside on a single GPU, we place shards of this tensor on separate GPUs. This technique is sometimes called horizontal parallelism or intra-layer model parallelism.

Pipeline parallelism (PP) splits up the model’s layers across multiple GPUs, so that only a fraction of the layers of the model are placed on each GPU. This is sometimes called vertical parallelism.

Finally, the Zero Redundancy Optimizer (ZeRO; Rajbhandari et al., 2020) allows different processes to only hold a fraction of data (parameters,  gradients,  and optimizer states) required for a training step. We used ZeRO stage 1, meaning that only the optimizer states are sharded in this manner.

BLOOM使用Megatron-DeepSpeed(Smith等人,2022)进行训练,这是一个用于大规模分布式训练的框架。它由两部分组成:Megatron-LM(Shoeybi等人,2019)提供了Transformer实现、张量并行性和数据加载原语,而DeepSpeed(Rasley等人,2020)提供了ZeRO优化器、模型流水线和通用分布式训练组件。这个框架使我们能够高效地进行3D并行训练(Narayanan等人,2021,如图6所示),这是分布式训练的三种互补方法的融合。下面介绍了这些方法:

数据并行(DP)将模型复制多次,每个副本放置在不同的设备上并提供一部分数据。处理过程并行进行,所有模型副本在每个训练步骤结束时进行同步。

张量并行(TP)将模型的各个层分割到多个设备上。这样,不需要将整个激活或梯度张量放置在单个GPU上,而是将该张量的分片放置在不同的GPU上。这种技术有时被称为水平并行或层内模型并行。

管道水线并行(PP)将模型的层分割到多个GPU上,因此每个GPU上只放置模型的部分层。这有时被称为垂直并行。

最后,零冗余优化器(Zero Redundancy Optimizer,ZeRO;Rajbhandari等,2020)使不同的进程仅保存了训练步骤所需的数据(参数、梯度和优化器状态)的一部分。我们使用了ZeRO阶段1,这意味着只有优化器状态以这种方式进行了分片。

Figure 6: DP+PP+TP combination leads to 3D parallelism.

图6:DP+PP+TP组合实现了3D并行。

The four components described above are combined together to allow scaling to hundreds of GPUs with extremely high GPU utilization. We were able to achieve 156 TFLOPs in our fastest configuration with A100 GPUs, attaining our objective of half of the theoretical peak performance of 312 TFLOPs (in float32 or bfloat16).

上述四个组件被结合在一起,以实现数百个GPU的扩展,并且具有极高的GPU利用率。我们在最快的配置下使用A100 GPU实现了156 TFLOPS的性能,达到了我们目标的312 TFLOPS(以float32或bfloat16为基准)的理论峰值性能的一半。

3.4.3、Floating Point Format浮点数格式

In earlier experiments with 104B-parameter models on NVIDIA V100 GPUs, we observed numerical instabilities that caused irreversible training divergences. We hypothesize that these instabilities stem from our initial use of IEEE float16 — a 16-bit floating point format with a very limited dynamic range that can cause overflows. The NVIDIA A100 GPUs that we ultimately had access to support the bfloat16 format (Wang and Kanwar, 2019; Kalamkar et al., 2019), which has the same dynamic range as float32. On the other hand, bfloat16 still has much lower precision, which motivated our use of mixed-precision training (Micikevicius et al., 2018). This technique performs certain precision-sensitive operations such as gradient accumulation and softmax in float32 precision and the rest of operations in lower precision, allowing us to achieve a balance of high performance and training stability. Ultimately, we performed final training in bfloat16 mixed precision, which proved to solve the instability problem (in line with previous observation by Smith et al., 2022).

在之前使用NVIDIA V100 GPU进行的104B参数模型的实验中,我们观察到了导致不可逆训练发散的数值不稳定性。我们假设这些不稳定性源自我们最初使用的IEEE float16,这是一种具有非常有限动态范围的16位浮点数格式,可能会导致溢出。而我们最终使用的NVIDIA A100 GPU支持bfloat16格式(Wang和Kanwar,2019;Kalamkar等,2019),其具有与float32相同的动态范围。另一方面,bfloat16的精度仍然较低,这促使我们使用混合精度训练(Micikevicius等,2018)。这种技术在float32精度下执行某些对精度敏感的操作,如梯度累积和softmax,而其他操作则使用较低精度,从而在高性能和训练稳定性之间取得平衡。最终,我们在bfloat16混合精度下进行最终训练,这证明解决了不稳定性问题(与Smith等人的先前观察一致,2022年)。

3.4.4、Fused CUDA Kernels融合的CUDA内核

In general, GPUs cannot retrieve data to perform computations on and perform these computations at the same time. Moreover, the compute performance of modern GPUs is much higher than the speed of memory transfer required for every operation (often called a kernel in GPU programming). Kernel fusion (Wu et al., 2012) is an approach for optimizing GPU-based computations by performing several consecutive operations in only one kernel call. This approach offers a way to minimize data transfers: intermediary results stay in the GPU register instead of being copied into VRAM, saving overhead.

We used several custom fused CUDA kernels provided by Megatron-LM. First, we used an optimized kernel to perform LayerNorm, as well as kernels to fuse various combinations of the scaling, masking, and softmax operations. The addition of a bias term is also fused with the GeLU activation using the JIT functionality of PyTorch. As an example consequence of the use of fused kernels, adding the bias term in the GeLU operation adds no additional time, as the operation is memory-bound: the additional computation is negligible compared to data transfers between GPU VRAM and registers, so fusing both operations essentially halves their runtime.

一般来说,GPU无法同时获取数据进行计算并执行这些计算。此外,现代GPU的计算性能远高于每个操作所需的内存传输速度(通常称为GPU编程中的内核)。内核融合(Wu等,2012年)是一种通过在一个内核调用中执行多个连续操作来优化基于GPU的计算的方法。这种方法可以最小化数据传输:中间结果保存在GPU寄存器中,而不是复制到VRAM中,从而节省开销。

我们使用了Megatron-LM提供的几个自定义融合的CUDA内核。首先,我们使用了一个优化的内核来执行LayerNorm,以及将缩放、掩码和softmax操作融合在一起的内核。还使用了PyTorch的JIT功能将偏置项与GeLU激活融合在一起。作为使用融合内核的一个示例后果,将偏置项添加到GeLU操作中不会增加额外的时间,因为该操作受限于内存:与GPU VRAM和寄存器之间的数据传输相比,额外的计算可以忽略不计,因此融合这两个操作可以将其运行时间减半。

3.4.5、Additional Challenges额外挑战

Scaling to 384 GPUs required two final changes: disabling asynchronous CUDA kernel launches (for ease of debugging and to prevent deadlocks) and splitting parameter groups into smaller subgroups (to avoid excessive CPU memory allocations).

During training, we faced issues with hardware failures: on average, 1–2 GPU failures occurred each week. As backup nodes were available and automatically used, and check- points were saved every three hours, this did not affect training throughput significantly. A PyTorch deadlock bug in the data loader and disk space issues led to 5–10h downtimes. Given the relative sparsity of engineering issues, and since there was only one loss spike, which the model swiftly recovered from, human intervention was less necessary than in com- parable projects (Zhang et al., 2022). Full details of our experience with training BLOOM and a detailed report of all issues we faced are publicly available.23

扩展到384个GPU需要进行两个最终的更改:禁用异步CUDA内核启动(以便于调试和防止死锁),并将参数组分割成较小的子组(以避免过多的CPU内存分配)。

在训练过程中,我们面临硬件故障的问题:平均每周发生1-2次GPU故障。由于备用节点可用并自动使用,并且每三小时保存检查点,这并没有显著影响训练吞吐量。PyTorch中的数据加载器死锁错误和磁盘空间问题导致了5-10小时的停机时间。鉴于工程问题相对较少,而且只有一个损失峰值,模型迅速从中恢复,相比可比项目(Zhang等,2022年),人为干预的必要性较小。我们的BLOOM训练经验的详细信息和我们所面临的所有问题的详细报告都可以公开获取。

3.5Training训练

预训练模型Pretrained Models

We train six size variants of BLOOM with respective hyperparam- eters detailed in Table 3. Architecture and training hyperparameters come from our ex- perimental results (Le Scao et al., 2022) and prior work on training large language models (Brown et al., 2020; Kaplan et al., 2020). Model depth and width for the non-176B models roughly follow previous literature (Brown et al., 2020; Zhang et al., 2022), deviating for 3B and 7.1B in order only to fit the models more easily on our training setup. Embed- ding parameter sizes are larger for BLOOM owing to the larger multilingual vocabulary, but scaling literature discounts embedding operations (Kaplan et al., 2020). During the development process at the 104B parameters scale, we experimented with different values of Adam β parameters, weight decay and gradient clipping to target stability, but did not find it helpful. For all models, we use a cosine learning rate decay schedule (Loshchilov and Hutter, 2016) over 410B tokens, taken as an upper bound for the length of training if compute permitted, and warmup for 375M tokens. We use weight decay, gradient clipping, and no dropout. The ROOTS dataset contains around 341 billion tokens of text, so we aimed to train all models for the equivalent amount of tokens. However, in light of revised scaling laws published during training (Hoffmann et al., 2022), we decided to train the large models for an additional 25 billion tokens on repeated data. As warmup tokens + decay tokens were larger than the total number of tokens, the end of learning rate decay was never reached.

我们使用表3中详细说明的六种尺寸变体的BLOOM进行训练,其中包括相应的超参数。模型架构和训练超参数来自我们的实验结果(Le Scao等,2022年)以及关于训练大型语言模型的先前工作(Brown等,2020年;Kaplan等,2020年)。对于非176B模型,模型的深度和宽度大致遵循以前的文献(Brown等,2020年;Zhang等,2022年),为了更容易将模型放入我们的训练设置中,有所偏离。由于BLOOM具有更大的多语言词汇表,因此嵌入参数的大小更大,但是缩放文献不考虑嵌入操作(Kaplan等,2020年)。在104B参数规模的开发过程中,我们尝试了不同的Adam β参数权重衰减梯度裁剪值以实现稳定性,但并没有发现其有帮助。对于所有模型,我们使用余弦学习率衰减计划(Loshchilov和Hutter,2016年)进行训练,处理410B个标记,如果计算允许的话,这被视为训练长度的上限,并进行375M个标记的预热。我们使用权重衰减、梯度裁剪和无丢弃。ROOTS数据集包含约3410亿个文本标记,因此我们的目标是对所有模型进行相当数量的标记训练。然而,在训练过程中根据发布的修订后的缩放定律(Hoffmann等,2022年),我们决定对大型模型进行额外的250亿个标记的训练。由于预热标记+衰减标记大于总标记数,因此学习率衰减的结束点从未达到

多任务微调Multitask Finetuning

Finetuned  BLOOMZ models (Muennighoff et  al., 2022b)  main- tain the same architecture hyperparameters as BLOOM models. The finetuning hyperpa- rameters are loosely based on T0 (Sanh et al., 2022) and FLAN (Wei et al., 2021). Learning rates are determined by doubling the minimum learning rate of the respective pretrained model and then rounding. Global batch sizes are multiplied by four for small variants to increase throughput.  While the models are finetuned for 13 billion tokens, the best check- point is chosen according to a separate validation set. We found performance to plateau after 1 – 6 billion tokens of finetuning.

微调的BLOOMZ模型(Muennighoff等,2022b)具有与BLOOM模型相同的架构超参数。微调的超参数基本上是基于T0(Sanh等,2022年)和FLAN(Wei等,2021年)的。学习率是通过将相应预训练模型的最小学习率加倍然后四舍五入确定的。全局批量大小对于小变体而言乘以四以增加吞吐量。虽然模型被微调了130亿个标记,但根据单独的验证集选择了最佳检查点。我们发现在微调10-60亿个标记后性能趋于平稳。

对比微调Contrastive Finetuning

We also perform contrastive finetuning of the 1.3 and 7.1 billion parameter BLOOM models using the SGPT Bi-Encoder recipe (Muennighoff, 2022) to train models that produce high-quality text embeddings. We created SGPT-BLOOM-7.1B- msmarco24 geared towards multilingual information retrieval and SGPT-BLOOM-1.7B-nli25 for multilingual semantic textual similarity (STS). However, recent benchmarking has found these models to also generalize to various other embedding tasks, such as bitext mining, reranking or feature extraction for downstream classification (Muennighoff et al., 2022a).

我们还使用SGPT Bi-Encoder配方(Muennighoff,2022年)对1.3和7.1亿参数的BLOOM模型进行对比微调,以训练产生高质量的文本嵌入模型。我们创建了面向多语言信息检索的SGPT-BLOOM-7.1B-msmarco24模型,以及面向多语言语义文本相似性(STS)的SGPT-BLOOM-1.7B-nli25模型。然而,最近的基准测试发现这些模型也适用于各种其他嵌入任务,例如双文本挖掘、重新排序或下游分类的特征提取(Muennighoff等,2022a年)。

3.5.1Carbon Footprint碳足迹

While most attempts to estimate the carbon footprint of language models have shed light on the emissions produced due to energy consumed during model training (e.g. Patterson et al., 2021; Strubell et al., 2019), other sources of emissions are also important to consider. In our efforts to estimate the carbon emissions of BLOOM, we were inspired by the Life Cycle Assessment (LCA) approach (Klopffer, 1997) and aimed to consider aspects such as the emissions of equipment manufacturing, intermediate model training, and deployment. According to our estimates, the carbon emissions from BLOOM training add up to approx- imately 81 tons of CO2eq, of which 14% were generated by the equipment manufacturing process (11 tons), 30% by the energy consumed during training (25 tons) and 55% by idle consumption of the equipment and computing cluster used for training (45 tons).

尽管大多数估算语言模型碳足迹的尝试都揭示了模型训练过程中能源消耗产生的排放量(例如Patterson等,2021年;Strubell等,2019年),但还有其他排放来源也很重要。在估计BLOOM的碳排放量时,我们受到了生命周期评估(LCA)方法(Klopffer,1997年)的启发,并致力于考虑设备制造、中间模型训练和部署等方面的排放。根据我们的估计,BLOOM训练所产生的碳排放量总计约为81吨CO2eq,其中设备制造过程产生了14%(11吨)的排放量,训练过程中消耗的能源产生了30%的排放量(25吨),而用于训练的设备和计算集群的空闲消耗产生了55%的排放量(45吨)。

Table 4: Comparison of carbon emissions between BLOOM and similar LLMs. Numbers in

italics have been inferred based on data provided in the papers describing the models.

表4:BLOOM和类似LLM之间的碳排放比较。斜体数字根据描述模型的论文中提供的数据推断得出。

Comparing the carbon emissions of BLOOM training to other  similar  models  (see Table 4), reveals that while the energy  consumption  of  BLOOM  is  slightly  higher  than OPT (Zhang et al., 2022) (433 Mwh compared to OPT’s 324 MWh), its emissions are ap- proximately 2/3 less (25 tons versus 70 tons). This is thanks to the low carbon intensity of the energy grid used for training BLOOM, which emits 57 gCO2eq/kWh, compared to 231 gCO2eq/kWh for the grid used for OPT training.  Specifically, France’s national energy grid (which is used by Jean Zay) is largely powered by nuclear energy, which is low-carbon compared to grids powered by energy sources such as coal and natural gas. While the sustainability of nuclear energy is debated, it is one of the least carbon-intensive sources of energy that is currently available. Both BLOOM and OPT incurred significantly less carbon emissions than GPT-3 (as reported by (Patterson et al., 2021)), which can be at- tributed to several factors including more efficient hardware as well as less carbon-intensive energy sources.

将BLOOM训练的碳排放与其他类似模型进行比较(见表4),可以发现,尽管BLOOM的能源消耗略高于OPT(Zhang等,2022年)(433 MWh与OPT的324 MWh相比),但其排放量约为OPT的2/3(25吨对比70吨)。这要归功于用于训练BLOOM的能源网络的低碳强度,该网络每千瓦时排放57克CO2eq,而OPT训练所使用的网络每千瓦时排放231克CO2eq。具体而言,法国的国家能源网络(Jean Zay使用的网络)主要由核能驱动,与煤炭和天然气等能源驱动的网络相比,核能具有较低的碳排放。尽管人们对核能的可持续性存在争议,但它是目前可用的碳强度最低的能源之一。无论是BLOOM还是OPT的碳排放量都明显低于GPT-3(据(Patterson等,2021年)报道),这可以归因于更高效的硬件和碳强度较低的能源来源等多种因素。

We also pursued further exploration of the carbon footprint of (1) the computation carried out on Jean Zay within the scope of the Big Science workshop, and (2) running the BLOOM model API in real time. In terms of the footprint of the totality of the computation, we estimate that the final BLOOM training represents approximately 37% of the overall emissions, with other processes such as intermediate training runs and model evaluation adding up to the other 63%. This is slightly less than the estimate made by the authors of the OPT paper, who stated that the total carbon footprint of their model is roughly 2 times higher due to experimentation, baselines and ablation (Zhang et al., 2022). Our ongoing exploration of the carbon emissions of the BLOOM API have estimated that the real-time deployment of the model on a GCP instance with 16 GPUs running in the us-central1 region results in approximately 20 kg of CO2eq emitted per day of deployment (or 0.83 kg per hour). This figure is not representative of all deployment use-cases, and will vary depending on the hardware used as well as the specifics of model implementation

(e.g. whether batching is used) and the number of requests the model receives.  Further information regarding BLOOM’s carbon footprint can be found in Luccioni et al. (2022).

我们还进一步探究了在Big Science研讨会范围内在Jean Zay上进行的计算以及实时运行BLOOM模型API的碳足迹。就计算的总体碳足迹而言,我们估计最终BLOOM训练约占总排放量的37%,其他过程,如中间训练和模型评估,占总排放量的63%。这比OPT论文作者的估计略低,他们表示由于实验、基线和剖析,他们的模型的总碳足迹大约是原来的两倍(Zhang等,2022年)。我们正在持续探索BLOOM API的碳排放量,估计在美国中部地区运行16个GPU的GCP实例上实时部署模型每天会排放约20公斤CO2eq(每小时0.83公斤)。这个数字不能代表所有部署用例,它会根据使用的硬件、模型实现的具体细节(例如是否使用批处理)以及模型收到的请求数量而变化。有关BLOOM的碳足迹的更多信息可以在Luccioni等人的文章中找到(2022年)。

3.6Release发布

Openness has been central to the development of BLOOM and we wanted to ensure it is easily available for the community to use. As such, we worked on producing documentation as a Model Card (Mitchell et al., 2019) and a new license addressing specific goals of the project.

开放性一直是BLOOM开发的核心,我们希望确保它方便地供社区使用。因此,我们致力于制作文档,以模型卡(Mitchell等,2019年)的形式提供,并提供了针对项目特定目标的新许可证。

模型卡Model Card

Following best practices for releasing machine learning models, the BLOOM model has been released along with a detailed Model Card26 (Mitchell et al., 2019) describing its technical specifications, details on training, intended-use, out-of-scope uses as well as the model’s limitations. Participants across working groups worked together to produce the final Model Card and similar cards for each checkpoint. The work was collaborative, primarily composed “live” by thinking through and discussing each section, then further dividing into subsections based on the categorizations and distinctions participants naturally ended up creating throughout discussions.

遵循发布机器学习模型的最佳实践,BLOOM模型发布时附带了一份详细的模型卡(Mitchell等,2019年),其中描述了其技术规格、训练细节、预期用途、不适用的用途以及模型的局限性。各个工作组的参与者共同合作制作了最终的模型卡,以及每个检查点的类似卡片。这项工作是协作完成的,首先通过思考和讨论每个部分,然后根据参与者在讨论中自然产生的分类和区分进一步划分为子部分。

许可证Licensing

Being mindful of the  potentially  harmful  use-cases  that  BLOOM  could  en- able, we chose to strike a balance between unrestricted open-access and responsible-use by including behavioral-use clauses (Contractor et al., 2022) to limit the application of the model towards potentially harmful use-cases. Such clauses are routinely being included in a growing class of “Responsible AI Licenses (RAIL)”27  that the community has been adopting when releasing their models.28 A distinguishing aspect of the RAIL license developed for BLOOM is that it separates licensing of the “source code” and “model”, as referenced by its trained parameters. It further includes detailed definitions of “use” and “derived works” of the model to ensure that anticipated downstream use by prompting, finetuning, distillation, use of logits and probability distributions are explicitly identified. The license contains 13 behavioral-use restrictions that have been identified based on the intended uses and lim- itations described in the BLOOM Model Card, as well as the BigScience ethical charter. The license offers the model at no charge and users are free to use the model as long as they comply with the terms (including usage restrictions). The source code for BLOOM has been made available under an Apache 2.0 open source license.

考虑到BLOOM可能带来的潜在有害用途,我们选择在无限制开放访问和负责任使用之间取得平衡,通过包含行为使用条款(Contractor等,2022年)来限制模型应用于潜在有害用途。这样的条款通常被包含在一类不断增加的“负责任人工智能许可证(RAIL)”中,社区在发布其模型时采用这种许可证。为BLOOM开发的RAIL许可证的一个区别性特点是将“源代码”和“模型”(即其训练参数)的许可证分开。它还包含了有关模型“使用”和“派生作品”的详细定义,以确保明确识别通过提示、微调、蒸馏、使用对数和概率分布等方式预期的下游使用。该许可证包含了13个行为使用限制,这些限制是基于BLOOM模型卡中描述的预期用途和限制,以及BigScience的伦理宪章。该许可证免费提供模型,只要用户遵守条款(包括使用限制)。BLOOM的源代码已根据Apache 2.0开源许可证公开。

4、Evaluation评估

Our evaluations focus on zero-shot and few-shot settings. Our goal is to present an accurate picture of how BLOOM compares to existing LLMs in settings that most realistically reflect the way the models are likely to be used in practice. Because of the scale of these models, prompt-based adaptation and few-shot “in-context learning” are currently more common than finetuning. Thus, we report results on a range of tasks - SuperGLUE 4.2, machine translation 4.3, summarization 4.4 - and languages in zero-shot and one-shot prompt-based settings, as well as after multitask finetuning (Section 4.7). We also perform code gener- ation 4.5, use BLOOM-derived text embeddings for representation tasks 4.8 and interpret BLOOM’s generalization abilities from the perspective of multilingual probing (Section 4.9).

我们的评估重点是零样本和少样本设置。我们的目标是以最真实地反映模型在实践中可能使用的方式的设置中,呈现BLOOM与现有LLM相比的准确画面。由于这些模型的规模,基于提示的自适应和少样本的“上下文学习”目前比微调更常见。因此,我们报告了在一系列任务(SuperGLUE 4.2,机器翻译 4.3,摘要 4.4)和语言的零样本和一次提示设置中的结果,以及多任务微调后的结果(第4.7节)。我们还进行了代码生成(第4.5节),使用BLOOM生成的文本嵌入进行表示任务(第4.8节),并从多语言探测的角度解释BLOOM的泛化能力(第4.9节)。

4.1、Experimental Design实验设计

4.1.1、Prompts提示

Based on recent research on the impact of prompting on language model performance, we decided to build a language model evaluation suite that allowed us to vary both the basic task data as well as the prompting that is used to contextualize the task. Our prompts were developed prior to BLOOM’s release, and did not undergo any a priori refinement using models. That is, the prompts we use in our evaluation are ones that humans believed were a reasonable way to solicit the desired task behavior from a language model. Our goal for designing prompts in this way is to simulate realistic zero-shot or one-shot results that a new user could expect from BLOOM. This is in contrast to presenting best-case performances that might result from multiple rounds of trial-and-error on prompt design. We choose to report the former because the latter is harder to reproduce systematically, is arguably a less representative picture of how the model works in the average setting, and is not representative of true zero-shot learning where no labeled data is available.

基于最近关于提示对语言模型性能影响的研究,我们决定构建一个语言模型评估套件,使我们能够在基本任务数据和用于上下文化任务的提示方面进行变化。我们的提示是在BLOOM发布之前开发的,并没有经过任何预先的模型细化。也就是说,我们在评估中使用的提示是人们认为从语言模型中获取所需任务行为的合理方式。我们设计提示的目标是模拟新用户可以从BLOOM中期望到的真实零样本或一样本结果。这与呈现可能是在提示设计的多轮试错过程中产生的最佳结果不同。我们选择报告前者,因为后者在系统地重复时更难以重现,可能在普通设置中呈现的是一个不太代表性的模型工作图景,并且不代表真正的零样本学习,其中没有可用的标记数据。

We generate multiple prompts per task using promptsource (Bach et al., 2022). We follow the procedure used by Sanh et al. (2022), in which prompt generation is crowd- sourced, and thus we see substantial variety in length and style across prompts. To improve quality and clarity, multiple peer reviews were performed on each prompt for artifacts and consistency.

我们使用promptsource(Bach等人,2022)生成每个任务的多个提示。我们采用了Sanh等人(2022)使用的程序,即通过众包方式生成提示,因此提示的长度和风格在不同提示之间存在相当大的差异。为了提高质量和清晰度,对每个提示进行了多次同行评审,以查找存在的问题并保持一致性。

Table 5 shows examples of the resulting prompts used for the WMT’14 task. We also generate prompts for many tasks that are not included in this paper due to resource con- straints. All of our prompts for all tasks (both those analyzed in the paper and those not yet analyzed) are publicly available.29

表5显示了用于WMT'14任务的提示示例。我们还为许多未包含在本文中的任务生成了提示,这是由于资源限制。我们所有任务的所有提示(包括论文中分析的和尚未分析的)都是公开可用的。

4.1.2、Infrastructure基础设施

Our framework extends EleutherAI’s Language Model Evaluation Harness (Gao  et  al., 2021) by integrating it with the promptsource (Bach et al., 2022) library described  in Section 3.1.4. We release our Prompted Language Model Evaluation Harness as an open source library for people to use. We use this framework in order to run the experiments and aggregate results.

我们的框架扩展了EleutherAI的语言模型评估工具(Gao等人,2021),并将其与promptsource(Bach等人,2022)库整合在一起,该库在第3.1.4节中进行了描述。我们将我们的提示语言模型评估工具发布为开源库供人们使用。我们使用这个框架来运行实验和汇总结果。

4.1.3、Datasets数据集

SuperGLUE

We use a subset of the SuperGLUE (Wang et al., 2019) evaluation suite of classification tasks, specifically: Ax-b, Ax-g, BoolQ, CB, WiC, WSC, and RTE tasks. We excluded the remaining tasks because they require an order of magntiude more compute to run than all of these tasks we consider combined. These tasks are English-only, and are thus included to facilitate comparison with prior work, which has primarily focused on English-only models. We also note that performance on these tasks has not yet been widely reported using zero- and one-shot prompt-based setting. T0 (Sanh et al., 2022) is the first exception, but that model is instruction-tuned and thus not directly comparable to models like BLOOM and OPT. For each task, we select a random sample of five prompts from promptsource and evaluate all models on that set of prompts. As with other prompting tasks in Evaluation Harness (Gao et al., 2021), the prediction of a model for a given prompt is measured using the maximum log likelihood among a set of specified candidate label strings associated with the prompt.

我们使用SuperGLUE(Wang等人,2019)分类任务评估套件的一个子集,具体包括:Ax-b、Ax-g、BoolQ、CB、WiC、WSC和RTE任务。我们排除了其余的任务,因为与我们考虑的所有这些任务相比,运行它们所需的计算量要大一个数量级。这些任务仅限于英文,因此包括它们是为了便于与之前的工作进行比较,这些工作主要集中在仅支持英文的模型上。我们还注意到,使用零样本和一样本提示为基础的设置尚未广泛报告这些任务的性能。T0(Sanh等人,2022)是第一个例外,但该模型是经过指令微调的,因此不能直接与BLOOM和OPT等模型进行比较。对于每个任务,我们从promptsource中选择五个随机提示,并在该组提示上评估所有模型。与评估工具(Gao等人,2021)中的其他提示任务一样,对于给定提示的模型预测是通过在与提示相关联的一组指定候选标签字符串中选择最大对数似然来衡量的。

Machine Translation (MT) 机器翻译(MT)

We evaluate BLOOM on three datasets (using ISO-639-1 codes to refer to languages): WMT14 en↔fr and en↔hi (Bojar et al., 2014), Flores-101 (Goyal et al., 2022) and DiaBLa (Bawden et al., 2020). We evaluate using the sacrebleu (Post, 2018) implementation of BLEU (Papineni et al., 2002), using default tokenisation for WMT and DiaBLa and spm-flores-101 for Flores.30 We use greedy decoding with generation proceeding until the EOS token, or additionally \n###\n for the 1-shot case. The maximum generation length was set per dataset to be in line with what is typically used in the literature; specifically, 64 tokens for WMT14 and 512 tokens for Flores-101 and DiaBla. Task-specific experimental design details are below.

我们在三个数据集上评估BLOOM:WMT14 en↔fr和en↔hi(Bojar等人,2014),Flores-101(Goyal等人,2022)和DiaBLa(Bawden等人,2020)。我们使用sacrebleu(Post,2018)对WMT和DiaBLa使用默认的标记化,对Flores使用spm-flores-101。我们使用贪婪解码,直到出现EOS标记或1-shot情况下的\n###\n。每个数据集的最大生成长度根据文献中通常使用的设定值确定,具体为WMT14为64个标记,Flores-101和DiaBla为512个标记。任务特定的实验设计细节如下。

Summarization 摘要

We evaluate summarization on the WikiLingua (Ladhak et al., 2020) dataset. WikiLingua is a multilingual summarization dataset comprising WikiHow article and step-by-step summary pairs. Pairs are aligned across multiple languages, with transla- tion of source and summary further reviewed by an international translation team. One-shot conditional natural language generation has typically not been reported by models with size comparable to BLOOM. PaLM (Chowdhery et al., 2022) is the first exception, and reports scores on WikiLingua; however, only the model’s ability to summarize in English was ex- amined (-> en). By contrast, we opted to test BLOOM’s inherent multilingual ability by assessing the abstractive summarization in the source language (e.g. vi -> vi). We focus on the nine languages (Arabic, English, Spanish, French, Hindi, Indonesian, Portuguese, Vietnamese and Chinese) which were amongst those targeted as part of the BigScience effort.

我们在WikiLingua(Ladhak等人,2020)数据集上评估摘要生成。WikiLingua是一个多语言摘要数据集,包括WikiHow文章和逐步摘要对。这些对应关系跨多种语言进行对齐,源语言和摘要的翻译进一步由国际翻译团队进行审核。具有与BLOOM相当规模的模型通常没有报告过单样本条件自然语言生成的能力。PaLM(Chowdhery等人,2022)是第一个例外,它在WikiLingua上报告了得分;然而,仅检查了模型在英语中进行摘要的能力(-> en)。相比之下,我们选择通过评估源语言(例如,vi -> vi)中的抽象摘要能力来测试BLOOM的多语言能力。我们专注于以下九种语言(阿拉伯语、英语、西班牙语、法语、印地语、印尼语、葡萄牙语、越南语和中文),它们是BigScience计划的目标之一。

Natural language generation is notoriously challenging to evaluate, with multilingual generation compounding this challenge due to a lack of metric support. Following the sug- gestions by Gehrmann et al. (2022b), we report ROUGE-2, ROUGE-L (Lin, 2004),31 and Levenshtein distance. One important modification to ROUGE is using the SentencePiece tokenizer (Kudo and Richardson, 2018) built from the Flores-101 dataset (Goyal et al.,2022). A naive approach would use a tokenizer based on English, but using a multilingual tokenizer improves the capacity to measure the fidelity of multilingual generations. To min- imize inference time of the model we use the subsamples from the updated GEM benchmark (Gehrmann et al., 2022a) (3000 uniformly sampled test examples). The authors note that there is minimal difference when comparing model performance between the subsamples and the full test sets. For decoding and generation, we use the same procedure as described above for MT.

自然语言生成的评估一直是非常具有挑战性的,而多语言生成由于缺乏度量支持而使得这一挑战更加复杂。根据Gehrmann等人(2022b)的建议,我们报告了ROUGE-2、ROUGE-L(Lin,2004)和Levenshtein距离。对ROUGE的一个重要修改是使用基于Flores-101数据集(Goyal等人,2022)构建的SentencePiece标记器(Kudo和Richardson,2018)。一个朴素的方法是使用基于英语的标记器,但是使用多语言标记器可以提高测量多语言生成的保真度的能力。为了减少模型的推理时间,我们使用更新的GEM基准(Gehrmann等人,2022a)的子样本(3000个均匀采样的测试样例)。作者指出,在子样本和完整测试集之间比较模型性能时几乎没有差异。对于解码和生成,我们使用与机器翻译相同的过程(如上所述)。

4.1.4、Baseline Models基准模型

We use the following baseline models where appropriate (e.g. in settings where they support the language of the evaluation dataset):

我们在适当的情况下使用以下基准模型(例如,在它们支持评估数据集的语言方面):

•mGPT (Shliazhko et al., 2022), GPT-style models trained on 60 languages from Wikipedia and Common Crawl

•GPT-Neo (Black et al.), GPT-J-6B (Wang and Komatsuzaki, 2021), and GPT-NeoX (Black et al., 2022), a family of GPT-style models trained on The Pile (Gao et al., 2020)

•T0 (Sanh et al., 2022), a variant of T5 (Raffel et al., 2020) that underwent multitask prompted finetuning on datasets from P3 (Bach et al., 2022)

•OPT (Zhang et al., 2022), a family of GPT-style model trained on a mixture of datasets including those from RoBERTa Liu et al. (2019) and The Pile (Gao et al., 2020)

•XGLM (Lin et al., 2021), a GPT-style multilingual model trained on a variant of CC100 (Conneau et al., 2020)

•M2M (Fan et al., 2021), a massively multilingual model trained to translate between 100 languages

•AlexaTM (Soltan et al., 2022), an encoder-decoder model trained on a mixture of masked and causal language modeling on data from Wikipedia and mC4 (Xue et al., 2021)

•mTk-Instruct (Wang et al., 2022b), a variant of T5 that underwent multitask prompted finetuning on datasets from Super-NaturalInstructions

•Codex (Chen et al., 2021), a family of GPT models finetuned on code from GitHub

•GPT-fr (Simoulin and Crabb?, 2021), a GPT-style model trained on French text

•mGPT(Shliazhko等人,2022),在Wikipedia和Common Crawl的60种语言上训练的GPT风格模型 •GPT-Neo(Black等人),GPT-J-6B(Wang和Komatsuzaki,2021)和GPT-NeoX(Black等人,2022),在The Pile(Gao等人,2020)上训练的一系列GPT风格模型 •T0(Sanh等人,2022),T5(Raffel等人,2020)的一种变体,在来自P3的数据集上进行多任务提示微调(Bach等人,2022) •OPT(Zhang等人,2022),在包括RoBERTa Liu等人(2019)和The Pile(Gao等人,2020)的数据集中进行训练的一系列GPT风格模型 •XGLM(Lin等人,2021),在CC100的变体上训练的GPT风格多语言模型(Conneau等人,2020) •M2M(Fan等人,2021),一种用于在100种语言之间进行翻译的大规模多语言模型 •AlexaTM(Soltan等人,2022),在Wikipedia和mC4(Xue等人,2021)的数据上进行蒙版和因果语言建模的编码器-解码器模型 •mTk-Instruct(Wang等人,2022b),T5的一种变体,在Super-NaturalInstructions的数据集上进行多任务提示微调 •Codex(Chen等人,2021),在GitHub代码上进行微调的一系列GPT模型 •GPT-fr(Simoulin和Crabb?,2021),在法语文本上训练的GPT风格模型

4.2、SuperGLUE

Figure 7 shows zero- and one-shot performance on SuperGLUE. In both settings, on en- tailment tasks (BoolQ and CB), performance is well above random chance for BLOOM, T0, OPT, and GPT-J. On other tasks, while the best prompts do better, the average per- formance across prompts hovers around chance, suggesting that the success of individual prompts is primarily statistical variation. There is some signal for BLOOM in the diagnostic (Ax-b and Ax-g) datasets. The exception is the T0 model, which shows strong performance. However, this model is finetuned in the multitask setting (similar to BLOOMZ, see Sec- tion 4.7) in order to improve performance in zero-shot prompting settings, and thus is not directly comparable to the other models shown here.

As models go from zero-shot to one-shot, variability is reduced across all prompts and models and performance slightly and inconsistently increases. Notably, BLOOM sees more of an increase in performance than comparable models when going from zero-shot to one- shot, as it is generally behind OPT in the zero-shot setting but matches or improves on it in the one-shot setting, even though it has only partly been trained on English. This may be because a multilingual language model gains more certainty in the language of input and output with a longer context.

SuperGLUE 图7展示了SuperGLUE任务的零-shot和一-shot表现。在这两种设定下,对于蕴涵任务(BoolQ和CB),BLOOM、T0、OPT和GPT-J的性能远高于随机猜测。在其他任务中,尽管最佳提示表现更好,但平均性能在各种提示中都接近随机猜测,这表明个别提示的成功主要是统计变异。在诊断数据集(Ax-b和Ax-g)中,BLOOM表现出一些信号。例外情况是T0模型,它显示出较强的性能。然而,该模型在多任务设置下进行了微调(类似于BLOOMZ,请参见第4.7节),以改善零-shot提示设置中的性能,因此不能与此处显示的其他模型直接进行比较。

当模型从零-shot转变为一-shot时,所有提示和模型的变异性都减少了,性能略微且不一致地提高。值得注意的是,与OPT相比,BLOOM在从零-shot到一-shot的转变中的性能提高更多,尽管在零-shot设置中BLOOM通常落后于OPT,但在一-shot设置中与OPT相匹配或超过它,即使它只在英语上部分进行了训练。这可能是因为多语言语言模型在具有更长上下文的情况下在输入和输出的语言中获得更多的确定性。

Figure 7: Performance of various LLMs on subset of tasks from SuperGLUE benchmark in zero- and one-shot prompt-based setting.

图7:各种LLM在SuperGLUE基准测试的部分任务上的零-shot和一-shot基于提示的性能。

We perform an additional analysis comparing BLOOM models across model sizes. As a baseline, we also measure the average one-shot accuracy of OPT models of similar sizes (350M parameters to 175B parameters).32 Figure 8 shows the accuracy of each prompt on each task across model scales. Both OPT and BLOOM model families improve very slightly with scale, with only models over 2 billion parameters showing signal, and there is no consistent difference between families across all tasks. In the 1-shot setting, BLOOM- 176B is ahead of OPT-175B on Ax-b, CB, WSC and WiC, and matches it on the other tasks, suggesting that multilinguality does not limit performance of BLOOM on English-only tasks in the zero-shot setting.

我们进行了进一步的分析,比较了BLOOM模型在不同模型规模上的表现。作为基准,我们还测量了与规模相似的OPT模型的平均一-shot准确率(从35亿参数到1750亿参数)。图8显示了每个任务上每个提示的准确率随模型规模的变化情况。OPT和BLOOM模型家族的性能在规模上略微改善,只有超过200亿参数的模型显示出一定的信号,并且在所有任务上家族之间没有一致的差异。在一-shot设置中,BLOOM-176B在Ax-b、CB、WSC和WiC上优于OPT-175B,在其他任务上与其相匹配,这表明在零-shot设置中,多语言性不会限制BLOOM在仅限英语的任务上的性能。

Figure 8: Comparison of the scaling of BLOOM versus OPT on each SuperGLUE one-shot task. Each point represents the average accuracy of a model within the BLOOM or OPT family of models on one of the five task prompts. The number of parameters on the x-axis is presented in log-scale.

图8:BLOOM和OPT在SuperGLUE一-shot任务上的规模比较。每个点表示BLOOM或OPT模型家族中一个模型在五个任务提示之一上的平均准确率。x轴上的参数数量以对数刻度.

4.3、Machine Translation机器翻译

In addition to the results presented here, a more detailed analysis of BLOOM’s MT quality can be found in (Bawden and Yvon, 2023).

除了在这里提供的结果之外,有关BLOOM的机器翻译质量的更详细分析可以在(Bawden和Yvon,2023)中找到。

4.3.1、WMT

WMT results for BLOOM-176B in the zero-shot and 1-shot setting are given in Table 6. The best prompts tend to be the more verbose ones; the “version-target” prompt is consistently better and the “gpt3-target” and “xglm-source+target” prompts have very poor performance, especially for zero-shot. In the one-shot setting, BLOOM can, with the right prompt, perform competent translation, although it is behind dedicated (supervised) models such as M2M-100 (43.8 BLEU for English→French and 40.4 for French→English, compared to34.2 and 35.4 BLEU for BLOOM). The two major problems observed, particularly in the zero-shot setting, are (i) over-generation and (ii) not producing the correct language (an obvious prerequisite for a good translation). Both of these aspects are greatly improved as the number of few-shot examples is increased.

表6显示了BLOOM-176B在零翻译和一次翻译设置中的WMT结果。最好的提示往往是较冗长的提示;“version-target”提示始终更好,“gpt3-target”和“xglm-source+target”提示在零翻译时表现非常糟糕。在一次翻译设置中,通过选择合适的提示,BLOOM可以进行称职的翻译,尽管它在质量上落后于专用(监督)模型,如M2M-100(英语→法语的BLEU为43.8,法语→英语的BLEU为40.4,与BLOOM的34.2和35.4 BLEU相比)。主要问题是过度生成和未产生正确的语言(这是良好翻译的明显先决条件),特别是在零翻译设置中。随着few-shot示例数量的增加,这两个方面都得到了极大的改善。

Table 6: WMT’14 zero- and one-shot results (BLEU) for BLOOM-176B. The prompts used are described in Table 5.

表6:BLOOM-176B的WMT’14零翻译和一次翻译结果(BLEU)。所使用的提示在表5中有描述。

4.3.2、DiaBLA

Table 7: DiaBLa 1-shot results (BLEU) for the “xglm-source+target” prompt when using the previous or a random sentence as the 1-shot example (with and without truncation of outputs). In bold the best results for each direction.

表7显示了在使用“xglm-source+target”提示时使用前一个或随机句子作为一次翻译示例(是否截断输出)时的DiaBLa一次翻译结果(BLEU)。每个方向上的最佳结果以粗体显示。

Table 7 shows results testing the use of linguistic context with DiaBLa, a parallel dataset of informal bilingual dialogues. In a 1-shot context and using the “xglm-source+target” prompt, we compare the effect of using a random test set example as the 1-shot example versus using the previous dialogue utterance.  In light of the overgeneration issues seen and in order to evaluate the quality of the prediction independently of overgeneration, we report results for both original outputs and after applying a custom truncation function.33 The automatic results are inconclusive, with little difference between scores (BLEU scores are higher for previous context but COMET scores are lower). Despite these results, there is evidence in the predictions themselves that the model is able to use the context of the 1-shot example to make translation choices. See (Bawden and Yvon, 2023) for examples and further analysis.

表7显示了在使用DiaBLa(一组非正式的双语对话数据集)时测试使用语言上下文的结果。在一次翻译上下文中,使用“xglm-source+target”提示,我们比较了使用随机测试集示例作为一次翻译示例与使用前一个对话话语作为一次翻译示例的效果。鉴于观察到的过度生成问题,并为了独立于过度生成评估预测的质量,我们报告了原始输出和应用自定义截断函数后的结果。自动结果并不确定,得分之间几乎没有差异(BLEU得分在前一个上下文中更高,但COMET得分较低)。尽管如此,预测结果本身证明了模型能够利用一次翻译示例的上下文进行翻译选择。有关示例和进一步分析,请参阅(Bawden和Yvon,2023)。

4.3.3、Flores

In the 1-shot setting, we test several language directions in the Flores-101 (Goyal et al., 2022) devtest set using the “xglm-source+target” prompt (Lin et al., 2021). The 1-shot example is randomly taken from the dev set. We separate out results for low-resource language pairs (Table 8a), between related languages of the Romance language family (Table 8b), high-resource language pairs (Table 8c) and high-to-mid-resource language pairs (Table 8d).Languages are classified as low-, mid- and high-resource depending on their representation in ROOTS. We compare to supervised results from the M2M-100 model (Fan et al., 2021) with 615M parameters, for which scores are computed by Goyal et al. (2022). Additionally, we compare to 32-shot AlexaTM results for high-resource language pairs (Soltan et al., 2022). Results are good across the board for both translation between high-resource languages and from high- to mid-resource languages, suggesting BLOOM’s good multilingual capacity, even across scripts (here between Latin (or extended Latin), Chinese, Arabic and Devanagari scripts). Compared to the supervised M2M-100 model, results are often comparable and sometimes better in this 1-shot setting, and results are comparable in many cases to those of AlexaTM (even though AlexTM results are for 32-shot).

在1-shot设置中,我们使用Flores-101(Goyal等人,2022年)devtest数据集以“xglm-source+target”提示(Lin等人,2021年)测试了几种语言方向。1-shot示例是从dev集中随机选取的。我们将低资源语言对(表8a)、属于罗曼斯语系的相关语言对(表8b)、高资源语言对(表8c)和高到中等资源语言对(表8d)的结果分开。根据它们在ROOTS中的表示,语言被分类为低资源、中等资源和高资源。我们与M2M-100模型(Fan等人,2021年)的有监督结果进行比较,该模型具有615M个参数,其分数由Goyal等人(2022年)计算。此外,我们还将其与高资源语言对的32-shot AlexaTM结果进行了比较(Soltan等人,2022年)。无论是高资源语言之间的翻译还是从高资源语言到中等资源语言的翻译,结果在各个方面都很好,这表明BLOOM具有良好的多语言能力,甚至跨脚本(在此处为拉丁(或扩展拉丁)、汉字、阿拉伯字母和天城体之间)。与有监督的M2M-100模型相比,这个1-shot设置中的结果通常是可比较的,有时甚至更好,并且在许多情况下与AlexaTM的结果相当(尽管AlexTM的结果是32-shot)。

The translation quality for many of the low-resource languages is good, comparable to or even slightly better than  the  supervised  M2M  model.  However,  results  are  very poor between Swahili and Yoruba, languages that are present but under-represented in BLOOM’s training data (<50k tokens each). This contrasts with the results for translation between Romance (and therefore related) languages, where results are good across-the- board, including for translation from Galician (glg), a language not included in the training data, but which shares many similarities with the other Romance languages, in particular with Portuguese (por). This however does question BLOOM’s quality on those under- represented low-resource languages included in training.

对于许多低资源语言,翻译质量很好,与或甚至略好于有监督的M2M模型。然而,在斯瓦希里语和约鲁巴语之间的结果非常差,这两种语言在BLOOM的训练数据中存在但表示不足(每种语言少于50k个标记)。这与罗曼斯语(因此相关)语言之间的翻译结果形成鲜明对比,在这些语言之间的翻译结果非常好,包括从加利西亚语(glg)翻译,该语言不包含在训练数据中,但与其他罗曼斯语言,特别是葡萄牙语(por)具有许多相似之处。然而,这确实对BLOOM在训练中包含的那些低资源语言的质量提出了质疑。

4.4、Summarization摘要

Figure 9 shows one-shot results for BLOOM models alongside OPT-175B for comparison. Each point represents a per-prompt score. The key takeaways are that BLOOM attains higher performance on multilingual summarization than OPT and that performance in- creases as the parameter count of the model increases. We suspect this is due to BLOOM’s multilingual-focused training.

As discussed in Section 4.1, we report ROUGE-2 scores for the sake of comparability with prior work, and because there is a lack of alternatives for generation evaluation. However, we qualitatively observe that in many cases, the ROUGE-2 score understates the quality of the summaries generated by the systems.

图9显示了与OPT-175B进行比较的BLOOM模型在一次性摘要任务中的结果。每个点代表一个提示的得分。重要的发现是,与OPT相比,BLOOM在多语言摘要方面表现更好,并且随着模型参数数量的增加,性能也提高。我们怀疑这是由于BLOOM的多语言训练所致。

正如第4.1节中讨论的,我们报告ROUGE-2分数是为了与之前的工作进行比较,并且因为在生成评估方面缺乏替代方法。然而,我们在定性上观察到,在许多情况下,ROUGE-2分数低估了系统生成的摘要质量。

4.5、Code Generation代码生成

The BLOOM pretraining corpus, ROOTS, consists of around 11% of code. In Table 9, we report benchmarking results of BLOOM on HumanEval (Chen et al., 2021). We find the performance of pretrained BLOOM models to be similar to that of the similar-sized GPT models trained on the Pile (Gao et al., 2020). The Pile contains English data and around 13% of code (GitHub + StackExchange), which is similar to the code data sources and proportions in ROOTS. The Codex models, which have solely been finetuned on code, are significantly stronger than other models. Multitask finetuned BLOOMZ models do not improve significantly over BLOOM models. We hypothesize this is due to the finetuning dataset, xP3, not containing significant amounts of pure code completion. Rather, xP3 contains code-related tasks, such as estimating the time complexity of a given Python code snippet.  Additional analysis is provided in Muennighoff et al. (2022b).

BLOOM的预训练语料库ROOTS中约有11%的代码。在表9中,我们报告了BLOOM在HumanEval(Chen等人,2021年)上的基准结果。我们发现预训练的BLOOM模型的性能与在Pile(Gao等人,2020年)上训练的大小相似的GPT模型相当。Pile包含英语数据和约13%的代码(GitHub + StackExchange),与ROOTS中的代码数据源和比例相似。专门在代码上进行微调的Codex模型比其他模型要强得多。多任务微调的BLOOMZ模型在BLOOM模型上并没有显著改善。我们推测这是由于微调数据集xP3中没有包含大量的纯代码完成任务。相反,xP3包含与代码相关的任务,例如估计给定Python代码片段的时间复杂度。Muennighoff等人(2022b年)提供了更多的分析。

Figure 9: WikiLingua One-shot Results. Each plot represents a different language with per-prompt ROUGE-2 F-measure scores.

图9:WikiLingua一次性任务结果。每个图表示不同语言的每个提示的ROUGE-2 F-测量分数。

4.6、HELM benchmark基准

For completeness, we reproduce here evaluations from the HELM benchmark (Liang et al., 2022), which ran 5-shot evaluations of a variety of language models on English-only tasks. Despite the multilingual training, BLOOM is roughly on par in accuracy with previous- generation English-only models, such as GPT3-davinci  v1  and  J1-Grande  v1,  but  be- hind more recent monolingual models such as InstructGPT davinci v2, Turing NLG v2, Anthropic-LM v4-s3, or OPT. Like other large language models of this size, it is not very well calibrated, but quite robust. Finally, on this benchmark, it is one of the best models for fairness, slightly more toxic than average in English, and average for bias.

为了完整起见,我们在这里重现了来自HELM基准测试(Liang等人,2022年)的评估结果,该基准测试对只使用英语的任务进行了5-shot评估。尽管进行了多语言训练,BLOOM在准确性上与先前一代的只英语模型(如GPT3-davinci v1和J1-Grande v1)大致相当,但落后于更近期的单语模型,如InstructGPT davinci v2、Turing NLG v2、Anthropic-LM v4-s3或OPT。与此大小的其他大型语言模型一样,它的校准度不高,但非常稳健。最后,在这个基准测试中,它是公平性方面表现最好的模型之一,在英语中稍微比平均水平更有害,而在偏见方面处于平均水平。

4.7、Multitask Finetuning多任务微调

Building on recent work on multitask finetuning (Sanh et al., 2022; Wei et al., 2021; Wang et al., 2022a) we explore using multilingual multitask finetuning to improve the zero-shot performance of the BLOOM model. We conducted multilingual multitask finetuning of BLOOM models using the xP3 corpus outlined in Section 3.1.4. We find that zero-shot performance significantly increases. In Figure 11, we compare the zero-shot performance of pretrained BLOOM and XGLM models with multitask finetuned BLOOMZ, T0 and mTk-Instruct (Wang et al., 2022b). BLOOM and XGLM performances are near the ran- dom baselines of 33% for NLI (XNLI) and 50% for coreference resolution (XWinograd) and sentence completion (XCOPA and XStoryCloze). After going through multilingual multi- task finetuning (BLOOMZ), zero-shot performance significantly improves on the depicted held-out tasks. Despite also being multitask finetuned, T0 performs badly on the multi- lingual datasets shown due to it being a monolingual English model. Additional results provided in Muennighoff et al. (2022b), however, show that models finetuned on xP3 also outperform T0 on English datasets when controlling for size and architecture. This is likely due to T0’s finetuning dataset (P3) containing less diverse datasets and prompts than xP3. Multitask finetuning performance has been shown to correlate with the amount of datasets and prompts (Chung et al., 2022).

基于最近关于多任务微调的研究(Sanh等人,2022年;Wei等人,2021年;Wang等人,2022a年),我们探索使用多语言多任务微调来提高BLOOM模型的零-shot性能。我们使用第3.1.4节中概述的xP3语料库对BLOOM模型进行了多语言多任务微调。我们发现零-shot性能显著提高。在图11中,我们将预训练的BLOOM和XGLM模型的零-shot性能与多任务微调的BLOOMZ、T0和mTk-Instruct(Wang等人,2022b年)进行了比较。BLOOM和XGLM的性能接近NLI(XNLI)的随机基线(33%)以及核心指代消解(XWinograd)和句子补全(XCOPA和XStoryCloze)的随机基线(50%)。经过多语言多任务微调(BLOOMZ)后,零-shot性能在所示的保留任务上显著提高。尽管也经过多任务微调,T0在多语言数据集上的表现不佳,这是因为它是一个单语英语模型。然而,Muennighoff等人(2022b年)提供的其他结果显示,在控制大小和架构时,在xP3上微调的模型在英语数据集上也优于T0。这可能是因为T0的微调数据集(P3)包含的数据集和提示比xP3较少。多任务微调的性能已经显示与数据集和提示的数量相关(Chung等人,2022年)。

Table 9: Performance on HumanEval (Chen et al., 2021). Non-BLOOM results come from prior work (Chen et al., 2021; Fried et al., 2022). The Codex model is a language model that was finetuned on code, while the GPT models (Black et al.; Wang and Komatsuzaki, 2021; Black et al., 2022) are trained on a mix of code and text like BLOOM.

表9:HumanEval(Chen等人,2021年)的性能。非BLOOM结果来自先前的研究(Chen等人,2021年;Fried等人,2022年)。Codex模型是在代码上微调的语言模型,而GPT模型(Black等人;Wang和Komatsuzaki,2021年;Black等人,2022年)是像BLOOM一样在代码和文本混合数据上训练的。

4.8、Embeddings嵌入

In Section 3.5, we have outlined the contrastive finetuning procedure for creating SGPT- BLOOM text embedding models.   In Table 10,  we  report benchmarking results on two multilingual datasets from the Massive Text Embedding Benchmark (MTEB, Muennighoff et al., 2022a). We find that SGPT-BLOOM-7.1B-msmarco36 provides state-of-the-art per- formance on several classification and semantic textual similarity splits. However, with 7.1 billion parameters it is an order of magnitude larger than models like the displayed mul- tilingual MiniLM37 and MPNet38. SGPT-BLOOM-1.7B-nli39 performs significantly worse, likely due to less parameters and its finetuning being shorter (NLI is a much smaller dataset than MS-MARCO). Apart from the BLOOM models, ST5-XL40 is the largest model with 1.2 billion parameters. However, as an English-only model its performance on non-English languages is poor. The languages displayed are part of the BLOOM pretraining corpus. Performance on more languages and datasets can be inspected on the MTEB leaderboard41.

在第3.5节中,我们概述了用于创建SGPT-BLOOM文本嵌入模型的对比微调过程。在表10中,我们报告了来自Massive Text Embedding Benchmark(MTEB,Muennighoff等人,2022a年)的两个多语言数据集的基准结果。我们发现SGPT-BLOOM-7.1B-msmarco36在多个分类和语义文本相似性分割上提供了最先进的性能。然而,它拥有71亿个参数,比显示的多语言MiniLM37和MPNet38模型大一个数量级。SGPT-BLOOM-1.7B-nli39的性能明显较差,可能是由于参数较少和微调时间较短(NLI数据集比MS-MARCO要小得多)。除了BLOOM模型外,ST5-XL40是最大的模型,具有12亿个参数。然而,作为一个只有英语的模型,它在非英语语言上的性能较差。显示的语言是BLOOM预训练语料库的一部分。更多语言和数据集的性能可以在MTEB排行榜上查看。

Figure 10: Results for a wide variety of language models on the 5-shot HELM benchmark. Taken from Liang et al. (2022)

图10:在5-shot HELM基准测试上各种语言模型的结果。取自Liang等人(2022年)。

Figure 11: BLOOMZ zero-shot task generalization. Five untuned prompts are evaluated for each dataset and plotted. T0 is monolingual (English) while other models are multilingual. T0 performance may be hurt by its inability to tokenize some non-English texts.

图11:BLOOMZ的零-shot任务泛化。对于每个数据集,评估了五个未微调的提示并进行绘图。T0是单语言(英语),而其他模型是多语言。T0的性能可能受到无法标记化某些非英语文本的影响。

4.9、Multilingual Probing多语言探测

Probing has emerged as a significant evaluation paradigm to analyze and interpret the inner workings of LLMs (Ettinger et al., 2016; Adi et al., 2017; Belinkov et al., 2017; Hupkes et al.,2018; Tenney et al., 2018; Belinkov and Glass, 2019; Teehan et al., 2022), although it comes with certain shortcomings (Belinkov, 2022). Examination of the LLM embeddings can help shed light on the generalizing abilities of the model apart from its training objective loss or downstream task evaluation, which is especially beneficial for examining languages lacking annotated datasets or benchmarks.

探测已经成为一种重要的评估范式,用于分析和解释LLM(Ettinger等,2016年;Adi等,2017年;Belinkov等,2017年;Hupkes等,2018年;Tenney等,2018年;Belinkov和Glass,2019年;Teehan等,2022年)的内部工作方式,尽管它也存在一些缺点(Belinkov,2022年)。检查LLM嵌入可以帮助揭示模型的泛化能力,除了其训练目标损失或下游任务评估之外,这对于检查缺乏注释数据集或基准的语言特别有益。

Table 10: Performance of BLOOM models finetuned for sentence embeddings on classifica- tion and STS datasets from MTEB (Muennighoff et al., 2022b).

表10:在MTEB(Muennighoff等,2022年)的分类和STS数据集上微调用于句子嵌入的BLOOM模型的性能。

4.9.1、Method方法

For interpreting BLOOM’s multilingual generalizing abilities, we utilize the “Universal Prob- ing” framework42 for systematic probing analysis in 104 languages and 80 morphosyntactic features (Serikov et al., 2022). The framework provides SentEval-style (Conneau et al., 2018) probing setup and datasets for each language available in Universal Dependencies (UD; Nivre et al., 2016). We consider the following 17 languages from 7 language families present in BLOOM’s pretraining corpus (Section 3.1) and UD treebanks: Arabic (Afro- Asiatic), Bambara (Mande), Basque (language isolate), Bengali, Catalan, English, French, Hindi, Marathi, Portuguese, Spanish, Urdu (Indo-European), Chinese (Sino-Tibetan), In- donesian (Austronesian), Tamil (Dravidian), Wolof, Yoruba (Niger-Congo). Our setup covers 38 morphosyntactic features in total, which represent language-specific linguistic information. We provide a dataset sample in Table 11.

为了解释BLOOM的多语言泛化能力,我们利用了“通用探测”框架42进行系统探测分析,涵盖了104种语言和80种形态句法特征(Serikov等,2022年)。该框架为每种语言提供了SentEval风格(Conneau等,2018年)的探测设置和数据集,这些语言都可以在通用依赖关系(UD;Nivre等,2016年)中找到。我们选择了来自BLOOM预训练语料库(第3.1节)和UD树库的7个语言家族中的17种语言进行探测,这些语言家族包括:阿拉伯语(阿非罗-亚细亚语系)、班巴拉语(曼德语系)、巴斯克语(孤立语系)、孟加拉语、加泰罗尼亚语、英语、法语、印地语、马拉地语、葡萄牙语、西班牙语、乌尔都语(印欧语系)、中文(汉藏语系)、印度尼西亚语(南岛语系)、泰米尔语(德拉维达语系)、沃洛夫语、约鲁巴语(尼日尔-刚果语系)。我们的设置涵盖了总共38个形态句法特征,代表了语言特定的语言信息。在表11中,我们提供了一个数据集示例。

The probing procedure is conducted as follows.   First,  we compute <s>-pooled rep- resentations of the input sentence at each layer of the 1.7B-parameter BLOOM variant (“BLOOM 1B7”) and BLOOM (with 176B parameters). Second, we train a binary logis- tic regression classifier to predict a presence of a morphosyntactic feature in the sentence. Logistic regression is chosen due to its higher selectivity as opposed to non-linear probing classifiers (Hewitt and Liang, 2019). We use the original UD training, validation, and test splits here. Third, the probing performance is evaluated by F1 weighted score due to target class imbalance for most probing tasks. The results are averaged across three runs with different random seeds.

探测过程如下进行。首先,我们在1.7B参数的BLOOM变体(“BLOOM 1B7”)和BLOOM(176B参数)的每个层级上计算输入句子的<s>池化表示。其次,我们训练一个二元逻辑回归分类器来预测句子中是否存在形态句法特征。选择逻辑回归是因为相对于非线性探测分类器(Hewitt和Liang,2019年),它具有更高的选择性。我们在这里使用了原始的UD训练、验证和测试拆分。第三,通过F1加权分数来评估探测性能,因为大多数探测任务的目标类别不平衡。结果在具有不同随机种子的三次运行中进行平均。

Table 11: Examples of the Number task in English and Spanish. The subject number indicator is highlighted in bold. The task is to predict if the sentence includes a singular subject number (upper sentence) and a plural subject number (bottom sentence).

表11:英语和西班牙语中“Number”任务的示例。主语数量指示器以粗体显示。任务是预测句子中是否包含单数主语数量(上句)和复数主语数量(下句)。

Baselines 基准

We compare the probing performance with random guessing and logistic re- gression classifiers trained on the following TF-IDF features (Salton and Yang, 1973): word unigrams, character N-grams, BPE43 token N-grams, and SentencePiece44 (SP; Kudo and Richardson, 2018) token N-grams. We use the N-gram range ∈ [1; 4] and limit the TF-IDF vocabularies to top-250k features.

我们将探测性能与随机猜测和基于逻辑回归训练的分类器进行比较,这些分类器使用以下TF-IDF特征(Salton和Yang,1973年)进行训练:单词单元、字符N-gram、BPE43标记N-gram和SentencePiece44(SP;Kudo和Richardson,2018年)标记N-gram。我们将N-gram范围限定在[1;4]之间,并将TF-IDF词汇表限制为前25万个特征。

Correlation相关性

We run statistical tests to analyze correlations between the probing perfor- mance and linguistic, dataset, and model configuration criteria:

>> Language script: the results are divided into two groups by the language script – Latin and others (Devanagari, Tamil, and Arabic). Here, we use the non-parametric test Mann-Whitney U (Mann and Whitney, 1947).

>> Language family: the results are divided into 7 groups by the language family. We apply the ANOVA to analyze the variance between the groups.

>> Probing and pretraining dataset size: we run the Pearson correlation coefficient test (Pearson, 1895) to compute correlation between the probing performance and these data configuration criteria.

>> Effect of the model size: the results are divided into two groups by the BLOOM version. Here, we use the Mann-Whitney U test to see if there is a correlation between the number of parameters and the probing results.

我们进行统计测试,分析探测性能与语言、数据集和模型配置标准之间的相关性:

>>语言脚本:结果根据语言脚本分为两组 - 拉丁文和其他语言(天城文、泰米尔文和阿拉伯文)。在这里,我们使用非参数检验Mann-Whitney U(Mann和Whitney,1947年)。

>>语言家族:结果根据语言家族分为7个组。我们应用ANOVA来分析组间的方差。

>>探测和预训练数据集大小:我们运行皮尔逊相关系数检验(Pearson,1895年)来计算探测性能与这些数据配置标准之间的相关性。

>>模型大小的影响:结果根据BLOOM版本分为两组。在这里,我们使用Mann-Whitney U检验来查看参数数量和探测结果之间是否存在相关性。

4.9.2、Results结果

Probing探测

Table 12 presents the results of probing experiments averaged over the probing tasks and experiment runs within each language. The  overall pattern is  that  BLOOM- 1B7 performs on par or better than BLOOM, and both LLMs outperform the count-based baselines. In particular, the LLMs achieve more robust performance on Arabic, Basque, and Indo-European languages (e.g., Catalan, French, Hindi, Portuguese, Spanish, and Urdu), while Bengali, Wolof, and Yoruba receive the lowest scores. We attribute this behavior to the transfer abilities: BLOOM infers linguistic properties better for the closely related languages that comprise a significant amount of data. For example, the performance on any Romance language is better than in English, and the results in Indic languages are close to those in high-resource languages.

表12呈现了在每种语言内进行的探测实验结果,这些结果是在探测任务和实验运行中进行平均得出的。总体模式是BLOOM-1B7的性能与BLOOM相当或更好,并且两个LLM在计数基准上的表现优于。特别是,在阿拉伯语、巴斯克语和印欧语系(如加泰罗尼亚语、法语、印地语、马拉地语、葡萄牙语、西班牙语和乌尔都语)中,LLM表现出更稳健的性能,而孟加拉语、沃洛夫语和约鲁巴语的得分最低。我们将这种行为归因于传递能力:BLOOM在与其构成大量数据的密切相关语言上更好地推断语言属性。例如,对于任何罗曼语言,其性能优于英语,而在印度语言中的结果接近高资源语言。

Table 12: Probing performance (F1 averaged by layers) of the BLOOM-based classifiers and count-based baselines. The results are averaged over probing tasks, and three experiment runs within each language. Standard deviation is determined by the results along the language tasks.

表12:基于BLOOM的分类器和基于计数的基线的探测性能(按层平均的F1)。结果是在探测任务和每种语言内进行的三次实验运行中进行平均的。标准差由语言任务的结果确定。

Figure 12 presents the language-wise probing performance results for morphosyntactic features represented at least in 5 languages. The probing performance of both LLMs is similar despite the difference in size. We find that the LLMs infer Mood and Person well with no regard for language. Number, NumType (numeral type), and Voice are moderately inferred in most languages. The models generally show worse qualities in the other cate- gories, indicating that they do not encode such morphological information. The possible explanation of such difference in performance may be the diversity of possible values of these categories. For example, Mood and Person share similar values across the presented languages, while the set of Case values is highly dependent on the language.

图12呈现了至少在5种语言中表示的形态句法特征的按语言划分的探测性能结果。尽管大小不同,两个LLM的探测性能相似。我们发现,LLM在无论语言如何都能很好地推断出语气和人称。大多数语言中,数字、数字类型(数值类型)和声态的推断程度适中。模型通常在其他类别上显示出较差的质量,表明它们不编码这种形态信息。这种性能差异的可能解释可能是这些类别可能的值的多样性。例如,语气和人称在所呈现的语言中具有类似的值,而格的集合则高度依赖于语言。

Correlation相关性

The correlation analysis results support conclusions on the probing perfor- mance and reveals contributing factors (see Table 13). Both models show similar results on the languages with different language scripts. Results of BLOOM-1B7 are highly correlated with language family, probing dataset size, and pretraining dataset size. According to the results of Mann-Whithey U test, BLOOM-1B7 shows significantly better results (p < 0.01) than BLOOM. However, BLOOM shows more stable performance on different languages in spite of the amount of data it has seen during pretraining. This might indicate the better generalization abilities of the model with more parameters.

相关性分析结果支持对探测性能的结论,并揭示了一些影响因素(见表13)。两个模型在具有不同语言脚本的语言上显示出相似的结果。BLOOM-1B7的结果与语言家族、探测数据集大小和预训练数据集大小高度相关。根据Mann-Whitney U检验的结果,BLOOM-1B7的结果明显优于BLOOM(p < 0.01)。然而,尽管在预训练过程中看到了大量的数据,BLOOM在不同语言上显示出更稳定的性能。这可能表明具有更多参数的模型具有更好的泛化能力。

Figure 12: Probing classifiers’ results by language and task category. White squares denote that the morphosyntactic category is not represented in the language.

图12:按语言和任务类别划分的探测分类器结果。白色方块表示该语言中没有表示该形态句法类别。

Table 13: Results of statistical tests and correlation analysis between probing performance and linguistic, dataset, and model configuration criteria.

表13:探测性能与语言、数据集和模型配置标准之间的统计测试和相关性分析结果。

Discussion讨论

It should be noted that the following questions remain for further research:

需要进一步研究的问题如下:

1、Generalizing abilities. BLOOM-1B7 is leading in the average performance of mor- phosyntactic feature classification for the languages in Table 12. The BLOOM results are lower, which can be interpreted as a worse grammatical generalization over the aforecited languages.   However, the BLOOM-1B7’s probing correlation results with factors like pretraining dataset size are more prominent, which makes it potentially less generalizing on the under-resourced languages than the bigger version.

2、Multilingual abilities. A separate research interest implies considering languages that are not explicitly included in the pretraining corpus of the models. Expand- ing the set of languages for probing will allow for a typological interpretation and a deeper analysis of the most learnable and hard-to-learn linguistic features on a more considerable scope.

3、Under-resourced language evaluation. The under-resourced  languages  of  the Indic and Niger-Congo families included in the pretraining corpus in smaller shares represent a separate subject for future probing. We also plan to investigate the results of high-resourced and under-resourced languages to reveal possible linguistic insights in these two groups.

4、Different layers and training dynamics. The analysis has focused on averaged representations of all layers and at the end of training. Analyzing different layers may reveal how morpho-syntactic representations are built during processing. Similarly, investigating how properties are acquired over the course of pre-training (Choshen et al., 2022; Zhang et al., 2021; Voloshina et al., 2022) is a viable direction for research.

1、泛化能力。BLOOM-1B7在表12中的形态句法特征分类的平均性能中处于领先地位。BLOOM的结果较低,可以解释为在上述语言中更差的语法泛化能力。然而,BLOOM-1B7与预训练数据集大小等因素的探测相关结果更为突出,这使得它在资源较少的语言上的泛化能力可能不如更大的版本。

2、多语言能力。另一个研究兴趣是考虑未明确包含在模型预训练语料库中的语言。扩展探测语言的范围将允许对更广泛的范围上最易学习和难学习的语言特征进行类型学解释和深入分析。

3、资源不足语言评估。预训练语料库中包含的印度语系和尼日尔-刚果语系的资源不足语言代表了未来探测的一个独立课题。我们还计划研究高资源和资源不足语言的结果,以揭示这两个群体可能的语言洞见。

4、不同层次和训练动态。分析集中在所有层次和训练结束时的平均表示上。分析不同层次可能会揭示出在处理过程中如何构建形态句法表示。类似地,研究在预训练过程中如何获取属性(Choshen et al。,2022;Zhang et al。,2021;Voloshina et al。,2022)是一个可行的研究方向。

4.10Bias偏见

As a preliminary study into the biases learned by BLOOM, we present evaluation on the multilingual CrowS-Pairs dataset, which combines a revised version of the CrowS-Pairs dataset developed by Nangia et al. (2020) together with the French version of CrowS-Pairs introduced by Neveol al. (2022). One challenge of this evaluation was to adapt a dataset originally intended for masked language models to autoregressive language models such as BLOOM. CrowS-Pairs relies on minimal pairs to compare a stereotyped statement and a non-stereotyped statement (e.g. “Women can’t drive.” is a gender stereotype while “Men can’t drive” is not). The two statements differ only by the social category targeted by the stereotype and that social category is present in the stereotyped statement and not in the non-stereotyped statement. The evaluation aims at assessing systematic preference of models for stereotyped statements. The original “metric score” compared pseudo-log- likelihood of sentences in a pair to determine which sentence received a higher score from a masked language model. Prompts were designed to require the model to select one of the statements based on the “likely” and “realistic” nature of the situations described.

Figure 13 shows that BLOOM’s overall prompt accuracy was close to .50, which suggests an overall absence of bias. We note that the scores in English and French are very close, suggesting similar overall behavior of the model on both languages. We also show results on mono-lingual autoregressive models — GPT-Neo (Black et al.) and GPT-FR (Simoulin and Crabb?, 2021) for English and French, respectively.

作为对BLOOM学习的偏见的初步研究,我们在多语言CrowS-Pairs数据集上进行了评估,该数据集结合了Nangia等人(2020)开发的CrowS-Pairs数据集的修订版本和Neveol等人(2022)引入的法语版本的CrowS-Pairs。这种评估的一个挑战是将最初用于掩码语言模型的数据集适应于BLOOM等自回归语言模型。CrowS-Pairs利用最小对比来比较一个刻板陈述和一个非刻板陈述(例如,“女人不能开车”是一种性别刻板印象,而“男人不能开车”则不是)。这两个陈述只有通过刻板陈述中的社会类别与非刻板陈述不同。该评估旨在评估模型对刻板陈述的系统偏好。原始的“度量分数”将一对句子中的伪对数似然性与来自掩码语言模型的较高分数的句子进行比较,以确定哪个句子获得了更高的分数。提示被设计为要求模型根据所描述情况的“可能性”和“现实性”选择其中一个陈述。

图13显示,BLOOM在crowS-Pairs上的整体提示准确率接近0.50,这表明整体上没有偏见。我们注意到,英语和法语的分数非常接近,这表明模型在这两种语言上的整体行为类似。我们还展示了针对英语和法语的单语自回归模型(Black et al.的GPT-Neo和Simoulin和Crabb?的GPT-FR,2021)的结果。

Table 14 presents the results per bias type in the CrowS-Pairs dataset. The results are quite homogeneous over the categories, which contrasts with previous studies on masked language models, which suggested models were prone to bias in specific categories, which differed between models tested. Nonetheless, accuracy significantly differs from 50 (T-test,p < .05) overall for both languages, as well as for a number of bias categories, as shown per asterisks in the table.

表14呈现了CrowS-Pairs数据集中每种偏见类型的结果。结果在各个类别上相当一致,这与以前关于掩码语言模型的研究形成对比,该研究表明模型在特定类别中容易出现偏见,而被测试的模型之间的偏见类型有所不同。然而,准确率与50(T检验,p <0.05)整体上不同,以及表中星号所示的多个偏见类别。

Figure 13: Overall accuracy of BLOOM on crowS-Pairs per prompt for English and French. Results on the two smallest BLOOM models and monolingual GPT models of comparable size are also shown.

图13:BLOOM在英语和法语的crowS-Pairs上的整体准确率。还显示了与两个最小的BLOOM模型和具有相似规模的单语GPT模型的结果。

Table 14: BLOOM accuracy results on crowS-Pairs bias categories averaged over eight runs for English and French. Significance for the one sample T-test (p < .05) is indicated with *.

表14:BLOOM在英语和法语的crowS-Pairs偏见类别上的准确率结果,平均值基于八次运行。在一个样本T检验(p <0.05)中的显着性用*表示。

Limitations 限制

Blodgett et al. (2021) discuss validity issues with the original CrowS-Pairs corpus. The CrowS-Pairs version used here differs from the original by addressing some of the issues pointed out by Blodgett et al. (2021) and by constructing 200 additional sentence pairs based on stereotypes collected from French speakers. In a recent evaluation of bias in masked language models in English and French, results obtained on the revised dataset were not significantly different from those obtained on the original dataset Nevel et al. (2022).

However, its original validation does not naturally apply here, and comparison to other CrowS-Pairs results is more difficult. For a stronger assessment of bias, results obtained with CrowS-Pairs should be compared with other measures of bias, and also assessed for all languages in the model. However, as noted by Talat et al. (2022), very little material (corpora, measures) is available for multilingual bias assessment.

Blodgett等人(2021)讨论了原始CrowS-Pairs语料库的有效性问题。这里使用的CrowS-Pairs版本与Blodgett等人(2021)指出的问题进行了一些改进,并根据法语发言者收集的刻板印象构建了200个额外的句子对。在对英语和法语的掩码语言模型中进行的最新的偏见评估中,使用修订后的数据集获得的结果与使用原始数据集获得的结果没有显著差异(Nevel等人,2022)。

然而,其原始验证在这里并不自然,与其他CrowS-Pairs结果的比较更加困难。为了更强有力地评估偏见,应将CrowS-Pairs获得的结果与其他偏见度量进行比较,并对模型涵盖的所有语言进行评估。然而,正如Talat等人(2022)指出的,对于多语言偏见评估,可用的材料(语料库、度量)非常有限。

Although our examinations suggest a limited presence of bias in the model, they cannot cover the breadth of possible usage scenarios. One such scenario where models may have a larger impact is on linguistic diversity and language variation encountered. As the training resources for BLOOM are carefully curated, they may also capture some language variations to a larger degree than other models. This also impacts the ability of trained models to equitably represent different variations. Such differences can aid in the propagation and legitimization of some language variants over others. Our evaluation of biases in the model are further limited to the situations, languages and language variants that are covered by multilingual CrowS-Pairs. We therefore expect a distinction between our findings using CrowS-Pairs and wider model use (for a more detailed exploration on such differences, see Raji et al., 2021).

虽然我们的研究表明模型中存在有限的偏见,但它们无法涵盖所有可能的使用场景。模型可能对语言多样性和遇到的语言变体产生更大的影响的一个场景是语言多样性和语言变体。由于BLOOM的训练资源经过精心筛选,它们可能比其他模型更多地捕捉到某些语言变体。这也影响训练模型公平地代表不同的变体的能力。我们对模型中偏见的评估还受限于被多语言CrowS-Pairs所覆盖的情况、语言和语言变体。因此,我们预计在使用CrowS-Pairs和更广泛的模型时会出现差异(有关这种差异的更详细探索,请参阅Raji等人,2021)。

5、Conclusion结论

In this work, we present BLOOM, a 176B-parameter open-access multilingual language model. BLOOM was created by BigScience, a collaboration of hundreds of researchers, and was trained on the French government-funded Jean Zay supercomputer for 3.5 months. In this paper, we chronicled the development of BLOOM, from the creation of its training dataset ROOTS to the design of its architecture and tokenizer. We also discuss evaluation results of BLOOM and other large language models, finding it has competitive performance that improves after multitask finetuning.

在这项工作中,我们介绍了BLOOM,一个176B参数的开放获取多语言语言模型。BLOOM是由数百名研究人员组成的BigScience团队创建的,并在法国政府资助的Jean Zay超级计算机上进行了为期3.5个月的训练。在本文中,我们详细记录了BLOOM的开发过程,包括其训练数据集ROOTS的创建,架构和分词器的设计。我们还讨论了BLOOM和其他大型语言模型的评估结果,发现它在竞争性性能方面表现出色,并在多任务微调后有所提高。

We hope that the release of a powerful multilingual language model unlocks new applica-tions and research directions for large language models. Further, we hope that documenting our experience will help the machine learning research community organize new large-scale collaborative projects similar to BigScience. Besides enabling results that are impossible for any individual research group to achieve, this form of organization will also allow more people with different backgrounds to share their ideas and participate in the development of major advances in the field.

我们希望强大的多语言语言模型的发布能够为大型语言模型开辟新的应用和研究方向。此外,我们希望记录我们的经验能够帮助机器学习研究社区组织类似于BigScience的大规模协作项目。除了能够实现任何个体研究团队无法实现的结果外,这种组织形式还将允许更多具有不同背景的人分享他们的想法,并参与该领域的重大进展的发展。

6、Contributions贡献

Authors are assigned to each authorship category according to which aspects of the project they contributed to. Many authors appear under multiple categories because they con-tributed to the project in more than one way. Author order in all categories is alphabetical by first name, except for “Major Contributors” where authors are shuffled randomly apart from Teven Le Scao, who is intentionally listed first and “Organization” where Thomas Wolf is intentionally listed last. A description of each category follows. For finer-grained contribution details, please see the papers mentioned under each category.

根据作者在项目中的贡献,将作者分配到各个作者类别。由于许多作者以多种方式为项目做出了贡献,所以许多作者出现在多个类别中。在所有类别中,按照名字的字母顺序排列,除了“Major Contributors”中,除了特意列在第一位的Teven Le Scao之外,其他作者被随机洗牌,并且“Organization”中特意将Thomas Wolf列在最后。每个类别的描述如下。有关更详细的贡献细节,请参阅每个类别下提到的论文。

Major Contributors  lists individuals without whom BLOOM would not have happened and/or who spent more than 20% of their time on the BigScience effort as a whole.

Dataset lists individuals who contributed to data sourcing, organization, and processing efforts, including the authors of Lauren?con et al. (2022), McMillan-Major et al. (2022), and Jernite et al. (2022).

Tokenization lists individuals who built the BLOOM tokenizer and authors of Mielke et al. (2021).

Prompt Engineering lists individuals who wrote, edited, and reviewed prompt templates for the datasets we consider as well as authors of Sanh et al. (2022), Bach et al. (2022), and Muennighoff et al. (2022b).

Architecture and Objective lists individuals who ran experiments to help determine BLOOM’s model architecture and training objective, including authors of Wang et al. (2022a) and Le Scao et al. (2022).

Engineering lists individuals who contributed to code and infrastructure to train BLOOM on the Jean Zay supercomputer.

Evaluation and interpretability  lists individuals who helped evaluate the BLOOM model as well as authors of Talat et al. (2022).

Broader Impacts lists authors of the ethical charter, license, and model card, in addi- tion to individuals who studied privacy issues, social impacts, and BLOOM’s carbon footprint.

Applications lists members of working groups focused on applications of BLOOM, includ- ing authors of Fries et al. (2022b), Fries et al. (2022a), and De Toni et al. (2022).

Organization lists individuals who coordinated the BigScience effort and authors of Akiki et al. (2022).

主要贡献者:列出了在没有这些人的支持下,BLOOM将无法实现,或者他们在整个BigScience项目中投入了超过20%的时间。

数据集:列出了为数据获取、组织和处理工作做出贡献的个人,包括Lauren?con等人(2022)、McMillan-Major等人(2022)和Jernite等人(2022)的作者。

分词:列出了构建BLOOM分词器的个人和Mielke等人(2021)的作者。

提示工程:列出了为我们考虑的数据集编写、编辑和审查提示模板的个人,以及Sanh等人(2022)、Bach等人(2022)和Muennighoff等人(2022b)的作者。

架构和目标:列出了进行实验以帮助确定BLOOM模型架构和训练目标的个人,包括Wang等人(2022a)和Le Scao等人(2022)的作者。

工程:列出了为在Jean Zay超级计算机上训练BLOOM而做出代码和基础设施贡献的个人。

评估和可解释性:列出了帮助评估BLOOM模型的个人,以及Talat等人(2022)的作者。

更广泛的影响:列出了伦理准则、许可证和模型卡片的作者,以及研究隐私问题、社会影响和BLOOM的碳足迹的个人。

应用:列出了专注于BLOOM应用的工作组成员,包括Fries等人(2022b)、Fries等人(2022a)和De Toni等人(2022)的作者。

组织:列出了协调BigScience项目的个人,以及Akiki等人(2022)的作者。

Acknowledgments致谢

The BigScience Workshop was granted access to the HPC resources of the Institut du développement et des ressources en informatique scientifique (IDRIS) du Centre national de la recherche scientifique (CNRS) under the allocation 2021-A0101012475 made by the Grand équipement national de calcul intensif (GENCI). Model training ran on the Jean- Zay supercomputer of GENCI at IDRIS, and we thank the IDRIS team for their responsive support throughout the project, in particular Rémi Lacroix.

Roman Castagné, Thomas Wang, Benoˆıt Sagot and Rachel Bawden’s contributions were funded by Benoˆıt Sagot’s and Rachel Bawden’s chairs in the PRAIRIE institute funded by the French national agency ANR as part of the “Investissements d’avenir” programme under the reference ANR-19-P3IA-0001. Aurélie Névéol’s contribution was supported by ANR under grant GEM ANR-19-CE38-0012. Oskar van der Wal’s contributions were financed by the Dutch Research Council (NWO) as part of Open Competition Digitalisation-SSH with project number 406.DI.19.059.

BigScience研讨会获得了法国国家科学研究中心(CNRS)Institut du développement et des ressources en informatique scientifique(IDRIS)的HPC资源的使用权限,该资源由Grand équipement national de calcul intensif(GENCI)在2021-A0101012475分配的支持下提供。模型训练在GENCI的IDRIS上的Jean-Zay超级计算机上进行,我们感谢IDRIS团队在整个项目期间对我们的快速支持,特别是Rémi Lacroix。

Roman Castagné、Thomas Wang、Benoît Sagot和Rachel Bawden的贡献得到了法国国家研究机构(ANR)的支持,该机构作为“Investissements d'avenir”计划的一部分,在PRAIRIE研究所中担任主席,并获得了ANR-19-P3IA-0001的参考资金。Aurélie Névéol的贡献得到了ANR在GEM ANR-19-CE38-0012的支持。Oskar van der Wal的贡献得到荷兰科学研究组织(NWO)的资助,项目编号为406.DI.19.059。

The BigScience Workshop would also like to acknowledge the support and financing of the following organizations, organization members and affiliations of some of the par- ticipants: ESPCI and LAMSADE (Dauphine Université, PSL, CNRS) for Alexandre Al- lauzen; MELODI team at IRIT/University of Toulouse for Farah Benamara, Chloé Braud, Philippe Muller, and Véronique Moriceau; IRISA LinkMedia team IMATAG/CNRS for Vin- cent Claveau and Antoine Chaffin; Université de Lorraine ATILF UMR 7118 CNRS / UL for Mathieu Constant; University of Paris for Benoˆıt Crabbé, Marie Candito and Antoine Simoulin; GdR TAL (CNRS) for Béatrice Daille; CNRS DR1 INSERM UMR1093 UBFC Dijon for Peter Ford Dominey; Aix-Marseille University UTLN CNRS LIS/UMR7220 for Benoˆıt Favre and Frédéric Béchet; CEA LASTI for Bertrand Delezoide, Olivier Ferret, Adrian Popescu and Julien Tourille; Sorbonne Université LORIA for Karen Fort; CNRS DR1 LORIA UMR7503 Nancy for Claire Gardent and Christophe Cerisara; MAS Lab- oratory of Ecole Centrale Paris for Céline Hudelot, RCLN/LIPN UMR 7030 University Sorbonne-Paris-Nord/CNRS for Joseph Le Roux and Nadi Tomeh, Université de Paris and Necker - Enfants Malades hospital for Antoine Neuraz and Ivan Lerner, Université Paris Saclay LISN CNRS UMR9105 for Aurélie Névéol, Anne-Laure Ligozat, Caio Corro, Fran- cois Yvon; Inria, Univ. Bordeaux and Ensta ParisTech for Pierre-Yves Oudeyer, Cédric Colas, Grgur Kovac, Tristan Karch; Inria Paris for Benoˆıt Sagot, Djamé Seddah, Pedro Ortiz; University Toulouse CNRS for Ludovic Tanguy, Sorbonne Université, LIMICS (Sor- bonne Université, Inserm, Univ. Sorbonne Paris Nord) for Xavier Tannier; I3S Laboratory, CNRS, INRIA, Université Cote d’Azur for Serena Villata and Elena Cabrio; Airbus, Cen- tral Research & Technology for Guillaume Alleon, Alexandre Arnold, and Catherine Kobus; Cloud Temple for Jean-Michel Dussoux; Illuin Technology for Robert Vesoul, Gautier Vi- aud, Martin d’Hoffschmidt, and Wacim Belblidia; Levia.ai for Romain Riviere; LightOn for Igor Carron, Laurent Daudet, Iacopo Poli, and Julien Launay; Nabla for Alexandre Lebrun, Martin Raison, and Samuel Humeau; Naver Labs Europe for Matthias Gallé and Laurent Besacier; Orange Labs for Géraldine Damnati, Johannes Heinecke, and Frederic Herledan; OVHcloud for Jean-Louis Queguiner and Guillaume Salou; ReciTAL for Thomas Scialom, Gilles Moyse, and Jacopo Staiano; Renault Group for Vincent Feuillard, Joan André, Francois-Paul Servant, Raphael Sourty, and Ayhan Uyanik; SYSTRAN for Jean Senellart, Josep Crego, Elise Michon, Guillaume Klein, Dakun Zhang, and Natalia Segal; Ubisoft for Guillaume Gaudron. Leipzig University and the Center for Scalable Data Ana- lytics and Artificial Intelligence (ScaDS.AI) in Leipzig for Christopher Akiki.

BigScience研讨会还要感谢以下组织、组织成员和部分参与者的支持和资助:ESPCI和LAMSADE(Dauphine Université,PSL,CNRS)对Alexandre Allauzen的支持;IRIT/University of Toulouse的MELODI团队对Farah Benamara、Chloé Braud、Philippe Muller和Véronique Moriceau的支持;IRISA LinkMedia团队IMATAG/CNRS对Vincent Claveau和Antoine Chaffin的支持;Lorraine大学ATILF UMR 7118 CNRS / UL对Mathieu Constant的支持;巴黎大学对Benoît Crabbé、Marie Candito和Antoine Simoulin的支持;GdR TAL(CNRS)对Béatrice Daille的支持;CNRS DR1 INSERM UMR1093 UBFC Dijon对Peter Ford Dominey的支持;Aix-Marseille大学UTLN CNRS LIS/UMR7220对Benoît Favre和Frédéric Béchet的支持;CEA LASTI对Bertrand Delezoide、Olivier Ferret、Adrian Popescu和Julien Tourille的支持;Sorbonne大学LORIA对Karen Fort的支持;CNRS DR1 LORIA UMR7503 Nancy对Claire Gardent和Christophe Cerisara的支持;Ecole Centrale Paris的MAS实验室对Céline Hudelot的支持,RCLN/LIPN UMR 7030 University Sorbonne-Paris-Nord/CNRS对Joseph Le Roux和Nadi Tomeh的支持,巴黎大学和Necker - Enfants Malades医院对Antoine Neuraz和Ivan Lerner的支持,巴黎萨克雷大学LISN CNRS UMR9105对Aurélie Névéol、Anne-Laure Ligozat、Caio Corro、Francois Yvon的支持;Inria、Bordeaux大学和Ensta ParisTech对Pierre-Yves Oudeyer、Cédric Colas、Grgur Kovac、Tristan Karch的支持;Inria Paris对Benoît Sagot、Djamé Seddah、Pedro Ortiz的支持;图卢兹大学CNRS对Ludovic Tanguy的支持,Sorbonne大学、LIMICS(Sorbonne大学、Inserm、Univ. Sorbonne Paris Nord)对Xavier Tannier的支持;I3S实验室,CNRS,INRIA,Université Cote d'Azur对Serena Villata和Elena Cabrio的支持;空中客车公司Central Research & Technology对Guillaume Alleon、Alexandre Arnold和Catherine Kobus的支持;Cloud Temple对Jean-Michel Dussoux的支持;Illuin Technology对Robert Vesoul、Gautier Viaud、Martin d'Hoffschmidt和Wacim Belblidia的支持;Levia.ai对Romain Riviere的支持;LightOn对Igor Carron、Laurent Daudet、Iacopo Poli和Julien Launay的支持;Nabla对Alexandre Lebrun、Martin Raison和Samuel Humeau的支持;Naver Labs Europe对Matthias Gallé和Laurent Besacier的支持;Orange Labs对Géraldine Damnati、Johannes Heinecke和Frederic Herledan的支持;OVHcloud对Jean-Louis Queguiner和Guillaume Salou的支持;ReciTAL对Thomas Scialom、Gilles Moyse和Jacopo Staiano的支持;雷诺集团对Vincent Feuillard、Joan André、Francois-Paul Servant、Raphael Sourty和Ayhan Uyanik的支持;SYSTRAN对Jean Senellart、Josep Crego、Elise Michon、Guillaume Klein、Dakun Zhang和Natalia Segal的支持;Ubisoft对Guillaume Gaudron的支持。Leipzig大学和Leipzig的可扩展数据分析与人工智能中心(ScaDS.AI)对Christopher Akiki的支持。

Hugging Face provided storage for the entirety of the project, as well as compute for de- velopment and part of training the smaller BLOOM models. Many of the evaluations in this paper were made possible by compute resources donated by CoreWeave and EleutherAI.

Hugging Face为整个项目提供了存储空间,以及用于开发和训练较小的BLOOM模型的计算资源。本文中的许多评估工作得益于CoreWeave和EleutherAI捐赠的计算资源的支持。

  • 4
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

一个处女座的程序猿

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值