LLMs之FLM-101B:《FLM-101B: An Open LLM and How to Train It with $100K Budget一个开放的LLM和如何用10万美元的预算训练训它》翻

70 篇文章 43 订阅

LLMs之FLM-101B:《FLM-101B: An Open LLM and How to Train It with $100K Budget一个开放的LLM和如何用10万美元的预算训练训它》翻译与解读

导读:2023年9月7日,文章提出了一个低成本训练大规模语言模型(LLM)的方法,即增量训练策略。通过这种策略,作者仅使用10万美元的预算(而GPT-3需要数千万美元),基于0.31T个标记的数据集,就可以从头训练出了一个具有101B参数的双语语言模型FLM-101B。

>>采用增长训练策略,首先训练一个16B模型,然后逐步增长到51B和101B模型。增长训练能够很好地实现知识传承,从而大幅降低训练成本。

>>该模型采用FreeLM框架,实现语言建模任务和教师指导任务的统一。同时采用μP理论预测损失,有效解决了大规模模型训练的稳定性问题。

>>论文除了使用常见的知识评估 Benchmark 外,还参考IQ智商测试的思路设计了符号映射、规则理解、模式挖掘和抗干扰等四类能力评估任务。这些评估强调模型的知识学习/推理能力,而不是纯记忆能力。
>> 作者训练出的模型FLM-101B在常见的NLP任务上达到了与GPT-3和GLM-130B相当的性能水平,特别是在新的知识型评估中表现更好。实验结果显示,FLM-101B在这些新任务上也表现出色。

总体来看,FLM-101B在低成本情况下实现了101B模型规模,并在常见知识任务和新设计的IQ样式任务上表现优异,为大规模开放模型的研究开拓了新思路。采用增长训练策略大幅降低了成本,对未来模型规模的继续扩大具有重要意义。

目录

《FLM-101B: An Open LLM and How to Train It with $100K Budget》翻译与解读

Abstract摘要

LLMs两大主要挑战(高计算成本、公平客观的评估)→提出增长策略来显著降低LLMs的训练成本、提出智商评估降低记忆影响→设计出仅10万美元的预算内的FLM-101B且可媲美GPT-3

1 Introduction引言

三类模型架构面临的当前痛点:高昂的训练成本+趋势是使用更多的训练数据—找到降低训练成本的有效解决方案

提出增长策略

成本:增长策略具有节省成本的潜力

评估:主流两类评估(知识评估【知识为导向】/NLP任务评估)→无法反映模型能力(因其存在评估数据泄露问题+导致很难区分模型的能力源自记忆还是推理)→提出更加公平客观的智商评估(基于【智商测试启发】+不太会受到模型记忆/数据泄露的影响)

本文三点贡献:第一次尝试增长策略降低训练成本+通过超参数搜索/函数保持增长/基于FreeLM的改进提高稳定性+提出基于知识为导向和智商测试改进评估方法

2 Design Overview of FLM-101B—FLM-101B设计概述

2.1 Architecture架构

骨干架构:采用FreeLM(GPT-style+两个预训练目标【语言目标和教师目标】+采用词汇量大小为10w的GPT-4的分词器)

集成xPos:xPos(可外推位置嵌入,在旋转矩阵中引入指数衰减)提高长序列建模+提高长度外推能力

模型规模:

2.2 Pre-Training Setup预训练设置

预训练阶段整合多任务指令驱动数据:OIG/COIG,通过指令数据增强LLMs的理解能力

eFLM-16B=应用了知识增强的FLM-16B模型:应用FreeLM教师信号来增强FLM

原FreeLM模型采用两个训练目标(语言建模目标+二元分类目标)→FLM-101B通过掩码和专用TOKEN统一两个目标(eFLM-16B通过表情符号替换类别标签+在结束位置预测专用符号,将分类目标转化为语言建模格式),并混合采样提高训练稳定性

2.3 Growth Strategy增长策略

依次训练三个模型(都继承其前身的知识):与传统做法不一致

功能保持增长(有助于知识传承和训练稳定性):

进度和成本效益:FLM-101B模型耗时21.54天,节省了72%的时间

2.4 The Parallelism Setup and Model Configurations并行设置和模型配置

FLM-101B硬件配置:24台服务器*8*A800-80G

并行策略:DP和MP已成为B级模型标配,但过多的MP会增加GPU通信开销→集成PP的3D并行策略可实现最佳吞吐量→再结合序列并行SP【沿着序列长度维度对Transformer核心的LayerNorm和Dropout层的输入进行切片】进一步节省GPU计算资源→再结合分布式优化器的Megetron-LM【均匀分布优化器状态】进一步减少GPU内存消耗

表2

FLM-101B模型配置:80层+使用AdamW优化器+余弦学习率调度,上下文窗口2048个标记+10w的词汇量

表1:

3 Training Stability of FLM-101B—FLM-101B的训练稳定性

超100B的显著影响的稳定性问题:损失发散、梯度爆炸以及数值溢出等→搜索成本+训练维护成本+项目预算不可控

措施1—损失预测:越宽越好+损失预测

措施2—Bfloat16的混合精度:Bfloat16对于接近零的值具有更高的精度+消除了对损失规模调整的需求

4 Benchmark Evaluation基准评估

仅仅依靠知识本身可能无法全面反映LLM的能力,评估LLM的知识性+IQ智商测试

成本估算法:单语言的LLM(GPT-3/LLAMA-2)、多语言LLM(GLM-130B/FLM-101B)

4.1 Open LLM Evaluation—Open LLM评估

Open LLM包含四个任务:ARC(常识和事实)、HellaSwag(常识推理)、MMLU(57个多选任务+特定领域的专业知识和复杂的推理)、TruthfulQA(817个检测模型错误的事实问题)

表3:FLM-101B和包括LLAMA系列和GLM-130B在内的基线的性能。为了直观比较性能和成本,我们估算了训练过程的浮点运算(zetta = 1021)。

Results结果:如果FLM-101B获得更多训练数据,它在这些任务上应有更佳表现

4.2 Evaluation on the Professional Knowledge-Enhanced Version专业知识增强版本的评估

表4:eFLM-16B和C-Eval上的基线性能。在这张表格中,eFLM-16B指的是专业知识增强的FLM-16B。请注意,C-Eval排行榜只保留了一个小数位的评估结果。

Results结果:仅用专业知识评估可能不能完全反应LLM的实际能力

4.3 Evaluation of the Growth Strategy增长策略的评估

减少计算成本的核心方法是增长策略。我们想验证增长策略在知识传承上的效果,以及模型能力随规模增长的轨迹

表5:FLM在Open LLM上的三个阶段的性能。为了减少评估过程中的计算成本,我们对HellaSwag和MMLU任务分别采样了20%和30%的items

Results结果:评估了FLM在知识相关能力和性能随训练数据数量和领域变化情况方面的表现—随着模型规模的增加,FLM的表现也在提升,意味着我们的模型在每个增长阶段后都成功地从上一个阶段继承知识

5 Evaluations Inspired by IQ Tests基于智商测试启发的评估

痛点(知识可能无法充分反映LLM的智商)→提出使用现有的与IQ智商测试数据集,四方面考察=符号映射+规则理解+模式挖掘+抗干扰

5.1 Symbolic Mapping Evaluation符号映射评估

文本形式的分类任务可能已泄露在原始数据中→导致模型过度拟合标签语义,而非智能推理得出→无法衡量智力

图3:符号映射的示例。主要区别在于符号映射方法将原始标签替换为随机字符串。在这个示例中,蕴含类别被随机字符串<30mFC%4Z>替换,而非蕴含类别被替换为?V9qP@Rx。

5.1.1 Data Collection数据收集:基于SuperGLUE和CLUE采样300实例+用随机字符串替换原来的类别标签

表6:SuperGLUE-IQ和CLUE-IQ数据集的统计信息。“WSD”代表“词义消歧”;“SS”代表“句子相似性”;“KR”代表“关键词识别”;coref.代表“共指解析”。

5.1.2 SuperGLUE-IQ:基于SuperGLUE原始数据集构建,采样验证集(两原则筛选)

Results结果:双向编码器(GLM-130B擅长英语共指解析任务)、单向(FLM-101B和GPT-更擅长推理任务【如BoolQ】),FLM-101B已接近GPT-3,但两类各有优势

5.1.3 CLUE-IQ:基于CLUE数据集构建,评估四个任务

Results结果:对比GLM-130B,FLM-101B有良好的中文能力且成本更低

5.2 Rule Understanding Evaluation规则理解评估

拥有推理能力的标志(理解规则并根据规则执行),规则理解求值(封闭环境下执行正确动作的能力)不同于COT(开放环境下的推理能力)

所选任务和数据的详细信息:计数任务+300个随机生成项目的双语数据集,字符串替换+两个子任务

Results结果:FLM-101B性能不一定最好但性价比最高

5.3 Pattern Mining Evaluation模式挖掘评估

模式挖掘测试,存在数据泄露即采用替代方法来缓解,建立三个任务

Figure 4: Examples of pattern mining evaluation.

Results结果:FLM-101B性价比更高

5.4 Anti-interference Evaluation抗干扰性评估

选择的任务和数据收集:三种任务类型

图5:抗干扰评估示例。

表11:FLM-101B、GPT-3和GLM-130B的抗干扰性能评估。

Results结果:FLM-101B均高于GLM-130B,且性价比高

总结:四项附加评估中,FLM-101B在某些任务中优于GLM-130B,并且成本大大降低——两大原因(训练数据+增长策略的优势)

6 Related Work相关工作

将语言模型扩展到100B:FLM-101B性价比最高

与人类对齐:有证明LLM出现了推理能力,但仍需增强遵循指令的能力,并对齐人类偏好

LLM 评估:基础模型评估的三类基准(NLP+CK+PK)、微调的模型评估其人类对齐能力,本文章通过重新组织现有数据集+基于智商测试的额外评估

模型增长:FLM-101B首次尝试增长策略来训练100B+规模LLM

7 Conclusions and Future Work结论与未来工作

FLM-101B(仅用10万美元预算)优于基线模型——降低训练成本的关键思想:利用增长策略突破模型参数的固定数量

LLM的潜力(通向AGI的重要可能技术路径之一)→未来趋势(构建有强大推理能力但不具备大量知识的基本LLM+再领域扩展来更好的支持应用)

Acknowledgments致谢


《FLM-101B: An Open LLM and How to Train It with $100K Budget》翻译与解读

地址

论文地址:https://arxiv.org/abs/2309.03852

时间

2023年9月7日

最新,2023年9月17日

作者

北京市人工智能研究院+中科院+电子科技大+哈工大+NTU

Abstract摘要

LLMs两大主要挑战(高计算成本公平客观的评估)→提出增长策略来显著降低LLMs的训练成本、提出智商评估降低记忆影响→设计出仅10万美元的预算内FLM-101B且可媲美GPT-3

Large language models (LLMs) have achieved remarkable success in NLP and multimodal tasks, among others. Despite these successes, two main challenges remain in developing LLMs: (i) high computational cost, and (ii) fair and objective evaluations. In this paper, we report a solution to significantly reduce LLM training cost through a growth strategy. We demonstrate that a 101B-parameter LLM with 0.31T tokens can be trained with a budget of 100K US dollars. Inspired by IQ tests, we also consolidate an additional range of evaluations on top of existing eval-uations that focus on knowledge-oriented abilities. These IQ evaluations include symbolic mapping, rule understanding, pattern mining, and anti-interference. Such evaluations minimize the potential impact of memorization. Experimental results show that our model, named FLM-101B, trained with a budget of $100K, achieves performance comparable to powerful and well-known models, e.g., GPT-3 and GLM-130B, especially on the additional range of IQ evaluations. The checkpoint of FLM-101B is released at https://huggingface.co/CofeAI/FLM-101B.

大型语言模型(LLMs)在自然语言处理(NLP)和多模态任务等领域取得了显著的成功。尽管取得了这些成功,但在开发LLMs时仍然存在两个主要挑战:

(i)高计算成本和

(ii)公平客观的评估。

在本文中,我们报告了一种通过增长策略显著降低LLM训练成本的解决方案。我们证明,具有1010亿参数和0.31万亿标记的LLM可以在10万美元的预算内进行训练。

受智商测试的启发,我们还在现有的以知识为导向的评估之上整合了一系列额外的评估。这些智商评估包括符号映射、规则理解、模式挖掘抗干扰等。这些评估可以将记忆的潜在影响降到最低

实验结果显示,我们的模型,命名为FLM-101B,在10万美元的预算内训练,性能可与强大且知名的模型(如GPT-3和GLM-130B)相媲美,尤其是在额外的智商评估方面。FLM-101B的检查点可在https://huggingface.co/CofeAI/FLM-101B上获取。

1 Introduction引言

三类模型架构面临的当前痛点:高昂的训练成本+趋势是使用更多的训练数据—找到降低训练成本的有效解决方案

Large language models (LLMs) have demonstrated great successes in a wide range of tasks, par-ticularly in language processing [65; 64; 11; 30] and multimodal tasks [82; 33]. Throughout their development, many model architectures have been proposed and evaluated, including decoder-only structures (e.g., the GPT series [40; 41; 3] and the LLAMA series [58; 59]), encoder-only structures (e.g., BERT [10]), and encoder-decoder structures (e.g., T5 [44]), along with their vari-ants [29; 21; 55; 45]. Regardless of the differences in model architectures, all LLMs face the same challenge of high training cost. There is also a current trend suggesting using larger amounts of training data. For example, the LLAMA-1 [58] models use 1-1.4 T tokens for training, while LLAMA-2 [59] series use 2T tokens. A primary emphasis in LLM research hence is to find effective solutions to reduce training costs.

大型语言模型(LLMs)在各种任务中取得了巨大的成功,尤其是在语言处理[65; 64; 11; 30]和多模态任务[82; 33]方面。在它们的发展过程中,许多模型架构已经被提出和评估,包括仅decoder-only解码器结构(例如,GPT系列[40; 41; 3]和LLAMA系列[58; 59])、encoder-only仅编码器结构(例如BERT[10])和encoder-decoder编码器-解码器结构(例如T5[44])以及它们的变种[29; 21; 55; 45]。尽管模型架构不同,但所有LLMs都面临高昂的训练成本的挑战。目前的趋势是使用更多的训练数据。例如,LLAMA-1[58]模型使用了1-1.4万亿标记进行训练,而LLAMA-2[59]系列使用了2万亿标记。因此,LLM研究的主要重点之一是找到降低训练成本的有效解决方案

提出增长策略

成本:增长策略具有节省成本的潜力

In this paper, we present our solutions to train an LLM at the 100B-parameter scale using a growth strategy inspired by our previous research [78]. “Growth” means that the number of parameters is not fixed, but expands from small to large along the training progresses. Figure 1 illustrates three typical scenarios for growth strategies. As the FLOPs of LLMs are approximately proportional to their number of parameters [19], the area under the parameter curve represents the computational cost of training. Figure 1(a) serves as a reference for the cost with a constant number of parameters (y-axis) w.r.t. the number of tokens (x-axis). Figure 1(b) illustrates a straightforward linear growth strategy, leading to a cost-saving of exactly 50%; Figure 1(c) showcases a modest growth strategy that reduces the cost by less than 50%; in contrast, Figure 1(d) represents an aggressive growth strategy, which reduces the cost by more than 50%. This analysis informs our decision to employ the aggressive growth strategy for maximal computational savings. In our model training, we achieve aggressive growth with an enhanced growth strategy originated in our previous work MSG [78], a strategy that achieves strict function-preserving when growing.

With a fixed $100K budget, we focus on 100B+ parameters. Although the Chinchilla laws [19] suggest that training a smaller model with more data may potentially result in higher scores on some benchmarks due to more sufficient training, we believe that verifying the feasibility of a growth strategy [15; 51; 6; 78] would be a new direction and beneficial to the community of LLM as well. This is because (i) larger models have higher upper bounds for capabilities that may not be reached by scaling only the training data [69], and (ii) data can be linearly scaled up with the budget, while a growth strategy has the potential for saving cost regardless of the amount of available data, if it turns out to be feasible. Existing studies such as [19] have not extensively investigated this area because they only consider the scenarios where model sizes are fixed through training.

在本文中,我们提出了在1000亿参数尺度下训练LLM的解决方案,使用了受我们之前研究[78]启发的增长策略。“增长”意味着参数数量不是固定的,而是随着训练的进行从小到大扩展。图1展示了三种典型的增长策略情景。由于LLMs的FLOPs大约与它们的参数数量近似成正比[19],因此参数曲线下的面积代表了训练的计算成本

图1(a)作为成本的参考,其中参数数量(y轴)与tokens数量(x轴)相同。

图1(b)说明了一种简单的线性增长策略,导致成本节省了正好50%;

图1(c)展示了一种适度的增长策略,使成本降低不到50%;相反,

图1(d)代表了一种激进的增长策略,可以使成本降低超过50%。

这个分析决定了我们采用了激进的增长策略以最大限度地节省计算量提供了依据。在我们的模型训练中,我们通过源自我们之前的工作MSG [78]的增强增长策略实现了激进的增长,该策略在增长过程中实现了严格的功能保持。

在一个固定的10万美元预算下,我们关注1000亿+参数。尽管Chinchilla法则[19]表明,使用更多的数据训练较小的模型可能会导致某些基准测试上得分更高,因为训练更充分,但我们认为验证增长策略的可行性[15; 51; 6; 78]将是一个新的方向,也将有益于LLM社区。这是因为

(i)较大的模型具有更高的能力上限,仅通过扩展训练数据可能无法达到[69],

(ii)数据可以根据预算进行线性扩展,而增长策略具有节省成本的潜力,无论可用数据的数量如何,如果事实证明可行。现有的研究,如[19],未广泛研究此领域,因为它们只考虑了模型大小在训练中固定的情况。

评估:主流两类评估(知识评估【知识为导向】/NLP任务评估)→无法反映模型能力(因其存在评估数据泄露问题+导致很难区分模型的能力源自记忆还是推理)→提出更加公平客观的智商评估(基于【智商测试启发】+不太会受到模型记忆/数据泄露的影响)

Another critical challenge in LLM research is evaluation. Existing mainstream evaluations can be broadly grouped into two categories: knowledge evaluation (i.e., MMLU [17] and C-Eval [20]), and NLP tasks evaluation. Such evaluations may not fully reflect the model capability due to potential data leakage if some of the evaluation datasets were also used in model training. In addition, it is also difficult to distinguish whether the models remember a piece of knowledge or possess the capacity for reasoning and/or inference. Borrowing some ideas from Intelligence Quotient (IQ) tests (i.e., Perceptual Reasoning and Working Memory [67]), we consolidate another range of evaluations on LLMs, including symbolic mapping, rule understanding, pattern mining, and anti-interference evaluations. Symbolic mapping [71] evaluation tests the capability of LLMs in learning to use (less meaningful) symbols instead of (more meaningful) category labels for some forms of classification tasks. Rule understanding evaluation is to test the capability of understanding some given rules, and then to perform corresponding actions. Pattern mining (involving both induction and deduction), is often used in various levels of competition. It tests the pattern-finding capability (e.g., repetition of certain parts of a given input). Last but not least, anti-interference is an ability to recognize core information from noisy input [5; 84]. We believe the evaluations inspired by IQ tests are less likely to be affected by data leakage or memorization, hence providing another dimension for fair, objective, and reliable evaluations of LLMs.

LLM研究中的另一个关键挑战是评估。现有的主流评估可以大致分为两类:知识评估(例如MMLU [17]和C-Eval [20])以及NLP任务评估。由于某些评估数据集可能也用于模型训练,这些评估可能无法完全反映模型的能力,可能存在数据泄漏的问题。此外,也很难区分模型是否记住了某些知识具有推理和/或推断的能力

借鉴一些智商测试(例如感知推理和工作记忆[67])的思想,我们在LLMs上进行了一系列其他评估,包括符号映射、规则理解、模式挖掘和抗干扰等。符号映射[71]评评估测试了LLM在某些形式的分类任务中学习使用(不太有意义的)符号而不是(更有意义的)类别标签的能力。规则理解能力评价是对给定规则的理解能力进行测试,然后进行相应的操作。模式挖掘(包括归纳和演绎)经常在各种级别的竞争中使用。它测试模式发现能力(例如,重复给定输入的某些部分)。最后但并非最不重要的是,抗干扰是一种从噪声输入中识别核心信息的能力[5; 84]。

我们认为受到智商测试启发的评估不太可能受到数据泄漏或记忆的影响,因此为LLMs的公平、客观和可靠评估提供了另一维度。

本文三点贡献:第一次尝试增长策略降低训练成本+通过超参数搜索/函数保持增长/基于FreeLM的改进提高稳定性+提出基于知识为导向和智商测试改进评估方法

To summarize, the paper has made the following contributions. First, to the best of our knowledge, this is the first attempt to use a growth strategy to train an LLM with 100B+ parameters from scratch. Simultaneously, it is probably the lowest-cost model with 100B+ parameters, costing only 100,000 US dollars. Second, we address several instability issues via promising approaches for hyperparameter search, function-preserving growth, and improvements based on our FreeLM [25]. Our methodology holds potential benefits for the broader research community. Third, we conduct extensive evaluations, including both the commonly used knowledge-oriented benchmarks and the new range of evaluations inspired by IQ tests. Experimental results show that, despite its low training cost, FLM-101B is competitive and robust. Lastly, we release the model checkpoints, code, related tools, et al. to promote research on bilingual Chinese and English LLMs at the scale of 100B+.

总之,本文具有以下几点贡献。

首先,据我们所知,这是第一次尝试使用增长策略从零开始训练具有1000亿+参数的LLM。同时,它可能是1000亿+参数规模中成本最低的模型,仅需10万美元。

其次,我们通过超参数搜索函数保持增长基于FreeLM的改进等有前途的方法解决了几个不稳定性问题[25]。我们的方法对更广泛的研究界有潜在的好处。

第三,我们进行了广泛的评估,包括常用的以知识为导向的基准测试以及受到智商测试启发的新范围的评估

实验结果表明,尽管训练成本较低,FLM-101B具有竞争力和稳健性。最后,我们发布了模型检查点、代码、相关工具等,以促进1000亿+规模的中英双语LLMs的研究

2 Design Overview of FLM-101BFLM-101B设计概述

In this section, we provide an outline of FLM-101B, detailing its architecture, pre-training methods, and configuration specifics.

在本节中,我们提供FLM-101B的概要,详细介绍其架构、预训练方法和配置细节。

2.1 Architecture架构

The architecture of an LLM significantly impacts its capabilities. Current researches [80; 3] under-score the high costs associated with experimenting on diverse architectures. Hence, it is more suitable to select an architecture with great potential for cost effectiveness and model capability.

LLM的的体系结构对其能力有很大的影响。当前的研究[80; 3]强调了在不同架构上进行实验所涉及的高成本。因此,选择具有成本效益和模型功能的巨大潜力的体系结构更为合适。

骨干架构:采用FreeLM(GPT-style+两个预训练目标【语言目标和教师目标】+采用词汇量大小为10w的GPT-4的分词器)

Backbone. Among the many existing model architectures, we adopt FreeLM [25] as the backbone for our models, with modifications. FreeLM is based on GPT [41], a transformer-like architecture with a decoder-only configuration known for its exceptional performance. Different from GPT, FreeLM features two pre-training objectives: the language objective and the teacher objective (Section 2.2). We preserve the GPT-style transformer block designs, including the Pre-LayerNorm and the additional LayerNorm after the last transformer layer. We employ the tokenizer derived from GPT-4, characterized by a vocabulary size of 100, 256.

骨干结构。在众多现有的模型架构中,我们采用了FreeLM [25]作为我们模型的骨干结构,并进行了修改。FreeLM基于GPT [41],这是一种类似于transformer以解码器为主的配置,以其卓越性能而闻名。与GPT不同,FreeLM具有两个预训练目标语言目标教师目标(第2.2节)。我们保留了GPT-style的transformer 块设计,包括最后一个transformer 层后的Pre-LayerNorm和附加的LayerNorm。我们采用了源自GPT-4的分词器,其词汇量大小为100,256

集成xPosxPos(可外推位置嵌入在旋转矩阵中引入指数衰减)提高长序列建模+提高长度外推能力

Integration of xPos. To enhance long sequence modeling, we integrate the Extrapolatable Position Embedding (xPos) [56] in FLM-101B. This innovation draws inspiration from the principles of RoPE [54], which aims to improve the length extrapolation ability. By introducing an exponential decay into the rotation matrix, xPos strives to rectify this hurdle. To the best of our knowledge, FLM-101B is the largest model to date that incorporates the xPos technology.

xPos的整合。为了增强对长序列的建模,我们在FLM-101B中集成了可外推位置嵌入(xPos)[56]。这一创新受到RoPE [54]原则的启发,旨在提高长度外推能力。通过在旋转矩阵中引入指数衰减,xPos努力纠正了这个障碍。据我们所知,FLM-101B是迄今为止首个集成xPos技术的最大模型。

模型规模

Model Sizes. Benefiting from the proposed growth strategy, the FLM series produces three models with 16B, 51B, and 101B (i.e., FLM-101B) parameters in a single training. The training process is carried out in a sequential manner, starting from a smaller model (i.e., 16B) and progressively growing to larger ones (i.e., 51B and 101B).

模型规模。由于提出的增长策略,FLM系列在单一训练中产生了三个模型,分别具有16B、51B和101B(即FLM-101B)个参数。训练过程是按顺序进行的,从较小的模型(即16B)开始,逐渐增长到较大的模型(即51B和101B)。

2.2 Pre-Training Setup预训练设置

预训练阶段整合多任务指令驱动数据:OIG/COIG,通过指令数据增强LLMs的理解能力

FLM-101B. By design, FLM-101B is an English-Chinese bilingual model pre-trained with causal language modeling. It mixes English and Chinese corpora at a ratio of approximately 53.5% : 46.5% for language modeling. Inspired by the finding that instruction data can augment LLMs’ comprehension capabilities [37], we integrate multi-task instructionally prompted data: OIG (Open Instruction Generalist) 1 and COIG (Chinese Open Instruction Generalist) 2, in the pre-training stage.

FLM-101B。按设计,FLM-101B是一种经过因果语言建模预训练的中英双语模型。它在语言建模中混合了英语和汉语语料库,占比约为53.5%:46.5%。受到一项发现的启发,该发现表明指令数据可以增强LLMs的理解能力[37],我们在预训练阶段整合了多任务指令驱动数据OIG(Open Instruction Generalist)1和COIG(Chinese Open Instruction Generalist)2。

eFLM-16B=应用了知识增强的FLM-16B模型应用FreeLM教师信号来增强FLM

eFLM-16B. To evaluate the effect of using domain-specific knowledge data (Section 4.2), we apply the FreeLM teacher signals [25] to enhance FLM. Due to computational cost, we incorporate the teacher signals only in the smallest 16B model. This knowledge-enhanced FLM-16B is named eFLM-16B.

eFLM-16B。为了评估使用特定领域知识数据的效果(第4.2节),我们应用了FreeLM教师信号[25]来增强FLM。由于计算成本的限制,我们仅在最小的16B模型中合并了教师信号。这个增强了知识的FLM-16B被命名为eFLM-16B。

原FreeLM模型采用两个训练目标(语言建模目标+二元分类目标)→FLM-101B通过掩码和专用TOKEN统一两个目标(eFLM-16B通过表情符号替换类别标签+在结束位置预测专用符号,将分类目标转化为语言建模格式),并混合采样提高训练稳定性

The original FreeLM incorporates two training objectives: language modeling objective guided by language signals and binary classification objective guided by teacher signals. In FLM-101B, we unify the two objectives by using a masking strategy and two specialized tokens. These tokens facilitate the transformation of the binary classification objective into the unified language modeling format. The unified training objective leads to training stability when the model becomes much larger in scale. Hence, for eFLM-16B, we transform this binary classification into the format of causal language modeling. Specifically, we employ two emojis: ��� (U+1F621) and ��� (U+1F608) 3, from the vocabulary to replace the original binary labels of 1 and 0. We apply zero-masking to the loss for tokens in the propositions and predict one of these two special tokens at the end of each proposition. By this method, we unify the teacher objective and language modeling. Moreover, we discard the original Iterative Training approach [25] and completely mix the samples from both signals in every batch. This strategy can enhance the consistency of data sampling distribution as well as improve training stability.

原始的FreeLM包括两个训练目标:以语言信号为导向的语言建模目标和以教师信号为导向的二元分类目标。在FLM-101B中,我们通过使用屏蔽策略和两个专用token统一这两个目标。这些token有助于将二进制分类目标转化为统一的语言建模格式。当模型规模变大时,统一的训练目标可以保证训练的稳定性。因此,对于eFLM-16B,我们将这个分类转化为因果语言建模格式

具体来说,我们使用两个表情符号:我们从词汇表中使用了两个表情符号:���(U+1F621)和���(U+1F608) 3来代替原来的二进制标签1和0。我们对命题中的tokens进行屏蔽处理,并在每个命题的末尾预测这两个特殊tokens中的一个。通过这种方法,我们统一了教师目标和语言建模。此外,我们放弃了原始的迭代训练方法[25],在每批中完全混合两个信号的样本。这种策略可以增强数据采样分布的一致性,提高训练稳定性。

2.3 Growth Strategy增长策略

依次训练三个模型(都继承其前身的知识):与传统做法不一致

The essence of the low cost in scaling FLM-101B up is the growth strategy in model training. Specifically, we train three models, with 16B, 51B, and 101B parameters respectively, in a sequential manner. Each model inherits knowledge from its predecessor. This is contrary to the common practice that the models of different sizes are trained independently [58; 59].

将FLM-101B扩展的低成本本质在于模型训练中的增长策略。具体而言,我们依次训练三个模型,分别具有16B、51B和101B的参数。每个模型都继承了其前身的知识。这与通常的做法相反,即不同大小的模型独立训练[58; 59]。

功能保持增长(有助于知识传承和训练稳定性)

Function-preserving Growth. Function preservation means that before and after growth, the models yield consistent outputs given the same arbitrary inputs. This property has proven beneficial for both knowledge inheritance [8; 6; 51] and training stability [78]. The growth operators used in FLM-101B training originate from [78], with improvement. Specifically, to adapt these operators to the multi-node 3D parallel framework, we implement them by extending the model structures offline and reloading the checkpoint when the next stage starts.

功能保持增长。功能保留意味着在增长之前和之后,模型在给定相同任意输入时产生一致的输出。这一特性已被证明有助于知识传承[8; 6; 51]和训练稳定性[78]。FLM-101B训练中使用的生长算子源自[78],并进行了改进。具体而言,为了使这些运算符适应多节点的3D并行框架,我们通过离线扩展模型结构并在下一阶段开始时重新加载检查点来实现它们。

进度和成本效益FLM-101B模型耗时21.54天,节省了72%的时间

Schedules and Cost-Effectiveness. Model growth scheduling is a trade-off between the pros and cons inherent to models of different sizes [78]: a smaller model is faster in computing each training step, enabling more rapid consumption of training data for broader commonsense knowledge; conversely, a larger model is better in the reduction of loss per step, indicating a deeper understanding of the nuanced linguistic patterns. We train the 16B model with 245.37B tokens, the 51B model with 39.64B tokens, and the 101B model with 26.54B tokens. The billion tokens per day of different sizes are listed in Table 1. Under this growth schedule, the total time cost for our 101B model is 21.54 days, which is 72% time-saving (or a 3.56x speedup) compared to training a 101B model from scratch (76.74 days). This is consistent with our motivations depicted in Figure 1.

进度和成本效益。模型增长调度是不同大小的模型固有优缺点之间的权衡[78]:较小的模型在计算每个训练步骤时更快,能够更快地消耗训练数据以获得更广泛的常识知识;相反,更大的模型在减少每一步的损失方面做得更好,这表明对细微的语言模式有更深入的理解。我们用245.37B标记训练16B模型,用39.64B标记训练51B模型,用26.54B标记训练101B模型。表1列出了每天10亿个不同大小的标记。

根据这个增长计划,我们的101B模型的总时间成本为21.54天,与从零开始训练101B模型相比节省了72%的时间(或3.56倍的速度提升)。这与我们在图1中描述的动机一致。

2.4 The Parallelism Setup and Model Configurations并行设置和模型配置

FLM-101B硬件配置:24台服务器*8*A800-80G

FLM-101B is trained on a cluster of 24 DGX-A800 GPU (8×80G) servers. Following the growth strategy, we sequentially complete the model training for sizes 16B, 51B, and 101B on this cluster.

FLM-101B是在一组24台DGX-A800 GPU(8×80G)服务器上进行训练的。根据增长策略,我们依次在该集群上完成了16B、51B和101B大小的模型训练。

并行策略DP和MP已成为B级模型标配,但过多的MP会增加GPU通信开销→集成PP的3D并行策略可实现最佳吞吐量→再结合序列并行SP【沿着序列长度维度对Transformer核心的LayerNorm和Dropout层的输入进行切片】进一步节省GPU计算资源→再结合分布式优化器的Megetron-LM均匀分布优化器状态进一步减少GPU内存消耗

The Parallel Strategies. Data parallelism [60] and tensor model parallelism [52] have become the standard approaches for training models at the billion scale. Nevertheless, an excessive amount of tensor parallelism may escalate GPU communication overheads, hampering training efficiency. To tackle this problem, we integrate pipeline model parallelism [35] and employ a 3D parallel strategy for optimal throughput. Moreover, by employing sequence parallelism [24], we slice the inputs to the Transformer core’s LayerNorm and Dropout layers along the sequence length dimension, leading to additional savings in GPU computational resources and memory utilization. We also utilize the Megetron-LM 4 implementation of the distributed optimizer [46] to further reduce the GPU memory consumption, which is a technique that evenly distributes the optimizer states across data parallel ranks.

并行策略数据并行[60]和张量模型并行[52]已成为训练规模达到十亿级别的标准方法。然而,过多的张量并行可能会增加GPU通信开销,降低训练效率。为了解决这个问题,我们集成了流水线模型并行[35],并采用了3D并行策略以实现最佳吞吐量。此外,通过使用序列并行[24],我们沿着序列长度维度对Transformer核心的LayerNorm和Dropout层的输入进行切片,从而进一步节省GPU计算资源和内存利用率。我们还利用了分布式优化器的Megetron-LM [46]实现,以进一步减少GPU内存消耗,这是一种在数据并行队列均匀分布优化器状态的技术。

表2

Table 2 shows the parallelism configurations and training throughput in each stage of FLM-101B training under our growth strategy. In different stages, we configure different Tensor Parallel × Pipeline Parallel sizes to achieve higher throughput. The single-GPU throughput for all three training stages consistently exceeds 160 teraFLOPs/sec with a utilization rate of at least 51.3%. For comparison, GLM-130B achieves 135 teraFLOPs/sec [80] with a 42.27% utilization rate. We can also find that FLM-101B has a higher FLOP utilization rate than Megatron-LM [24] under a similar model size.

表2显示了在我们的增长策略下FLM-101B训练的每个阶段的并行配置和训练吞吐量。在不同阶段,我们配置不同的张量并行×流水线并行大小以实现更高的吞吐量。所有三个训练阶段的单GPU吞吐量始终超过160 teraFLOPs/sec,利用率至少为51.3%。作为比较,GLM-130B实现了135 teraFLOPs/sec [80]的吞吐量,利用率为42.27%。我们还可以发现,在类似的模型大小下,FLM-101B的FLOP利用率高Megatron-LM [24]。

FLM-101B模型配置80层+使用AdamW优化器+余弦学习率调度上下文窗口2048个标记+10w的词汇量

FLM-101B Configurations. The FLM-101B model is structured with a hidden state dimension of 10, 240, a layer number of 80, a context window of 2,048 tokens, 80 attention heads, and a vocabulary size of 100, 256. FLM-101B uses the AdamW optimizer [31] with β1 = 0.9 and β2 = 0.95. A cosine learning rate schedule is employed, leading to a final learning rate of 6e − 6. We use a weight decay of 0.1 and gradient clipping of 1.0.

FLM-101B配置FLM-101B模型的结构包括10,240的隐藏状态维度、80个层、2,048个标记的上下文窗口、80个注意力头和100,256的词汇量大小。FLM-101B使用AdamW优化器 [31],其中β1 = 0.9,β2 = 0.95。采用余弦学习率调度,最终学习率为6e - 6。我们使用了0.1的权重衰减和1.0的梯度裁剪。

表1

Table 1 presents part of the hyperparameters used in different growth stages. In each growth stage, we approximately inherit the previous learning rate and adhere to the same schedule. The learning rate at the beginning of each stage is reported in the table. In the 16B stage, 4,608k samples are used for learning rate warmup, while in later growth stages, we use fewer samples of 230.4k. Note that we do not apply batch size warmup because we address the stability issue in a different manner, detailed in Section 3.

The training duration and token consumption for each stage are also outlined in Table 1. In total, FLM-101B training is accomplished within 22 days using 311.54B tokens.

表1列出了不同增长阶段使用的一些超参数。在每个增长阶段,我们大致继承了先前的学习率并遵循相同的时间表。表中报告了每个阶段开始时的学习率。在16B阶段,用于学习率预热的样本数为4,608k,而在后来的增长阶段,我们使用了更少的样本,为230.4k。请注意,我们没有应用批量大小预热,因为我们以不同的方式解决了稳定性问题,详见第3节。

表1还概述了每个阶段的训练持续时间和标记消耗。总体而言,FLM-101B训练在22天内完成,共使用了311.54B标记

3 Training Stability of FLM-101B—FLM-101B的训练稳定性

超100B的显著影响的稳定性问题:损失发散梯度爆炸以及数值溢出→搜索成本+训练维护成本+项目预算不可控

Models beyond 100B parameters [49; 80] usually suffer from a bunch of notorious stability issues including loss divergence, gradient explosion, and numerical overflow/underflow. This not only inflates the cost of searching for feasible hyperparameters like optimal learning rates, but also intensifies ongoing maintenance during training, such as babysitting, issue resolution, data adjustment, and rebooting. Moreover, this makes the budget of the whole project unpredictable. We have undertaken the following efforts to mitigate these issues.

超过100B参数的模型通常会面临一系列臭名昭著的稳定性问题,包括损失发散梯度爆炸以及数值溢出/下溢等。这不仅增加了搜索可行超参数(如最佳学习率)的成本,还加重了训练期间的持续维护工作,如监控、问题解决、数据调整和重启。此外,这也使整个项目的预算变得不可预测。为了缓解这些问题,我们采取了以下措施。

措施1—损失预测越宽越好+损失预测

Loss Prediction. The Tensor Programs theories [75; 28] unveil the universal relations across the training dynamics of a series of models with the model width tending to infinite. For certain classes of hyperparameters, this results in a parameterized mapping for their optimal value between a small model and its larger counterparts, which is termed µP [76]. Two important insights are:

>> The wider, the better: theoretically, under µP transfer, a wider model will always yield lower loss than its narrower counterparts when exposed to identical data [76]. As a direct corollary, if a narrow model converges, its wider counterparts will always converge.

>> Loss prediction: the loss value of a large model is predictable using the loss of its smaller counterparts, as claimed in GPT-4 technical report [36]. For the first time in the open-source world, µScaling [77] provides evidence that loss prediction can be achieved by combining µP [76] and (a modified) scaling law [23; 18; 19].

损失预测张量规划理论[75];[28]揭示了一系列模型的训练动态之间的普遍关系,模型宽度趋于无限。对于某些类别的超参数,这导致了它们在小模型和较大模型之间的值的参数化映射,称为µP [76]。两个重要的见解是:

>> 越宽越好:理论上,在µP传递下,当暴露于相同的数据时,较宽的模型总是比较窄的模型产生更低的损失[76]。作为一个直接的推论,如果一个狭窄的模型收敛,它的较宽的对应模型总是收敛的。

>> 损失预测:GPT-4技术报告[36]中提出,利用较小模型的损失来预测大模型的损失值。在开源世界中,µScaling[77]首次提供了证据,证明可以通过结合µP[76]和(修改的)缩放定律[23]来实现损失预测;18;19)。

Based on these findings, our method to solve training stability is as follows: we first determine the data distribution before the FLM-16B training starts. Next, we perform a grid search on three hyperparameters including the learning rate, initialization standard deviation, and the softmax tem-perature in the output layer. This grid search is performed by running a proxy model (less than 100M) with a hidden state dimension (“model width”) of 256 and a head number of 2. All the other structural hyperparameters and training data of the proxy model are identical to those of FLM-16B. A single run of grid search takes 24.6 hours with data parallelism on 6 nodes, which is equivalent to 6 hours per run given our 24-node infrastructure. Finally, We find a group of well-performing hyperparameters: learning rate = 4e − 4, standard deviation = 1.6e − 2, and softmax temperature = 2.0, through this grid search. Transferring these hyperparameters to the 16B model via µP [76] led to a seamless training experience devoid of instabilities. Combining with MSG [78], we also witness no post-growth divergence in FLM-51B and FLM-101B.

基于这些发现,我们解决训练稳定性的方法如下:首先,在FLM-16B训练开始之前,我们确定了数据分布。接下来,我们对三个超参数进行网格搜索,包括学习率、初始化标准差和输出层的softmax温度。这个网格搜索是通过运行一个代理模型(小于100M)进行的,该模型的隐藏状态维度(“模型宽度”)为256,头数为2。代理模型的所有其他结构超参数训练数据与FLM-16B相同。在数据并行性为6个节点的情况下,单次运行网格搜索需要24.6小时,这相当于在我们的24节点基础设施中每次运行6小时。最后,通过这个网格搜索,我们找到了一组表现良好的超参数:学习率=4e - 4,标准差=1.6e - 2,softmax温度=2.0。将这些超参数通过µP [76]转移到16B模型后,训练体验变得平稳,没有不稳定性。结合MSG [78],我们还观察到FLM-51BFLM-101B在增长后没有出现发散

The full training loss curve is presented in Figure 2. The first stage (16B) stably goes through 246B tokens. Immediately afterwards, FLM grows from 16B to 51B. As expected, the training is stable. More importantly, we observe that the loss curve becomes steeper. It matches the intuition that a larger model is better in loss reduction per step. Subsequently, FLM grows to 101B. Although the training data for the 51B stage are only 40B tokens, the 101B training remains stable, and the loss curve becomes slightly steeper again. This loss curve proves the effectiveness of the growth strategy.

Our implementations of µP are largely consistent with those in µScaling [77], with modifications to handle the rotary embedding. Thus, the intermediate loss ranges for FLM-16B are also predictable with the results from multiple proxy widths at the same steps.

完整的训练损失曲线如图2所示。第一阶段(16B)稳定地完成了246B标记。紧接着,FLM从16B增长到51B。如预期的那样,训练是稳定的。更重要的是,我们观察到损失曲线变得更加陡峭。它符合直觉,即更大的模型在每一步的损失减少方面更好。随后,FLM增长到101B。尽管51B阶段的训练数据仅有40B标记,但101B的训练仍然稳定,损失曲线再次变得略微陡峭。这个损失曲线证明了增长策略的有效性

我们的µP实现在很大程度上与µScaling[77]中的实现一致,并对处理旋转嵌入进行了修改。因此,FLM-16B的中间损失范围也可以用相同步骤的多个代理宽度的结果来预测。

措施2—Bfloat16的混合精度Bfloat16对于接近零的值具有更高的精度+除了对损失规模调整的需求

Mixed Precision with Bfloat16. We apply mixed-precision training to save run-time memory and reduce time costs. Specifically, we choose Bfloat16 instead of FP16 due to its superior precision for values approaching zero, making it more suitable for µP. As a result, we do not encounter the FP16 underflow issue reported by [76]. To our knowledge, the FLM models are currently the largest ones successfully trained with mixed precision + µP. Moreover, Bfloat16 negates the need for loss scale adjustments, making our training procedure more promising and reproducible.

Bfloat16的混合精度。我们采用混合精度训练来节省运行时内存并降低时间成本。具体来说,我们选择了Bfloat16而不是FP16,因为它对于接近零的值具有更高的精度,更适合µP。因此,我们没有遇到[76]报告的FP16下溢问题。据我们所知,FLM模型目前是成功使用混合精度+µP进行训练的最大模型。此外,Bfloat16消除了对损失规模调整的需求,使我们的训练过程更具前景和可重复性。

4 Benchmark Evaluation基准评估

仅仅依靠知识本身可能无法全面反映LLM的能力,评估LLM的知识性+IQ智商测试

Many existing benchmarks (e.g., Open LLM) focus on assessing the knowledgeability of LLMs. In this section, we discuss the results of FLM on these benchmarks. We argue that knowledge alone might not comprehensively reflect LLM’s capability (see Section 4.2 for more details). Thus, in addition to the common benchmark evaluation, we borrow the concept of IQ tests and evaluate LLMs with some specific tasks in Section 5.

许多现有的基准测试(例如Open LLM)旨在评估LLM的知识性。在本节中,我们将讨论FLM在这些基准测试上的结果。我们认为,仅仅依靠知识本身可能无法全面反映LLM的能力(详见4.2节)。因此,除了常规的基准评估外,我们还借用了IQ智商测试的概念,在第5节中使用一些特定任务来评估LLM。

成本估算法:单语言的LLM(GPT-3/LLAMA-2)、多语言LLM(GLM-130B/FLM-101B)

Cost Estimation Method. Due to the considerable computational expense of LLMs, we also emphasize their associated costs in our experimental results. However, it is hard to directly compare the actual cost of LLMs due to their different infrastructures, and the different costs incurred on different hardware. To objectively compare training costs, we use the number of floating-point operations for training as the cost estimation index, which can be estimated from the model’s hyperparameters, configuration, and training data [35]. Since many models do not release the complete training configuration (e.g., GPT-3, LLAMA series), we estimate FLOPs within a range5.

成本估算方法。由于LLM的计算开销相当大,我们在实验结果中也强调了它们的相关成本。然而,由于它们具有不同的基础设施和不同硬件上产生的不同成本,因此很难直接比较LLM的实际成本。为了客观地比较训练成本,我们使用了训练的浮点运算数作为成本估算指标,这可以从模型的超参数、配置和训练数据中估算出来[35]。由于许多模型没有发布完整的训练配置(例如GPT-3,LLAMA系列),我们在一个范围内估算FLOPs

For monolingual LLMs, e.g., GPT-3, the cost from monolingual data is equal to the total cost. The computational cost of GPT-3 is calculated as 376.41 (±53.77) zettaFLOPs, and LLAMA-2 (13B) as 210.37 (±28.77) zettaFLOPs. Because the cost is linear to both model parameters and training data [19], we could calculate the cost of the remaining LLAMA models easily. For bilingual or multilingual models, it is necessary to estimate based on the amount of data in the corresponding language. The total cost of GLM-130B is 421.60 zettaFLOPs. We know that the data ratio of English and Chinese is 1:1. Hence, the cost of GLM-130B for English is 210.80 zettaFLOPs, and the same for Chinese. The data ratio of FLM-101B is 53.5% : 46.5% for English and Chinese. The total cost of FLM-101B is 52.76 zettaFLOPs. According to the data ratio, the cost for English and Chinese is 28.22 zettaFLOPs and 24.54 zettaFLOPs, respectively.

对于单语言LLM,例如GPT-3,来自单语数据的成本等于总成本。GPT-3的计算成本为376.41(±53.77)zettaFLOPs,LLAMA-2(13B)为210.37(±28.77)zettaFLOPs。由于成本与模型参数和训练数据线性相关[19],我们可以轻松计算剩余的LLAMA模型的成本。

对于双语或多语言模型,需要根据相应语言的数据量来估算。GLM-130B的总成本为421.60zettaFLOPs。我们知道英语和汉语的数据比例是1:1。因此,GLM-130B的英语成本和汉语成本都是210.80zettaFLOPs。FLM-101B的数据比例为53.5%:46.5%的英语和汉语。FLM-101B的总成本为52.76zettaFLOPs。根据数据比例,英语和汉语的成本分别为28.22zettaFLOPs和24.54zettaFLOPs。

4.1 Open LLM EvaluationOpen LLM评估

Open LLM包含四个任务:ARC(常识和事实)HellaSwag(常识推理)、MMLU(57个多选任务+特定领域的专业知识复杂的推理)TruthfulQA(817个检测模型错误事实问题)

Open LLM is an open-source project 6. Its target is to track and evaluate the open-sourced LLMs and chatbots. Open LLM contains four tasks: ARC-Challenge (ARC for short), HellaSwag, MMLU, and TruthfulQA. The Open LLM Leaderboard applies the average score of these tasks as a metric.

Open LLM是一个开源项目6。它的目标是跟踪和评估开源的LLM和聊天机器人。Open LLM包含四个任务:ARC-Challenge(简称ARC),HellaSwag,MMLU和TruthfulQA。Open LLM排行榜将这些任务的平均得分作为指标。

ARC: The ARC [9] dataset is proposed for graduate-school level closed book science question-answering tasks. Most problems in ARC are solvable with life experiences and Wikipedia searches. Thus, a model is expected to perform better if exposed to more commonsense and factual data.

HellaSwag: This is a sentence completion task emphasizing on commonsense inference [79]. We observe that the increase in HellaSwag performance is highly correlated with the reduction of training loss. This is intuitive because the training data is usually enriched with common sense.

MMLU: MMLU includes 57 multiple-choice tasks covering subjects spanning STEM to social science [17]. The tasks differ significantly in complexity, with many STEM-oriented questions demanding domain-specific professional knowledge and intricate reasoning to be solved.

TruthfulQA: TruthfulQA contains 817 factual questions to detect model falsehoods caused by naively mimicking human language patterns [27]. The solutions to these questions are closely associated with English Wikipedia sources. The task probes a model’s factual knowledge and resistance to popular misconceptions.

ARC:ARC [9]数据集用于研究生水平的闭卷科学问答任务。ARC中的大多数问题都可以通过生活经验维基百科搜索来解决。因此,预期模型在接触更多常识和事实数据时将表现更好。

HellaSwag:这是一个强调常识推理的句子补全任务[79]。我们观察到,HellaSwag的性能提高与训练损失的降低高度相关。这是直观的,因为训练数据通常富含常识

MMLU:MMLU包括57个多选任务,涵盖从STEM到社会科学的各种学科[17]。这些任务在复杂性上差异很大,许多以STEM为导向的问题需要特定领域的专业知识复杂的推理才能解决。

TruthfulQA:TruthfulQA包含817个关于检测模型错误的事实问题,这些错误是由于模仿人类语言模式而引起的[27]。这些问题的解决方案与英语维基百科的来源密切相关。该任务探讨了模型的事实知识和抵抗流行误解的能力。

表3:FLM-101B和包括LLAMA系列和GLM-130B在内的基线的性能。为了直观比较性能和成本,我们估算了训练过程的浮点运算(zetta = 1021)。

Table 3: Performance of FLM-101B and baselines including LLAMA series and GLM-130B. In order to visually compare the performance and cost, we estimate the floating-point opera-tions (zetta = 1021) of the training process.

表3:FLM-101B和包括LLAMA系列和GLM-130B在内的基线的性能。为了直观比较性能和成本,我们估算了训练过程的浮点运算((zetta = 1021)。

Table 3 details the performance of FLM-101B and strong baselines, including LLAMA series and GLM-130B. Because GPT-3 is closed-source, we could not get the probability values for a fair comparison. As a result, we cannot list GPT-3 here. GLM-130B results are achieved by our run on an open-sourced checkpoint.

表3详细介绍了FLM-101B和强大基线的性能,包括LLAMA系列和GLM-130B。由于GPT-3是闭源的,我们无法获得公平比较的概率值。因此,我们无法在这里列出GPT-3。GLM-130B的结果是通过我们在开源检查点上运行得到的。

Results结果如果FLM-101B获得更多训练数据,它在这些任务上应有更佳表现

尽管FLM-101B在所有基线模型平均表现为43.94,最差,但这并不意味着模型及训练方法的劣势。MMLU需要领域知识,而FLM-101B训练未使用英语教材;同时,加入领域知识的eFLM-16B表现优于GLM-130B。此外,TruthfulQA、ARC和HellaSwag更关注常识和wiki知识,FLM-101B使用的数据量较少但在TruthfulQA已经表现最佳;在ARC和HellaSwag,FLM-101B与拥有比其更多训练数据的GLM-130B表现相当。总体来说,如果FLM-101B获得更多训练数据,它在这些任务上应有更佳表现。

Results. Among all the baseline models, FLM-101B ranks last with an average of 43.94. However, going deeper into the nature of these tasks, this does not necessarily indicate the inferiority of our model and training procedures.

结果。在所有基线模型中,FLM-101B以平均43.94分排名最后。然而,深入研究这些任务的性质,这并不一定表明我们的模型和训练程序的劣势。

(i) MMLU typically requires domain knowledge to solve. In our training of FLM-101B, no English textbook or sample exam questions are intentionally used. Nevertheless, in an FLM variant that incorporates this knowledge with FreeLM objectives (eFLM-16B, Section 2.2), even a 16B FLM model can outperform GLM-130B, supporting our claims here.

(ii) As aforementioned, TruthfulQA, ARC, and HellaSwag emphasize more on common sense and Wiki-level knowledge, and their performances improve with the increased amount of data and the reduction of training loss. With less than 0.16T English data (about one-tenth of LLAMA-2), FLM-101B already achieves the best accuracy of 41.47 among all the baselines on TruthfulQA. On ARC and HellaSwag, FLM-101B is comparable to GLM-130B with a similar amount of English data (approximately 0.2T). Also, the training data of GLM-130B includes ARC and HellaSwag, as expressly claimed in [80]. In our understanding, superior performance of FLM-101B can be expected on these three tasks if exposed to more training data.

(i) MMLU通常需要领域知识来解决。在我们的FLM-101B训练中,没有刻意使用英语教材和试题样题。然而,在FLM变体中,将这些知识与FreeLM目标(eFLM-16B,第2.2节)相结合,即使是16B FLM模型也可以优于GLM-130B,支持我们在这里的声明。

(ii) 如前所述,TruthfulQA,ARC和HellaSwag更加强调常识和维基级别的知识,它们的性能随着数据量的增加和训练损失的减少而提高。FLM-101B仅使用不到0.16T的英语数据(约为LLAMA-2的十分之一),就已经在TruthfulQA的所有基线中取得了41.47的最佳准确率。在ARC和HellaSwag上,FLM-101B与具有类似数量英语数据(约为0.2T)的GLM-130B相媲美。此外,GLM-130B的训练数据包括ARC和HellaSwag,正如[80]中明确提到的。在我们的理解中,如果暴露于更多的训练数据,FLM-101B在这三个任务上可以期望表现更好。

4.2 Evaluation on the Professional Knowledge-Enhanced Version专业知识增强版本的评估

我们对知识增强版本FLM-16B进行了实验,验证使用领域知识数据的效果。为减少训练成本,我们继续用来自MMLU和C-Eval等相近领域和格式的辅助训练数据以及其他领域知识数据的教师信号,进一步训练最小模型FLM-16B。注意,这种训练方法不同于典型的微调,它不会影响语言模型的语言能力,同时也保留了语言信号。表4展示了eFLM-16B和基线模型在C-Eval数据集上的效果。

We have also conducted experiments on a knowledge-enhanced version (eFLM-16B, detailed in Section 2.2) of the FLM to validate the effect of using domain-specific knowledge data. To reduce the training cost, we continue to train the smallest FLM-16B with teacher signals from a combination of (i) part of the auxiliary training data of MMLU [17], (ii) exam questions in similar domains and formats to C-Eval [20] 7, and (iii) other domain knowledge data. Note that, eFLM-16B is not a typical fine-tuning with additional data, which may affect the language capability of LLM. Recall that the FLM series uses FreeLM as its backbone which can learn both language and teacher signals. In this training, we preserve the language signal. Table 4 lists the result of eFLM-16B and baselines on C-Eval.

我们还对FLM的知识增强版本(eFLM-16B,详见第2.2节)进行了实验,以验证使用领域特定知识数据的效果。为了降低训练成本,我们继续使用来自以下组合的教师信号训练最小的FLM-16B:(i)MMLU的辅助训练数据的一部分,(ii)与C-Eval[20] 7相似领域和格式的试题,以及(iii)其他领域知识数据。请注意,eFLM-16B不是典型的使用额外数据的微调,因为这可能会影响LLM的语言能力。回顾一下,FLM系列使用FreeLM作为其基础,可以学习语言和教师信号。在这个训练中,我们保留了语言信号。表4列出了eFLM-16B和C-Eval上的结果。

表4:eFLM-16B和C-Eval上的基线性能。在这张表格中,eFLM-16B指的是专业知识增强的FLM-16B。请注意,C-Eval排行榜只保留了一个小数位的评估结果。

Table 4: Performance of eFLM-16B and baselines on C-eval. In this table, eFLM-16B refers to the professional-knowledge-enhanced FLM-16B. Note that C-Eval leaderboard only keeps one decimal place for the evaluation results.

表4:eFLM-16B和C-Eval上的基线性能。在这张表格中,eFLM-16B指的是专业知识增强的FLM-16B。请注意,C-Eval排行榜只保留了一个小数位的评估结果。

Results结果仅用专业知识评估可能不能完全反应LLM的实际能力

用专业知识增强后,表现有显著提升。在MMLU任务上,采用专业知识数据的教师信号使eFLM-16B得分达到44.50,超过了在相关领域使用多任务数据的GLM-130B模型;而未增强的FLM-16B在此任务上仅得27.02分。在C-Eval任务上,eFLM-16B的表现比GLM-130B高约2个点。这些结果表明,仅用专业知识评估可能不能完全反应LLM的实际能力,特别是不同LLM采用不同的数据集训练时。

Results. Enhanced with professional knowledge, significant improvements are observed. On MMLU task, the incorporation of the teacher signals with professional knowledge data results in a score of 44.50 for eFLM-16B (see Table 3), which surpasses GLM-130B (42.59), a model that also uses multi-task data in the related domain [80]. As a comparison, the MMLU score is 27.02 for the un-enhanced FLM-16B. On C-Eval tasks 8, we observe that eFLM-16B performs better than GLM-130B by about 2 points. As a comparison, the average C-Eval score of the vanilla FLM-16B is 27.0, which underperforms GLM-130B. These results suggest that evaluation with professional knowledge may not fully reflect the capability of LLMs, particularly when different LLMs are trained with different data collections, and some may not come with a clear list.

结果。在专业知识的增强下,我们观察到了显著的改进。在MMLU任务中,将专业知识数据教师信号相结合,eFLM-16B的得分为44.50(见表3),超过了GLM-130B(42.59),后者也在相关领域使用多任务数据[80]。作为对比,未增强的FLM-16B的MMLU得分为27.02。在C-Eval任务8中,我们观察到eFLM-16B的性能比GLM-130B提高了约2个点。作为对比,未经改进的FLM-16B的平均C-Eval得分为27.0,低于GLM-130B。这些结果表明,使用专业知识进行评估可能无法充分反映LLM的能力,尤其是当不同的LLM使用不同的数据集进行训练,而且一些可能没有明确的列表。

4.3 Evaluation of the Growth Strategy增长策略的评估

减少计算成本的核心方法是增长策略。我们想验证增长策略在知识传承上的效果,以及模型能力随规模增长的轨迹

Our core method for reducing computational cost is the growth strategy. We would like to answer the question of whether our growth strategy is effective in knowledge inheritance, and the trajectory of how model capabilities grow with size. Hence, we evaluate the performance of FLM on all the stages: 16B, 51B, and 101B. The training data for each stage is 0.245T, 0.04T, and 0.027T, respectively, in an accumulative manner according to the growth setting. Table 5 shows the performance of FLM models at each stage.

我们降低计算成本的核心方法是增长策略。我们想要回答的问题是,我们的成长策略在知识继承方面是否有效,以及模型能力如何随规模增长的轨迹。因此,我们评估了FLM在所有阶段的性能:16B、51B和101B。每个阶段的训练数据分别为0.245T、0.04T和0.027T,按照增长设置进行累积。表5显示了每个阶段FLM模型的性能。

表5:FLM在Open LLM上的三个阶段的性能。为了减少评估过程中的计算成本,我们对HellaSwag和MMLU任务分别采样了20%和30%的items

Table 5: Performance of the three stages of FLM on Open LLM. To reduce the computational cost during evaluation, we sample 20% and 30% items for HellaSwag and MMLU tasks, respectively.

表5:FLM在Open LLM上的三个阶段的性能。为了减少评估过程中的计算成本,我们对HellaSwag和MMLU任务分别采样了20%和30%的项目。

Results结果:评估了FLM在知识相关能力和性能随训练数据数量和领域变化情况方面的表现—随着模型规模的增加,FLM的表现也在提升,意味着我们的模型在每个增长阶段后都成功地从上一个阶段继承知识

Results. As expected, the performance of FLM improves with the increase in model size. FLM-101B achieves the best performance on almost all tasks. This means that our model inherits knowledge from the previous stage after each growth. We also observe that the 101B model improves the performance scores more significantly than the 51B model, with less data. This indicates that the models are successfully incorporating new weights in training after growth, and taking advantage of larger model sizes when the loss is low. Interestingly, the performance on ARC and HellaSwag increases steadily and significantly. This corresponds exactly to the steady decline of the model loss. Again, as we claimed in Section 4.1, when more training data is processed, FLM’s performance on Open LLM becomes better.

结果。正如预期的那样,FLM的性能随着模型规模的增加而提高。FLM-101B在几乎所有任务上都取得了最佳性能。这意味着我们的模型在每次增长后都从前一阶段继承了知识。我们还观察到,101B模型的性能分数相比51B模型更显著地提高,而数据量较少。这表明在增长后,模型成功地在训练中加入了新的权重,并在损失较低时利用了更大的模型规模。有趣的是,在ARC和HellaSwag上的性能稳定而显著提高。这与模型损失的稳定下降完全一致。再次强调,正如我们在第4.1节中所声明的,处理更多的训练数据时,FLM在Open LLM上的性能变得更好。

The above experiments evaluate the knowledge-related ability of FLM and how the performances depend on the amount and domain of training data. We also conduct an additional range of evaluations inspired by IQ tests in the following section.

上述实验评估了FLM的知识相关能力以及性能如何取决于训练数据的数量和领域。在接下来的部分,我们还进行了一系列受智商测试启发的额外评估

5 Evaluations Inspired by IQ Tests基于智商测试启发的评估

痛点(知识可能无法充分反映LLM的智商)→提出使用现有的与IQ智商测试数据集,四方面考察=符号映射+规则理解+模式挖掘+抗干扰

Section 4 details the evaluation of existing benchmarks, focusing on knowledge. As we discussed in Section 1, knowledge could not fully reflect the Intelligence Quotient (IQ) of LLMs. To this end, we use existing IQ-related datasets [71; 72; 53] and make necessary modifications or generate new synthetic datasets where necessary.

Specifically, the IQ test mainly considers four aspects: symbolic mapping, rule understanding, pattern mining, and anti-interference. A common key property of these tasks is that they are dependent on the inference and generalization in a new context, instead of the previously-learned knowledge. We re-organize the modified existing datasets and our newly generated datasets under these four aspects, and introduce the motivation for each aspect, as well as the detailed execution methods.

第4节详细介绍了现有基准的评估,重点关注知识。正如我们在第1节中讨论的那样,知识可能无法充分反映LLM的智商(IQ)。为此,我们使用现有的与IQ相关的数据集[71; 72; 53],并在必要时进行必要的修改或生成新的合成数据集。

具体来说,IQ测试主要考虑四个方面:符号映射、规则理解、模式挖掘抗干扰。这些任务的一个共同关键属性是它们依赖于在新上下文中的推理和概括,而不是先前学到的知识。我们根据这四个方面重新组织修改过的现有数据集和我们新生成的数据集,并介绍每个方面的动机以及详细的执行方法。

Compared Methods. Borrowing psychological ideas that the measurement of IQ is dependent on age 9, we mainly consider models trained with similar amounts of data to FLM-101B. As a milestone of LLM development, GPT-3 (175B) [3] proposed in-context learning for the first time. GLM-130B [80] is the first open English-Chinese bilingual LLM. Hence, we select them as baseline models. Both models are trained with 300 ~400 billion tokens, which are in the same range as ours. GPT-3 focuses on English, so it is not included in the Chinese-related evaluation (i.e., CLUE-IQ).

比较方法。借鉴心理学思想,智商的测量依赖于年龄[9],我们主要考虑使用与FLM-101B相似数量的数据进行训练的模型。作为LLM发展的里程碑,GPT-3(175B)[3]首次提出了上下文学习GLM-130B[80]是第一个开放的英汉双语LLM。因此,我们选择它们作为基准模型。这两个模型都是用3000到4000亿个标记进行训练的,与我们的范围相同。GPT-3侧重于英语,因此不包括在与中文相关的评估中(即CLUE-IQ)。

5.1 Symbolic Mapping Evaluation符号映射评估

文本形式的分类任务可能已泄露在原始数据中→导致模型过度拟合标签语义,而非智能推理得出→无法衡量智力

An existing study [71] points out that classification tasks (e.g., document classification, sentiment classification) in textual forms often lack generalization. This is because they often come with very indicative and meaningful category labels. Such labels may laterally appear in the raw training data or popular websites, i.e., SemEval, IMDB [32], and Yelp 10 et al.. This leads a model to over-fit the semantics of the labels instead of inferring them from the new context, while the latter is critical for measuring intelligence as well. Considering this, we use a symbolic mapping method to replace the original category labels with symbols that are unlikely to be seen in the training data. Hence, we can evaluate the LLMs’ language understanding ability as well as the generalization abilities to a new context. Because the labels are from a given scope, we form our evaluation task as in-context learning with few-shot examples for each label.

一项现有研究[71]指出,文本形式的分类任务(例如,文档分类、情感分类)通常泛化能力。这是因为它们通常带有非常指示性和有意义的类别标签这些标签可能会在原始训练数据或流行网站(如SemEval、IMDB[32]和Yelp等)中横向出现。这导致模型过度拟合标签的语义,而不是从新上下文中推断它们,而后者对于衡量智力也至关重要。考虑到这一点,我们使用符号映射方法将原始类别标签替换为在训练数据中不太可能出现的符号。因此,我们可以评估LLM的语言理解能力以及对新上下文的概括能力。由于标签来自给定的范围,我们将评估任务形成为对每个标签进行少样本示例的上下文学习。

图3:符号映射的示例。主要区别在于符号映射方法将原始标签替换为随机字符串。在这个示例中,蕴含类别被随机字符串<30mFC%4Z>替换,而非蕴含类别被替换为?V9qP@Rx。

Figure 3: An example of symbolic mapping. The main difference is that the symbolic mapping method replaces the original label with random strings. In this example, we use <30mFC%4Z> and <?V9qP@Rx> to replace entailment and not entailment, respectively.

图3:符号映射的示例。主要区别在于符号映射方法将原始标签替换为随机字符串。在这个示例中,蕴含类别被随机字符串<30mFC%4Z>替换,而非蕴含类别被替换为<?V9qP@Rx>。

5.1.1 Data Collection数据收集基于SuperGLUE和CLUE采样300实例+用随机字符串替换原来的类别标签

We use the existing benchmark datasets (e.g., SuperGLUE [61], CLUE [74]) as the source and sample up to 300 instances. Then, we replace the original category labels with random strings. Figure 3 shows an example. In this case, the entailment category is replaced by random string <30mFC%4Z> while the not entailment category is replaced by <?V9qP@Rx>. This processing also mitigates the problem that these datasets may contaminate the LLM pre-training data, since both benchmarks are public with lots of reproductions. Table 6 presents the statistics and task types of the rebuilt datasets.

我们使用现有的基准数据集(例如,SuperGLUE[61]、CLUE[74])作为数据源,并采样多达300个实例。然后,我们用随机字符串替换原来的类别标签。图3显示了一个示例。在这种情况下,蕴含类别被随机字符串<30mFC%4Z>替换,而非蕴含类别被替换为 <?V9qP@Rx>。这种处理还可以在一定程度上减轻这些数据集可能污染LLM预训练数据的问题,因为这两个基准测试都是公开的,有很多复制品。表6列出了重建的数据集的统计信息和任务类型。

表6:SuperGLUE-IQ和CLUE-IQ数据集的统计信息。“WSD”代表“词义消歧”;“SS”代表“句子相似性”;“KR”代表“关键词识别”;coref.代表“共指解析”。

Table 6: Statistics for SuperGLUE-IQ and CLUE-IQ datasets. “WSD” stands for “Word Sense Disambiguation”; “SS” stands for “Sentence Similarity”; “KR” stands for “Keyword Recognition”; coref. stands for “coreference resolution”.

表6:SuperGLUE-IQ和CLUE-IQ数据集的统计信息。“WSD”代表“词义消歧”;“SS”代表“句子相似性”;“KR”代表“关键词识别”;coref.代表“共指解析”。

5.1.2 SuperGLUE-IQ基于SuperGLUE原始数据集构建,采样验证集(两原则筛选)

SuperGLUE is a benchmark dataset used in evaluating the classification ability of various models including LLMs. However, the data is publicly available and many websites have reproduced this dataset. As a result, it is inevitable that the models might have already been trained on it. Thus, we build a new dataset named SuperGLUE-IQ based on the original dataset. Since the answers for the test set of SuperGLUE are not publicly available, we use a validation set here. There are two rules for selecting the sub-tasks: (i) the number of instances exceeds 100; (ii) the classification categories are fixed sets. The building process is detailed in Section 5.1.1. Table 7 lists the performance of FLM-101B and the baselines.

SuperGLUE是一个用于评估各种模型包括LLM的分类能力的基准数据集。然而,这些数据是公开可用的,许多网站都复制了这些数据集。因此,模型可能已经在其上进行了训练。因此,我们基于原始数据集构建了一个名为SuperGLUE-IQ的新数据集。由于SuperGLUE的测试集答案不公开,我们在这里使用验证集。选择子任务有两个规则:(i)实例数超过100;(ii)分类类别是固定的集合。构建过程详见第5.1.1节。表7列出了FLM-101B和基线的性能。

Results结果双向编码器(GLM-130B擅长英语共指解析任务)、单向(FLM-101B和GPT-更擅长推理任务【如BoolQ),FLM-101B已接近GPT-3,但两类各有优势

Results. On BoolQ, WiC, and RTE tasks, FLM-101B and GPT-3 perform at the same level, and both outperform GLM-130B. In specific, GPT-3 and FLM-101B are more than 9 points better than GLM-130B on BoolQ. On WSC task, FLM-101B and GPT-3 perform comparably while both perform worse than GLM-130B with about an 18 points gap. The technical report of GLM-130B [80] shows that they use both the WSC and RTE datasets in training. It is interesting to observe that the performance of GLM-130B on the two tasks has such a difference. Since the original label is replaced by a random string, overfitting can be ruled out to a certain extent. We believe that the main reason lies in the structure of language models: GLM-130B contains a bidirectional encoder while FLM-101B and GPT-3 are uni-directional. This feature potentially makes GLM-130B perform better in English coreference resolution tasks, while poor in reasoning-related tasks (e.g., BoolQ). More importantly, the costs of the three models are very different. FLM-101B achieves a comparable performance with GPT-3 under about 1/13 of its computational cost.

结果。在BoolQWiCRTE任务中,FLM-101B和GPT-3表现相同,都优于GLM-130B。具体来说,GPT-3FLM-101BBoolQ上比GLM-130B了9多个点。在WSC任务中,FLM-101B和GPT-3表现相近,但都比GLM-130B差,差距约为18个点。GLM-130B的技术报告[80]显示,他们在训练中使用了WSCRTE数据集。有趣的是观察到GLM-130B在这两个任务上的表现有这么大的差异。由于原始标签被随机字符串替换,可以在一定程度上排除过度拟合的可能性。我们认为主要原因在于语言模型的结构:GLM-130B包含双向编码器,而FLM-101BGPT-3单向的。这个特性可能使GLM-130B在英语共指解析任务中表现更好,而在与推理相关的任务(如BoolQ)中表现较差。更重要的是,这三个模型的成本非常不同。FLM-101B在大约1/13的计算成本下实现了与GPT-3相媲美的性能。

5.1.3 CLUE-IQ基于CLUE数据集构建,评估四个任务

CLUE [74] is an open benchmark for Chinese NLP tasks. Similar to SuperGLUE-IQ, we build CLUE-IQ based on the CLUE dataset. Because GPT-3 is unable to handle Chinese well, here we compare FLM-101B with GLM-130B only. There are four tasks to be evaluated, including AFQMC, CSL, OCNLI, and CLUEWSC2020.11 Similar to SuperGLUE-IQ, we follow the same two rules to filter the original CLUE. Table 8 lists the performances of FLM-101B and GLM-130B.

CLUE [74] 是一个开放的用于中文自然语言处理任务的基准测试集。与SuperGLUE-IQ类似,我们基于CLUE数据集构建了CLUE-IQ。因为GPT-3无法很好地处理中文,所以这里我们只将FLM-101B与GLM-130B进行比较。要评估的任务有四个,包括AFQMC、CSL、OCNLI和cluewsc20120.11。与SuperGLUE-IQ类似,我们遵循相同的两个规则来筛选原始的CLUE。表8列出了FLM-101B和GLM-130B的性能。

Results结果对比GLM-130BFLM-101B有良好的中文能力成本更低

Results. On CLUE-IQ, our proposed FLM-101B achieves the best average performance of 42.07. Among the evaluated tasks, FLM-101B outperforms GLM-130B on AFQMC, CSL, and CLUEWSC2020. The results show that FLM-101B has good Chinese ability at the level of 100B parameters. Interestingly, FLM-101B performs better than GLM-130B on Chinese WSC, while worse than GLM-130B on English WSC. In addition, FLM-101B performs worse than GLM-103B on OCNLI. These results suggest that Chinese and English are different in nature and a model excelling in one language may not be good at both. Finally, from a cost-effective perspective, FLM-101B achieves better performance in Chinese at about 12% of the training cost of the counterpart.

结果。在CLUE-IQ上,我们提出的FLM-101B取得了42.07的最佳平均性能。在评估的任务中,FLM-101B在AFQMC、CSL和CLUEWSC2020上表现优于GLM-130B。结果表明,FLM-101B在100B参数级别具有良好的中文能力。有趣的是,FLM-101B在中文WSC上的表现优于GLM-130B,而在英文WSC上不及GLM-130B。此外,FLM-101B在OCNLI上的表现不及GLM-103B。这些结果表明,中文和英文在本质上是不同的,一种在一种语言上表现出色的模型可能在两种语言上表现不佳。最后,从成本效益的角度看,FLM-101B在大约12%的训练成本下在中文上取得更好的性能

5.2 Rule Understanding Evaluation规则理解评估

拥有推理能力的标志(理解规则并根据规则执行),规则理解求值(封闭环境下执行正确动作的能力)不同于COT(开放环境下的推理能力)

Symbolic mapping is able to lighten the negative effects of data overfitting. From a different perspective, we consider understanding rules and executing them according to the given rules is a strong indication of reasoning capability. To this end, we design rule understanding evaluation. Note that, this test is different from reasoning based on the chain of thought. The former focuses on the understanding ability of simple rules (e.g., counting) and performing the right action in a closed setting, while the latter focuses on reasoning ability in an open setting (e.g., different valid reasons for the same conclusion). For example, “counting an increasing sequence of numbers” is a typical task for rule understanding evaluation, which can be zero-shot.

符号映射能够减轻数据过拟合的负面影响。从不同的角度来看,我们认为理解规则并根据给定的规则执行推理能力的有力标志。为此,我们设计了规则理解求值。需要注意的是,这个测试不同于基于思维链COT过程的推理。前者侧重于对简单规则(例如计数)的理解能力以及在封闭环境下执行正确动作的能力,而后者侧重于在开放环境下的推理能力(例如,对相同结论的不同有效理由)。例如,“计算递增数列”是规则理解评估的典型任务,可以进行零样本测试

所选任务和数据的详细信息计数任务+300个随机生成项目的双语数据集,字符串替换+两个子任务

Details of Selected Tasks and Data. Counting (0-shot) is the simplest test method for rule under-standing ability. Here, we build a bilingual dataset with 300 randomly generated items and report the results on 148 of them with English instructions. A typical example is “Let’s count from 10010 to 10035: 10010, 10011, 10012,”. String replacement (4-shots) is another task that examines the model’s capacity to edit the text precisely following human intention. We build two sub-tasks: Replace-Word and Replace-Lowercase, each of which contains 300 instances. Each instance starts with a clear instruction: for the “Replace-Word” task, it is like “In the following sentence, replace the specified word with the target word. word to replace: **WQHF** target word: **DFBB**”; for the “Replace-Lowercase” task, it is like “For the following text, please modify all uppercase letters to lowercase”. The counting range and words to replace are sampled with a uniform distribution. Table 9 shows the performance of our proposed FLM-101B against GPT-3 and GLM-130B on both counting and string replacement tasks.

所选任务和数据的详细信息

计数(0样本)是规则理解能力的最简单测试方法。在这里,我们建立了一个包含300个随机生成项目的双语数据集,并报告了其中148个项目的英语说明结果。一个典型的示例是“让我们从10010开始计数到10035:10010、10011、10012”。

字符串替换(4个样本)是另一项测试,它检验了模型根据人类意图精确地编辑文本的能力。我们构建了两个子任务:Replace-Word和Replace-Lowercase,每个子任务包含300个实例。每个实例都以明确的指令开头

>> 对于“Replace-Word”任务,指令如“在以下句子中,将指定的单词替换为目标单词。要替换的单词: **WQHF**  目标单词:**DFBB**;

>> 对于“Replace-Lowercase”任务,指令如“对于以下文本,请将所有大写字母修改为小写字母”。

计数范围和要替换的单词都是根据均匀分布进行抽样的。表9显示了我们提出的FLM-101B在计数和字符串替换任务上与GPT-3和GLM-130B的性能。

Results结果FLM-101B性能不一定最好但性价比最高

Results. On counting task, FLM-101B achieves 69.59%, about 9 points better than GLM-130B. GPT-3 wins the first place in counting and Replace-Lowercase, and second place in Replace-Word. This is potentially because GPT-3 has the largest amount of English training data. This experiment shows that the advantages of each model are varied. Hence, in future work, rule understanding evaluation tasks should cover more scenarios. Finally, considering the cost of each model, the performance of FLM-101B is satisfactory.

结果。在计数任务中,FLM-101B实现了69.59%的成绩,比GLM-130B高出约9个点。GPT-3在计数和Replace-Lowercase任务中获得第一名,在Replace-Word任务中获得第二名。这可能是因为GPT-3拥有最多的英语训练数据。这个实验显示了每个模型的优势各不相同。因此,在未来的工作中,规则理解评估任务应涵盖更多的场景。最后,考虑到每个模型的成本FLM-101B的性能令人满意。

5.3 Pattern Mining Evaluation模式挖掘评估

模式挖掘测试存在数据泄露即采用替代方法来缓解建立三个任务

Pattern Mining test is common in IQ tests. In detail, it is the induction and deduction of the patterns emerging in a new context. In general, it is difficult even for humans and is frequently used in intelligence tests. Again, we face the problem that the same test data might have appeared in large quantities, so we also use replacement methods similar to Section 5.1 to alleviate this problem.

Specifically, we build a benchmark with three tasks (i.e., Head & Tail, Full Repeating, and Head Slicing) for evaluation. Head & Tail is to add a head and a tail to the given input, which should be exactly the same as the ones in the given examples. Regarding Full Repeating, the input sequence should be fully repeated once. For the Head Slicing task, the model needs to return the first fixed number of characters of the input. The number can be inferred from the preceding examples. No instruction or clue is provided except the examples.

模式挖掘测试在智商测试中很常见。具体来说,它是在新的背景下归纳和演绎出现的模式。一般来说,即使对人类来说也很困难,经常在智商测试中使用。同样,我们面临着相同的问题,即相同的测试数据可能已经大量出现,因此我们也使用了类似于第5.1节的替代方法来缓解这个问题。

具体来说,我们建立了一个包含三个任务(即Head & Tail、Full Repeating和Head Slicing)的基准测试用于评估。

>> Head & Tail任务是在给定的输入中添加一个头和一个尾,它们应该与给定示例中的头和尾完全相同

>> 对于Full Repeating任务,输入序列应该完全重复一次

>> 对于Head Slicing任务,模型需要返回输入的前面固定数量的字符。这个数字可以从前面的示例中推断出来。除了示例之外,没有提供任何指令或线索。

Figure 4: Examples of pattern mining evaluation.

Figure 4 shows examples of these tasks. We sample the input strings, heads, and tails from a uniform distribution. These tasks are actually the “alphabetical” versions of the list_functions sub-task of Big-Bench [53]. The original numerical version is so simple that most existing LLMs could achieve 90%+ accuracy. To improve the distinctiveness, we replace the numbers with characters. All these tasks require the model to discover the behavior patterns inside the given examples. Each task is 5-shot and contains 100 instances. Table 10 lists the experimental results of our proposed FLM-101B against GPT-3 and GLM-130B on pattern mining tasks.

图4显示了这些任务的示例。我们从均匀分布中抽样输入字符串、头和尾。这些任务实际上是Big-Bench [53]的list_functions子任务的“字母顺序”版本。原始的数字版本非常简单,大多数现有的LLM可以实现90%以上的准确度。为了提高区分度,我们将数字替换为字符。所有这些任务都要求模型发现给定示例中的行为模式。每个任务都是5样本,包含100个实例。表10列出了我们提出的FLM-101B在模式挖掘任务上与GPT-3GLM-130B的实验结果。

Results结果FLM-101B性价比更高

Results. On all three tasks, FLM-101B outperforms GLM-130B by a large margin. For the head & tail and full repeating tasks, FLM-101B is a few points behind GPT-3, but outperforms the latter on the head slicing task. Considering the computational cost, FLM-101B exhibits noticeable abilities in this area.

结果。在所有三个任务中,FLM-101B都远远优于GLM-130B。在头部和尾部以及完全重复的任务中,FLM-101B后于GPT-3几个百分点,但在头部切片任务上表现出色。考虑到计算成本FLM-101B在这个领域展现出明显的能力。

5.4 Anti-interference Evaluation抗干扰性评估

Anti-interference capability is critical for finding and utilizing information that is truly related to a specific goal, in an unseen and noisy context (Figure 5). We believe that in addition to generalization, anti-interference is also one of the important principles of AGI. For example, many LLMs will babble when given noisy cues. Another famous hard problem, the cocktail party problem in speech recognition [38], also suggests the importance of the anti-interference ability of intelligent agents. To this end, we conduct this anti-interference evaluation. Figure 5 shows two typical examples of this test.

抗干扰性能对于在未见过的嘈杂环境中找到并利用与特定目标真正相关的信息至关重要(见图5)。我们认为,除了泛化能力,抗干扰性也是通用人工智能(AGI)的重要原则之一。例如,当给定嘈杂的提示时,许多LLM会胡言乱语。另一个著名的难题是语音识别中的鸡尾酒会问题[38],它也表明了智能体的抗干扰能力的重要性。为此,我们进行了抗干扰性评估。图5显示了这项测试的两个典型示例。

选择的任务和数据收集:三种任务类型

Selected Tasks and Data Collection. We conduct anti-interference evaluation in three task types: multiple key retrievals, single supporting fact tracking, and two supporting facts tracking. Multiple key retrieval is a kind of puzzle that hides some important information (referred to as keys) inside a lot of irrelevant text. If the anti-interference ability of LLMs is not good enough, they will output the wrong or even meaningless words. Even if LLMs pass the first challenge, they may still fail due to multiple relevant noises. We collect a multiple key retrieval dataset in similar formats as those in [7] with at most 3 keys in each instance, exemplified in Figure 5. The single supporting fact tracking and two supporting facts tracking tasks test whether a model can find the chain of supporting facts to answer a question correctly, which is hidden inside a set of irrelevant statements. There are two sub-tasks in the babi-20 [72] benchmark (qa1 and qa2 12) that are aligned with this setting. Thus, we directly modify them in a generative format with 3 shots. We randomly sampled 300 questions for each of these three tasks. Table 11 shows the evaluation results on anti-interference.

选择的任务和数据收集。我们在三种任务类型中进行抗干扰性评估:多键检索单支持事实跟踪双支持事实跟踪

>>多键检索是一种谜题,它将一些重要信息(称为关键词)隐藏在大量不相关的文本中。如果LLM的抗干扰能力不够好,它们会输出错误甚至毫无意义的词语。即使LLM通过了第一个挑战,它们仍然可能由于多个相关噪声而失败。我们按照[7]中的类似格式收集了一个多键检索数据集,每个实例最多有3个关键词,如图5所示。

>>单支持事实跟踪和支持事实跟踪任务测试了模型是否能够找到隐藏在一组不相关陈述中的支持事实链以正确回答问题。babi-20 [72]基准测试中有两个子任务(qa1和qa2)与这种设置一致。因此,我们直接用3个样本将它们修改为生成格式。我们随机抽取了每个任务的300个问题。表11显示了抗干扰性评估的评估结果。

图5:抗干扰评估示例。

Figure 5: Examples of anti-interference evaluation.

表11:FLM-101B、GPT-3和GLM-130B的抗干扰性能评估。

Table 11: Performance of FLM-101B, GPT-3, and GLM-130B on anti-interference evaluation.

Results结果FLM-101B均高于GLM-130B且性价比高

Results. Among all the baselines for this evaluation, FLM-101B achieves the second-best passing rates of 89.00%, 59.00%, and 32.33%, respectively, which is an advantage of about 11%, 3%, and 6% compared to GLM-130B. Considering the computational cost, FLM-101B delivers exciting performance.

结果。在所有这些评估的基线中,FLM-101B分别取得了89.00%、59.00%和32.33%的第二高通过率,相对于GLM-130B有约11%、3%和6%的优势。考虑到计算成本,FLM-101B表现出令人兴奋的性能。

总结:四项附加评估,FLM-101B在某些任务中优于GLM-130B,并且成本大大降低——两大原因(训练数据+增长策略的优势)

In conclusion, on our four additional evaluations inspired by the IQ tests, FLM-101B outperforms GLM-130B and obtains competitive results compared to GPT-3 in some tasks with much lower costs. Except for the impacts of training data, the superiority may be owed to a story that in the growth strategy, the smaller models in early stages refine a more efficient searching space, which keeps taking effect when the model grows larger with increased generalization ability.

总之,在我们受到智商测试启发的四项附加评估中,FLM-101B在某些任务中优于GLM-130B,并且在成本大大降低的情况下获得了与GPT-3相竞争的结果。除了训练数据的影响外,这种优势可能是由于在增长策略中,早期较小的模型细化了更有效的搜索空间,随着模型规模的扩大和泛化能力的提高,这种优势也会继续发挥作用。

6 Related Work相关工作

将语言模型扩展到100B:FLM-101B性价比最高

Scaling Up Language Models to 100B. The burgeoning advancements in hardware and computa-tional techniques in recent years [47; 52] have laid a robust groundwork for the expansion of language models. The benefits of scaling up LLMs include discernible advantages in language perplexity supported by studies on scaling laws [23; 18; 19; 77], as well as the emergent cognitive competencies in models [69; 4].

In the realm of 100+ billion parameters, examples of closed-source pre-trained LLMs include GPT-3 [3], Gopher [42], and Palm [1]. For closed-source models trained on Chinese data, notable mentions are Ernie 3.0 [63], Pangu-Σ [48], and InternLM [57]. Turning our attention to open-source variants, OPT [81] and BLOOM [49] are among the counterparts to GPT-3; the Llama [58; 59] series strategically operates on a slightly reduced scale (approximately 70B parameters) but amplifies the data to 2T. GLM-130B [80] is an open-source bilingual model with decent performance in both Chinese and English tasks. Nevertheless, the development trajectory and cost of GLM-130B remain largely inaccessible to many academic and industrial entities. FLM-101B is an exemplary paradigm for achieving comparable performance with a relatively small $100K budget. It is our aspiration that this model serves as a catalyst, expediting research advancements and making them more economically feasible in this domain.

将语言模型扩展到100B。近年来硬件和计算技术的飞速发展[47;[52]为语言模型的扩展奠定了坚实的基础。扩大LLM的好处包括对缩放定律研究支持的语言困惑的明显优势[23];18;19;77],以及模型中涌现的认知能力[69;4)。

在1000亿以上参数领域,闭源的预训练LLM示例包括GPT-3 [3]、Gopher [42]和Palm [1]。对于在中文数据上训练的闭源模型,值得一提的有Ernie 3.0 [63]、Pangu-Σ [48]和InternLM [57]。关注开源变体,OPT [81]和BLOOM [49]是GPT-3的对应模型;Llama [58; 59]系列在稍微减小的规模(约70B参数)上进行战略操作,但将数据扩大到2TGLM-130B [80]是一个在中英双语任务中表现良好的开源模型。然而,GLM-130B的发展轨迹和成本在很大程度上仍然无法被许多学术和工业实体所了解。FLM-101B是一个在相对较小的10万美元预算下实现了可比较性能的典范范例。我们希望这个模型能够作为催化剂,加速研究进展,并使其在这一领域更经济可行。

与人类对齐有证明LLM出现了推理能力,但仍需增强遵循指令的能力,并对齐人类偏好

Aligning with Humans. Despite the evidence that foundation LLMs present reasoning abilities in zero/few-shot learn

ing and chain-of-thought prompting [3; 70], further refinement is needed to enhance their abilities to follow instructions [68] and align with human preferences [37; 36; 13; 2]. Supervised fine-tuning releases the potential of LLMs to imitate the instruction-following formats and provide human-like responses in dialogical and problem-solving contexts [66; 73; 34; 26]. Meanwhile, policy optimization methods [50; 43] lead LLMs to generate responses that maximize rewards congruent with human preferences, e.g., being helpful and harmless [12].

On the other hand, although these post-training techniques have proven effective and successful in industrial applications, the scaling laws regarding model sizes persist even after alignment with humans: larger models provide more factual and reasonable responses [16], as well as being better calibrated with their confidence probabilities [22]. We hereby release FLM-101B as a large foundation model, making it an accessible starting point for subsequent alignment studies.

与人类对齐。尽管有证据表明基础LLM在零/少次训练和思维链启发[3; 70]中表现出了推理能力,但仍需要进一步改进以增强它们遵循指令的能力[68],并与人类偏好[37; 36; 13; 2]对齐监督微调释放了LLM模型模仿遵循指令格式,并在对话和问题解决情境中提供类似人类的响应的潜力[66; 73; 34; 26]。同时,策略优化方法[50; 43]引导LLM生成符合人类偏好的奖励最大化的响应,例如有帮助和无害[12]。

另一方面,尽管这些后训练技术在工业应用中被证明是有效的和成功的,但即使在与人类对齐后,关于模型大小的缩放规律仍然存在:更大的模型提供更多的事实和合理的响应[16],,并且可以更好地使用它们的置信概率进行校准[22]。我们在此发布FLM-101B作为一个大型基础模型,使其成为后续对齐研究的可访问起点

LLM 评估基础模型评估的三类基准(NLP+CK+PK)、微调的模型评估其人类对齐能力,本文章通过重新组织现有数据集+基于智商测试的额外评估

LLM Evaluation. Widely-used approaches to evaluate LLMs include natural language processing benchmarks [74; 61], commonsense knowledge benchmarks [9; 79; 27], and professional knowledge benchmarks [17; 20]. For chatbots after fine-tuning, automatic and semi-automatic playgrounds are developed to evaluate their human alignment abilities [83]. Although knowledge-oriented ability is important, the results can be substantially impacted by training data and domains. To measure other classes of abilities, existing research like Big-Bench [53] and babi-20 [72] include some sub-tasks relevant to IQ tests, while others still depend more on NLP and knowledge. In this work, we add additional ranges of evaluation in the IQ-test paradigms by re-organizing existing datasets as well as creating new ones where proper.

LLM 评估。广泛使用的LLM评估方法包括自然语言处理基准NLP[74; 61]、常识知识基准CK[9; 79; 27]和专业知识基准PK[17; 20]。

对于经过微调的聊天机器人,已经开发了自动和半自动的playgrounds,用于评估人类对齐能力[83]。尽管知识相关的能力很重要,但训练数据和领域会对结果产生实质性影响。

为了衡量其他类别的能力,像Big-Bench [53]和babi-20 [72]这样的现有研究包括一些与智商测试相关的子任务,而其他任务更依赖于自然语言处理和知识。在这项工作中,我们通过重新组织现有数据集以及创建新的数据集,在智商测试范式下添加了额外的评估范围。

模型增长:FLM-101B首次尝试增长策略来训练100B+规模LLM

Model Growth A line of existing work studies the progressive expansion of structures in training Transformer-like models [14; 51; 15; 6; 39; 62; 78]. To our knowledge, FLM-101B presents the first attempt to use a growth strategy to train LLMs in the 100B+ scale. For a more comprehensive summary, please refer to [78].

模型增长。一系列现有工作研究了在训练Transformer类似模型时逐渐扩展结构[14; 51; 15; 6; 39; 62; 78]。据我们所知,FLM-101B第一个尝试使用增长策略来训练100B+规模LLM的尝试。有关更全面的总结,请参阅[78]。

7 Conclusions and Future Work结论与未来工作

FLM-101B(仅用10万美元预算)优于基线模型——降低训练成本的关键思想:利用增长策略突破模型参数的固定数量

In this paper, we introduce FLM-101B, an open-source LLM that is successfully trained from scratch within a $100,000 budget. The key idea of reducing the training cost of FLM-101B is to utilize the growth strategy to break through the fixed number of model parameters. To fairly evaluate LLMs, we conduct a set of evaluations inspired by IQ tests. We believe that along this pathway, better IQ evaluation methods will continue to emerge in future studies. Experimental results show that FLM-101B outperforms strong baseline models under the same computational cost.

本文介绍了FLM-101B,这是一个开源LLM,成功地在10万美元的预算从头开始训练。减少FLM-101B训练成本的关键思想是利用增长策略突破模型参数的固定数量。为了公平评估LLM,我们进行了一系列受启发于智商测试的评估。我们相信,沿着这条道路,未来研究将继续涌现出更好的智商评估方法。实验结果表明,在相同的计算成本下,FLM-101B优于强基线模型

LLM的潜力(通向AGI的重要可能技术路径之一)→未来趋势(构建有强大推理能力但不具备大量知识的基本LLM+再领域扩展来更好的支持应用)

The power of LLMs is very exciting. We believe that LLMs are one of the important possible technical paths to AGI. For the sustainable development of LLMs, we believe that it may be an effective path to construct a basic LLM with strong reasoning capabilities but not a large amount of knowledge (for cost saving), and then expand the knowledge of the LLM in different domains to better support applications. Besides, our exploration on the growth strategy as well as training stability would potentially be beneficial for future attempts of further scaling up LLMs, e.g., beyond 1T parameters.

LLM的潜力非常令人兴奋。我们认为LLM是通向AGI的重要可能技术路径之一。为了LLM的可持续发展,我们认为,构建一个有强大推理能力但不具备大量知识的基本LLM(以节省成本),然后在不同领域扩展LLM的知识以更好地支持应用可能是一条有效的路径。此外,我们对增长策略以及训练稳定性的探索可能有助于未来尝试进一步扩展LLM,例如超过1T参数。

Acknowledgments致谢

This work is supported by the National Key R&D Program of China (2022ZD0116300) and the National Science Foundation of China (NSFC No. 62106249). We would like to thank Hanxiao Qu, Yan Tian, Xigang Cao, Xiaolong Zhang, Kailong Xie and Conghui Guo for their help on computational resources, Quanyue Ma, Hanyu Zhao, Yihui Guo and Jiahong Leng for their help on data, and all other colleagues’ strong supports for this project.

本工作得到了中国国家重点研发计划(2022ZD0116300)和中国国家自然科学基金(NSFC No. 62106249)的支持。我们要感谢麻汉孝、田岩、曹西岗、张小龙、谢开龙和郭聪辉为本项目提供的计算资源支持,马全悦、赵瀚宇、郭奕辉和冷嘉泓为数据提供的帮助,以及所有其他同事对本项目的大力支持。

  • 3
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

一个处女座的程序猿

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值