A Survey of Large Language Models（个人笔记）

RSociopath

已于 2024-07-31 16:57:50 修改

阅读量654

点赞数 3

文章标签：语言模型人工智能自然语言处理

于 2024-07-31 16:57:15 首次发布

本文链接：https://blog.csdn.net/RSociopath/article/details/140827296

版权

“1 INTRODUCTION”

“Statistical language models (SLM)”-----“Markov assumption”
“Neural language models (NLM)” -----“neural networks”
“Pre-trained language models (PLM)” -----“biLSTM network”
“Large language models (LLM)” -----“large-sized PLMs”

“three major differences between LLMs and PLMs”

“LLMs display some surprising emergent abilities that may not be observed in previous smaller PLMs.” LLMs显示出一些令人惊讶的涌现能力，这些能力在以前较小的PLMs中可能观察不到。
“LLMs would revolutionize the way that humans develop and use AI algorithms” LLMs将彻底改变人类开发和使用人工智能算法的方式
“the development of LLMs no longer draws a clear distinction between research and engineering.” LLMs的发展使得研究与工程不再泾渭分明。

“2 OVERVIEW”

“2.1 Background for LLMs”

“Scaling Laws for LLMs.”

“KM scaling law”
“Chinchilla scaling law”

“the KM scaling law favors a larger budget allocation in model size than the data size, while the Chinchilla scaling law argues that the two sizes should be increased in equal scales, i.e., having similar values for a and b in Equation” KM尺度律认为模型尺度比数据尺度更有利于预算分配，而Chinchilla尺度律则认为两种尺度应等比例增加，即对方程中的a和b具有相似的值

“Emergent Abilities of LLMs.”

“In-context learning”语境学习
“Instruction following.”指示遵循
“Step-by-step reasoning.”逐步逻辑分析

“Key Techniques for LLMs.”

“Scaling.”规模
“Training.”训练
“Ability eliciting.”能力激发
“Alignment tuning.”对标人类价值观的调整
“Tools manipulation.” 工具使用

“2.2 Technical Evolution of GPT-series Models”

“Early Explorations.”

“With the advent of Transformer, OpenAI developed two initial GPT models, namely GPT-1 and GPT-2 , which can considered as the foundation to more powerful models subsequently” 随着Transformer的出现，OpenAI开发了两个初始的GPT模型，即GPT - 1 和GPT - 2，这两个模型可以作为后续更强大模型的基础

“GPT-2 [26] increased the parameter scale to 1.5B, which was trained with a large webpage dataset WebText.” GPT-2 [ 26 ]将参数规模增加到1.5 B，使用大型网页数据集超文本进行训练。

“Capacity Leap.”

“ICL can teach (or instruct) LLMs to understand the tasks in the form of natural language text.” ICL(In-context Learning)可以教(或指导) LLM以自然语言文本的形式理解任务。

“Capacity Enhancement.”

“OpenAI has explored two major approaches to further improving the GPT-3 model, i.e., training on code data and alignment with human preference” OpenAI探索了进一步改进GPT - 3模型的两种主要方法，即对代码数据的训练和与人类偏好的对齐

“Training on code data.”

“Codex [89] was introduced by OpenAI in July 2021, which was a GPT model fine-tuned on a large corpus of GitHub code.” Codex [ 89 ]是由OpenAI于2021年7月提出的，是一个在GitHub代码的大型语料库上微调的GPT模型。

“there is also a speculation that training on code data can greatly increase the chain-of-thought prompting abilities of LLMs [47]” 还有一种猜测是，对代码数据的训练可以极大地增加LLMs的思维链提示能力

“Human alignment.”

“a work that applied reinforcement learning (RL) to learn from the preference comparisons annotated by humans ” 一项应用强化学习( Reinforcement Learning，RL )从人类标注的偏好比较中学习的工作。

“InstructGPT [61] was proposed in January 2022 to improve the GPT-3 model for human alignment, which formally established a three-stage reinforcement learning from human feedback (RLHF) algorithm.” InstructGPT [ 61 ]于2022年1月提出，用于改进人体对齐的GPT - 3模型，正式建立了基于人体反馈的三阶段强化学习( RLHF )算法。

“OpenAI describes their approach to alignment research in a technical article [113], which has summarized three promising directions: “training A 8 systems to use human feedback, to assist human evaluation and to do alignment research”.” OpenAI在一篇技术文章中描述了他们的对齐研究方法[ 113 ]，该文章总结了3个有前景的方向："训练AI系统使用人类反馈、辅助人类评估和进行对齐研究"。

“The Milestones of Language Models.”

“ChatGPT.”
“GPT-4.”

“3 RESOURCES OF LLMS”

“3.1 Publicly Available Model Checkpoints or APIs”

“Given the huge cost of model pre-training, well-trained model checkpoints are critical to the study and development of LLMs for the research community.” 考虑到模型预训练的巨大成本，训练有素的模型检查点对于研究界学习和开发LLMs至关重要。

“Models with Tens of Billions of Parameters.”

“Flan-T5”

“it explores the instruction tuning from three aspects [64]: increasing the number of tasks, scaling the model size, and fine-tuning with chain-of-thought prompting data.” 它从三个方面对指令调优进行了探索[ 64 ]：增加任务数量、缩放模型大小和使用思维链提示数据进行微调。
“CodeGen ”

“good candidate for exploring the code generation ability.” 很好的探索代码生成能力的候选者。
“mT0”

“a good candidate model, which has been fine-tuned on multilingual tasks with multilingual prompts.” 一个好的候选模型，它已经在具有多语言提示的多语言任务上进行了微调。
“PanGu-α”

“good performance in Chinese downstream tasks in zero-shot or few-shot settings,” 在零样本或少样本设置的中文下游任务中表现良好。
“LLaMA”

“exhibited superior performance in tasks related to instruction following” 在与指令跟随相关的任务中表现出优越的性能

“Models with Hundreds of Billions of Parameters”

“OPT”

“aims to enable researchers to carry out reproducible research at scale” 旨在使研究人员能够在规模上进行可重复的研究
“BLOOM (176B version) and BLOOMZ (176B version)”

“the competence in multilingual language modeling tasks” 多语种语言建模任务中的能力
“OPT-IML”

“good candidates for studying the effect of instruction tuning” 学习指令调试效果的好模型

“Public API of LLMs.”

“OpenAI has provided seven major interfaces to the models in GPT-3 series: ada, babbage, curie, davinci (the most powerful version in GPT-3 series), text-ada-001, text-babbage-001, and text-curie-001” OpenAI提供了GPT - 3系列模型的7个主要接口：ada、babbage、curie、davinci技术、( GPT - 3系列中功能最强大的版本)、text - ada - 001、text -babbage- 001和text - curie - 001

“3.2 Commonly Used Corpora”

“Based on their content types, we categorize these corpora into six groups: Books, CommonCrawl, Reddit links, Wikipedia, Code, and others.” 根据内容类型，我们将这些语料库分为6类：Books、CommonCrawl、Reddit Links、Wikipedia、Code和其他。

“3.3 Library Resource”

“Transformers”
“DeepSpeed”
“Megatron-LM”
“JAX”
“Colossal-AI”
“BMTrain”
“FastMoE”

“4 PRE-TRAINING”

“Pre-training establishes the basis of the abilities of LLMs” 预训练奠定了LLMs能力的基础

“4.1 Data Collection”

“4.1.1 Data Source”

“4.1.2 Data Preprocessing”

“4.1.3 Effect of Pre-training Data on LLMs”

“Mixture of Sources.”

“By pre-training on a mixture of text data from diverse sources, LLMs can acquire a broad scope of knowledge and may exhibit a strong generalization capacity.” 通过对来自不同来源的文本数据的混合进行预训练，LLMs可以获得广泛的知识范围，并可能表现出很强的泛化能力。
“Amount of Pre-training Data.”

“it is suggested that researchers should pay more attention to the amount of high-quality data for adequately training the model, especially when scaling the model parameters.” 建议研究人员应更多地关注用于充分训练模型的高质量数据量，尤其是在对模型参数进行缩放时。
“Quality of Pre-training Data.”

“it is essential to incorporate preprocessing methods on the pre-training corpus carefully (as illustrated in Section 4.1.2), to improve stability of the training process and avoid affecting the model performance.” 为了提高训练过程的稳定性，避免影响模型性能，有必要在预训练语料(如4.1 . 2节所示)上仔细融入预处理方法。

“4.2 Architecture”

“4.2.1 Mainstream Architectures”

“the mainstream architectures of existing LLMs can be roughly categorized into three major types, namely encoder-decoder, causal decoder, and prefix decoder, as shown in Figure 4.” 现有的LLMs的主流架构大致可以分为三类，即编码器-解码器、因果解码器和前缀解码器，如图4所示。

“Encoder-decoder Architecture.”

“The encoder adopts stacked multi-head self-attention layers to encode the input sequence for generating its latent representations, while the decoder performs cross-attention on these representations and autoregressively generates the target sequence.” 编码器采用堆叠的多头自注意力层对输入序列进行编码以生成其潜在表示，而解码器对这些表示进行交叉注意力并自回归地生成目标序列。
“Causal Decoder Architecture.”

“The causal decoder architecture incorporates the unidirectional attention mask, to guarantee that each input token can only attend to the past tokens and itself.” 因果解码器架构融合了单向注意力掩码，保证每个输入令牌只能关注过去的令牌和自身。
“Prefix Decoder Architecture.”

“The prefix decoder architecture (a.k.a., non-causal decoder [169]) revises the masking mechanism of causal decoders, to enable performing bidirectional attention over the prefix tokens [170] and unidirectional attention only on generated tokens.” 前缀解码器架构( 换言之 ,非因果解码器)修改了因果解码器的掩蔽机制，使其能够对前缀令牌进行双向注意[ 170 ]，并且只对生成的令牌进行单向注意。

“4.2.2 Detailed Configuration”

we will discuss the corresponding configurations for four major parts of the Transformer, including “normalization, position embeddings, activation functions, and attention and bias” .我们将讨论Transformer的4个主要部分的相应配置，包括归一化、位置嵌入、激活函数、注意力和偏置。

“4.2.3 Pre-training Tasks”

“For training LLMs, there are two commonly used pretraining tasks, namely language modeling and denoising autoencoding.” 对于LLMs的训练，有两种常用的预训练任务，即语言建模和去噪自编码。

“4.2.4 Summary and Discussion”

“4.3 Model Training”

“4.3.1 Optimization Setting”

“Batch Training”批训练
“Learning Rate.”学习率
“Optimizer.”优化
“Stabilizing the Training.”稳定训练

“4.3.2 Scalable Training Techniques”

“3D Parallelism.”

“3D parallelism is actually a combination of three commonly used parallel training techniques, namely data parallelism, pipeline parallelism [194, 195], and tensor parallelism [66]19.” 3D并行实际上是三种常用的并行训练技术的组合，分别是数据并行、流水线并行以及张量并行
“ZeRO”

“focuses on the issue of memory redundancy in data parallelism.” 重点研究了数据并行中的内存冗余问题。
“Mixed Precision Training.”

“5 ADAPTATION TUNING OF LLMS”

“ increasing studies have shown that LLM’s abilities can be further adapted according to specific goals”越来越多的研究表明LLM的能力可以根据特定目标更进一步地适应

“5.1 Instruction Tuning”

“In essence, instruction tuning is the approach to fine-tuning pre-trained LLMs on a collection of formatted instances in the form of natural language [62], which is highly related to supervised fine-tuning [61] and multi-task prompted training” 从本质上讲，指令微调是以自然语言的形式在格式化实例集合上对预训练的LLM进行微调的方法[ 62 ]，与有监督的微调[ 61 ]和多任务提示训练高度相关

“After instruction tuning, LLMs can demonstrate superior abilities to generalize to unseen tasks [28, 62, 64], even in a multilingual setting” (Zhao 等, 2023, p. 19) 经过指令调整后，LLMs可以表现出优越的泛化能力，即使在多语言环境下，也可以泛化到未见过的任务[ 28、62、64 ]

“5.1.1 Formatted Instance Construction”

“5.1.2 Instruction Tuning Strategies”

“instruction tuning is often more efficient since only a moderate number of instances are used for training.” (Zhao 等, 2023, p. 21) 由于只使用了中等数量的实例进行训练，指令调优往往更加高效。

“In addition to these optimization configurations, there are also two important aspects to consider for instruction tuning:” 除了这些优化配置外，指令调优还有两个重要的方面需要考虑：

“Balancing the Data Distribution.”
“Combining Instruction Tuning and Pre-Training.”

“5.1.3 The Effect of Instruction Tuning”

“Performance Improvement.”
“Task Generalization.”

“5.2 Alignment Tuning”

“5.2.1 Background and Criteria for Alignment”

“human alignment has been proposed to make LLMs act in line with human expectations” 为了使LLMs的行为符合人类的期望，人们提出了human alignment

“It has been shown that alignment might harm the general abilities of LLMs to some extent, which is called alignment tax in related literature” 研究表明，对齐可能会在一定程度上损害LLMs的一般能力，相关文献称之为对齐税

“Alignment Criteria.”

“Helpfulness.”
“Honesty.”
“Harmlessness.”

“5.2.2 Collecting Human Feedback”

“High-quality human feedback is extremely important for aligning LLMs with human preferences and values.” 高质量的人类反馈对于使LLMs与人类偏好和价值观保持一致是极其重要的。

“5.2.3 Reinforcement Learning from Human Feedback”

“To align LLMs with human values, reinforcement learning from human feedback (RLHF) [70, 226] has been proposed to fine-tune LLMs with the collected human feedback data,” 为了使LLMs与人类的价值观保持一致，提出了基于人类反馈的强化学习( RLHF ) [ 70、226]，利用收集到的人类反馈数据对LLMs进行微调。

“5.3 Efficient Tuning”

“6 UTILIZATION”

“major approach to using LLMs is to design suitable prompting strategies for solving various tasks.” 使用LLMs的主要方法是设计合适的提示策略来解决各种任务。

“7 CAPACITY EVALUATION”

(Zhao 等, 2023, p. 30)

“7.1 Basic Evaluation Tasks”

“7.1.1 Language Generation”

“Language Modeling.”

“language modeling aims to predict the next token based on the previous tokens [15], which mainly focuses on the capacity of basic language understanding and generation.” 语言建模旨在根据先前的标记预测下一个标记[ 15 ]，主要关注基本语言理解和生成的能力。
“Conditional Text Generation.”

“conditional text generation [48] focuses on generating texts satisfying specific task demands based on the given conditions, typically including machine translation [367], text summarization [368], and question answering [369].” 条件文本生成[ 48 ]侧重于根据给定的条件生成满足特定任务需求的文本，典型的包括机器翻译[ 367 ]、文本摘要[ 368 ]和问答[ 369 ]。
“Code Synthesis.”

“generate formal language, especially computer programs (i.e., code) that satisfy specific conditions, called code synthesis [374].” 生成形式语言，特别是满足特定条件的计算机程序(即,代码)，称为代码综合[ 374 ]。

“7.1.2 Knowledge Utilization”

“Knowledge utilization is an important ability of intelligent systems to accomplish knowledge-intensive tasks (e.g., commonsense question answering and fact completion) based on supporting factual evidence.” 知识利用是智能系统在支持事实证据的基础上完成知识密集型任务(例如,常识性问答和事实补全)的重要能力。

“In particular, question answering (QA) and knowledge completion have been two commonly used tasks for evaluating this ability” 特别地，问答( Question Answering，QA )和知识补全一直是评价这种能力的两个常用任务

“Closed-Book QA.”

“LLMs should answer the question only based on the given context without using external resources.” LLMs应该只基于给定的情境回答问题，而不使用外部资源。
“Open-Book QA.”

“LLMs can extract useful evidence from the external knowledge base or document collections, and then answer the question based on the extracted evidence” LLMs可以从外部知识库或文档集合中提取有用的证据，然后根据提取的证据回答问题
“Knowledge Completion.”

“LLMs might be (to some extent) considered as a knowledge base [341], which can be leveraged to complete or predict the missing parts of knowledge units” LLMs可能是(在一定程度上)被认为是一个知识库[ 341 ]，可以利用它来完成或预测知识单元的缺失部分

“7.1.3 Complex Reasoning”

“Complex reasoning refers to the ability of understanding and utilizing supporting evidence or logic to derive conclusions or make decisions” 复杂推理是指理解和运用支持证据或逻辑推导结论或做出决策的能力

“Knowledge Reasoning.”

“The knowledge reasoning tasks rely on logical relations and evidence about factual knowledge to answer the given question.” 知识推理任务依靠逻辑关系和关于事实性知识的证据来回答给定的问题。
“Symbolic Reasoning”

“The symbolic reasoning tasks mainly focus on manipulating the symbols in a formal rule setting to fulfill some specific goal [51], where the operations and rules may have never been seen by LLMs during pretraining.” 符号推理任务主要是在正式的规则设置中操纵符号以实现某种特定的目标[ 51 ]，而这些操作和规则可能是LLMs在预训练时从未见过的。
“Mathematical Reasoning.”

“The mathematical reasoning tasks need to comprehensively utilize mathematical knowledge, logic, and computation for solving problems or generating proof statements.” 数学推理任务需要综合利用数学知识、逻辑和计算来解决问题或生成证明语句。

“7.2 Advanced Ability Evaluation”

“7.2.1 Human Alignment”

“7.2.2 Interaction with External Environment”

“LLMs have the ability to receive feedback from the external environment and perform actions according to the behavior instruction” LLMs具有接收外界环境反馈并根据行为指令执行动作的能力

“7.2.3 Tool Manipulation”

“By encapsulating available tools with API calls, existing work has involved a variety of external tools” 通过将可用工具与API调用进行封装，现有工作已经涉及多种外部工具

“8 CONCLUSION AND FUTURE DIRECTIONS”

“Theory and Principle.”

“To understand the underlying working mechanism of LLMs, one of the greatest mysteries is how information is distributed, organized, and utilized through the very large, deep neural network.” (Zhao 等, 2023, p. 36) 为了理解LLMs的潜在工作机制，最大的谜团之一是信息如何通过非常大的、深度的神经网络进行分配、组织和利用。
“Model Architecture.”

“It is important to investigate the effect of more efficient Transformer variants in building LLMs” 研究更高效的Transformer变体在构建LLMs中的作用非常重要
“Model Training.”

“it becomes particularly important to develop more systemic, economical pre-training approaches for optimizing LLMs, considering the factors of model effectiveness, efficiency optimization, and training stability.” 综合考虑模型有效性、效率优化和训练稳定性等因素，开发更加系统、经济的LLMs优化预训练方法变得尤为重要。
“Model Utilization.”

“it is important to develop more informative, flexible task formatting methods for prompts” 开发更丰富、更灵活的提示任务格式化方法具有重要意义
“Safety and Alignment.”

“it is necessary to improve the RLHF framework for reducing the efforts of human labelers and seek a more efficient annotation approach with guaranteed data quality,” 有必要对RLHF框架进行改进，以减少人工标注者的工作量，在保证数据质量的前提下，寻求更高效的标注方法。
“Application and Ecosystem.”