GPT-3 总结

最近GPT-3比较热,本文根据GPT-3的论文,整理了一些GPT-3以及论文提到的其他few shot learning的方法

GPT-3:

Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly(动态) reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans.

 

 

task-Agnostic Meta-Learning 任务无偏的元学习 (以下摘自论文作者齐国君的文章)

基于梯度下降的训练算法,它有两个在传统机器学习框架下不可学习的超参数1)初始的模型参数;2)每步的更新步长。模型参数往往通过随机初始化来实现。但由于大部分深度学习模型都是非凸的,导致模型的学习效果非常依赖于不同的随机初始条件。一个好的初始模型参数会对模型的学习效果有着非常大的影响。而元学习的一个重要用途,就是通过学习的方法去学习一个对多个任务来说合适的初始参数,使得对这些训练任务和其代表的更多未来任务来说,从这个初始参数开始,对模型进行更新,都可以更快更好地得到新的模型。这里更快的意思就是只需要少量的训练样本和少数的几次梯度下降,我们就可以期望得到合适的新任务的模型 (即few shot learning)。

经典的元学习方法忽略了在多个任务上学习最优初始模型的一个重要问题:如何保证学习得到的初始模型对所有任务是没有偏差(unbiased)的。一个很可能发生的情形是,初始模型对某些任务跟有效,而对另外一些任务就不是特别有效。这种情形,meta-learner对不同任务是有偏的。

为了解决这个问题,作者提出一种任务无关(task agnostic)的无偏元学习方法。作者通过对初始模型加上一个正则化条件,使得它对不同的任务能“一视同仁”。具体的,对一个分类任务,可以直接最大化初始模型在不同类别上的熵(Entropy Maximization)来实现对任务的无偏性。另一方面,对一般任务,比如回归或增强学习任务,往往可以通过定义一个损失函数(loss function)或者奖励函数(reward function)来定义和优化这些任务。如果把负损失或者奖励看着是给每个任务的收入(income),我们就可以基于经济学中的度量收入不平等(inequality)的方法来刻画meta-learner 在不同任务的bias。比如,我们可以用广泛应用的基尼系数来度量元学习在不同任务的偏差,除此之外还有GE指数、Theil指数等。这些不平等度量具有不同的特性,可以聚焦考虑在特定的损失或奖励(收入)区间上任务。同时,这些度量还满足若干性质,使得它们非常适合作为不平等度量。比如对称性、伸缩不变性、非负性、传递原则等等。通过最小化不平等度量,我们可以得到对不同任务无偏的meta-learner

这个方法的问题,根据GPT-3的原文‘this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples.‘这会导致如下问题:

First, from a practical perspective, the need for a large dataset of labeled examples for every new task limits the applicability of language models. (即,fine-tune还是需要较大的数据集来进行调试,但很多任务是提供不了用来fine-tune的数据的

Second, the potential to exploit spurious correlations in training data fundamentally grows with the expressiveness of the model and the narrowness of the training distribution. This can create problems for the pre-training plus fine-tuning paradigm, where models are designed to be large to absorb information during pre-training, but are then fine-tuned on very narrow task distributions.

 

For each task, we evaluate GPT-3 under 3 conditions: (a) “few-shot learning”, or in-context learning where we allow as many demonstrations as will fit into the model’s context window (typically 10 to 100), (b) “one-shot learning”, where we allow only one demonstration, and (c) “zero-shot” learning, where no demonstrations are allowed and only an instruction in natural language is given to the model.

由上图:Model performance improves with the addition of a natural language task description, and with the number of examples in the model’s context, K. Few-shot learning also improves dramatically with model size. Though the results in this case are particularly striking, the general trends with both model size and number of examples in-context hold for most tasks we study. We emphasize that these “learning” curves involve no gradient updates or fine-tuning, just increasing numbers of demonstrations given as conditioning.

we also train a series of smaller models (ranging from 125 million parameters to 13 billion parameters) in order to compare their performance to GPT-3 in the zero, one and few-shot settings. Broadly, for most tasks we find relatively smooth scaling with model capacity in all three settings; one notable pattern is that the gap between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models are more proficient meta-learners.

关于Fine-tuning的优点和缺点:

The main advantage of fine-tuning is strong performance on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential for poor generalization out-of-distribution。文章特别提到,GPT-3本身是可以用来fine-tune的,且这也是其未来的研究方向之一

为什么要把one-shot从few-shot和zero-shot中分出来,因为one-shot实际上是最贴近人的情况。

GPT-3的结构:

We use the same model and architecture as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer。

下面是Jay Alammar 的关于GPT-3的介绍。以一个trained model为例:

The model is presented with an example. We only show it the features and ask it to predict the next word.

The model’s prediction will be wrong. We calculate the error in its prediction and update the model so next time it makes a better prediction.

Repeat millions of times:

How does a system process the word “robotics” and produce “A”?

High-level steps:

  1. Convert the word to a vector (list of numbers) representing the word
  2. Compute prediction
  3. Convert resulting vector to word

See all these layers? This is the “depth” in “deep learning”.

Each of these layers has its own 1.8B parameter to make its calculations. That is where the “magic” happens. This is a high-level view of that process:

  • 2
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
提供的源码资源涵盖了安卓应用、小程序、Python应用和Java应用等多个领域,每个领域都包含了丰富的实例和项目。这些源码都是基于各自平台的最新技术和标准编写,确保了在对应环境下能够无缝运行。同时,源码中配备了详细的注释和文档,帮助用户快速理解代码结构和实现逻辑。 适用人群: 这些源码资源特别适合大学生群体。无论你是计算机相关专业的学生,还是对其他领域编程感兴趣的学生,这些资源都能为你提供宝贵的学习和实践机会。通过学习和运行这些源码,你可以掌握各平台开发的基础知识,提升编程能力和项目实战经验。 使用场景及目标: 在学习阶段,你可以利用这些源码资源进行课程实践、课外项目或毕业设计。通过分析和运行源码,你将深入了解各平台开发的技术细节和最佳实践,逐步培养起自己的项目开发和问题解决能力。此外,在求职或创业过程中,具备跨平台开发能力的大学生将更具竞争力。 其他说明: 为了确保源码资源的可运行性和易用性,特别注意了以下几点:首先,每份源码都提供了详细的运行环境和依赖说明,确保用户能够轻松搭建起开发环境;其次,源码中的注释和文档都非常完善,方便用户快速上手和理解代码;最后,我会定期更新这些源码资源,以适应各平台技术的最新发展和市场需求。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值