本文是完整读过GPT-2 paper之后记录下来的觉得重要的地方,其中大部分摘自paper原文,有用中文部分记录自己读paper时想到的东西以及不懂的地方,求指教!
读GPT-2 paper之前可以作为预习先看看张俊林大佬漫谈式的文章《效果惊人的GPT 2.0模型:它告诉了我们什么》 https://zhuanlan.zhihu.com/p/56865533 写的着实很好!
Abstract
The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks.
Introduction
- Current systems are better characterized as narrow experts rather than competent generalists. We would like to move towards more general systems which can perform many tasks – eventually without the need to manually create and label a training dataset for each one.
- Multi-task learning
- The two most ambitious efforts to date have trained on a total of 10 and 17 (dataset, objective) pairs respectively.
- From a meta-learning perspective, each (dataset, objective) pair is a single training example sampled from the distribution of datasets and objectives.
- Cons:Multitask training may need just as many effective training pairs to realize its promise with current approaches. It will be very difficult to continue to scale the creation of datasets and the design of objectives to the degree that may be required to brute force our way there with current techniques.
Approach
- Learning to perform a single task can be expressed in a probabilistic framework as estimating a conditional distribution P(output | input). Since a general system should be able to perform many different tasks, even for the same input, it should condition not only on the input but also on the task to be performed. That is, it should model P(output | input, task).
- Task conditioning is often implemented at an architecture level, such as the task specific encoder-decoders network. However, language provides a flexible way to specify tasks, inputs and outputs all as a sequence of symbols. (For example, a translation training example can be written as the sequence (translate to French, english text, french text 因为训练这个模型就是让它学习语言知识,这样自然也能够理解语言给的任务指令).
- Language modeling is also able to, in principle, learn the tasks of McCann et al. (2018) without the need for explicit supervision of which symbols are the outputs to be predicted. The problem instead becomes whether we are