因果语言模型(Causal Language Model) 与序列到序列模型(Seq2Seq)的区别与联系

作为一个NLP Beginner,也做了不少模型训练任务了,但居然到现在都没有搞清楚这两个概念的区分,一直感觉都是transformer,无非是有的encoder-decoder有的是decoder-only。今天好好去了解了一下,弄清楚了,还是比较清晰的,有问题也请大家指正。

我之所以搞不清楚,是因为我认为现在的LLM明明全都是输入一个“序列”,然后生成一个“序列”。不都应该叫Seq2Seq吗?实际上是命名有些问题,本身两个模型是有区分的。直接看定义很生涩,但举几个具体的任务例子就清楚了!

关键是我们用他们 建模 什么任务

Seq2Seq:基本上就是专指encoder-decoder架构了,这个架构最早可以源于RNN,transformer将其发扬光大。huggingface的定义是:

Encoder-decoder models (also called sequence-to-sequence models) use both parts of the Transformer architecture

它一定是要有一个编码过程!
它的一些最典型的任务是:

  • 机器翻译(transformer)
  • 图像生成文字(Vit)

为什么这两个任务典型?因为这两个任务都需要一个encoder来处理输入数据,把它处理为一个序列。图像生成文字典型的就是Vit(Vision Transformer)。而这些任务输出的是文字序列,在Seq2Seq的建模下,输入和输出的数据并不是在一个空间里的。这种情况下,encoder是必要的!

Causal Language Model:
那对于刚才的任务,有人会说,机器翻译中,我也可以把英语和中文都建模在一个空间里的呀!我可以认为,我的续写模型,是建模在世界文本数据集上,因此一个“续写”模型,我只需要用适当的prompt,比如:“Translation Chinese to English: 我爱复旦大学”。一个“续写”模型就会续写“I love Fudan University”。从而就帮你进行了翻译

这个思想本身就是GPT-2提出的,GPT-3中把它正式定义了一下,以下这段话摘自论文《A Survey of LLM》

A basic understanding of this claim is that each (NLP) taskcan be considered as the word prediction problem basedon a subset of the world text. Thus, unsupervised language modeling could be capable in solving various tasks, if it was trained to have sufficient capacity in recovering the world text. These early discussion in GPT-2’s paper echoed in the interview of Ilya Sutskever by Jensen Huang: “What the neural network learns is some representation of the process that produced the text. This text is actually a projection of the world…the more accurate you are in predicting the next word, the higher the fidelity, the more resolution you get in
this process…”

是的,所以关键是建模的方式不同,OpenAI的GPT系列就坚持建模decoder-only的方法。

现在一般来说,decoder-only自回归因果语言模型这三个词基本上是等价的。一句话来说:
根据之前的序列,生成后续的序列。

其实也就是自回归的定义,根据x1,…xt-1预测xt

其实有一个容易搞糊涂的点就是,Seq2Seq的decoder和因果语言模型的decoder的原理是几乎完全一样的,那么为什么Seq2Seq不能叫自回归(auto-regression)呢?需要注意!Seq2Seq(比如transformer)的decoder虽然也是根据之前生成的序列预测下一个词,但它是有用到了encoder的信息(交叉注意力机制)! 基于RNN的Seq2Seq也是如此,Encoder留下了隐藏状态供Decoder使用,因此,Seq2Seq不能叫做自回归模型。

那么这些概念就很清楚了,当前decoder-only的模型已经占了绝对主导,比如GPT系列的模型,被大量的用于文本生成。

其实有时候还有一个概念是Masked Language Modeling(遮蔽语言模型),这个概念已经几乎快要和Encoder Model等价了,因为Encoder Model(如Bert等),一般都离不开基于MLM方法的预训练。这种语言模型就是纯双向的注意力机制,因为Encoder-only。

Huggingface上是这样说的:

Masked language modeling predicts a masked token in a sequence, and the model can attend to tokens bidirectionally. This means the model has full access to the tokens on the left and right. Masked language modeling is great for tasks that require a good contextual understanding of an entire sequence. BERT is an example of a masked language model.

Encoder models use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence. These models are often characterized as having “bi-directional” attention, and are often called auto-encoding models.

The pretraining of these models usually revolves around somehow corrupting a given sentence (for instance, by masking random words in it) and tasking the model with finding or reconstructing the initial sentence.

Encoder models are best suited for tasks requiring an understanding of the full sentence, such as sentence classification, named entity recognition (and more generally word classification), and extractive question answering.

所以基本上是差不多了。但是Encoder Model当前不是主流了。

记录自己的感受,解决了一直以来一个疑惑,希望也能帮到别人。有问题请大家指出

评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值