Pretrained Transformers As Universal Computation Engines Kevin
-
- Abstract
- 1 Introduction
- 2 Methodology
- 3 Empirical Evaluations
-
- 3.1 Can pretrained language models transfer to different modalities?
- 3.2 What is the importance of the pretraining modality?
- 3.3 How important is the transformer architecture compared to LSTM architecture?
- 3.4 Does language pretraining improve compute efficiency over random initialization?
- 3.5 Do the frozen attention layers attend to modality-specific tokens?
- 3.6 Does freezing the transformer prevent overfitting or underfitting?
- 3.7 Does performance scale with model size?
- 3.8 Can performance be attributed simply to better statistics for initialization?
- 3.9 Can we train a transformer by only finetuning the output layer?
- 3.10 What is the role of model depth in token mixing?
- 3.11 Can training more parameters improve performance?
- 3.12 Which parameters of the model are important to finetune?
- 3.13 Is finetuning layer norm necessary for FPT to perform well?
- 3.14 How well do the trends hold across other transformer models?
- 4 Related Work and Discussion
Abstract
在文本上预训练了的transformers可以轻松拓展到其他模态
1 Introduction
假设transformers在一个data-rich的模态预训练后,就能够迁移至其他模态。
验证假设时,finetune只调输入输出的线性层,pos emb和layer norm的参数,用FPT简称Frozen Pretrained Transformer。
结果显示它比直接在下游从头训练的transformers或者LSTM结果好且收敛更快。
2 Methodology
2.1 Tasks
实验使用了多种模态的分类任务
-
Bit memory:给定5个长度1k的bit串,每位以0.5的概率mask,任务是预测被mask的bit
-
Bit XOR:给定两个长度为5的bit串,判断xor。
-
ListOps:给定一系列的操作,判断最后输出的数字
-
MNIST
-
CIFAR-10
-
CIFAR-10 LRA:CIFAR-10变成了灰度图且被flatten掉(去掉了位置信息)
-
Remote homology detection:预测蛋白质的折叠
2.2 Architecture
输出层:单层linear
输入层:单层linear
layer norm:微调
pos emb:微调(几乎没有收益,但计算代价也很小)
transformer是base的大小
3 Empirical Evaluations
3.1 Can pretrained language models transfer to different modalities?
- 迁移的模型达到了和在下游任务上从头开始训练的transformer近似的效果
- 从头开始的base大小的transoformer是很难在小数据上收敛的,而迁移的模型不仅很容易收敛,而且增大模型尺寸后会有显而易见的效果提升
3.2 What is the importance of the pretraining modality?
- 虽然预训练方式不同,但只要有,就比随机初始化的模型效果好且收敛快。
- 虽然对图像的预训练对图像的下游任务更友好,但对文本的预训练