【2】Pretrained Transformer As Universal Computation Engines

最新推荐文章于 2024-02-01 14:11:01 发布

Perry 彭儒

最新推荐文章于 2024-02-01 14:11:01 发布

阅读量550

点赞数

分类专栏： Transformer系列文章标签： transformer 自然语言处理预训练

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/pengru120/article/details/122039109

版权

Transformer系列专栏收录该内容

2 篇文章 0 订阅

订阅专栏

论文标题：Pretrained Transformer As Universal Computation Engines - CoRR 2021

原文传送门：https://arxiv.org/abs/2103.05247https://arxiv.org/abs/2103.05247

1. Abstarct

作者研究了在语言上预训练的Transformer以最少的微调泛化到其他模态的能力——特别是在没有微调residual blocks (Self-Attn和FFN Layers)的情况下。

他们称该预训练模型为 Frozen Pretrained Transformer (FPT)，在涵盖数值计算、视觉和蛋白质折叠预测的各种序列分类任务上对其进行微调。

与在相同模态数据集下预训练+微调的模型比较，基于自然语言的预训练可提高非语言下游任务的性能和计算效率。此外，还与随机初始化Transformer、LSTM 进行了比较。

总体性能图：

2. Introduction & Methodology

Introduction：简而言之，在大型语料库训练GPT，在不同任务的小数据集上做微调。注：仅针对线性输入和输出层，以及位置嵌入和层范数参数。

Methodology:

I. Bit memory, Bit XOR, ListOps. (数值计算任务) / MNIST, CIFAR-10, CIFAR-10 LRA (图像分类任务) / Remote homology detection (蛋白质折叠预测) II. Transformer模型就不做赘述；

3. Empirical Evaluations

3.1 Can pretrained language models transfer to different modalities?

结论：7个任务下，FPT基本和随机初始化Transformer持平，比随机初始化LSTM好。

3.2 What is the importance of the pretraining modality?

Random initialization (Random): 随机初始化的GPT-2；

Bit memory pretraining (Bit): 在Bit Memory数据集上预训练；

Image pretraining (ViT): 在ImageNet-21K上预训练；

结论：7个任务下，FPT最好；而其他预训练模型，在各自模态数据下会较优。

3.3 How important is the transformer architecture compared to LSTM architecture?

Trans.: 随机初始化Transformer

LSTM: 随机初始化LSTM

LSTM*: +12-Layer + Residual Connections + Positional Embeddings

结论：Transformer较之LSTM，存在明显模型优势。

3.4 Does language pretraining improve compute efficiency over random initialization?

结论：FPT模型收敛速度 faster >> Random Transformer

3.5 Do the frozen attention layers attend to modality-specific tokens?

结论： FPT 关注到数据中语义有效的模式，仅限Bit XOR任务

3.6 Does freezing the transformer prevent overfitting or underfitting?

结论： FPT 会欠拟合，可在增大模型容量来改进；Linformer反映Transformer会在低资源数据下过拟合。

3.7 Does performance scale with model size?

结论：较之从头训练的Transformer，FPT增大模型容量不会出现过拟合和模型不收敛。

3.8 Can performance be attributed simply to better statistics for initialization?

结论：移除FPT中的逐层均值和标准差，该Statistics Only模型介于FPT和 Random Transformer之间。

3.9 Can we train a transformer by only finetuning the output layer?

结论：FPT仅用于线性分类(Table. 10两项任务)的特征提取，1) 收敛加速；2）性能下降，模型过拟合（缺少对特征的正则化操作）

3.10 What is the role of model depth in token mixing?

With finetuning layernorm.：层少时，使用Pretrained Layer时对Token Mixing有效，层多到6层就没区别了。 Without finetuning layernorm.：Random模型一直不行，而Pretrained会ok，但是需要足够多的层才能恢复原始性能。

3.11 Can training more parameters improve performance?

结论：微调FFN Layer可提升性能，CIFAR-10只微调最后一个注意力层最佳；

3.12 Which parameters of the model are important to finetune?

消融仅微调选择参数，以查看哪些参数最敏感。

结论：+ layernorm, + input, + positions 都有用，其中+ layernorm最好

3.13 Is finetuning layer norm necessary for FPT to perform well?

只考虑微调输入和输出层，将整个FPT作为黑盒。

结论：仿射层范数参数的内部调制有所帮助，类似加入更精细的位置信息。

3.14 How well do the trends hold across other transformer models?

使用其他Transformer变种，如BERT，T5，Longformer；

结论：基于自然语言的预训练可提高非语言下游任务的性能和计算效率，该结论同样成立

4 Related Work and Discussion 略

5 Conclusion 略

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
【2】Pretrained Transformer As Universal Computation Engines

论文标题：Pretrained Transformer As Universal Computation Engines - CoRR 2021原文传送门：https://arxiv.org/abs/2103.05247https://arxiv.org/abs/2103.052471. Abstarct作者研究了在语言上预训练的Transformer以最少的微调泛化到其他模态的能力——特别是在没有微调residual blocks (Self-Attn和FFN Layers)的情况下。他们
复制链接

扫一扫

专栏目录

评论 1

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。