1. 提升self-attention的时间、空间利用率
Linformer: Self-Attention with Linear Complexity
论文中的相关工作提及:提高Transformer效率的常用技术:
-
混合精度
Mixed precision training. 2017
fairseq: A fast, extensible toolkit for sequence modeling. 2019
Quantization and training of neural networks for efficient integer-arithmetic-only inference. 2018
Training with quantization noise for extreme fixed-point compression. 2020 -
知识蒸馏
Distilling the knowledge in a neural network. 2015
Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 2019 -
Sparse Attention
Generating long sequences with sparse transformers. 2019
Blockwise self-attention for long document understanding. 2019 -
LSH Attention
Reformer: The efficient transformer. 2020 -
Improving Optimizer Efficiency
Gpipe: Efficient training of giant neural networks using pipeline parallelism. 2019
Training deep nets with sublinear memory cost. 2016
2. 数据增强
Data Augmentation by Concatenation for Low-Resource Translation: A Mystery and a Solution
3. mt book推断加速
-
输出层的词汇选择
On Using Very Large Target Vocabulary for Neural Machine Translation. 2015 -
消除冗余计算
(1)对不同层的注意力权重进行共享
Sharing Attention Weights for Fast Transformer. 2019
(2)不同层的参数进行共享
Recurrent Stacking of Layers for Compact Neural Machine Translation Models. 2019 代码 tf -
轻量解码端及小模型
(1)把解码端的网络变得更 “浅”、更 “窄”
考虑使用知识精炼(见7.5.3节)或深层编码器(见7.3.1节)配合基于小模型的解码神经网络一起使用
(2)化简 Transformer 的解码端神经网络
①使用平均注意力机制代替原始的 Transformer 自注意力机制
Accelerating Neural Transformer via an Average Attention Network. 2018 代码 tf②使用运算更轻的卷积操作代替注意力模块
Pay Less Attention with Lightweight and Dynamic Convolutions. 2019 论文解读 代码 fairseq!!!③基于共享注意力机制的模型也是一种典型的轻量模型
Sharing Attention Weights for Fast Transformer. 2019④使用异构神经网络也是一种平衡精度和速度的有效方法
The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation. 2018 代码1 fairseq!!! 代码2 tensor2tensor -
批量推断
-
低精度运算
-
非自回归翻译
-
其他
进化的Transformers
Linear Transformers Are Secretly Fast Weight Memory Systems. 2021 代码 pytorch fairseq!!!
DEFINE: deep factorized input token embeddings for neural sequence modeling. ICLR 2020
DELIGHT: Deep and Light-Weight Transformer. ICLR 2021 代码 fairseq
Performers: Rethinking attention with performers. ICLR 2021 代码 tf 里面有一部分pytorch的实现,数据是随机初始化的
Efficient transformer for mobile applications. ICLR 2020
Learning Light-Weight Translation Models from Deep Transforer. 2020
Reformer: the efficient transformer. ICLR 2020 代码 trax
Universal transformers. ICLR 2019 代码 trax tensor2tensor
Depth-adaptive transformer. ICLR 2020
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. 2020
Are Pre-trained Convolutions Better than Pre-trained Transformers?. 2021
Measuring and Increasing Context Usage in Context-Aware Machine Translation. 2021
基础知识
机器学习——低秩矩阵分解中低秩的意义、矩阵填补、交叉验证
矩阵低秩的意义?
Depthwise卷积与Pointwise卷积