参考文章 Fast Transformer Decoding: One Write-Head is All
You Need
问题分析
1、transformer训练很快,因为decoder只需要一次,但推断却很慢,因为前一个推断要作为下一个推断的输入
2、作为认为慢的原因是重复载入key和value导致的内存带宽消耗
这里是引用
While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large “keys” and “values” tensors.
总体思路
1、multi-query attention,key和value共享
实现细节
未完待续