小周带你读论文-1之“浪潮Yuan2 有哪些创新“

周博洋K

已于 2023-12-02 13:10:25 修改

阅读量1.1k

点赞数 36

文章标签：深度学习 AIGC

于 2023-12-02 13:10:08 首次发布

本文链接：https://blog.csdn.net/kingsoftcloud/article/details/134750896

版权

本文是小周带你读论文系列的第一篇，主要探讨了浪潮Yuan2在Transformer结构的改进——局部过滤注意力（LFA）和分布式训练时间预测法。LFA通过预处理QKV输入，引入局部依赖。此外，文章还介绍了非均匀PP（Non-uniform PP）解决显存分配不均的问题。尽管存在争议，Yuan2的创新对于理解LLM优化具有参考价值。

摘要由CSDN通过智能技术生成

新开一个系列连载，小周带你读论文，会不定期的更新各种新的，甚至老的有价值的论文，当然您有时间自己读最好了，如果自己读嫌麻烦，可以来看我这个的总结

老规矩，1,2,3 上链接...

IEIT-Yuan/Yuan-2.0: Yuan 2.0 Large Language Model (github.com)

Yuan2是浪潮的刚发布的LLM是基于Yuan1改的（这里吐槽一下浪潮，Yuan1的pretrain数据原来是公开下载的有1T多的语料很大一部分中文比例，现在给关闭了

）

Yuan2这论文写的还是有点意思的，受限于算力要求，很多事实性的实验我没法做证明或者证伪，那就先看看文中的一些理论创新

1- 魔改Transformer（LFA）：

为了好理解我沾个Llama2的结构作为对比

几乎一眼就可以看出来变化，他把multiheader attention层给改了（其实要严格一点说也不算全改，只是前面加东西了）！Transformer玩的啥呢，其实就是玩attetion这层呢，他为什么要把核心内容给改了呢？

下面是论文里给的说法：

Attention, as a basic building block in LLMs, has showed great success across NLP tasks [9,10]. When a sequence is fed in to a language model, attention mechanism learns the weights of each pair of tokens to build the dependencies across the entire input sequence. The mechanism equally treats a token in neighbourhood and that in a distance. However, in natural language, the dependencies of words in neighbourhood are often stronger than the words faraway. The interconnection learned by Attention is global without any prior knowledge of local d