Self-Attention W q W^q Wq, W k W^k Wk, W v W^v Wv,为共享可学习的权重参数 d的大小为K的维度 Multi-head Self-Attention 学习来源:https://blog.csdn.net/qq_37541097/article/details/118242600