论文主要提出了一个weight tying概念
作者在introduction中提出 模型输入有个input embedding U 输出有个output embedding V 两个矩阵维度same size 而且两者都可以作为word embedding
然后作者 compare the quality of the input embedding to that of the output embedding 然后提出了下面几个方式可以用来improve neural network language models
(i) We show that in the word2vec skip-gram model, the output embedding is only slightly inferior to the input embedding. This is shown using metrics that are commonly used in order to measure embedding quality.
传统的模型 word2vec和skip-gram输入embedding稍微好于output embedding (output embedding is only slightly inferior to the input embedding.)
(ii) In recurrent neural network based language models, the output embedding outperforms the input embedding.
在循环神经网络中 output embedding 优于 input embedding
(iii) By tying the two embeddings together, i.e., enforcing U
=
V
, the joint embedding evolves in a more similar way to the output embedding than to the input embedding of the untied model.
基于第二点 作者想到为什么不吧 input embedding 和 output embedding 用同一个embedding martrix来表示
(iv) Tying the input and output embeddings leads to an improvement in the perplexity of various language models. This is true both when using dropout or when not using it.
通过第三点作者发现绑定 input embedding 和 output embedding让模型更加好 perplexity有提升
(v) When not using dropout, we propose adding an additional projection P
before
V
, and apply regularization to P
.
如果不用dropout 可以添加一个投影矩阵P (embedding_size, embedding_size) 并加上正则化
(vi) Weight tying in neural translation models can reduce their size (number of parameters) to less than half of their original size without harming their performance.
通过绑定 input embedding和output embedding 模型参数减少 而且不影响模型性能
最后作者对为什么weight tying会有效果给出了自己的解释
不要问我上面反向传播是怎么算出来的。这就是很简单的交叉熵损失的反向传播。
作者从反向传播的角度给出了解释,对于V举证 由于所有的单词最后都参与loss计算,所以对于每一步的反向传播V所有行都会进行更新
对于U确不同,仅仅当前输入的input 在U中对应的行才会进行更新。总之就是V比U要更新的频繁。最终词的表现形式会更好。