attention matrix 是一个 [batch_size, sequence_len1, hidden_size] 的tensor和一个
[batch_size, sequence_len2, hidden_size] 的tensor得到的
[batch_size, sequence_len1, sequence_len2] 的tensor,
反过来其实就是一个矩阵分解操作,
[sequence_len1, sequence_len2]分解为 [sequence_len2, hidden_size] 和 [sequence_len1, hidden_size],而每个hidden_size是sequence每个item的向量表示