总结
generator: GRU,policy gradient优化,self reward + differential reward,从粗排到精排
evaluator: bi-lstm+self-attention,交叉熵损失,对final list做rank
细节
generator
把gru当作一个policy,reward有2部分:self reward + differential reward。
self reward
r s e l f ( x o t ∣ u , O ) = E ( x o t ∣ u , O ; Θ E ) r^{self}(x_o^t | u, O) = E(x_o^t | u, O; \Theta^E) rself(xot∣u,O)=E(xot∣u,