2016.9-Jiwei Li-Deep reinforcement learning for dialogue generation-Stanford 阅读笔记

最新推荐文章于 2020-03-11 00:06:54 发布

元气少女wuqh

最新推荐文章于 2020-03-11 00:06:54 发布

阅读量358

点赞数

分类专栏： Paper Reading

本文链接：https://blog.csdn.net/tsinghuahui/article/details/83047862

版权

10 篇文章 0 订阅

订阅专栏

2016.9-Jiwei Li-Deep reinforcement learning for dialogue generation-arXiv-Stanford

Recent neural models
- Short-sighted
  - predicting utterances one at a time
  - ignoring their influence on future outcomes
This paper: combine Seq2Seq and RL paradigms
- back bone: encoder-decoder
- Apply deep RL: model future reward
  - can optimize long-term rewards designed by sys. dvlprs.
  - Policy graident method
  - 3 conversation properties
  - informativity, coherence, ease of answering
Result (compared with standard Seq2Seq models with MLE objective)
- more interactive responses
- manages to foster a more sustained conversation

Action $a$
- the dialogue utterance to generate.
- the action space is infinite.
State $p_i, q_i]$ (previous)
- dialogue history is transformed to a vec. repr. by feeding state to the LSTM encoder-decoder.
Policy
- the form of an LSTM encoder-decoder $p_{RL}(p_{i+1}|p_i,q_i)$
Reward
- Ease of answering: negative log likelihood of responding with a dull reponse.
- Information flow: negative log of the cosine similarity between two consecutive turns $p_i$ and $p_{i+1}$
- Demantic Coherence
Initialize the Policy Model using Mutual Informaiton Model
- Training steps:
  - Use pre-trained $p_{Seq2Seq}$ and $p_{Seq2Seq}^{backward}$ as the initial model (supervised Seq2Seq model)
  - Generate a list of candidates $\hat{a} \sim p_{RL}$ (注意此处不是 $p_{Seq2Seq}$ )
  - Obtain the mutual infromation score of each $(\hat{a}, [p_i, q_i]$ from the $p_{Seq2Seq}$ and $p_{Seq2Seq}^{backward}$
  - Use the mutual infromation score as rewards to tailor the $p_{Seq2Seq}$ and $p_{Seq2Seq}^{backward}$ to generate sequences with higher rewards.
Optimize the Policy Model
- use policy gradient methods
Automatic Evaluation
- Length of the Dialogue.
- Diversity
- Human Evaluation

关注