2016.9-Jiwei Li-Deep reinforcement learning for dialogue generation-arXiv-Stanford
Abstract
- Recent neural models
- Short-sighted
- predicting utterances one at a time
- ignoring their influence on future outcomes
- Short-sighted
- This paper: combine Seq2Seq and RL paradigms
- back bone: encoder-decoder
- Apply deep RL: model future reward
- can optimize long-term rewards designed by sys. dvlprs.
- Policy graident method
- 3 conversation properties
- informativity, coherence, ease of answering
- Result (compared with standard Seq2Seq models with MLE objective)
- more interactive responses
- manages to foster a more sustained conversation
Method
- Action
a
a
a
- the dialogue utterance to generate.
- the action space is infinite.
- State
[
p
i
,
q
i
]
[p_i, q_i]
[pi,qi] (previous)
- dialogue history is transformed to a vec. repr. by feeding state to the LSTM encoder-decoder.
- Policy
- the form of an LSTM encoder-decoder p R L ( p i + 1 ∣ p i , q i ) p_{RL}(p_{i+1}|p_i,q_i) pRL(pi+1∣pi,qi)
- Reward
- Ease of answering: negative log likelihood of responding with a dull reponse.
- Information flow: negative log of the cosine similarity between two consecutive turns p i p_i pi and p i + 1 p_{i+1} pi+1
- Demantic Coherence
- Initialize the Policy Model using Mutual Informaiton Model
- Training steps:
- Use pre-trained p S e q 2 S e q p_{Seq2Seq} pSeq2Seq and p S e q 2 S e q b a c k w a r d p_{Seq2Seq}^{backward} pSeq2Seqbackward as the initial model (supervised Seq2Seq model)
- Generate a list of candidates a ^ ∼ p R L \hat{a} \sim p_{RL} a^∼pRL (注意此处不是 p S e q 2 S e q p_{Seq2Seq} pSeq2Seq)
- Obtain the mutual infromation score of each ( a ^ , [ p i , q i ] (\hat{a}, [p_i, q_i] (a^,[pi,qi] from the p S e q 2 S e q p_{Seq2Seq} pSeq2Seq and p S e q 2 S e q b a c k w a r d p_{Seq2Seq}^{backward} pSeq2Seqbackward
- Use the mutual infromation score as rewards to tailor the p S e q 2 S e q p_{Seq2Seq} pSeq2Seq and p S e q 2 S e q b a c k w a r d p_{Seq2Seq}^{backward} pSeq2Seqbackward to generate sequences with higher rewards.
- Training steps:
- Optimize the Policy Model
- use policy gradient methods
- Automatic Evaluation
- Length of the Dialogue.
- Diversity
- Human Evaluation