[Paper Learning] Prompt-Based Monte-Carlo Tree Search for Goal-oriented Dialogue Policy Planning-CSDN博客

本文链接：https://blog.csdn.net/weixin_46620878/article/details/139280528

Prompt-Based Monte-Carlo Tree Search for Goal-oriented Dialogue Policy Planning

Authors: Xiao Yu, Maximillian Chen, Zhou Yu

Abstract

Planning for goal-oriented dialogue often requires simulating future dialogue interactions and estimating task progress.

Many approaches thus consider training neural networks to perform look-ahead search algorithms such A* search and Monte Carlo Tree Search (MCTS).

However, this training often requires abundant annotated data, which might be faced with noisy annotations or low-resource settings.

We introduce GDP-ZERO, an approach using Open-Loop MCTS to perform goal-oriented dialogue policy planning without any model training.

GDP-ZERO prompts a LLM to act as a policy prior, value function, user simulator, and system model during the tree search.

We evaluate GDP-ZERO on the goal-oriented task PersuasionForGood, and its responses are preferred over ChatGPT up to 59.32% of the time.

Code available at: here

1. Introduction

In many goal-oriented conversation tasks, interacting parties must retake initiative by executing conversational strategies to lead the conversation a desired outcome (e.g. successful negotiation or emotional support). It is imperative to have high-quality dialogue policy planners that can prescribe an ‘optimal’ strategy at each turn of the dialogue.

Optimal policy planning is a difficult task. Many goal-oriented tasks like persuation task, individual persuaders might adopt different strategies, making it difficult to train or evaluate a policy planner. Moreover, ‘optimality’ in these complex tasks may require expert domain knowledge (e.g., negotiation skills).

In this work, we contribute a novel approach Goal-oriented Dialogue Planning with Zero training (GDP-Zero). GDP-ZERO prompts a LLM to perform planning by simulating future dialogue interactions (seen Figure 1).

Unlike previous approaches, we treat policy planning as a stochastic game, and use prompting for every stage of an open-loop tree search.

We evaluate GDP-ZERO on PersuasionForGood due to its difficult panning task (Wang et al., 2019).

Figure 1: Using GDP-ZERO for persuasion with zero model training.

2. Related Work

Prompting Methods However, prompting has largely focused on dialogue response generation, conversation synthesis and dialogue understanding. To date, prompting has not been used for policy planning.

Dialogue Policy Planning Research on dialogue policy planning can be categorized into neural-focused and algorithmic-focused.

Neural-focused approaches use annotated dialogues to train dedicated classifiers or value functions to predict the next dialogue acts without explicit look-ahead planning. For many goal-oriented dialogues, however, both annotated strategies and dialogue repsonses can be suboptimal/noisy, as different people can respond differently even given the same context.
To reduce the reliance on a labeled dataset, much work has also attempted to combine neural networks with search algorithms, such as A* search and tree search. However, these methods still require model training for dialogue simulation or value function estimation, and are therefore highly rependent on training data quality.

3. Method (important!!!)

In this work, we introduce GDP-ZERO, an algorithm-focused dialogue policy planner for goal-oriented dialogue policy planner for goal-oriented dialogue tasks like persuasion. GDP-ZERO uses zero model training and instead performs Open-Loop MCTS at decision time by prompting an LLM to simulate user and system response, evaluate current task progress, and predict a prior next dialogue act. Our approach has two main differences from existing policy planning work:

we use few-shot prompting to bypass the need for model training on noisy data.
we use Open-Loop MCTS to reduce compounding simulation errors by continuously re-generating system and user repsonses during the tree search.

3.1 Problem Definition

We first formulate planning as a Markov Decision Process (MDP).
A $t$ turn dialogue between a user and a system can be defined as:
$h=(a_{0}^{sys}, u_{1}^{sys}, u_{1}^{usr},...,a_{t-1}^{sys}, u_{t}^{sys}, u_{t}^{usr})$
where $a_{i}^{sys}$ is the system’s dialogue act at turn $i$ , $u_{i}^{sys}$ is the system’s response, and $u_{i}^{usr}$ is the user’s utterance at turn $i$ .

We define the task of planning the next $a_{i+1}^{sys}$ as an MDP problem: $\gamma>.$

$a_{i}^{sys}$ represents an action $a_i \in A$ of the system at the $i$ -th turn;
the corresponding dialogue history $h=(a_{0}^{sys}, u_{1}^{sys}, u_{1}^{usr},...,a_{t-1}^{sys}, u_{t}^{sys}, u_{t}^{usr})$ also represents a state marked as $s_i \in S$ .
$R$ is a reward function associated with $s, a$ , representing the likelihood of a desired conversational outcome, denoted as $R (s, a)$ , such as persuading a user to donate to a charity.
$P$ is a transition function, representing the probability of transitioning from the state $s_i$ to state $s_{i+1}$ after executing $a_i$ at the $i$ -th turn, denoted as $\times A \rightarrow S$ .
$\gamma \in [0,1)$ is the discount factor.

3.2 Dialogue Planning as a Stochastic MDP

…
However, in simulating dialogue interactions during tree search, generating a slightly improbable system or user response for state $s$ and storing it in a search tree could lead to a large compounding error for the rest of the subtree. This is because the state space representing all possible repsonses is large, and dialogue response are diverse. We thus treat dialogue policy planning as a stochastic MDP, where the simulated next state $\leftarrow P(s,a)$ is drawn from a large unknown distribution and might not be representative of the most probable $s^{'}$ . Unlike previous usages of closed-loop MCTS for dialogue which consider a deterministic transition, this formulation requires potentially different $s^{'}$ to be returned given the same dialogue context $s$ and system action $a$ .

3.3 GDP-ZERO

Figure 2: GDP-ZERO with ChatGPT backbone. During Selection, simulations are either sampled from cache or newly generated. During Expansion and Evaluation, we prompt ChatGPT for prior policy $\pi$ and value estination.

Selection Given a tree state $s^{tr}$ , the action $a^*$ with the highest Predictor Upper Confidence Tree Bound (PUCT) is selected to traverse the tree:
$PUCT(s^{tr}, a)=Q(s^{tr},a)+c_p{\frac{\sqrt{\Sigma_{a}N(s^{tr},a)}}{1+N(s^{tr},a)}}$

where $N$ records the number of times a $s^{tr},a)$ pair has been visited, and $c_p$ is a hyperparameter controlling exploration. (details seen Appendix). We repeat this process until $s^{tr}$ becomes leaf node.

Expansion Once a leaf node is reached, we treat a LLM $M_{\theta}$ as a prior policy by prompting it to generate a distribution of next dialogue acts. This is done by sampling $M_{\theta}$ at temperature $\tau =1.0$ for $m$ times, and converting the sampled DAs into a distribution (seen Appendix A).

Evaluation We model the value of a state $v(s^{tr})$ by the probability that its dialogue context $h^{tr}$ can lead to task success.

backpropagation

Appendix-A: Additional details on GDP-ZERO

GDP-ZERO requires a generative LLM as a backbone model, and takes in a dialogue history $h_i$ at $i$ -th turn as input. For each state, GDP-ZERO keeps a cache of size $k$ storing newly generated user and system utterances. We use $c_p=1.0$ and $Q_0=\{0.0, 0.25, 0.5\}$ to promote exploration.
在这里插入图片描述