End-to-End Reinforcement Learning of Dialogue Agents for Information Access 端对端加强学习对话代理信息访问

最新推荐文章于 2022-07-04 09:39:06 发布

李善宰

最新推荐文章于 2022-07-04 09:39:06 发布

阅读量847

点赞数

This paper proposes KB-InfoBot—a dialogueagent that provides users with an entity from a knowledge base (KB) byinteractively asking for its attributes. All components of the KBInfoBot aretrained in an end-to-end fashion using reinforcement learning. Goal-orienteddialogue systems typically need to interact with an external database to accessreal-world knowledge (e.g., movies playing in a city). Previous systemsachieved this by issuing a symbolic query to the database and adding retrievedresults to the dialogue state. However, such symbolic operations break thedifferentiability of the system and prevent end-to-end training of neuraldialogue agents. In this paper, we address this limitation by replacingsymbolic queries with an induced “soft” posterior distribution over the KB thatindicates which entities the user is interested in. We also provide a modifiedversion of the episodic REINFORCE algorithm, which allows the KBInfoBot toexplore and learn both the policy for selecting dialogue acts and the posteriorover the KB for retrieving the correct entities. Experimental results show thatthe end-to-end trained KB-InfoBot outperforms competitive rule-based baselines,as well as agents which are not end-to-end trainable.

本文提出KB-InfoBot - 一个对话代理，通过交互式询问其属性为用户提供来自知识库（KB）的实体。

KBInfoBot的所有组件都使用强化学习以端到端的方式进行培训。面向目标的对话系统通常需要与外部数据库交互以访问现实世界的知识（例如，在城市中播放的电影）。以前的系统通过向数据库发出符号查询并将检索到的结果添加到对话状态来实现。然而，这种象征性的操作打破了系统的可区分性，并阻止了神经对话代理的端到端训练。在本文中，我们解决了这个限制，通过在KB上引用“软”后验分布来代替符号查询来解决这个限制，这表明用户感兴趣的是哪个实体。我们还提供了一个修改版本的情景REINFORCE算法，它允许KBInfoBot探索和学习选择对话行为的政策和知识产权后验以检索正确的实体。

实验结果表明，端到端培训的KB-InfoBot胜过基于竞争规则的基线，以及不是端对端可训练的代理。

guage. In this work, we present KB-InfoBot,a dialogue agent that identifies entities of interest to the user from aknowledge base (KB), by interactively asking for attributes of that entitywhich helps constrain the search. Such an agent finds application ininteractive search settings. Figure 1 shows a dialogue example between a usersearching for a movie and the proposed KB-InfoBot.

面向对象的对话系统可以帮助用户通过自然语言与他们交互来完成特定的任务，如预订航班或搜索数据库。在这项工作中，我们提出KB-InfoBot，一个对话代理，通过交互式地询问该实体的属性来帮助约束搜索，从而从知识库（KB）中识别用户感兴趣的实体。这样的代理在交互式搜索设置中查找应用程序。图1示出了用户搜索电影和所提出的KB-InfoBot之间的对话示例。

A typical goal-oriented dialogue systemconsists of four basic components: a language understanding (LU) module foridentifying user intents and extracting associated slots (Yao et al., 2014;HakkaniTur et al., 2016; Chen et al., 2016), a dialogue ¨ state tracker whichtracks the user goal and dialogue history (Henderson et al., 2014; Henderson,2015), a dialogue policy which selects the next system action based on thecurrent state (Young et al., 2013), and a natural language generator (NLG) forconverting dialogue acts into natural language (Wen et al., 2015; Wen et al.,2016a). For successful completion of user goals, it is also necessary to equip thedialogue policy with real-world knowledge from a database. Previous end-to-endsystems achieved this by constructing a symbolic query from the current beliefstates of the agent and retrieving results from the database which match thequery (Wen et al., 2016b; Williams and Zweig, 2016; Zhao and Eskenazi, 2016).Unfortunately, such operations make the model non-differentiable, and variouscomponents in a dialogue system are usually trained separately.

一个典型的面向目标的对话系统由四个基本组成部分组成：

用于识别用户意图并提取相关时隙的语言理解（LU）模块，

跟踪用户目标和对话历史的状态跟踪器（Henderson et al。，2014; Henderson，2015），

基于当前状态选择下一个系统动作的对话政策（Young等，2013），

和一种将对话行为转化为自然语言的自然语言生成器（Wen et al。，2015; Wen et al。，2016a）。

为了顺利完成用户目标，还必须将对话政策与数据库中的现实知识结合在一起。先前的端到端系统通过从代理的当前信念状态构建符号查询并从数据库检索与查询匹配的结果来实现这一点（Wen等人，2016b; Williams和Zweig，2016; Zhao和Eskenazi， 2016）。不幸的是，这样的操作使得模型不可区分，并且对话系统中的各种组件通常是分开训练的。

In our work, we replace SQL-like querieswith a probabilistic framework for inducing a posterior distribution of theuser target over KB entities. We build this distribution from the belieftracker multinomials over attribute-values and binomial probabilities of theuser not knowing the value of an attribute. The policy network receives asinput this full distribution to select its next action. In addition to makingthe model end-to-end trainable, this operation also provides a principledframework to propagate the uncertainty inherent in language understanding tothe dialogue policy making the agent robust to LU errors. Our entire model isdifferentiable, which means that in theory our system can be trained completelyend-to-end using only a reinforcement signal from the user that indicateswhether a dialogue is successful or not. However, in practice, we find thatwith random initialization the agent is unable to see any rewards if thedatabase is large; even when it does, credit assignment is tough. Hence, at thebeginning of training, we first have an imitation-learning phase (Argall etal., 2009) where both the belief tracker and policy network are trained tomimic a rule-based agent. Then, on switching to reinforcement learning, theagent is able to improve further and increase its average reward. Such abootstrapping approach has been shown effective when applying reinforcementlearning to solve hard problems, especially those with long decision horizons(Silver et al., 2016).