The User Simulator for Task-completion Dialogue 任务完整对话的用户模拟器-------文章解读+源码解析

本文链接：https://blog.csdn.net/weixin_44384749/article/details/116715935

Summary

问题：在对话系统中，由于构建特定领域且合适数据集成本的高昂和耗时。
解决方法：建立user simulator去模拟user，利用user simulation和agent的交互，选择下一个action并得到reward进行强化学习

本文实现两个主要任务：电影票预定和电影的查询

User Simulator的组成部分：user goal、user action、dialogue status、NLU、NLG、metrics-----如下图
在这里插入图片描述
任务完整对话系统的总体流程：

代码的大致流程如下，如有错误请批评指正~ 谢谢~
在这里插入图片描述

文章的各部分总结

摘要

拟解决的关键问题：

RL要求与environment进行交互，传统corpus无法直接用于训练
每个任务都需要带有注释的 task-specific corpus
对于 task-oriented dialogue，去收集或注释数据集需要大量domain knowledge

总结：构建一个合适数据集的成本高昂且耗时
解决方法：建立 user simulator

user simulator 的优势：

随着 agent 与 simulator 进行交互，用户能用在线方式去训练强化学习 dialogue agents
在模拟器上被训练可以作为一个有效的起始点一旦 agents 掌控
simulator，他们可能被用于真实环境中去和人交互，并继续被在线训练

本文中提出的user simulator执行两个任务：电影票售卖和电影搜索

介绍

实际的对话系统由自然语言理解(natural language understanding-NLU)、自然语言生成(natural language generation-NLG)、知识库(knowledge bases-KBs)、状态追踪器(state trackers-ST)、对话政策(dialogue policy-DP) 和动作(action)组成

每一轮人机交互过程中：

对话系统利用 NLU 将用户语言映射为结构化且机器可理解的语义帧
通过KBs、ST去维护和追踪系统内部对话的进化状态
依据状态使用DP选择恰当的dialogue action
通过NLG将dialogue action转换为自然语言回答

对话系统：通过与用户交互选择action并得到相应reward，进行强化学习

dialog policy可由rule所显式规则化，但基于rule存在缺点：

很难设计一个基于推理的policy
最佳policy会随用户action的改变而改变，基于rule无法处理其非稳定性
所以，RL可以替代基于rule的方法，自动从experience中学习最优的policy

用户模拟的意义

通常，我们利用Supervised Learning and Reinforcement Learning方法找寻最优policy

In SL method：训练policy去模仿expert所观察到的action。
需大量标注数据且其dialog state space难以被充分探索，阻止一个supervised learner找到一个最优policy
In RL method：在没有expert-generated的情况下，给出reward signal，RL训练agent去学习。
RL需来自environment的例子，无法从实际用户开始学习
所以利用user simulator去训练RL agent

User Simulator：为了生成自然且合理的对话，让RL agent去探索policy空间，基于模拟的方法让agent去探索可能不存在与之前观察的数据轨迹，克服基于模仿方法的中心限制，dialog agent训练simulators作为一个有效起始点，之后simulators通过RL被用于against human去进一步提升

任务完整的对话系统

在对话系统中通过user simulator和agent之间的语言交互，帮user预定电影票或去找寻user想要的电影

agent收集顾客的期望信息并卖出电影票或确定感兴趣的电影，对话结束后，environment依据（1）是否一个电影被预定（2）是否电影满足用户的约束去评估结果是成功/失败

Database：
通过Amazon Mechanical Turk去收集的，注释有11个intents（如通知，要求，确认－问题，确认－答案等），和29个slots（如电影名字，开始时间，电影院，人数等）
总之，在电影领域标记了280个对话，每个对话大约平均11个

用户模拟器

依据agenda-based user simulator method，堆栈表示提供一个遍历机制去显式编码dialog history and user goal，state transition and user action generation 能作为入栈和出栈操作的序列。这里我们详细描述relu-based user simulator

用户目标

任务导向对话中，user simulator 的目标是生成一个user goal（agent在未知user goal的情况下帮助user去完成goal），完整的dialog conversation是隐式围绕goal去完成的。user goal的定义包括两部分：

inform_slots：大量slot-value对作为用户约束
request_slots：一系列slots，user没有关于value的信息，但想要在conversation中去从agent端获得value

为了goal更贴切实际，在user goal中添加约束

对于电影票预定
required slots：（moviename，theater，starttime，datatime，numberofpeople）and optional slots
request_slots：ticket为默认slot（任务的目标）

构造user goal数据集使用两种机制：

user 的第一轮内容中包含一部分，甚至 all user 的要求。是提取用户第一轮的所有slot信息 (注：不包括greeting 轮)
提取首先出现在所有用户轮中的所有slot，然后将它们整合到一个user goal中

对于simulation，我们打包这些user goal为文件作为user goal的数据库，每跑一轮对话，随机采样一个来自数据库的user goal

用户行为

First user-act (第一个用户动作)：电影订票中，user是说第一句话的那一方，根据user goal，随机生成user的第一轮对话内容

为了贴近现实，会在生成时加一些限制
例如：user的第一句话一般是个request 意图，
至少含有一个 inform_slot，如果user知道moviename，则把这个inform_slot添加到第一轮的对话内容中

在整个对话过程中，user simulator维护一个类似于栈的 user agenda。user state分为 agenda (A) 和 user goal (G) (约束©和请求®)

在每个时间步 t 中，user simulator依据current state 和最后一个system action，生成下一个user action，并更新current state。

在没有NLU情况下去训练和测试policy，引入error model 模仿NLU的噪声，实现user和agent之间的聒噪交流，error model中存在两种类型的噪音通道：【1】intent level 【2】slot level，三种可能噪声

slot detection：simulate slot没有被NLU所识别的场景
incorrect slot value：simulate slot名字没有被正确识别的场景，如错误单词分割
incorrect slot：simulate slot和它的value都没被正确识别的场景

如果agent action 是 inform(task_complete)，说明agent已经收集了所有信息，并准备订电影票了。user simulator 将会检测当前栈是否为空，并且检测是否满足约束条件来确保 agent 是否定了正确的电影票。

对话状态

对话状态有三种情况：

no_outcome_yet：是指agent没有 inform(task_comlete)，并且轮数还没有达到最大
success：agent 在最大轮数内，必须回答了所有的user 的request，并且订了一张正确的电影票
failure：其他的情况都为失败

自然语言理解NLU

NLU是一个LSTM，单个NLU能同时做intent预测，slot填充。

NLU对intent和slots进行联合建模，预测的标签集合是一个拼接的IOB－format slot和intent标签的集合，并且一个附加的标记被添加到每一个utterance的最后

自然语言生成NLG

user simulator可以被设计为dialog act level（输出到agent结构化的语义解析信息），也能被设计为utterance level （输出到agent自然语言形式）。utterance level这里就需要NLG了。我们在框架中提供一个NLG。由于有限标记的数据集，我们从经验测试发现一个纯净基于模型的NLG无法具备很好的泛化能力，对于policy，我们将引入大量噪声。因此策略是先模板后模型：
1、template-based NLG：为dialog acts输出一些预定义好的基于规则的模板
2、model-based NLG：输入为 dialog-act，输出为带slot标签的句子，（如：我希望电影从{start-time}开始），之后再进行替换。decoder采用beam search n取值为3 。（DM训练可以将beam search n=3 的句子作为输入噪声。

Usage

利用user simulator训练agents去完成电影票预定的完整任务对话设置和KB-InfoBot
【1】agent goal是帮助user成功预订一部电影
【2】agent和带有两个intents的user进行converation。使用两个intents和6个slots完成user goal对话

衡量agent的质量，有三个指标：
success rate （任务完成率）
average reward （平均奖励）
average turns（平均轮数）

一般来说，一个好的政策应该有较高的任务完成率、较高的平均奖励和较低的平均轮数。可以选择成功率作为主要的评估指标来评价agent的效果

讨论

relu-based simulator：为完成完整对话任务利用强化学习训练agent 。缺陷明显：需手工编写规则（耗时）

提升user simulator的完整对话设置方向：
【1】去囊括user goal的改变，使conversation更加复杂且真实
【2】利用model-based user simulator去完成完整任务对话（缺点：只要有数据驱动能很容易应用于其他领域
，优点：数据驱动导致不确定性，对于user simulator去训练RL agents 来说比较危险，使得RL agents学习错误或错误对话“success”）

补充知识

Deep Q Network

experience replay 经验池
神经网络计算Q值
暂时冻结q_target参数（切断相关性）—off policy

DQN更新步骤：

记忆库里存一些东西，之后才开始学习，每几步学习一次
基于观察的选择action，RL采取action并获得下一个观察和reward
判断学习是否结束，未结束进入下一回合继续学习到结束

User Simulator 的组成

RL agents方法的代码执行流程：

初始化参数，加载字典、用户目标集合，将目标集合拆分为训练集、校验集和测试集
加载电影数据集，行为集合，槽集合，电影字典
实例化AgentDQN->实例化DQN
input hidden：Wxh，bxh
output hidden：Wd，bd
update：Wxh，bxh，Wd，bd
regularize：Wxh，Wd
user simulatior 的参数
实例化user simulator—RuleSimulator
加载训练好的NLG model
包括
hidden_size：100
output_size：1047
bd：【1,1047】
Wd：【100,1047】
WLSTM：【1148,400】
bah：【1,400】
Wah：【1116,400】
model: listm_tanh
进一步加载模型
加载训练好的NLU model
实例化Dialog Manager
开始迭代交流模拟

User Simulator 源码解析

Run Commands:
python run.py --agt 5 --usr 1 --max_turn 40 --movie_kb_path .\deep_dialog\data\movie_kb.1k.p --goal_file_path .\deep_dialog\data\user_goals_first_turn_template.part.movie.v1.p --intent_err_prob 0.00 --slot_err_prob 0.00 --episodes 500 --act_level 1 --run_mode 1

训练基于Rule的user simulator

代码具体流程：

初始化相关参数后，进入run_episodes函数
其中num_episodes：500
status：{‘successes’: 0, ‘count’: 0, ‘cumulative_reward’: 0}

run_episodes(num_episodes, status)

def run_episodes(count, status):
    successes = 0
    cumulative_reward = 0
    cumulative_turns = 0
    
    if agt == 9 and params['trained_model_path'] == None and warm_start == 1:  # 不满足该条件跳过
        print('warm_start starting ...')
        warm_start_simulation()
        print('warm_start finished, start RL training ...')
    
    for episode in range(count):  # 进入训练迭代
        print("Episode: %s" % (episode))  
        # 为了新对话更新states
        # 初始化新的episodes（dialog），更新当前state和追踪的slots
        dialog_manager.initialize_episode() 
        """	可以看到输出如下:
        	Episode: 0
			New episode, user goal:
			{
			  "request_slots": {
			    "ticket": "UNK"
			  },
			  "diaact": "request",
			  "inform_slots": {
			    "city": "seattle",
			    "numberofpeople": "2",
			    "theater": "regal meridian 16",
			    "starttime": "9:25 pm",
			    "date": "tomorrow",
			    "moviename": "zoolander 2"
			  }
			}
			Turn 0 usr: request, inform_slots: {'moviename': 			'zoolander 2', 'theater': 'regal meridian 16'}, 			request_slots: {'ticket': 'UNK'}
		"""
        episode_over = False
        
        # 当迭代没有结束
        while(not episode_over):
        # 进行下一轮迭代（agent和user之间交流），返回迭代是否结束以及奖励
            episode_over, reward = dialog_manager.next_turn()
            # 奖励累加
            cumulative_reward += reward
            # 如果迭代结束
            if episode_over:
                if reward > 0:
                    print ("Successful Dialog!")
                    successes += 1
                else: print ("Failed Dialog!")
                
                # 状态追踪器跟着更新下一轮并累加到下一轮
                cumulative_turns += dialog_manager.state_tracker.turn_count
        
        # simulation agt为9，无训练好的模型
        if agt == 9 and params['trained_model_path'] == None:
            agent.predict_mode = True
            # simulation 的 下一步迭代更新  
            # simulation_res：user_action, kb_results_dict, turn, agent_action
            simulation_res = simulation_epoch(simulation_epoch_size)
            
            performance_records['success_rate'][episode] = simulation_res['success_rate']
            performance_records['ave_turns'][episode] = simulation_res['ave_turns']
            performance_records['ave_reward'][episode] = simulation_res['ave_reward']
            
            if simulation_res['success_rate'] >= best_res['success_rate']:
                if simulation_res['success_rate'] >= success_rate_threshold: # threshold = 0.30
                    agent.experience_replay_pool = []  # 过去的累积经验 
                    simulation_epoch(simulation_epoch_size)   # 模拟器更新
                
            if simulation_res['success_rate'] > best_res['success_rate']:
                best_model['model'] = copy.deepcopy(agent)
                best_res['success_rate'] = simulation_res['success_rate']
                best_res['ave_reward'] = simulation_res['ave_reward']
                best_res['ave_turns'] = simulation_res['ave_turns']
                best_res['epoch'] = episode
            
            # 初始化dqn网络模型    
            agent.clone_dqn = copy.deepcopy(agent.dqn)
            # batch训练agent
            agent.train(batch_size, 1)
            agent.predict_mode = False
            
            print ("Simulation success rate %s, Ave reward %s, Ave turns %s, Best success rate %s" % (performance_records['success_rate'][episode], performance_records['ave_reward'][episode], performance_records['ave_turns'][episode], best_res['success_rate']))
            if episode % save_check_point == 0 and params['trained_model_path'] == None: # save the model every 10 episodes
                save_model(params['write_model_dir'], agt, best_res['success_rate'], best_model['model'], best_res['epoch'], episode)
                save_performance_records(params['write_model_dir'], agt, performance_records)
        
        print("Progress: %s / %s, Success rate: %s / %s Avg reward: %.2f Avg turns: %.2f" % (episode+1, count, successes, episode+1, float(cumulative_reward)/(episode+1), float(cumulative_turns)/(episode+1)))
    print("Success rate: %s / %s Avg reward: %.2f Avg turns: %.2f" % (successes, count, float(cumulative_reward)/count, float(cumulative_turns)/count))
    status['successes'] += successes
    status['count'] += count
    
    if agt == 9 and params['trained_model_path'] == None:
        save_model(params['write_model_dir'], agt, float(successes)/count, best_model['model'], best_res['epoch'], count)
        save_performance_records(params['write_model_dir'], agt, performance_records)

实例化DialogManager类

通过调用初始化episodes/next_turn/reward函数进行user和agent间的conversation:

class DialogManager:
    """ A dialog manager to mediate the interaction between an agent and a customer """
    
    def __init__(self, agent, user, act_set, slot_set, movie_dictionary):
        self.agent = agent
        self.user = user
        self.act_set = act_set
        self.slot_set = slot_set
        self.state_tracker = StateTracker(act_set, slot_set, movie_dictionary)
        self.user_action = None
        self.reward = 0
        self.episode_over = False

    def initialize_episode(self):
        """ Refresh state for new dialog """
        
        self.reward = 0
        self.episode_over = False
        # 状态追踪器的初始化
        self.state_tracker.initialize_episode()
        # 随机选择第一个动作
        self.user_action = self.user.initialize_episode()
        # 依据用户动作更新状态追踪器
        self.state_tracker.update(user_action=self.user_action)
        
        if dialog_config.run_mode < 3:
            print("New episode, user goal:")
            print(json.dumps(self.user.goal, indent=2))
        self.print_function(user_action=self.user_action)
        # agent初始化
        self.agent.initialize_episode()

    def next_turn(self, record_training_data=True):
        """ This function initiates each subsequent exchange between agent and user (agent first) """
        
        ########################################################################
        #   agent进入下一论
        #   为 agent 获得状态并追踪
        #   agent state 转到 action
        ########################################################################
        self.state = self.state_tracker.get_state_for_agent()
        self.agent_action = self.agent.state_to_action(self.state)
        
        ########################################################################
        #   利用状态追踪器去做出agent行为 
        #   更新基于上一个action的state
        #   依据系统action做出状态更新和动作选择
        ########################################################################
        self.state_tracker.update(agent_action=self.agent_action)
        #   添加噪声到action中       腐蚀state
        self.agent.add_nl_to_action(self.agent_action)  # add NL to Agent Dia_Act
        self.print_function(agent_action=self.agent_action['act_slot_response'])
        
        ########################################################################
        #   user进入下一轮
        #   sys_action 初始化
        #   依据系统action去更新 user_action, episode_over, dialog_statues
        #   依据对话状态去计算reward
        ########################################################################
        self.sys_action = self.state_tracker.dialog_history_dictionaries()[-1]
        self.user_action, self.episode_over, dialog_status = self.user.next(self.sys_action)
        self.reward = self.reward_function(dialog_status)
        
        ########################################################################
        #  利用最后的用户行为更新状态追踪器
        ########################################################################
        if self.episode_over != True:
            self.state_tracker.update(user_action=self.user_action)
            self.print_function(user_action=self.user_action)

        ########################################################################
        #  Inform agent of the outcome for this time step (s_t, a_t, r, s_{t+1}, episode_over)
        ########################################################################
        if record_training_data:
            # 利用经验池 保存先前的值
            self.agent.register_experience_replay_tuple(self.state, self.agent_action, self.reward, self.state_tracker.get_state_for_agent(), self.episode_over)
        
        return (self.episode_over, self.reward)

    
    def reward_function(self, dialog_status):
        """
        对话失败，reward为-10   对话成功 reward为20
        其他情况reward -1
        """
        if dialog_status == dialog_config.FAILED_DIALOG:  
            reward = - self.user.max_turn  # 10
        elif dialog_status == dialog_config.SUCCESS_DIALOG:
            reward = 2 * self.user.max_turn  # 20
        else:
            reward = -1
        return reward
    
    def reward_function_without_penalty(self, dialog_status):
	    """
        不带惩罚的reward function
        对话失败 reward为0    对话成功 reward为 20
        其他情况为reward为0
        """
        if dialog_status == dialog_config.FAILED_DIALOG:
            reward = 0
        elif dialog_status == dialog_config.SUCCESS_DIALOG:
            reward = 2*self.user.max_turn
        else:
            reward = 0
        return reward
    
    
    def print_function(self, agent_action=None, user_action=None):
        """ Print Function """
            
        if agent_action:
            if dialog_config.run_mode == 0:
                if self.agent.__class__.__name__ != 'AgentCmd':
                    print("Turn %d sys: %s" % (agent_action['turn'], agent_action['nl']))
            elif dialog_config.run_mode == 1:
                if self.agent.__class__.__name__ != 'AgentCmd':
                    print("Turn %d sys: %s, inform_slots: %s, request slots: %s" % (agent_action['turn'], agent_action['diaact'], agent_action['inform_slots'], agent_action['request_slots']))
            elif dialog_config.run_mode == 2: # debug mode
                print("Turn %d sys: %s, inform_slots: %s, request slots: %s" % (agent_action['turn'], agent_action['diaact'], agent_action['inform_slots'], agent_action['request_slots']))
                print ("Turn %d sys: %s" % (agent_action['turn'], agent_action['nl']))
            
            if dialog_config.auto_suggest == 1:
                print('(Suggested Values: %s)' % (self.state_tracker.get_suggest_slots_values(agent_action['request_slots'])))
        elif user_action:
            if dialog_config.run_mode == 0:
                print("Turn %d usr: %s" % (user_action['turn'], user_action['nl']))
            elif dialog_config.run_mode == 1: 
                print("Turn %s usr: %s, inform_slots: %s, request_slots: %s" % (user_action['turn'], user_action['diaact'], user_action['inform_slots'], user_action['request_slots']))
            elif dialog_config.run_mode == 2: # debug mode, show both
                print("Turn %d usr: %s, inform_slots: %s, request_slots: %s" % (user_action['turn'], user_action['diaact'], user_action['inform_slots'], user_action['request_slots']))
                print("Turn %d usr: %s" % (user_action['turn'], user_action['nl']))
            
            if self.agent.__class__.__name__ == 'AgentCmd': # command line agent
                user_request_slots = user_action['request_slots']
                if 'ticket'in user_request_slots.keys(): del user_request_slots['ticket']
                if len(user_request_slots) > 0:
                    possible_values = self.state_tracker.get_suggest_slots_values(user_action['request_slots'])
                    for slot in possible_values.keys():
                        if len(possible_values[slot]) > 0:
                            print('(Suggested Values: %s: %s)' % (slot, possible_values[slot]))
                        elif len(possible_values[slot]) == 0:
                            print('(Suggested Values: there is no available %s)' % (slot))
                else:
                    kb_results = self.state_tracker.get_current_kb_results()
                    print('(Number of movies in KB satisfying current constraints: %s)' % len(kb_results))