Dialogue 数据集生成
two domains that talk about restaurants and hotels respectively.
为了形成每个领域的训练语料库,从一个统计对话管理器的先前用户试验(Gasiˇ c等人,“2015”)中收集的对话被随机抽样并展示给通过Amazon Mechanical Turk(AMT)服务招募的工人。 工人们被逐一展示每个对话,并被要求用自然英语输入与每个系统DA对应的适当的系统反应。对于每个领域,我们从大约1,000个随机抽样的对话中收集了大约5,000个系统话语。每个分类值都被一个代表其槽的标记所取代,在一个DA中多次出现的槽被合并为一个。在对每个语料根据其去词汇化的DA进行处理和分组后,我们在餐厅中得到248个不同的DAs和酒店领域的164个。每个领域的每个DA的平均槽数分别为2.25和1.95。 该系统使用Theano库(Bergstra等人,2010;Bastien等人,2012)实现,并通过将每个收集到的语料库按3:1:1的比例划分为训练、验证和测试集进行训练。
数据集长成什么样?
{0: {‘src’: “inform(name=none,area=citycentre,near=‘X’)”, ‘sys_summ’: ‘There is sorry no information matching constraints near X .’, ‘scores’: {‘informativeness’: 6.0, ‘naturalness’: 4.0, ‘quality’: 5.0}, ‘ref_summs’: [‘I am sorry but there are no venues near X in the city centre .’, ‘I am sorry but there are no venues near X in the city centre .’, ‘I am sorry but there are no venues near X in the city centre .’, ‘I am sorry but there are no venues near X in the city centre .’, ‘I am sorry but there are no venues near X in the city centre .’, ‘I am sorry but there are no venues near X in the city centre .’, ‘There are no places you are looking for near X in the centre of town .’, ‘There are no places you are looking for near X in the centre of town .’, ‘There are no places you are looking for near X in the centre of town .’, ‘There are no places you are looking for near X in the centre of town .’, ‘There are no places you are looking for near X in the centre of town .’, ‘There are no places you are looking for near X in the centre of town .’, ‘I am sorry but there are no venues near X in the city centre .’, ‘I am sorry but there are no venues near X in the city centre .’, ‘I am sorry but there are no venues near X in the city centre .’, ‘There are no places you are looking for near X in the centre of town .’, ‘There are no places you are looking for near X in the centre of town .’, ‘There are no places you are looking for near X in the centre of town .’]},1:{}}
每个source对应了18个reference。这是hotel或者饭店domain中的一种)