ntcir 15对话评估任务拨号1-CSDN博客

DialEval-1 consists of two subtasks: Nugget Detection (ND) and Dialogue Quality (DQ). They aim to evaluate customer-helpdesk dialogues automatically. The ND subtask is to classify whether a customer or helpdesk turn is a nugget, where being a nugget means that the turn helps towards problem solving; and the DQ subtask is to assign quality scores to each dialogue in terms of three criteria: task accomplishment, customer satisfaction, and efficiency. The official evaluation results of 18 runs are received from 7 teams. IMTKU (Dept. of Information Management, Tamkang University, Taiwan) and I, on behalf of ZEALS, by utilizing XLM-RoBERTa via HuggingFace Transformers, fastai, and blurr, achieve top-1 or top-2 in terms of various evaluation metrics for both ND and DQ.

DialEval-1包含两个子任务：块检测(ND)和对话质量(DQ)。 他们旨在自动评估客户服务台对话。 ND子任务是对客户转弯或服务台转弯进行分类，其中转弯表示转弯有助于解决问题； DQ子任务是根据三个标准为每个对话分配质量得分：任务完成，客户满意度和效率。 来自7个团队的18轮测试的官方评估结果。 IMTKU(台湾淡江大学信息管理系)和我代表 ZEALS ，通过 HuggingFace Transformers ， fastai 和 blurr 利用 XLM-RoBERTa ， 在各种评估指标上均达到了top-1或top-2 ND和DQ。

NTCIR？ DialEval？ ND？ DQ？ (NTCIR? DialEval? ND? DQ?)

NII (National Institute of Informatics) Testbeds and Community for Information Access Research, a.k.a. NTCIR, is a series of evaluation workshops designed to enhance research in information access technologies, including information retrieval, question answering, text summarization, extraction, etc. It can be seen as an Asia-Pacific counterpart of TREC (Text REtrieval Conference; America), CLEF (Conference and Labs of the Evaluation Forum; Europe), and FIRE (Forum for Information Retrieval Evaluation; Southern Asia).

ñII(国立信息学研究所) 笔 estbeds和C ommunity为我载文信息访问R esearch，又名NTCIR ，是一个系列评测研讨会旨在加强在信息接入技术，包括信息检索，问题解答，文本摘要，提取研究，可以将其视为TREC (文本检索会议；美国)， CLEF (评估论坛的会议和实验室；欧洲)和FIRE (信息检索评估论坛；南亚)的亚太地区对口单位。

DialEval-1 is one of the initiatives, e.g., DSTC (Dialog System Technology Challenges), that study the complexity of the dialogue phenomenon and various dialogue-related problems. Unlike typical tasks of DSTC, however, DialEval-1 focuses on reducing the cost of corpus curation and data analytics. Although WOZ (Wizard-of-Oz), a common approach nowadays, can distribute the workloads of data collection, annotation, and evaluation to crowd-sourcing, the time spent on reaching an agreement for annotation/evaluation standard is nonetheless linear, if not worse. In other words, human annotation and evaluation have two types of issues:

DialEval-1是一项举措，例如DSTC (对话系统技术挑战)，它研究了对话现象和各种与对话有关的问题的复杂性。与DSTC的典型任务不同，DialEval-1专注于降低语料库管理和数据分析的成本。尽管当今流行的WOZ(Wizard-of-Oz)可以将数据收集，注释和评估的工作量分配给众包，但达成注释/评估标准协议所花费的时间仍然是线性的，即使不是更差。换句话说，人类注释和评估有两种类型的问题：

Scalability: costly and hard to decentralize;
可扩展性：成本高且难以分散；
Measurability: likely unrepeatable/inconsistent even for the same system.
可测量性：即使对于同一系统，也可能不可重复/不一致。

To overcome the issues, DialEval-1 proposes to assess dialogues automatically. Given a customer-helpdesk dialogue (Figure 1), for example, can a system predict which turn of the dialogue is helpful, and by how much?

为了克服这些问题，DialEval-1建议自动评估对话。例如，给定一个客户服务台对话(图1)，系统能否预测对话的哪个回合是有帮助的？

Therefore, DialEval-1 continues ND and DQ subtasks of Short Text Conversation (STC-3) at NTCIR-14 [1], and further constructs a new test collection, such that the performance evaluation would be fair and realistic.

因此，DialEval-1继续在NTCIR-14 [1]上执行短文本对话(STC-3)的ND和DQ子任务，并进一步构建了一个新的测试集合，这样性能评估将是公平和现实的。

究竟要预测什么？ (What to predict, exactly?)

Allow me to briefly describe what the expected outcomes are, and how to evaluate them. There will be no math formulae. If the reader is interested in rigorous definitions, please kindly refer to the overview papers [1][2].

请允许我简要描述预期的结果以及如何评估它们。没有数学公式。如果读者对严格的定义感兴趣，请参阅概述文件[1] [2]。

块金检测 (Nugget Detection)

Image for post — Figure 2. Nugget state transitions [1][2]

A nugget is a turn by either Helpdesk or Customer. It helps Customer transition from Current State (including Initial State) towards Target State (i.e., when the problem is solved). STC-3 and DialEval-1 organizers define the following 7 nugget types:

掘金是帮助台或客户的转弯。它帮助客户从当前状态(包括初始状态)过渡到目标状态(即问题解决时)。 STC-3和DialEval-1组织者定义了以下7种块类型：

CNaN / HNaN: Customer or Helpdesk’s non-nuggets that are irrelevant to the problem-solving situation;
CNaN / HNaN：与解决问题的情况无关的客户或服务台的非核心；
CNUG / HNUG: Customer or Helpdesk’s regular nuggets that are relevant to the problem-solving situation;
CNUG / HNUG：与问题解决情况相关的客户或服务台的常规矿块；
CNUG* / HNUG*: Customer or Helpdesk’s goal nuggets that confirm and provide solutions, respectively;
CNUG * / HNUG *：分别确认和提供解决方案的客户或服务台的目标模块 ；
CNUG0: Customer’s trigger nuggets that initiate the dialogues with certain problem descriptions.
CNUG0：客户的触发块，它们启动带有某些问题描述的对话。

对话质量 (Dialogue Quality)

Using subjective scores to quantify the quality of a dialogue as a whole. The organizers define 3 score types:

使用主观评分来量化整个对话的质量。组织者定义了3种得分类型：

A-score: Accomplishment; has the problem solved? To what extent?
A分数：成就；问题解决了吗？到什么程度？
S-score: Satisfaction; how satisfied Customer is with the dialogue;
S分数：满意度；客户对对话的满意程度；
E-score: Effectiveness; how effective and efficient the dialogue is.
电子分数：有效性；对话的效率和效率。

Each score is on a 5-point scale of ranks, ranging from -2 to 2.

每个分数的等级分为5分，范围从-2到2。

评估指标 (Evaluation Metrics)

Since the issues are about inconsistent human assessment, the gold standard of the datasets is not trivial classes or ranks but distributions. Roughly speaking, the distributions are votes from annotators for certain class/rank. Such that the organizers evaluate a system’s prediction of class/rank by comparing how similar the predicted distribution to the gold standard’s. More specifically, the metrics for ND are Root Normalized Sum of Squares (RNSS) and Jensen-Shannon Divergence (JSD). and the ones for DQ are Normalized Match Distance (NMD) and Root Symmetric Normalized Order-aware Divergence (RSNOD). Again, I will spare the readers with the math formulae, please see the papers if interested.

由于问题是关于不一致的人类评估，因此数据集的黄金标准不是琐碎的类别或等级，而是分布。粗略地说，分配是来自某些类/等级的注释者的投票。这样，组织者可以通过比较预测的分布与黄金标准的相似程度来评估系统对等级/等级的预测。更具体地说，ND的度量是根归一化平方和(RNSS)和詹森-香农散度 (JSD)。 DQ的是归一化匹配距离(NMD)和根对称归一化顺序感知散度(RSNOD)。同样，我将为读者提供数学公式，如果有兴趣，请参阅论文。

我们如何处理任务？ (How do we approach the tasks?)

Despite the architecture differences, almost all participants of STC-3 modeled ND and DQ as classification tasks. We adopt the same tactic for DialEval-1, and then pay more attention to tokenization and optimization. Because STC-3 showed that none of architectures from participants outperformed the baselines of BiLSTM with GloVe [1][2]. Therefore our approach is simply fine-tuning the state-of-the-art pre-trained models with well-established tricks of tokenization and optimization.

尽管架构有所不同，但几乎所有STC-3参与者都将ND和DQ建模为分类任务。我们对DialEval-1采用相同的策略，然后更加注意标记化和优化。因为STC-3表明参加者的体系结构均未优于GloVe的BiLSTM基线[1] [2]。因此，我们的方法只是使用完善的标记化和优化技巧，对最新的预训练模型进行微调。

代币化 (Tokenization)

Based on our preliminary trials, we find that XLM-RoBERTa works well for both Simplified Chinese and English. Although the rationale behind it hasn’t been fully examined, we speculate that the sentencepiece-based unigram subwords may have been helpful. Besides that, in order to simulate the turn structure of a dialogue, we not only utilize XLM-RoBERTa’s special tokens, namely BOS (beginning of sentence; <s>), EOS (end of sentence; </s>), and SEP (separator of sentences; </s> </s>), but also customize several tokens in fastai style to provide some minimal context.

根据我们的初步试验，我们发现XLM-RoBERTa适用于简体中文和英文。尽管其背后的原理尚未得到充分研究，但我们推测基于句子的unigram子词可能有所帮助。除此之外，为了模拟对话的转弯结构，我们不仅利用XLM-RoBERTa的特殊标记，即BOS(句子的开头； <s> )，EOS(句子的结尾； </s> )和SEP (句子的分隔符； </s> </s> )，而且还以fastai样式自定义多个标记以提供一些最小的上下文。

For example, consider a tokenized turn below:

例如，考虑下面的标记化转弯：

xxlen ▁3 <s> xxtrn ▁1 xxsdr ▁customer ▁@ ▁China ▁Uni com ▁Customer ▁Service ▁in ▁Gu ang dong ▁Shi t ! ▁What ▁is ▁your ▁staff ▁service ▁doing ▁on ▁earth ? ▁I ▁have ▁called ▁the ▁staff ▁service ▁for ▁3 ▁hours , ▁but ▁no ▁one ▁answer ▁my ▁phone ▁call . ▁It ▁is ▁no ▁wonder ▁that ▁customer ▁evaluation ▁is ▁so ▁bad . ▁Shi t ! ▁I ▁am ▁at ▁Kang le ▁Middle ▁Road . </s>

Where xxlen and xxtrn stand for length of the dialogue in turns and the position of each turn of the dialogue, respectively. The numbers right next to them provide certain features of turns. The same trick goes with xxsdr that differentiates whether the sender is Customer or Helpdesk. When a turn’s context says xxtrn _1 xxsdr _customer, the nugget type is almost definitely CNUG0.

其中xxlen和xxtrn代表对话的长度和对话的每一圈的位置。它们旁边的数字提供了某些转弯特征。 xxsdr使用相同的技巧来区分发件人是客户还是服务台。当转弯的上下文显示xxtrn _1 xxsdr _customer ，块金类型几乎肯定是CNUG0。

As for DQ, a whole dialogue can be tokenized in a similar fashion, where xxlen could be useful for certain quality scores:

对于DQ，可以以类似的方式标记整个对话，其中xxlen对于某些质量得分可能有用：

xxlen ▁3 <s> xxtrn ▁1 xxsdr ▁customer ▁@ ▁China ▁Uni com ▁Customer ▁Service ▁in ▁Gu ang dong ▁Shi t ! ▁What ▁is ▁your ▁staff ▁service ▁doing ▁on ▁earth ? ▁I ▁have ▁called ▁the ▁staff ▁service ▁for ▁3 ▁hours , ▁but ▁no ▁one ▁answer ▁my ▁phone ▁call . ▁It ▁is ▁no ▁wonder ▁that ▁customer ▁evaluation ▁is ▁so ▁bad . ▁Shi t ! ▁I ▁am ▁at ▁Kang le ▁Middle ▁Road . </s> </s> xxtrn ▁2 xxsdr ▁help desk ▁Hello ! ▁We ▁are ▁sorry ▁for ▁the ▁in con veni ence . ▁100 10 ▁is ▁our ▁service ▁hot ▁line . ▁We ▁may ▁not ▁answer ▁your ▁phone ▁call ▁during ▁the ▁busy ▁hour ▁of ▁tele traf fic . ▁We ▁sincer ely ▁ap ologi ze ▁for ▁that ! ▁What ▁can ▁I ▁do ▁for ▁you ? ▁Thank ▁you ! </s> </s> xxtrn ▁3 xxsdr ▁customer ▁The ▁Uni com ▁Internet ▁access ▁in ▁Z ha oq ing ▁Nur sing ▁School ▁can ' t ▁be ▁connected . ▁What ▁is ▁wrong ▁with ▁it ? ▁You ▁have ▁repair ed ▁it ▁for ▁the ▁whole ▁afternoon ▁in ▁the ▁area . ▁What ▁are ▁you ▁doing ▁on ▁earth ? ▁Shi t ! ▁Why ▁can ▁the ▁China ▁Mobile ▁service ▁hot line ▁be ▁got ▁through ? ▁Shi t ! ▁The ▁service ▁hot line ▁can ' t ▁be ▁got ▁through ▁the ▁whole ▁morning . ▁ 651 ▁I ▁bought ▁a ▁watch ▁last ▁year ▁and ▁the ▁service ▁hot line ▁can ' t ▁be ▁got ▁through ▁within ▁24 ▁hours . ▁I ▁won ' t ▁for give ▁you ! ▁No ▁phone ▁call ▁is ▁answered ! </s>

优化 (Optimization)

Thanks to the great works of HuggingFace, fastai, and blurr, a stable fine-tuning scheme enables us to rapidly trial-and-error for a sufficiently good combination of hyper-parameters. For instance, the core steps for fine-tuning a model of ND can be as short as this:

得益于HuggingFace，fastai和blurr的出色工作，稳定的微调方案使我们能够快速反复尝试，以实现超参数的足够好的组合。例如，微调ND模型的核心步骤可以很简短：

dls = ... # fastai's Dataloaders
lrnr = Learner(
  dls,
  HF_BaseModelWrapper(hf_model), # blurr's HuggingFace model wrapper 
  opt_func=partial(SOME_OPTIMIZER, decouple_wd=True),
  loss_func=LabelSmoothingCrossEntropyFlat(),
  metrics=[
    accuracy,
    partial(top_k_accuracy, k=2),
    F1Score(average='weighted'),
    MatthewsCorrCoef(),
    ...
  ],
  cbs=[HF_BaseModelCallback],
  splitter=hf_splitter,
  path=DATA_DIR,
).to_fp16()
lrnr.create_opt()for ...:
  # iteratively decrease base_lr and/or factor
  lrnr.fit_one_cycle(n_epoch, lr_max=slice(base_lr/factor, base_lr))

Admittedly, there are many moving parts of this fine-tuning scheme. After all, the most time consuming step for fine-tuning is Grad Student Algorithm (a.k.a. Grad Student Descent), i.e., figure out a nice combination of magical numbers, a stable optimizer, a reasonable loss function, and other techniques such as discriminative training and mixed precision. Fortunately, with the help of slanted triangular learning rates, it only takes minutes to finish each fit_one_cycle(…).

诚然，这种微调方案有许多可动的部分。毕竟，微调最耗时的步骤是Grad Student Algorithm (又名Grad Student Descent )，即找出魔术数，稳定的优化器，合理的损失函数以及其他技术(例如判别式训练)的良好组合和混合精度。幸运的是，借助倾斜的三角形学习率，只需几分钟即可完成每个fit_one_cycle(…) 。

One particular choice we realize is that no need to do gradual unfreezing. It works pretty well with AWD-LSTM, but somewhat insignificant for fine-tuning Transformers.

我们意识到的一个特殊选择是，无需逐步解冻。它与AWD-LSTM配合使用时效果很好，但对于微调Transformer而言却微不足道。

而已？ (That’s it?)

Yes, mostly. In our experiences, although there are more techniques to explore, the bottom line is that, unless we discover a substantially better architecture (for classification) and/or an alternative modeling perspective (that is not simply classification), a good beginning (of tokenization and optimization) almost assures success.

是的，主要是。根据我们的经验，尽管有更多的技术可以探索，但最重要的是，除非我们发现一个更好的体系结构(用于分类)和/或一个替代的建模视角(这不仅仅是简单的分类)，否则(令牌化的)一个良好的开端。和优化) 几乎可以确保成功。

While the official report and datasets won’t be published until the end of year 2020 [2], STC-3 datasets are available for anyone wants to give it a shot: https://sakai-lab.github.io/stc3-dataset/ [1].

虽然官方报告和数据集要到2020年年底才会发布[2]，但任何想试一试的人都可以使用STC-3数据集： https ： //sakai-lab.github.io/stc3-数据集/ [1]。