Results of the WMT19 Metrics Shared Task:Segment-Level and Strong MT Systems Pose Big Challenges

本文给出WMT19度量共享任务。参与者被要求对参加 WMT19新闻翻译比赛的翻译系统的输出结果进行评分具有自动度量的任务。13研究小组提交了24个指标,其中10个是没有参考的“度量”和组成向 WMT19联合工作提交的质量评估任务,“量化宽松作为一个度量”。此外,我们计算了11个基线指标,包括8个常用的基线(BLEU,SentBLEU,NIST,WER,pER,TER,CDER 和 chrF)和3个重新实现(chrF + ,sacreBLEU-BLEU,和 sacreBLEU-chrF)与 WMT19官方手册排名相关的度量标准,以及段级别,这个度量标准与人类相关性如何今年,我们只采用直接评税人工评估形式人工评估形式人工评估形式

BLEU and NIST

The metrics BLEU (Papineni et al., 2002) and NIST (Doddington, 2002) were computed using mteval-v13a.pl8 from the OpenMT Evaluation Campaign. The tool includes its own tokenization. We run mteval with the flag --international-tokenization 

TER, WER, PER and CDER.

The metrics TER (Snover et al., 2006), WER, PER and CDER (Leusch et al., 2006) were produced by the Moses scorer, which is used in Moses model optimization. We used the standard tokenizer script as available in Moses toolkit for tokenization.

sentBLEU.

The metric sentBLEU is computed using the script sentence-bleu, a part of the Moses toolkit. It is a smoothed version of BLEU for scoring at the segment-level. We used the standard tokenizer script as available in Moses toolkit for tokenization.

chrF and chrF+.

The metrics chrF and chrF+ (Popović, 2015, 2017) are computed using their original Python implementation, see Table 2. We ran chrF++.py with the parameters -nw 0 -b 3 to obtain the chrF score and with -nw 1 -b 3 to obtain the chrF+ score. Note that chrF intentionally removes all spaces before matching the n-grams, detokenizing the segments but also concatenating words.1

sacreBLEU-BLEU and sacreBLEUchrF.

The metrics sacreBLEU-BLEU and sacreBLEU-chrF (Post, 2018a) are re-implementation of BLEU and chrF respectively. We ran sacreBLEU-chrF with the same parameters as chrF, but their scores are slightly different. The signature strings produced by sacreBLEU for BLEU and chrF respectively are BLEU+case.lc+lang.de-en+numrefs.1+ smooth.exp+tok.intl+version.1.3.6 and chrF3+case.mixed+lang.de-en +numchars.6+numrefs.1+space.False+ tok.13a+version.1.3.6.

中文分词很不幸,上面提到的标记脚本不支持。为了使用基线指标对中文进行评分,我们使用脚本对 MT 输出和参考译文进行了预处理

BEER (Stanojević and Sima’an, 2015) is a trained evaluation metric with a linear model that combines sub-word feature indicators (character n-grams) and global word order features (skip bigrams) to achieve a language agnostic and fast to compute evaluation metric. BEER has participated in previous years of the evaluation task

BERTr (Mathur et al., 2019) uses contextual word embeddings to compare the MT output with the reference translation. The BERTr score of a translation is the average recall score over all tokens, using a relaxed version of token matching based on BERT embeddings: namely, computing the maximum cosine similarity between the embedding of a reference token against any token in the MT output. BERTr uses bert_base_uncased embeddings for the to-English language pairs, and bert_base_multilingual_cased embeddings for all other language pairs.

CharacTER (Wang et al., 2016b,a), identical to the 2016 setup, is a character-level metric inspired by the commonly applied translation edit rate (TER). It is defined as the minimum number of character edits required to adjust a hypothesis, until it completely matches the reference, normalized by the length of the hypothesis sentence.

EED (Stanchev et al., 2019) is a characterbased metric, which builds upon CDER. It is defined as the minimum number of operations of an extension to the conventional edit distance containing a “jump” operation. The edit distance operations (insertions, deletions and substitutions) are performed at the character level and jumps are performed when a blank space is reached. Furthermore, the coverage of multiple characters in the hypothesis is penalised by the introduction of a coverage penalty. The sum of the length of the reference and the coverage penalty is used as the normalisation term.

Enhanced Sequential Inference Model (ESIM; Chen et al., 2017; Mathur et al., 2019) is a neural model proposed for Natural Language Inference that has been adapted for MT evaluation. It uses cross-sentence attention and sentence matching heuristics to generate a representation of the translation and the reference, which is fed to a feedforward regressor. The metric is trained on singly-annotated Direct Assessment data that has been collected for evaluating WMT systems: all WMT 2018 toEnglish data for the to-English language pairs, and all WMT 2018 data for all other language pairs.

YiSi (Lo, 2019) is a unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources. YiSi-1 is a MT evaluation metric that measures the semantic similarity between a machine translation and human references by aggregating the idf-weighted lexical semantic similarities based on the contextual embeddings extracted from BERT and optionally incorporating shallow semantic structures (denoted as YiSi-1_srl). YiSi-0 is the degenerate version of YiSi-1 that is ready-to-deploy to any language. It uses longest common character substring to measure the lexical similarity. YiSi-2 is the bilingual, reference-less version for MT quality estimation, which uses the contextual embeddings extracted from BERT to evaluate the crosslingual lexical semantic similarity between the input and MT output. Like YiSi-1, YiSi-2 can exploit shallow semantic structures as well (denoted as YiSi-2_srl).

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值