[ THUNLP-MT (2/10) ] BLEU: a Method for Automatic Evaluation of Machine Translation | NIST

最新推荐文章于 2022-12-28 17:55:53 发布

只眷恋两小无猜

最新推荐文章于 2022-12-28 17:55:53 发布

阅读量392

点赞数

分类专栏： NLP 文章标签： THUNLP-MT 10

本文链接：https://blog.csdn.net/qq_33387068/article/details/90243789

版权

NLP 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

本文讨论BLEU评价指标的主要原理。原论文由IBM发表于ACL’02，是老生常谈的一篇论文了。BLEU指标如今经常在机器翻译任务的评价中使用。 (被引用 8924 次。)此外，本文还讨论BLEU的变种，NIST评价指标。

BLEU 原论文传送门
BLEU: a Method for Automatic Evaluation of Machine Translation
NIST 原论文传送门
Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics

文章目录

一个好的译文评分方法

应该同时具有敏感性和一致性两方面特性
》敏感性
能够区分相似系统之间的好坏

》一致性
不会因为不同的测试文本和参考译文而影响对系统性能的评价

BLEU 主要原理

全名为：bilingual evaluation understudy，中文译为：双语互译质量评估辅助工具

》原理 : 机器翻译结果越接近专业人工翻译的结果，则越好。因此，考虑同现单元的计分。

$\begin{aligned} \text{BLEU} &= \text{BP}*exp(\sum_{n=1}^{N} w_n* \text{log } p_n) \tag1 \\ & \text{\# 累加 log } p_n ,等于\text{ log } \prod_{n=1}^{N} p_n ， \\ & \text{\# }会造成某项n-gram为零，整个\text{BLEU} 为零。 \end{aligned}$

$\text{BP}=\left\{ \begin{array}{rcl} 1 & & {,c > r} \\ e^{(1-r/c)} & & {,c <= r} \\ \end{array} \right. \tag2$

$\begin{aligned} p_n &= \frac{ \sum_{C∈{\{Candidate\}}} \sum_{n-gram∈C} Count_{clip}(n-gram) } { \sum_{C^{'}∈\{Candidate\}} \sum_{n-gram^{'}∈C^{'}} Count(n-gram') } \tag3 \\ & \# \text{Candidate 表示系统译文，Candidate means translated sentences} \end{aligned}$

$Count_{clip} = min( count, maxReferenceCount ) \tag4$

$\begin{aligned} \text{log BLEU} &= \text{log BP} + \sum_{n=1}^{N} w_n* \text{log}p_n \\ &= min(1-\frac{r}{c},0) + \sum_{n=1}^{N} w_n* \text{log}p_n \tag5 \end{aligned}$

(1) 中， $\text{BP}$ 表示 $\text{brevity penalty}$ ，即长度惩罚因子。 $p_n$ 表示整个测试文本的准确度积分（Corpus-based N-gram Precision），一般N=4，且 $w_n$ =1/N。

(2) 表示的是，某个n-gram组成的集合在所有候选翻译句子中的出现概率，即同现单元的计分方法。其中，c 和 r 分别表示候选译文的长度和参考译文长度。当候选译文长度大于参考译文， $\text{BP}=1$ ，当候选译文长度小于等于参考译文长度， $\text{BP}<1$ 。即偏向较长的候选译文。

(3) 表示的是累积所有翻译句子修正后的 N-gram 计数，除上测试集中所有句子的 N-gram 计数。 $C$ 和 $C^{'}$ 都表示系统译文，而分子表示的是系统译文出现在任何参考译文中的个数；分母表示的是系统译文的各个n-gram的计数加和。

(4) 表示修正后的 N-gram 计数的计算方式。即是每个单词在所有参考译文中的出现次数的最大值（在下式中用maxReferenceCount表示），以及单词在模型输出译文中的出现次数，两个值中的最小值。

(5) 表示取对数后的 BLEU 评价指标。

举例而言， “猫咪在垫子上。”
系统译文：the the the the the the the.
参考译文1：The cat is on the mat.
参考译文1：There is a cat on the mat.
$N = 2$ （一般N=4）

当 $n = 1$ 时，
未修正的测试文本准确度：
$\begin{aligned} p_1 &= \frac{ \sum_{C∈\{Candidate\}} \sum_{1-gram∈C} Count(1-gram) } { \sum_{C^{'}∈\{Candidate\}} \sum_{1-gram^{'}∈C^{'}} Count(1-gram') } \\ &= \frac{7}{7} = 1 \end{aligned}$
.
修正的测试文本准确度：（即避免系统译文中n-gram的个数，大于其在某个参考译文中出现的个数）
$\begin{aligned} Count_{clip} &= min( count, maxReferenceCount ) \\ &= min( 7, 2 ) \\ &= 2 \end{aligned}$
.

当 $n = 2$ 时，
未修正的测试文本准确度：
$\begin{aligned} p_2 &= \frac{ 0 } { 6 } \\ &= 0 \end{aligned}$
.
修正的测试文本准确度：（即避免系统译文中n-gram的个数，大于其在某个参考译文中出现的个数
$\begin{aligned} Count_{clip} &= min( count, maxReferenceCount ) \\ &= min( 6, 0 ) \\ &= 0 \end{aligned}$

$c = 7 ， r = 7$ （选取长度最接近的参考翻译，若存在多个则取最小的）
.
$\begin{aligned} \text{BP} &= e^{(1-r/c)} & & {（c <= r）} \\ &= 1 \end{aligned}$
.
$\begin{aligned} p_2 &= \frac{6}{0} = 0 \text{\# 警告:除以零} \end{aligned}$
.
$\begin{aligned} \text{BLEU} &= \text{BP}*exp(\sum_{n=1}^{2} w_n* \text{log} p_n) \\ &= \text{BP}*exp(\sum_{n=1}^{2} \frac{1}{2}*log p_n) \\ &= \text{BP}*exp( \frac{1}{2}*log(2/7) + \frac{1}{2}*log(0)) \text{\# 警告:除以零，解决：在源代码里返回了0} \\ &= \text{BP}*exp( \frac{1}{2}*log(2/7)) \\ &= exp( \frac{1}{2}*log(2/7)) = exp(\frac{1}{2}) * \frac{2}{7}=0.471 \end{aligned}$

NIST

NIST - National Institute of standards and Technology

》同现概率的几何平均 (=对数加权平均) 换成算术平均
对各阶 n-gram 同现单元的得分，取算术平均值。（原来 BLEU 是取加权平均）

》长度惩罚因子变种

$\text{Score} = \sum_{n=1}^N \{ \frac {\sum_{所有同现的 \ w_1...w_n} \text{Info}(w_1...w_n) } {\sum_{系统译文中所有\ w_1...w_n } (1) } \} · \exp \{ \beta \log^2 [ min( \frac{L_{sys}}{\overline{L_{ref}}}, 1 ) ] \} \$

$\text{Info}(w_1...w_n) = \log \frac{\text{Num of } occurrences\ of\ w_1...w_{n-1}}{\text{Num of} \ occurrences\ of\ w_1...w_{n}}$

To do this we used the F-ratio measure, namely the between-system
score variance divided by within-system score variance.
.
F-ratios 表示的是，不同系统间的得分偏差，除以某一系统本身的得分偏差。
the superior F-ratios of information-weighted counts and the comparable correlations