机器翻译评价指标之BLEU详细计算过程

1. 简介

BLEU(Bilingual Evaluation Understudy),相信大家对这个评价指标的概念已经很熟悉,随便百度谷歌就有相关介绍。原论文为BLEU: a Method for Automatic Evaluation of Machine Translation,IBM出品。

本文通过一个例子详细介绍BLEU是如何计算以及NLTKnltk.align.bleu_score模块的源码。

首先祭出公式:

BLEU=BPexp(n=1NwnlogPn) B L E U = B P ⋅ e x p ( ∑ n = 1 N w n l o g P n )

其中,
BP={1e1r/cif c>rif cr B P = { 1 if  c > r e 1 − r / c if  c ≤ r

注意这里的BLEU值是针对一条翻译(一个样本)来说的。

NLTKnltk.align.bleu_score模块实现了这里的公式,主要包括三个函数,两个私有函数分别计算P和BP,一个函数整合计算BLEU值。

# 计算BLEU值
def bleu(candidate, references, weights)

# (1)私有函数,计算修正的n元精确率(Modified n-gram Precision)
def _modified_precision(candidate, references, n)

# (2)私有函数,计算BP惩罚因子
def _brevity_penalty(candidate, references)

例子:

候选译文(Predicted)
It is a guide to action which ensures that the military always obeys the commands of the party

参考译文(Gold Standard)
1:It is a guide to action that ensures that the military will forever heed Party commands
2:It is the guiding principle which guarantees the military forces always being under the command of the Party
3:It is the practical guide for the army always to heed the directions of the party

2. Modified n-gram Precision计算(也即是 Pn P n

def _modified_precision(candidate, references, n):
    counts = Counter(ngrams(candidate, n))

    if not counts:
        return 0

    max_counts = {}
    for reference in references:
        reference_counts = Counter(ngrams(reference, n))
        for ngram in counts:
            max_counts[ngram] = max(max_counts.get(ngram, 0), reference_counts[ngram])

    clipped_counts = dict((ngram, min(count, max_counts[ngram])) for ngram, count in counts.items())

    return sum(clipped_counts.values()) / sum(counts.values())

我们这里 n n 取值为4,也就是从1-gram计算到4-gram。

Modified 1-gram precision:

首先统计候选译文里每个词出现的次数,然后统计每个词在参考译文中出现的次数,Max表示3个参考译文中的最大值,Min表示候选译文和Max两个的最小值。

候选译文 参考译文1 参考译文2 参考译文3 Max Min
the 3 1 4 4 4 3
obeys 1 0 0 0 0 0
a 1 1 0 0 1 1
which 1 0 1 0 1 1
ensures 1 1 0 0 1 1
guide 1 1 0 1 1 1
always 1 0 1 1 1 1
is 1 1 1 1 1 1
of 1 0 1 1 1 1
to 1 1 0 1 1 1
commands 1 1 0 0 1 1
that 1 2 0 0 2 1
It 1 1 1 1 1 1
action 1 1 0 0 1 1
party 1 0 0 1 1 1
military 1 1 1 0 1 1

然后将每个词的Min值相加,将候选译文每个词出现的次数相加,然后两值相除即得P1=3+0+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+13+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1=0.95

类似可得:

Modified 2-gram precision:

候选译文参考译文1参考译文2参考译文3MaxMin
ensures that110011
guide to110011
which ensures100000
obeys the100000
commands of100000
that the110011
a guide110011
of the101111
always obeys100000
the commands100000
to action110011
the party100111
is a110011
action which100000
It is111111
military always100000
the military111011

P2=1017=0.588235294 P 2 = 10 17 = 0.588235294

Modified 3-gram precision:

候选译文参考译文1参考译文2参考译文3MaxMin
ensures that the110011
which ensures that100000
action which ensures100000
a guide to110011
military always obeys100000
the commands of100000
commands of the100000
to action which100000
the military always100000
obeys the commands100000
It is a110011
of the party100111
is a guide110011
that the military110011
always obeys the100000
guide to action110011

P3=716=0.4375 P 3 = 7 16 = 0.4375

Modified 4-gram precision:

候选译文参考译文1参考译文2参考译文3MaxMin
to action which ensures100000
action which ensures that100000
guide to action which100000
obeys the commands of100000
which ensures that the100000
commands of the party100000
ensures that the military110011
a guide to action110011
always obeys the commands100000
that the military always100000
the commands of the100000
the military always obeys100000
military always obeys the100000
is a guide to110011
It is a guide110011

P4=415=0.266666667 P 4 = 4 15 = 0.266666667

然后我们取 w1=w2=w3=w4=0.25 w 1 = w 2 = w 3 = w 4 = 0.25 ,也就是Uniform Weights。

所以:

Ni=1wnlogPn=0.25logP1+0.25logP2+0.25logP3+0.25logP4=0.684055269517 ∑ i = 1 N w n log ⁡ P n = 0.25 ∗ log ⁡ P 1 + 0.25 ∗ log ⁡ P 2 + 0.25 ∗ log ⁡ P 3 + 0.25 ∗ log ⁡ P 4 = − 0.684055269517

3. Brevity Penalty 计算

def _brevity_penalty(candidate, references):

    c = len(candidate)
    ref_lens = (len(reference) for reference in references)
    #这里有个知识点是Python中元组是可以比较的,如(0,1)>(1,0)返回False,这里利用元组比较实现了选取参考翻译中长度最接近候选翻译的句子,当最接近的参考翻译有多个时,选取最短的。例如候选翻译长度是10,两个参考翻译长度分别为9和11,则r=9.
    r = min(ref_lens, key=lambda ref_len: (abs(ref_len - c), ref_len))
    print 'r:',r

    if c > r:
        return 1
    else:
        return math.exp(1 - r / c)

下面计算BP(Brevity Penalty),翻译过来就是“过短惩罚”。由BP的公式可知取值范围是(0,1],候选句子越短,越接近0。

候选翻译句子长度为18,参考翻译分别为:16,18,16。
所以 c=18 c = 18 r=18 r = 18 (参考翻译中选取长度最接近候选翻译的作为 r r

所以BP=e0=1

4. 整合

最终 BLEU=1exp(0.684055269517)=0.504566684006 B L E U = 1 ⋅ e x p ( − 0.684055269517 ) = 0.504566684006

BLEU的取值范围是[0,1],0最差,1最好。

通过计算过程,我们可以看到,BLEU值其实也就是“改进版的n-gram”加上“过短惩罚因子”。

已标记关键词 清除标记
©️2020 CSDN 皮肤主题: 精致技术 设计师:CSDN官方博客 返回首页