BLEU评估指标

定义

  1. BLEU(全称为Bilingual Evaluation Understudy),其意思为双语评估替补,用于机器翻译任务的评价,原文如下BLEU: a Method for Automatic Evaluation of Machine Translation
  2. BLEU算法实际上就是在判断两个句子的相似程度
  3. BLEU有许多变种,根据n-gram可以划分成多种评价指标,常见的评价指标有BLEU-1、BLEU-2、BLEU-3、BLEU-4四种,其中n-gram指的是连续的单词个数为n,BLEU-1衡量的是单词级别的准确性,更高阶的BLEU可以衡量句子的流畅性

计算

  1. BLEU计算的一个大致步骤是:

    • 分别计算candidate句和reference句的N-grams模型,然后统计其匹配的个数,计算匹配度
      c a n d i d a t e 和 r e f e r e n c e 中匹配的 n − g r a m 的个数 / c a n d i d a t e 中 n − g r a m 的个数 candidate和reference中匹配的n-gram的个数/candidate中n-gram的个数 candidatereference中匹配的ngram的个数/candidatengram的个数

      举例说明:

      candidate: It is a nice day today
      reference: Today is a nice day

      • 使用1-gram进行匹配

        candidate: {it, is, a, nice, day, today}
        reference: {today, is, a, nice, day}
        

        其中{today, is, a, nice, day}匹配,所以匹配度为5/6

      • 使用2-gram进行匹配

        candidate: {it is, is a, a nice, nice day, day today}
        reference: {today is, is a, a nice, nice day}
        

        其中{is a, a nice, nice day}匹配,所以匹配度为3/5

      • 使用3-gram进行匹配

        candidate: {it is a, is a nice, a nice day, nice day today}
        reference: {today is a, is a nice, a nice day}
        

        其中{is a nice, a nice day}匹配,所以匹配度为2/4

      • 使用4-gram进行匹配

        candidate: {it is a nice, is a nice day, a nice day today}
        reference: {today is a nice, is a nice day}
        

        其中{is a nice day}匹配,所以匹配度为1/3

    • 对匹配的N-grams计数进行修改,以确保它考虑到reference文本中单词的出现,而非奖励生成大量合理翻译单词的候选结果

      举例说明:

      candidate: the the the the

      reference: The cat is standing on the ground

      如果按照1-gram的方法进行匹配,则匹配度为1,显然是不合理的,所以计算某个词的出现次数进行改进

      将计算某个词的出现次数的方法改为计算某个词在译文中出现的最小次数,如下所示,
      count ⁡ k = min ⁡ ( c k , s k ) \operatorname{count}_{{k}}=\min \left({c}_{{k}}, {s}_{{k}}\right) countk=min(ck,sk)
      其中 k k k表示在机器译文(candidate)中出现的第 k k k个词语, c k c_{k} ck则代表在机器译文中这个词语出现的次数,而 s k s_{k} sk则代表在人工译文(reference)中这个词语出现的次数。

      由此,可以定义BLEU计算公式,首先定义几个数学符号:

      • 人工译文表示为 s j s_{j} sj,其中 j ∈ M {j} \in \mathrm{M} jM M \mathrm{M} M表示有 M \mathrm{M} M个参考答案
      • 翻译译文表示为 c i c_{i} ci,其中 i ∈ E i \in \mathrm{E} iE E \mathrm{E} E表示共有 E \mathrm{E} E个翻译
      • n n n表示 n n n个单词长度的词组集合,令 k k k表示第 k k k个词组
      • h k ( c i ) h_{k}(c_{i}) hk(ci)表示第 k k k个词组在翻译译文 c i c_{i} ci中出现的次数
      • h k ( s i , j ) h_{k}(s_{i,j}) hk(si,j)表示第 k k k个词组在人工译文 s i , j s_{i,j} si,j中出现的次数

      最后可以得到计算每个n-gram的公式,
      P n = ∑ i E ∑ k K min ⁡ ( h k ( c i ) , max ⁡ j ∈ M h k ( s i , j ) ) ∑ i E ∑ k K min ⁡ ( h k ( c i ) ) P_{n}=\frac{\sum_{i}^{\mathrm{E}} \sum_{k}^\mathrm{K} \min(h_{k}(c_{i}), \max_{j \in \mathrm{M}}h_{k}(s_{i,j})) } {\sum_{i}^{\mathrm{E}} \sum_{k}^\mathrm{K}\min(h_{k}(c_{i}))} Pn=iEkKmin(hk(ci))iEkKmin(hk(ci),maxjMhk(si,j))
      第一个求和符号统计的是所有的翻译句子,因为计算时可能有多个句子;第二个求和符号是统计一条翻译句子中所有的n-gram max ⁡ j ∈ M h k ( s i , j ) \max_{j \in \mathrm{M}}h_{k}(s_{i,j}) maxjMhk(si,j)表示第 i i i条翻译句子对应的 M \mathrm{M} M条人工译文中包含最多第 k k k个词组的句子中第 k k k个词组的数量

    • n-gram匹配度可能会随着句子长度的变短而变好,为了避免这种现象,BLEU在最后的评分结果中引入了长度惩罚因子(Brevity Penalty)
      B P = { 1  if  l c > l s e 1 − l s l c  if  l c < = l s B P=\left\{\begin{array}{lll} 1 & \text { if } & l_{c}>l s \\ e^{1-\frac{l_{s}}{l_{c}}} & \text { if } & l_{c}<=l_{s} \end{array}\right. BP={1e1lcls if  if lc>lslc<=ls
      其中, l c l_{c} lc表示机器翻译译文的长度, l s l_{s} ls表示参考译文的有效长度,当存在多个参考译文时,选取和翻译译文最接近的长度。当翻译译文长度大于参考译文长度时,惩罚因子为1,意味着不惩罚,只有翻译译文长度小于参考译文长度时,才会计算惩罚因子。

    • 计算BLEU最终公式

      为了平衡各阶统计量的作用,对各阶统计量进行加权求和,一般来说, N N N取4,最多只统计4-gram的精度, W n \boldsymbol{W}_{n} Wn 1 / N 1/N 1/N,进行均匀加权,最终公式如下:
      B L E U = B P × exp ⁡ ( ∑ n = 1 N W n log ⁡ P n ) B L E U=B P \times \exp \left(\sum_{n=1}^{N} \boldsymbol{W}_{n} \log P_{n}\right) BLEU=BP×exp(n=1NWnlogPn)

  2. 计算工具

    • nltk

      • 计算独立的BLEU:也就是只计算某一种n-gram的BLEU

        from nltk.translate.bleu_score import sentence_bleu
        
        sentence1 = "it is a guide to action which ensures that the military always obeys the commands of the party"
        sentence2 = "it is a guide to action that ensures that the military will forever heed party commands"
        sentence3 = "it is the guiding principle which guarantees the military forces always being under the command of the party"
        sentence4 = "it is the practical guide for the army always to heed the directions of the party"
        
        candidate = list(sentence1.split(" "))
        reference = [list(sentence2.split(" ")), list(sentence3.split(" ")), list(sentence4.split(" "))]
        
        print('Individual 1-gram: {}'.format(sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))))
        print('Individual 2-gram: {}'.format(sentence_bleu(reference, candidate, weights=(0, 1, 0, 0))))
        print('Individual 3-gram: {}'.format(sentence_bleu(reference, candidate, weights=(0, 0, 1, 0))))
        print('Individual 4-gram: {}'.format(sentence_bleu(reference, candidate, weights=(0, 0, 0, 1))))
        
        # Individual 1-gram: 0.9444444444444444
        # Individual 2-gram: 0.5882352941176471
        # Individual 3-gram: 0.4375
        # Individual 4-gram: 0.26666666666666666
        
        1. 计算 P 1 P_{1} P1
        候选译文参考译文1参考译文2参考译文3 max ⁡ j ∈ M h ( s ) \max_{j \in \mathrm{M}}h(s) maxjMh(s) min ⁡ ( h ( c ) , max ⁡ j ∈ M h ( s ) ) \min(h(c), \max_{j \in \mathrm{M}}h(s)) min(h(c),maxjMh(s))
        it111111
        is111111
        a110011
        guide110111
        to110111
        action110011
        which101011
        ensures110011
        that120021
        the313333
        military111011
        always101111
        obeys100000
        commands110011
        of101111
        party111111

        P 1 = 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 3 + 1 + 1 + 0 + 1 + 1 + 1 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 = 17 18 = 0.9444444444444444 P_{1}=\frac{1+1+1+1+1+1+1+1+1+3+1+1+0+1+1+1}{1+1+1+1+1+1+1+1+1+3+1+1+1+1+1+1}=\frac{17}{18}=0.9444444444444444 P1=1+1+1+1+1+1+1+1+1+3+1+1+1+1+1+11+1+1+1+1+1+1+1+1+3+1+1+0+1+1+1=1817=0.9444444444444444

        1. 计算 P 2 P_{2} P2
        候选译文参考译文1参考译文2参考译文3 max ⁡ j ∈ M h ( s ) \max_{j \in \mathrm{M}}h(s) maxjMh(s) min ⁡ ( h ( c ) , max ⁡ j ∈ M h ( s ) ) \min(h(c), \max_{j \in \mathrm{M}}h(s)) min(h(c),maxjMh(s))
        ensures that110011
        guide to110011
        which ensures100000
        obeys the100000
        commands of100000
        that the110011
        a guide110011
        of the101111
        always obeys100000
        the commands100000
        to action110011
        the party100111
        is a110011
        action which100000
        It is111111
        military always100000
        the military111011

        P 2 = 10 17 = 0.5882352941176471 P_{2}=\frac{10}{17}=0.5882352941176471 P2=1710=0.5882352941176471

        1. 计算 P 3 P_{3} P3
        候选译文参考译文1参考译文2参考译文3 max ⁡ j ∈ M h ( s ) \max_{j \in \mathrm{M}}h(s) maxjMh(s) min ⁡ ( h ( c ) , max ⁡ j ∈ M h ( s ) ) \min(h(c), \max_{j \in \mathrm{M}}h(s)) min(h(c),maxjMh(s))
        ensures that the110011
        which ensures that100000
        action which ensures100000
        a guide to110011
        military always obeys100000
        the commands of100000
        commands of the100000
        to action which100000
        the military always100000
        obeys the commands100000
        It is a110011
        of the party100111
        is a guide110011
        that the military110011
        always obeys the100000
        guide to action110011

        P 3 = 7 16 = 0.4375 P_{3}=\frac{7}{16}=0.4375 P3=167=0.4375

        1. 计算 P 4 P_{4} P4
        候选译文参考译文1参考译文2参考译文3 max ⁡ j ∈ M h ( s ) \max_{j \in \mathrm{M}}h(s) maxjMh(s) min ⁡ ( h ( c ) , max ⁡ j ∈ M h ( s ) ) \min(h(c), \max_{j \in \mathrm{M}}h(s)) min(h(c),maxjMh(s))
        to action which ensures100000
        action which ensures that100000
        guide to action which100000
        obeys the commands of100000
        which ensures that the100000
        commands of the party100000
        ensures that the military110011
        a guide to action110011
        always obeys the commands100000
        that the military always100000
        the commands of the100000
        the military always obeys100000
        military always obeys the100000
        is a guide to110011
        It is a guide110011

        P 4 = 4 15 = 0.26666666666666666 P_{4}=\frac{4}{15}=0.26666666666666666 P4=154=0.26666666666666666

      • 计算累积的BLEU:指的是为各个gram对应的权重加权,来计算得到一个加权几何平均,需要注意以下几点:

        1. BLEU-4并不是只看4-gram的情况,而是计算从1-gram4-gram的累积分数,加权策略为1-gram2-gram3-gram4-gram的权重各占25%
        2. 默认情况下(不加weights参数的情况下),sentence_bleu()corpus_bleu()都是计算BLEU-4分数的
        print('Cumulative 1-gram: {}'.format(sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))))
        print('Cumulative 2-gram: {}'.format(sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0))))
        print('Cumulative 3-gram: {}'.format(sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0))))
        print('Cumulative 4-gram: {}'.format(sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25))))
        
        # Cumulative 1-gram: 0.9444444444444444
        # Cumulative 2-gram: 0.7453559924999299
        # Cumulative 3-gram: 0.6270220769211224
        # Cumulative 4-gram: 0.5045666840058485
        
        1. 计算BLEU-1

          首先翻译句子的长度为18,而参考译文句子长度分别为16、18、16,选择与翻译句子长度最接近的参考译文句子,此时惩罚因子为1,即不惩罚。

          math.exp(1 * math.log(0.9444444444444444))
          # 0.9444444444444444
          
        2. 计算BLEU-2

          math.exp(0.5 * math.log(0.9444444444444444) + 0.5 * math.log(0.5882352941176471))
          # 0.7453559924999299
          
        3. 计算BLEU-3

          math.exp(0.33 * math.log(0.9444444444444444) + 0.33 * math.log(0.5882352941176471) + 0.33 * math.log(0.4375))
          # 0.6270220769211224
          
        4. 计算BLEU-4

          math.exp(0.25 * math.log(0.9444444444444444) + 0.25 * math.log(0.5882352941176471) 
                   + 0.25 * math.log(0.4375) + 0.25 * math.log(0.26666666666666666))
          # 0.5045666840058485
          
      • 调用corpus_bleu()方法求得语料级别的BLEU

        1. sentence_bleu()corpus_bleu()输入参数的对比(重点关注sentence_bleu()方法的referenceshypothesis参数,以及corpus_bleu()方法的list_of_referenceshypotheses参数)

          def sentence_bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25),
                            smoothing_function=None):
          		""""
              :param references: reference sentences
              :type references: list(list(str))
              :param hypothesis: a hypothesis sentence
              :type hypothesis: list(str)
              :param weights: weights for unigrams, bigrams, trigrams and so on
              :type weights: list(float)
              :return: The sentence-level BLEU score.
              :rtype: float
              """
              return corpus_bleu([references], [hypothesis], weights, smoothing_function)
          
          references = [ ["This", "is", "a", "cat"], ["This", "is", "a", "feline"] ]
          hypothesis = ["This", "is", "cat"]
          sentence_bleu(references, hypothesis)
          
          def corpus_bleu(list_of_references, hypotheses, weights=(0.25, 0.25, 0.25, 0.25),
                          smoothing_function=None):
              """
              :param references: a corpus of lists of reference sentences, w.r.t. hypotheses
              :type references: list(list(list(str)))
              :param hypotheses: a list of hypothesis sentences
              :type hypotheses: list(list(str))
              :param weights: weights for unigrams, bigrams, trigrams and so on
              :type weights: list(float)
              :return: The corpus-level BLEU score.
              :rtype: float
              """
          
        2. 计算语料级别的BLEU值

          from nltk.translate.bleu_score import corpus_bleu
          
          s1 = "the dog bit the man"
          s2 = "the dog had bit the man"
          
          s3 = "it was not unexpected"
          s4 = "no one was surprised"
          
          s5 = "the man bit him first"
          s6 = "the man had bitten the dog"
          
          s7 = "the dog bit the man"
          s8 = "it was not surprising"
          s9 = "the man had just bitten him"
          
          candidates = [list(s7.split(" ")), list(s8.split(" ")), list(s9.split(" "))]
          references = [
              [list(s1.split(" ")), list(s2.split(" "))],
              [list(s3.split(" ")), list(s4.split(" "))],
              [list(s5.split(" ")), list(s6.split(" "))]
          ]
          
          print('Corpus BLEU: {}'.format(corpus_bleu(references, candidates)))
          # Corpus BLEU: 0.5719285395120957
          

          PS:需要注意的是,计算所有单个句子的BLEU值然后求平均和直接计算corpus级别的BLEU值不同,如下所示,

          reference1 = [list(s1.split(" ")), list(s2.split(" "))]
          candidate1 = list(s7.split(" "))
          
          reference2 = [list(s3.split(" ")), list(s4.split(" "))]
          candidate2 = list(s8.split(" "))
          
          reference3 = [list(s5.split(" ")), list(s6.split(" "))]
          candidate3 = list(s9.split(" "))
          
          print('Sentence1 BLEU: ', sentence_bleu(reference1, candidate1))
          print('Sentence2 BLEU: ', sentence_bleu(reference2, candidate2))
          print('Sentence3 BLEU: ', sentence_bleu(reference3, candidate3))
          print('Average Sentence BLEU: ', (sentence_bleu(reference1, candidate1) + 
                                                  sentence_bleu(reference2, candidate2) + 
                                                  sentence_bleu(reference3, candidate3)) / 3)
          
          # Sentence1 BLEU:  1.0
          # Sentence2 BLEU:  8.636168555094496e-78
          # Sentence3 BLEU:  6.562069055463047e-78
          # Average Sentence BLEU:  0.3333333333333333
          

          正确的计算方法如下所示,将每个句子的i-gram概率的分子和分母对应相加,最后得出统一的4个独立BLEU( P i , i ∈ 1 , 2 , 3 , 4 P_{i},i \in {1,2,3,4} Pi,i1,2,3,4),再按照公式进行计算,特别地,在计算BP惩罚因子时,翻译句子长度由所有翻译句子长度相加得到,参考译文长度由所有与对应翻译句子长度最接近的参考译文长度相加得到,
          B L E U = B P × exp ⁡ ( ∑ n = 1 4 0.25 ∗ log ⁡ P n ) B L E U=B P \times \exp \left(\sum_{n=1}^{4} 0.25* \log P_{n}\right) BLEU=BP×exp(n=140.25logPn)

          from collections import Counter
          from nltk.translate.bleu_score import modified_precision
          p_numerators = Counter()
          p_denominators = Counter()
          for refs, hyps in zip(references, candidates):
              for i in range(1, 5):
                  p_i = modified_precision(refs, hyps, i)
                  p_numerators[i] += p_i.numerator
                  p_denominators[i] += p_i.denominator
                  
          print(p_numerators, p_denominators)
          # Counter({1: 13, 2: 8, 3: 5, 4: 2}) Counter({1: 15, 2: 12, 3: 9, 4: 6})
          res = 0
          for i in range(1, 5):
              res += 0.25 * math.log(p_numerators[i] / p_denominators[i])
              
          # 本例中惩罚因子为1
          print(math.exp(res))
          # 0.5719285395120957
          
          list(zip(references, candidates))
          # [([['the', 'dog', 'bit', 'the', 'man'],
          #    ['the', 'dog', 'had', 'bit', 'the', 'man']],
          #   ['the', 'dog', 'bit', 'the', 'man']),
          #  ([['it', 'was', 'not', 'unexpected'], ['no', 'one', 'was', 'surprised']],
          #   ['it', 'was', 'not', 'surprising']),
          #  ([['the', 'man', 'bit', 'him', 'first'],
          #    ['the', 'man', 'had', 'bitten', 'the', 'dog']],
          #   ['the', 'man', 'had', 'just', 'bitten', 'him'])]
          
    • sacrebleu

      • 计算sentence bleu

        import sacrebleu
        sentence1 = "it is a guide to action which ensures that the military always obeys the commands of the party"
        sentence2 = "it is a guide to action that ensures that the military will forever heed party commands"
        sentence3 = "it is the guiding principle which guarantees the military forces always being under the command of the party"
        sentence4 = "it is the practical guide for the army always to heed the directions of the party"
        bleu = sacrebleu.sentence_bleu(sentence1, [sentence2, sentence3, sentence4])
        print("Sentence BLEU: ", bleu)
        # Sentence BLEU:  BLEU = 50.46 94.4/58.8/43.8/26.7 (BP = 1.000 ratio = 1.000 hyp_len = 18 ref_len = 18)
        
      • 计算corpus bleu

        refs = [['the dog bit the man', 'it was not unexpected', 'the man bit him first'],
                ['the dog had bit the man', 'no one was surprised', 'the man had bitten the dog']]
        sys = ['the dog bit the man', "it was not surprising", 'the man had just bitten him']
        bleu = sacrebleu.corpus_bleu(sys, refs)
        print("Corpus BLEU: ", bleu)
        # Corpus BLEU:  BLEU = 57.19 86.7/66.7/55.6/33.3 (BP = 1.000 ratio = 1.000 hyp_len = 15 ref_len = 15)
        
    • multi-bleu.perl:使用multi-bleu进行评测要求事先把句子进行tokenize,这意味着multi-bleu得到的分数受tokenizer如何分词的影响

    • mteval-v14.pl:脚本内部有一套标准的分词器,不需要分词直接输入句子就可以进行评测,计算的值与sacrebleu计算的值相同

参考文献:

  1. BLEU指标及评测脚本使用的一些误解
  2. 机器翻译评价指标BLEU介绍
  3. BLEU算法(例子和公式解释)
  4. 机器翻译评测——BLEU算法详解 (新增 在线计算BLEU分值)
  5. BLEU score评估模型
  6. BLEU详解
  7. 机器翻译评价指标之BLEU详细计算过程
  8. nltk/bleu_score.py at develop · nltk/nltk (github.com)
  9. moses-smt/mosesdecoder: Moses, the machine translation system (github.com)
  • 17
    点赞
  • 54
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值