BLEU 评分的计算方法

Sophie'sCookingLab

已于 2025-03-05 09:58:47 修改

阅读量1k

点赞数 27

分类专栏：大模型 NLP 文章标签： bleu n-gram

于 2025-03-05 09:54:02 首次发布

本文链接：https://blog.csdn.net/weixin_40566713/article/details/146034669

版权

大模型同时被 2 个专栏收录

76 篇文章

订阅专栏

NLP

16 篇文章

订阅专栏

BLEU（Bilingual Evaluation Understudy）是一种用于评估机器翻译、文本生成等任务的自动评测指标。它通过计算 n-gram 精确匹配率来衡量候选句子（模型输出）与参考句子（人类标准答案）的相似度。

1. BLEU 评分的计算流程

BLEU 评分主要由以下几个部分组成：

1.1 计算 n-gram 精确匹配率

BLEU 评分通过 n-gram（n 连续词序列）的匹配来衡量生成文本的质量。

示例：

参考翻译（Reference）: "this is a test"
模型输出（Candidate）: "this is test"

n-gram 计算：

1-gram（单个词匹配）:
- 候选: ['this', 'is', 'test']
- 参考: ['this', 'is', 'a', 'test']
- 匹配词: "this", "is", "test" → 3/3 = 1.0
2-gram（相邻两个词匹配）:
- 候选: ['this is', 'is test']
- 参考: ['this is', 'is a', 'a test']
- 匹配短语: "this is" → 1/2 = 0.5

计算不同 n-gram 的匹配率（precision）：
在这里插入图片描述

1.2 平均 n-gram 精确度（几何平均）

为了平衡不同 n-gram 的影响，BLEU 计算 1-gram 到 4-gram 的精确度，并取 加权几何平均值：
在这里插入图片描述

默认情况下，BLEU 计算 1 到 4-gram 的匹配率，权重均为 0.25：
在这里插入图片描述

1.3 惩罚短文本的 Brevity Penalty（BP）

如果候选翻译的长度远短于参考翻译，直接使用 n-gram 精确率可能会导致过高的 BLEU 分数。因此，BLEU 引入 Brevity Penalty（BP） 来惩罚过短的输出：
在这里插入图片描述

其中：

c 是候选翻译的长度。
r 是参考翻译的长度（可以是最接近候选的参考句长度）。

最终 BLEU 评分计算公式：
在这里插入图片描述

2. BLEU 评分示例

示例 1: 手动计算 BLEU

假设：

参考翻译（Reference）："the cat is on the mat"
机器翻译（Candidate）："the cat on the mat"

步骤 1: 计算 n-gram 精确匹配率

n-gram	候选短语	参考短语	匹配数	Precision
1-gram	the, cat, on, the, mat	the, cat, is, on, the, mat	4/5	0.8
2-gram	the cat, cat on, on the, the mat	the cat, cat is, is on, on the, the mat	3/4	0.75
3-gram	the cat on, cat on the, on the mat	the cat is, cat is on, is on the, on the mat	2/3	0.67
4-gram	the cat on the, cat on the mat	the cat is on, cat is on the, is on the mat	1/2	0.5

步骤 2: 计算几何平均

在这里插入图片描述

步骤 3: 计算 BP

c = 5（候选长度）
r = 6（参考长度）
BP = \exp(1 - 6/5) = \exp(-0.2) \approx 0.818

最终 BLEU 评分：
在这里插入图片描述

3. Python 实现

使用 NLTK 计算 BLEU：

from nltk.translate.bleu_score import sentence_bleu

reference = [['the', 'cat', 'is', 'on', 'the', 'mat']]
candidate = ['the', 'cat', 'on', 'the', 'mat']

score = sentence_bleu(reference, candidate)
print(f"BLEU Score: {score:.4f}")  # 约等于 0.578

使用 corpus_bleu 计算多个句子的 BLEU：

from nltk.translate.bleu_score import corpus_bleu

references = [
    [['the', 'cat', 'is', 'on', 'the', 'mat']],
    [['there', 'is', 'a', 'cat', 'on', 'the', 'mat']]
]
candidates = [
    ['the', 'cat', 'on', 'the', 'mat'],
    ['there', 'is', 'a', 'cat', 'on', 'mat']
]

score = corpus_bleu(references, candidates)
print(f"Corpus BLEU Score: {score:.4f}")

4. BLEU 的局限性

虽然 BLEU 是 NLP 任务中常用的评估指标，但它也有一些缺陷：

不考虑语义：BLEU 只基于 n-gram 匹配，无法理解句子语义。例如，“the cat is on the mat” 和 “the feline is on the rug” 语义相近，但 BLEU 评分会很低。
对短句惩罚过重：BP 可能会导致短句得分极低，即使翻译是合理的。
不能区分词序错误：如果句子中的单词顺序错误，BLEU 仍可能给出较高的分数。

为了解决这些问题，现代 NLP 任务还会使用 ROUGE、METEOR、BERTScore 等更先进的评测指标。