ROUGE（Recall-Oriented Understudy for Gisting Evaluation）评价指标

彬彬侠

已于 2025-04-08 23:38:03 修改

阅读量1k

点赞数 22

分类专栏：大模型文章标签： ROUGE 模型评价评估大模型自然语言处理文本生成

于 2025-04-08 21:33:05 首次发布

本文链接：https://blog.csdn.net/u013172930/article/details/147078085

版权

大模型专栏收录该内容

98 篇文章

订阅专栏

什么是 ROUGE 评价指标？

ROUGE（Recall-Oriented Understudy for Gisting Evaluation） 是一类用于 自动摘要评估 的评价指标，常用于衡量机器生成文本（如自动摘要、机器翻译等）与人工参考文本（如人工摘要、参考翻译）的相似度。ROUGE 主要关注 召回率（recall），因此它的计算是基于 生成文本与参考文本之间的重叠。

ROUGE 常用的变体包括 ROUGE-N, ROUGE-L, ROUGE-W, ROUGE-S 等，其中最常见的是 ROUGE-N 和 ROUGE-L。

ROUGE 主要的子指标

ROUGE-N：基于 n-gram 重叠的评价指标。
- ROUGE-1：计算单个词的重叠（即 unigrams）。
- ROUGE-2：计算双词组（bigrams）的重叠。
公式：
$\text{ROUGE-N} = \frac{\sum_{\text{n-gram}_i \in \text{generated}} \text{count of n-gram}_i \cap \text{count of n-gram}_i \text{in reference}}{\sum_{\text{n-gram}_i \in \text{generated}} \text{count of n-gram}_i}$
即：计算生成文本中与参考文本中出现的 n-gram 重叠的比率。
ROUGE-L：基于 最长公共子序列（LCS, Longest Common Subsequence）来计算文本之间的相似度，能够捕捉 词序信息。

公式：
$\text{ROUGE-L} = \frac{LCS(\text{generated}, \text{reference})}{\text{length of reference}}$
其中 LCS 是生成文本和参考文本之间的最长公共子序列。
ROUGE-S：基于 skip-gram（跳跃 n-gram）来计算重叠，允许 n-gram 中有跳跃的字符。
ROUGE-W：基于 加权词重叠，在 ROUGE-N 的基础上加权 n-gram 的贡献。

ROUGE 的计算例子

假设我们有以下 参考摘要（参考文本）和 生成摘要（模型生成的摘要）：

参考摘要（Reference Text）：

The cat sat on the mat.

生成摘要（Generated Text）：

The cat sat on the rug.

1. 计算 ROUGE-1（Unigram 重叠）

ROUGE-1 计算的是 单个词（unigrams）的重叠。

参考摘要中的 unigram：["The", "cat", "sat", "on", "the", "mat"]
生成摘要中的 unigram：["The", "cat", "sat", "on", "the", "rug"]

计算过程：

重叠的 unigram 有：["The", "cat", "sat", "on", "the"]
重叠的 unigram 数量是 5。
参考摘要中有 6 个 unigram，生成摘要中有 6 个 unigram。

$\text{ROUGE-1} = \frac{\text{重叠 unigram 数}}{\text{生成摘要中的 unigram 数}} = \frac{5}{6} \approx 0.8333$

因此，ROUGE-1 的得分是 0.8333。

2. 计算 ROUGE-2（Bigram 重叠）

ROUGE-2 计算的是 双词组（bigrams）的重叠。

参考摘要中的 bigram：[("The", "cat"), ("cat", "sat"), ("sat", "on"), ("on", "the"), ("the", "mat")]
生成摘要中的 bigram：[("The", "cat"), ("cat", "sat"), ("sat", "on"), ("on", "the"), ("the", "rug")]

计算过程：