评估生成故事的指标Distinct-3 (D-3)、Repetition-4 (R-4)、Lexical Repetition (LR-n)和BARTScore (BAS)

本文链接：https://blog.csdn.net/qq_44154915/article/details/138925171

在“Analysis of LLM-Based Narrative Generation Using the Agent-Based Simulation”这篇论文中，提到了一些用于评估生成故事的指标，包括Distinct-3 (D-3)、Repetition-4 (R-4)、Lexical Repetition (LR-n)和BARTScore (BAS)。以下是这些指标的详细说明：

Distinct-3 (D-3)

定义：Distinct-3计算文本中所有3-gram的唯一3-gram的比例。

公式：
$\text{Distinct-3} = \frac{D3}{T3} \times 100$
其中， $D 3$ 表示文本中唯一的3-gram数量， $T 3$ 表示文本中所有3-gram的总数。

意义：Distinct-3值越接近1，表明文本在3-gram级别上的多样性越高。用于评估生成文本的多样性，避免过于重复的内容。

Repetition-4 (R-4)

定义：Repetition-4计算在句子中重复出现的4-gram的百分比。

公式：
$\text{Repetition-4} = \frac{\sum_{t=1}^{T} I(R_t > 1)}{T} \times 100$
其中， $T$ 表示句子的总数， $R_t$ 表示句子t中重复出现的4-gram的数量， $I (x)$ 是一个指示函数，当x为真时取值1，否则为0。

意义：Repetition-4用于评估生成文本中重复内容的程度。值越高，表示文本中重复的4-gram越多。

Lexical Repetition (LR-n)

定义：Lexical Repetition (LR-n)计算生成文本中至少出现n次的4-gram的平均百分比。

公式：
$\text{Lexical Repetition} = \frac{\sum_{g=1}^{G} I(L_g \ge n)}{G} \times 100$
其中，(G)表示所有可能的4-gram的总数， $L_g$ 表示4-gram g的重复次数， $I (x)$ 是一个指示函数，当x为真时取值1，否则为0。

意义：LR-n用于评估生成文本中重复出现的4-gram的频率，帮助检测文本的冗余度。

BARTScore (BAS)

定义：BARTScore (BAS)用于评估生成故事与条件之间的相关性。本文中使用了在ParaBank2上预训练的BART模型。BAS使用目标标记的平均对数似然来衡量生成文本与给定条件的相关性。

公式：
BAS值通常小于0，本文中将其乘以-1以使其为正值，值越小，表示与条件的相关性越高。

意义：BARTScore用于评估生成文本与特定条件（如角色设定、故事主题）的相关性，确保生成内容的合理性和一致性。

总结

这些指标用于全面评估生成故事的质量，包括多样性、重复度和与条件的相关性。通过这些评估指标，研究人员可以更好地理解和改进生成文本的性能和表现。

举例说明

明白了，下面我们用一个新的例子，其中包含一些重复和多样的元素，以便更好地展示这些指标的计算和意义。

假设我们有以下一段由AI生成的文本：

The brave knight fought the dragon. The dragon was fierce and strong. The knight used a magical sword to defeat the dragon. The brave knight became a hero. The hero's victory was celebrated by everyone in the kingdom.

1. Distinct-3 (D-3)

我们先提取所有的3-gram（连续的三个词组合）：

The brave knight
brave knight fought
knight fought the
fought the dragon
the dragon The
dragon The dragon
The dragon was
dragon was fierce
was fierce and
fierce and strong
and strong The
strong The knight
The knight used
knight used a
used a magical
a magical sword
magical sword to
sword to defeat
to defeat the
defeat the dragon
the dragon The
dragon The brave
The brave knight
brave knight became
knight became a
became a hero
a hero The
hero The hero’s
The hero’s victory
hero’s victory was
victory was celebrated
was celebrated by
celebrated by everyone
by everyone in
everyone in the
in the kingdom

计算唯一的3-gram数量 $D 3$ 和总的3-gram数量 $T 3$ ：

唯一的3-gram数量 $D 3$ ：30
总的3-gram数量 $T 3$ ：37

$\text{Distinct-3} = \frac{30}{37} \times 100 \approx 81.1$

2. Repetition-4 (R-4)

我们提取所有的4-gram（连续的四个词组合）：

The brave knight fought
brave knight fought the
knight fought the dragon
fought the dragon The
the dragon The dragon
dragon The dragon was
The dragon was fierce
dragon was fierce and
was fierce and strong
fierce and strong The
and strong The knight
strong The knight used
The knight used a
knight used a magical
used a magical sword
a magical sword to
magical sword to defeat
sword to defeat the
to defeat the dragon
defeat the dragon The
the dragon The brave
dragon The brave knight
The brave knight became
brave knight became a
knight became a hero
became a hero The
a hero The hero’s
hero The hero’s victory
The hero’s victory was
hero’s victory was celebrated
victory was celebrated by
was celebrated by everyone
celebrated by everyone in
by everyone in the
everyone in the kingdom

计算每个句子中重复出现的4-gram数量：（这里有误，应该是4个，注意一下，理解什么意思就行，别注意细节哈哈，下面也是一样的）

第1句：0
第2句：0
第3句：1 （“the dragon The”）
第4句：0
第5句：0

重复的句子总数 $\sum_{t=1}^{T} I(R_t > 1) = 1$

句子总数 $T = 5$

$\text{Repetition-4} = \frac{1}{5} \times 100 = 20$

3. Lexical Repetition (LR-n)

计算在整个文本中至少出现两次的4-gram的平均百分比：

“the dragon The”出现2次

所有可能的4-gram的总数 $G = 37$

重复的4-gram数量 $\sum_{g=1}^{G} I(L_g \ge 2) = 1$

$\text{Lexical Repetition (LR-2)} = \frac{1}{37} \times 100 \approx 2.7$

4. BARTScore (BAS)

BARTScore评估生成的故事与给定条件（如角色设定、故事主题）的相关性。假设我们有如下条件：

条件：故事关于一位勇敢的骑士与龙的战斗。

我们使用BART模型计算生成文本与条件的相关性，BARTScore的值越小，表示相关性越高。假设BARTScore为-4.5，我们将其乘以-1：

$\text{BARTScore} = -(-4.5) = 4.5$

总结

通过以上例子，我们可以看到如何使用这些指标来评估生成文本的多样性、重复度和相关性。高Distinct-3值（81.1）表示文本在3-gram级别上有较高的多样性，Repetition-4（20）和Lexical Repetition（2.7）值显示了文本中存在一些重复，BARTScore（4.5）表示文本与给定条件的相关性。