自然语言处理无监督句子多样性评价指标

Mimasss

已于 2023-02-02 12:57:00 修改

阅读量772

点赞数

分类专栏： NLP 文章标签：自然语言处理人工智能

于 2023-02-01 13:00:00 首次发布

原文链接：https://mimas.top/2022/11/28/NLP_sentence_diversity_evaluation/

版权

NLP 专栏收录该内容

1 篇文章

订阅专栏

文章探讨了在项目中评估句子生成多样性的重要性，并介绍了几种无监督的评价方法，包括BERTScore、MoverScore、Perplexity等。BERTScore基于BERT模型计算词之间的相似度，MoverScore利用上下文嵌入和EarthMoversDistance，而Perplexity则是一个简洁但可能受多种因素影响的指标。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

转载自：我的个人博客

在项目推进过程中，产生了对生成句子多样性进行评价、筛选的需求。遂调研了部分现有的无监督句子多样性的评价指标，以备参考使用。

BERTScore

paper: BERTSCORE: EVALUATING TEXT GENERATION WITH BERT

在这里插入图片描述

每个词找另一个句子中和它内积最大的词

$R_{BERT} = \frac{1}{|x|} \underset{x_i \in x}{\Sigma} \underset{\hat{x}\in \hat{x}} {max} x_i^{T} \hat{x_j}, \quad P_{BERT} = \frac{1}{\|\hat{x}\|} \underset{\hat{x_i} \in \hat{x}}{\Sigma} \underset{\hat{x} \in \hat{x}}{max} x_i^{T} \hat{x_j}, \quad F_{BERT} = 2\frac{P_{BERT}\cdot R_{BERT}}{P_{BERT} + R_{BERT}}$

Importance Weighting

based on inverse document frequency

$-\log \frac{1}{M} \Sigma_{i=1}^{M} I [w \in x^{(i)}]$

rescaling

$\hat{R}_{BERT} = \frac{R_{BERT} - b}{1-b}$

b: empirical lower bound, calculated using Common Crawl monolingual datasets

Comparison

machine translation evalution -> $F_{BERT}$

text generation in Eglish -> 24-layer $RoBERTa_{large}$

non-English language -> $BERT_{multi}$

BLEURT

paper: BLEURT: Learning Robust Metrics for Text Generation. ACL 2020

Architecture

Bert + Linear Head

pre-training scheme

random perturbations of Wikipedia sentences augmented with a diverse set of lexical and semantic-level supervision signals

mask-filling with BERT -> lexical alterations
backtranslation
randomly dropping out words -> to recognize void preditions and sentence truncation in NLG systems

pretraining metrics: weighted sum of previous metrics

BARTScore

paper: BARTSCORE: Evaluating Generated Text as Text Generation

ExplainaBoard：http://explainaboard.nlpedia.ai/leaderboard/task-meval/

explainaboard

evaluation perspectives:

Informativeness
Relevance
Fluency
Coherence
FActuality
Semantic Coverage
Adequacy

BARTScore

$\Sigma_{t=1}^{m} \omega_t \log p(y_t | y_{<t}, x, \theta)$

using prompt to augment metrics

没太看明白，一开始列了一堆指标，最后又只有一个BARTScore。看了眼ExplainaBoard，猜测可能是评判的任务/输入数据对 ${x,y\}$ 不同，BARTScore体现出的评判句子的方面就不一样

MoverScore

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. link

MoverDistance

$WMD(x^n, y^n) := \underset{F \in R^{|x^n| \times |y^n|}}{min} <C,F>, \quad s.t. F1 = f_{x^n}, F^T 1 = f_{y^n}$

$C_{ij} = d(x_i^n, y_j^n)$ , the distance between the i-th n-gram of x and the j-th n-gram of y

$F$ : transportation flow matrix, $F_{ij}$ denoting the amount of flow traveling from the ith n-gram $x_i^n$ in $x^n$ to the j-th n-gram $y_j^n$ in $y^n$ .