【文本生成】评价指标:BERTScore

BERTScore是一种用于评估机器翻译和文本生成的新型度量标准,由Cornell University在2020年ICLR提出。它通过计算句子的上下文嵌入向量的余弦相似度,解决了n-gram匹配方法在捕捉语义和长距离依赖方面的不足。实验证明,BERTScore在多个基准测试中优于现有指标,与人类判断的相关性更强,尤其适用于图像标题生成任务。代码实现简单,可以直接使用预训练模型进行计算。
摘要由CSDN通过智能技术生成

BERTScore

这是2020ICLR由Cornell University提出来的论文:BERTSCORE: EVALUATING TEXT GENERATION WITH BERT

主要是基于BERT预训练模型,使用contextual embedding来描述句子,计算两个句子之间的余弦相似度。

基于n-gram matching metric 的常见缺陷:

  • semantically-correct phrases are penalized because they differ from the surface form of the reference.

    解决: In contrast to string matching (e.g., in BLEU) or matching heuristics (e.g., in METEOR), we compute similarity using contextualized token embeddings, which have been shown to be effective for paraphrase detection

  • n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes.

    解决: contextualized embeddings are trained to effectively capture distant dependencies and ordering

实验结果:
(1)In machine translation, BERTSCORE shows stronger system-level and segment-level correlations
with human judgments than existing metrics on multiple common benchmarks.
(2)BERTSCORE is well-correlated with human annotators for image captioning, surpassing SPICE.

bertscore原理

直接下载预训练模型,运用eval方式输入batch sentence,输出cls的embedding,对sentence_can和sentence_ref做相似度矩阵和idf计算,用greedy方式选取最大的相似度,乘以该词的idf,输出BERTScore

论文中还用了+1 smoothing和rescaling来微调BERTScore

bertscore-prf

结论:在机器翻译常用F_BERT作为指标;在评估英文生成时推荐使用 24-layer RoBERTa large model;多语言任务推荐multilingual BERT,中文自动选择的是"bert-base-chinese"。

代码实例:非常简单!

!pip install bert_score

from bert_score import score

# data
cands = ['我们都曾经年轻过']
refs = ['虽然我们都年少,但还是懂事的']

P, R, F1 = score(cands, refs, lang="zh", verbose=True)

print(f"System level F1 score: {F1.mean():.3f}") 
# System level F1 score: 0.959

其中P,R,F1的计算:

def greedy_cos_idf(ref_embedding, ref_masks, ref_idf, hyp_embedding, hyp_masks, hyp_idf, all_layers=False):
    """
    Compute greedy matching based on cosine similarity.
    """
    ……
    batch_size = ref_embedding.size(0)
    sim = torch.bmm(hyp_embedding, ref_embedding.transpose(1, 2))
    masks = torch.bmm(hyp_masks.unsqueeze(2).float(), ref_masks.unsqueeze(1).float())

    masks = masks.float().to(sim.device)
    sim = sim * masks

    word_precision = sim.max(dim=2)[0]
    word_recall = sim.max(dim=1)[0]

    hyp_idf.div_(hyp_idf.sum(dim=1, keepdim=True))
    ref_idf.div_(ref_idf.sum(dim=1, keepdim=True))
    precision_scale = hyp_idf.to(word_precision.device)
    recall_scale = ref_idf.to(word_recall.device)
    
    P = (word_precision * precision_scale).sum(dim=1)
    R = (word_recall * recall_scale).sum(dim=1)
    F = 2 * P * R / (P + R)
    ……
    return P, R, F

如果需要相似度矩阵图形:

import matplotlib.pyplot as plt

font = {'family': 'SimHei', 'size':'10'}
plt.rc('font', **font)

from bert_score import plot_example

cand = cands[0]
ref = refs[0]
plot_example(cand, ref, lang="zh")

# bert以字划分中文,这里不显示中文是因为colab服务器上没有中文字体

github地址:https://github.com/Tiiiger/bert_score/blob/3227a18cb2a546978a00beb94bf138fd72fef8cf/bert_score/score.py

  • 13
    点赞
  • 36
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值