文本摘要笔记

hit56笔记

已于 2023-04-04 14:52:58 修改

阅读量131

点赞数

分类专栏： nlp 文章标签：自然语言处理

于 2023-02-27 14:45:58 首次发布

本文链接：https://blog.csdn.net/zh515858237/article/details/129241843

版权

nlp 专栏收录该内容

13 篇文章 5 订阅

订阅专栏

文章目录

一、当前最好的文本摘要模型是什么
二、目前主要有哪些文本摘要算法
三、如何评价文本摘要效果
四、参考文献：

一、当前最好的文本摘要模型是什么

Pegasus模型：基于Transformer，5.68亿参数，它的训练语料包括可以是下面中的其中之一，根据具体摘要任务类型可以选择不同的预训练语料：

从3.5亿个网页爬取的750GB文本的Common Crawl数据集，主要是非新闻类语料。
从新闻网站收集的15亿篇文章、总计3.8TB的HugeNews数据集，主要是新闻类语料。

Pegasus的模型结构如下：
在这里插入图片描述

PEGASUS的基本架构是一个标准的Transformer encoder-decode。在这个例子中，GSG和MLM同时作为预训练目标。原始文本有三句话。一个句子被[MASK1]屏蔽，并用作目标生成文本(GSG)。其他两句话仍然保留在输入中，但是一些词被[MASK2]
(MLM)随机屏蔽了。

在这里插入图片描述

二、目前主要有哪些文本摘要算法

个人认为的文本摘要算法效果依次往下递减：

有监督生成式文本摘要：Pegasus、BART、机器翻译使用的NMT相关模型
无监督生成式文本摘要：GPT
有监督抽取式文本摘要：Bertsum，其demo网站和代码地址
无监督抽取式文本摘要：Textrank、基于bert的抽取式摘要

举个例子：Bert Extractive Summarizer 是一个可以轻松使用谷歌 BERT 提取文本摘要的项目，以下是简单示例代码：

from summarizer import Summarizer

body = 'Text body that you want to summarize with BERT'
body2 = 'Something else you want to summarize with BERT'
model = Summarizer()
model(body)
model(body2)

三、如何评价文本摘要效果

'''
Calculate ROUGE score.
:parameter    
    :param y_test: string or list    
    :param predicted: string or list
'''
def evaluate_summary(y_test, predicted):    
    rouge_score = rouge.Rouge()    
    scores = rouge_score.get_scores(y_test, predicted, avg=True)       
    score_1 = round(scores['rouge-1']['f'], 2)    
    score_2 = round(scores['rouge-2']['f'], 2)    
    score_L = round(scores['rouge-l']['f'], 2)    
    print("rouge1:", score_1, "| rouge2:", score_2, "| rougeL:",score_2, "--> avg rouge:", 
          round(np.mean([score_1,score_2,score_L]), 2))
## Apply the function to predicted
evaluate_summary(dtf_test["y"][i], predicted[i])

在这里插入图片描述

The results show that 31% of unigrams (ROUGE-1) and 7% of bigrams (ROUGE-2) are present in both summaries, while the longest common subsequences (ROUGE-L) match by 7%. Overall, the average score is 20%. Please note that ROUGE scores don’t measure how fluent the summary is, for that I usually use the good old human eye.