论文笔记：Generating Wikipedia by Summarizing Long Sequences

最新推荐文章于 2022-05-17 20:28:35 发布

xff1994

最新推荐文章于 2022-05-17 20:28:35 发布

阅读量1.8k

点赞数 1

分类专栏：笔记摘要生成

本文链接：https://blog.csdn.net/xff1994/article/details/103227666

版权

笔记同时被 2 个专栏收录

15 篇文章 2 订阅

订阅专栏

摘要生成

1 篇文章 0 订阅

订阅专栏

一、简介

问题：

multi-document abstractive summarization，where the input is a collection of related documents from which a summary is distilled。

abstractive 与 extractive

extractive summarization：抽取文本句子作为 summary。缺点是表达能力有限，好处是句子都是符合语法的。
abstractive summarization：生成新文本作为摘要。缺点是句子可能只是单词的组合，从语法上看狗屁不通。

输出输出

输入：wiki 文章的标题 $T(a_i)$ + 非wiki文章（该wiki的引用文章 ( $C_i$ ) +以该标题google到的文章（ $S_i$ ））。
输出：对应 wiki 文章（ $a_i$ ）

二、模型

由于 $C_i,S_i)$ 可能非常大，超出了硬件限制，模型分成了两部分：首先使用 extractive summarization 模型抽取输入的一个子集，然后用这个子集来训练 abstractive 模型。这与人类做摘要类似：先高亮出文章的重要句子，然后基于这些句子给出摘要。

extractive stage

对于每一篇文章 $a_i$ , 首先将输入中的段落重新计算 rank，得到段落列表 ${p_{R_i(j)}^i\}$ ， $R_i(j)$ 表示 $C_i, S_i)$ 的第 $j$ 个段落 $p_j^i$ 的 rank。然后取前 $L$ 个 tokens 作为第二阶段的输入。
文章共用了五种抽取模型：

Identity：a trivial extractor，直接使用原输入的前 $L$ 个 tokens.
tf-idf：consider ranking paragraphs as documents in a query retrieval problem, where the query is the title of the article。基于 ${p_j^i\}$ 对文章标题 $T(a_i)$ 计算tf-idf：

Where $N_w, N_d$ , and $N_{dw}$ are the count of the word in the document, total number of documents, and total number of documents containing the word, respectively.
TextRank：一种类似 PageRank 的基于图的算法
SumBasic：根据词频给词赋分，继而给句子赋分。选出得分最高的句子。选出句子后重新计算词频并重复这个过程，直到达到期望的摘要长度。
cheating：

abstractive stage

将 ${p_{R_i(j)}^i\}$ 按 rank 排序后拼接，然后接上 $T(a_i)$ 前缀作为 raw input，然后进行 tokenize:
在这里插入图片描述
截取前 $L$ 个 tokens 作为 input sequence：

abstractive 模型记作 $W$ ，即 $a_i = W(m_i^L)$ 。将这个问题当成是从长序列（ $L$ 最大取11000）生成短序列（一般少于500）的问题。文章修改了 transformer, 去掉了 encoder：
在这里插入图片描述

xff1994

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
论文笔记：Generating Wikipedia by Summarizing Long Sequences

简介问题：multi-document abstractive summarization，where the input is a collection of related documents from which a summary is distilled。abstractive 与 extractiveextractive summarization：抽取文本句子作为 summa...
复制链接

扫一扫

专栏目录