18_EMNLP_Keyphrase Generation with Correlation Constraints

最新推荐文章于 2022-11-27 15:26:47 发布

Fitz1318

最新推荐文章于 2022-11-27 15:26:47 发布

阅读量320

点赞数

分类专栏：深度学习文章标签：人工智能深度学习

本文链接：https://blog.csdn.net/Fitz1318/article/details/108005322

版权

深度学习专栏收录该内容

35 篇文章 3 订阅

订阅专栏

18_EMNLP_Keyphrase Generation with Correlation Constraints

具有相关约束的关键字短语生成

Abstract

概要

In this paper, we study automatic keyphrase generation. Although conventional approaches to this task show promising results, they neglect correlation among keyphrases, resulting in duplication and coverage issues. To solve these problems, we propose a new sequence-to-sequence architecture for keyphrase generation named CorrRNN, which captures correlation among multiple keyphrases in two ways. First, we employ a coverage vector to indicate whether the word in the source document has been summarized by previous phrases to improve the coverage for keyphrases. Second, preceding phrases are taken into account to eliminate duplicate phrases and improve result coherence. Experiment results show that our model significantly outperforms the state-of-the-art method on benchmark datasets in terms of both accuracy and diversity.

在本文中，我们研究了自动关键词生成。尽管完成此任务的常规方法显示出令人鼓舞的结果，但它们忽略了关键短语之间的相关性，从而导致了重复和覆盖问题。为了解决这些问题，我们为密钥短语生成提出了一种新的序列到序列体系结构，称为CorrRNN，它以两种方式捕获多个密钥短语之间的相关性。首先，我们采用覆盖率向量来指示源文档中的单词是否已被以前的短语概括，以提高关键字短语的覆盖率。其次，考虑前面的短语以消除重复的短语并提高结果的连贯性。实验结果表明，在准确性和多样性方面，我们的模型均明显优于基准数据集上的最新方法。

1 Introduction

A keyphrase is a piece of text that is able to summarize a long document, organize contents and highlight important concepts, like ”virtual organizations” in Table 1. It provides readers with a rough understanding of a document without going through its content, and has many potential applications, such as information retrieval, text summarization and document classification.
Keyphrase can be categorized into present keyphrase which appears in a source document, and absent keyphrase that does not appear in the document. Conventional approaches extract important text spans as candidate phrases and rank them as keyphrases (Hulth, 2003; Medelyan et al., 2008; Liu et al., 2011; Wu et al., 2015; Wang et al., 2016), that show promising results on the present keyphrases but cannot handle absent keyphrases.

To predict absent keyphrases, generative methods have been proposed by Meng et al. (2017).
The approach employs a sequence-to-sequence (Seq2Seq) framework (Sutskever et al., 2014) with a copy mechanism (Gu et al., 2016) to encourage rare word generation, in which the encoder compresses the text into a dense vector and the decoder generates a phrase with a Recurrent Neural Network (RNN) language model, achieving state-of-the-art performance. Since a document corresponds to multiple keyphrases, the approach divides it into multiple document-keyphrase pairs as training instances. This approach, however, neglects the correlation among target keyphrases since it does not model the one-to-many relationship between the document and keyphrases.
Therefore, keyphrase prediction only depends on the source document, and ignores the keyphrases which have been generated. As a consequence, the generated keyphrases suffer from duplication issue and coverage issue. A duplication issue is defined as at least two phrases expressing the same meaning, hindering readers from obtaining more information from keyphrases. For example, three keyphrases have an identical meaning in Table 1,including ”multi agent systems”, ”multi agent” and ”agent systems”. A coverage issue means some key points in the document are not covered by the keyphrases, such as ”norm conflict” and ”norm inconsistency” in Table 1.
To mitigate such issues, we mimic human behavior in terms of how to assign keyphrases for an arbitrary document. Given a document in Table 1, an annotator will read it and generate keyphrases according to his understanding of the content, like ”virtual organizations”, ”multi agent systems”. After that, instead of generating duplicate phrases like ”agent systems” and ”multi agent”, the annotator will review the document and preceding keyphrases, then generate a phrase like ”norm conflict” to cover topics that have not been summarized by previous phrases. The iteration does not stop until all of a document’s topics are covered by keyphrases.
We propose a new sequence-to-sequence architecture CorrRNN, capable of capturing correlation among keyphrases. Notably, correlation constraints in this paper are defined as keyphrases that should cover all topics in the source document and differ from each other. Specifically, we employ a coverage mechanism (Tu et al., 2016) to memorize which parts in the source document have been covered by previous phrases. In this way, the document coverage is modeled explicitly, enabling the generated keyphrases to cover more topics. Furthermore, we propose a review mechanism that considers the previous keyphrases in the generation process, in order to avoid the repetition in the final results. Concretely, the review mechanism explicitly models the correlation between the keyphrases that have been generated and the keyphrase that is going to be generated with a novel architecture. It extends the existing Seq2Seq model and captures the one-to-many relationship in keyphrase generation. Augmented with the coverage mechanism and the review mechanism, CorrRNN does not only inherit the advantages of the Seq2Seq model, but also improves the coverage and diversity in the generation process.

We test our model on three benchmark datasets.
The results show that our model outperforms state-of-the art methods by a large margin, demonstrating the effectiveness of the correlation constraints.
In addition, our model is better than heuristic rules on improving diversity, since it instills the correlation knowledge to the model in an end-to-end fashion.
Our contributions in this paper are three-fold: (1) the proposal of modeling the one-to-many correlation for keyphrase generation, (2) the proposal of a new architecture CorrRNN for keyphrase generation, and (3) empirical verification of the effectiveness of CorrRNN on public datasets.
In the remainder of this paper, we will first review the related work in Section 2, then we elaborate on the proposed model in Section 3. After that, we list the experiment settings in Section 4, results and discussion follow in Section 5. Finally, the conclusion and future work in Section 6.

关键短语是一段文本，它可以总结一个较长的文档，组织内容并突出显示重要概念，例如表1中的“虚拟组织”。它使读者对文档有一个粗略的了解，而无需浏览其内容，并且具有许多潜在的应用程序，例如信息检索，文本摘要和文档分类。
关键字短语可以分为在源文档中出现的当前关键字短语和在文档中没有出现的缺少关键字短语。常规方法将重要的文本跨度提取为候选短语并将其分类为关键短语（Hulth，2003; Medelyan et al。，2008; Liu et al。，2011; Wu et al。，2015; Wang et al。，2016），表明目前的关键短语有希望的结果，但不能处理缺少的关键短语。

为了预测缺少的关键短语，Meng等人提出了生成方法。（2017）。
该方法采用序列到序列（Seq2Seq）框架（Sutskever et al。，2014）和复制机制（Gu et al。，2016）来鼓励稀有单词的产生，其中编码器将文本压缩为密集向量解码器使用递归神经网络（RNN）语言模型生成短语，从而实现最新的性能。由于文档对应于多个关键短语，因此该方法将其分为多个文档关键短语对作为训练实例。但是，这种方法忽略了目标关键字之间的相关性，因为它没有对文档和关键字之间的一对多关系建模。
因此，关键短语预测仅取决于源文档，而忽略已生成的关键短语。结果，生成的关键字短语遭受重复问题和覆盖问题。重复问题被定义为至少两个表达相同意思的短语，从而阻止读者从关键字获得更多信息。例如，表1中三个关键短语的含义相同，包括“多代理系统”，“多代理”和“代理系统”。覆盖问题意味着文档中的某些关键点未包含在关键短语中，例如表1中的“规范冲突”和“规范不一致”。
为了缓解此类问题，我们在如何为任意文档分配关键字短语方面模仿了人类的行为。给定表1中的文档，注释者将根据其对内容的理解来阅读该文档并生成关键短语，例如“虚拟组织”，“多代理系统”。此后，注释者将生成文档和先前的关键字短语，而不是生成诸如“代理系统”和“多代理”之类的重复短语，然后生成诸如“规范冲突”之类的短语，以覆盖以前的短语未概括的主题。直到关键字的所有主题都覆盖了文档，迭代才会停止。
我们提出了一种新的序列到序列体系结构CorrRNN，它能够捕获关键短语之间的相关性。值得注意的是，本文中的相关性约束定义为应该涵盖源文档中所有主题并且彼此不同的关键短语。具体来说，我们采用覆盖机制（Tu等人，2016）来记住源文档中的哪些部分已被先前的短语覆盖。通过这种方式，可以对文档覆盖率进行显式建模，从而使生成的关键字短语可以覆盖更多主题。此外，为了避免在最终结果中重复，我们提出了一种审查机制，该机制考虑了生成过程中的先前关键短语。具体而言，审阅机制显式地对已经生成的密钥短语和将要使用新颖架构生成的密钥短语之间的相关性进行建模。它扩展了现有的Seq2Seq模型，并捕获了密钥短语生成中的一对多关系。通过覆盖机制和检查机制的增强，CorrRNN不仅继承了Seq2Seq模型的优势，而且在生成过程中提高了覆盖范围和多样性。

我们在三个基准数据集上测试我们的模型。
结果表明，我们的模型在很大程度上优于最新技术，证明了相关约束的有效性。
另外，我们的模型比启发式规则在改善多样性方面更好，因为它以端到端的方式向模型灌输了相关性知识。
我们在本文中的贡献包括三个方面：（1）为密钥短语生成建模一对多相关性的提案，（2）为密钥短语生成建立新架构CorrRNN的提案，以及（3）对密钥短语生成的经验验证 CorrRNN在公共数据集上的有效性。
在本文的其余部分中，我们将首先在第2节中回顾相关工作，然后在第3节中详细介绍所提议的模型。然后，在第4节中列出实验设置，结果并在第5节中进行讨论。，第6节的结论和未来工作。