19_DMKD_A Review of Keyphrase Extraction

A Review of Keyphrase Extraction

关键短语提取概述

Keyphrase extraction is a textual information processing task concerned with the automatic extraction of representative and characteristic phrases from a document that express all the key aspects of its content. Keyphrases constitute a succinct conceptual summary of a document, which is very useful in digital information management systems for semantic indexing, faceted search, document clustering and classification. This article introduces keyphrase extraction, provides a well-structured review of the existing work, offers interesting insights on the different evaluation approaches, highlights open issues and presents a comparative experimental study of popular unsupervised techniques on five datasets.

关键字短语提取是一项文本信息处理任务,涉及从表示其内容的所有关键方面的文档中自动提取代表短语和特征短语。 关键字构成了文档的简洁概念摘要,对于数字索引管理系统中的语义索引,分面搜索,文档聚类和分类非常有用。 本文介绍了关键短语提取,对现有工作进行了结构化的回顾,对不同的评估方法提供了有趣的见解,突出了未解决的问题,并针对五个数据集对流行的无监督技术进行了对比实验研究。

INTRODUCTION

介绍

Keyphrase extraction is concerned with automatically extracting a set of representative phrases from a document that concisely summarize its content (Hasan and Ng, 2014). There exist both supervised and unsupervised keyphrase extraction methods. Unsupervised methods are popular because they are domain independent and do not need labeled training data, i.e. manual extraction of the keyphrases, which comes with subjectivity issues as well as significant investment in time and money. Supervised methods on the other hand, have more powerful modeling capabilities and typically achieve higher accuracy than the unsupervised ones according to previous studies (Kim et al., 2013; Caragea et al., 2014; Meng et al., 2017).

关键字短语提取涉及从文档中自动提取一组代表性短语,以简要总结其内容(Hasan和Ng,2014)。 有监督和无监督的关键词提取方法。 无监督方法之所以受欢迎是因为它们是领域独立的,不需要标记的训练数据,即手动提取关键短语,这伴随着主观性问题以及对时间和金钱的大量投资。 另一方面,根据之前的研究(Kim等人,2013; Caragea等人,2014; Meng等人,2017),有监督的方法具有比无监督的方法更强大的建模功能,并且通常可以获得更高的准确性。

​ The versatility of keyphrases renders keyphrase extraction a very important document processing task. Keyphrases can be used for semantically indexing a collection of documents either in place of their full-text or in addition to it, enabling semantic and faceted search (Gutwin et al., 1999). In addition, they can be used for query expansion in the context of pseudo-relevance feedback (Song et al., 2006). They can also serve as features for document clustering and classification (Hulth and Megyesi, 2006). Furthermore, the set of extracted keyphrases can be viewed as an extreme summary of the corresponding document for human inspection, while the individual keyphrases can guide the extraction of sentences in automatic document summarization systems (Zhang et al., 2004). Keyphrase extraction is particularly important in the (academic) publishing industry for carrying out a number of important tasks, such as the recommendation of new articles or books to customers, highlighting missing citations to authors, identifying potential reviewers for submissions and the analysis of content trends (Augenstein et al., 2017).

关键字短语的多功能性使关键字短语提取成为非常重要的文档处理任务。 关键字短语可用于代替全文本或除其全文外,对文档集合进行语义索引,从而实现语义和分面搜索(Gutwin等,1999)。 此外,它们可用于伪相关反馈的情况下的查询扩展(Song等,2006)。 它们还可以用作文档聚类和分类的功能(Hulth和Megyesi,2006)。 此外,所提取的关键短语集可以看作是相应文档的极简摘要,供人类检查,而各个关键短语可以指导自动文档摘要系统中句子的提取(Zhang等,2004)。 在(学术)出版行业中,关键短语提取对于执行许多重要任务特别重要,例如向客户推荐新文章或新书,突出向作者的引文缺失,确定潜在的审稿人以及内容趋势分析 (Augenstein等人,2017)。

​ There exists a number of noteworthy keyphrase extraction surveys. Hasan and Ng (2014) focus on the errors that are made by state-of-the-art keyphrase extractors: (a) evaluation errors (when a returned keyphrase is semantically equivalent to a gold one but it is evaluated as erroneous), (b) redundancy errors (when a method returns correct but semantically equivalent keyphrases), © infrequency errors (when a keyphrase appears one or two times in a text and the method fails to detect it), and (d) overgeneration errors (when a system correctly returns a phrase as a keyphrase because it contains a word that appears frequently in the document, but erroneously outputs additional phrases as keyphrases that contain this frequent word). Despite that their analysis is not based on a large number of documents, it is quite interesting and well-presented. An earlier survey by the same authors presents the results of an experimental study of state-of-the-art unsupervised keyphrase extraction methods, conducted with the aim of gaining deeper insights into these methods (Hasan and Ng, 2010). The main conclusions are the following: (a) methods should be evaluated on multiple datasets, (b) post-processing steps (e.g., phrase formation) have a large impact on the performance of methods, and, © TfIdf is a strong baseline. Boudin et al. (2016) study the effect of document pre-processing pipelines to the keyphrase extraction process, while Florescu and Caragea (2017a) examine how keyphrase extraction is affected by phrase ranking schemes.

有许多值得注意的关键词提取调查。 Hasan和Ng(2014)着眼于最先进的密钥短语提取器所产生的错误:(a)评估错误(当返回的密钥短语在语义上等同于黄金密钥但被评估为错误时),( b)冗余错误(当方法返回正确但在语义上等效的关键字短语时),(c)频率错误(当关键字在文本中出现一两次,并且该方法无法检测到该关键字时),以及(d)过代错误(当 系统正确地将短语作为关键字短语返回,因为它包含一个频繁出现在文档中的单词,但是错误地将其他短语作为包含该常用单词的关键字短语输出。 尽管他们的分析不是基于大量的文档,但它非常有趣并且呈现得很好。 同一作者的较早调查显示了最新的无监督关键字提取方法的实验研究结果,目的是对这些方法有更深入的了解(Hasan和Ng,2010)。 主要结论如下:(a)方法应在多个数据集上进行评估;(b)后处理步骤(例如,词组形成)对方法的性能有很大的影响;并且(c)TfIdf是强大的 基线。 Boudin等。 (2016)研究了文档预处理管道对关键短语提取过程的影响,而Florescu和Caragea(2017a)研究了短语排序方案如何影响关键短语提取。

​ Our article constitutes a contemporary review of the keyprase extraction task, containing the following main contributions:

我们的文章构成了对关键短语提取任务的当代回顾,包含以下主要贡献:

  • A systematic presentation of both unsupervised (Section 2) and supervised (Section 3) keyphrase extraction methods via comprehensive categorization schemes based on the main properties of these methods. Our article reviews 37 additional methods compared to Hasan and Ng (2014). In addition, we contribute a time line of unsupervised and supervised methods to shed light on their evolution, as well as a presentation of the main types of features employed in supervised methods, along with a discussion of the issue of class imbalance.

  • We present the different approaches that can be followed for evaluating keyphrase extraction methods, as well as the different evaluation measures that exist, along with their popularity in the literature (Section 4).

  • We provide a list of popular keyphrase extraction datasets, including their sources and properties, as well as a comprehensive catalogue of commercial APIs and free software (Section 5) related to keyphrase extraction.

  • We present a thorough empirical study, both quantitative and qualitative, among commercial APIs and state-of-the-art unsupervised methods, which allows to gain a deeper understanding of how the results are affected by different evaluation approaches, evaluation measures and ground truth standards (Section 6).

  • 通过基于这些方法的主要属性的综合分类方案,系统地介绍了无监督(第2节)和受监督(第3节)关键字提取方法。 与Hasan和Ng(2014)相比,本文回顾了37种其他方法。 此外,我们提供了无监督方法和受监督方法的时间表,以阐明它们的发展,并介绍了受监督方法中使用的主要功能类型,并讨论了类不平衡问题。

  • 我们介绍了可用于评估关键短语提取方法的不同方法,以及现有的不同评估方法,以及它们在文献中的普及程度(第4节)。

  • 我们提供了流行的关键短语提取数据集的列表,包括其来源和属性,以及与关键短语提取有关的商业API和免费软件的全面目录(第5节)。

  • 我们在商业API和最新的无监督方法之间进行了定量和定性的透彻的实证研究,从而可以更深入地了解不同的评估方法,评估方法和基本事实如何影响结果 标准(第6节)。

The article search strategy that we followed, involved searching for “keyphrase extraction” in the following databases of scientific literature: Google Scholar, Springer Link, IEEE Xplore, ACM Digital Library and DBLP. We focused mainly on articles appearing at the high quality journals and conference proceedings that are given in Appendix C.

我们遵循的文章搜索策略涉及在以下科学文献数据库中搜索“关键词提取”:Google Scholar,Springer Link,IEEE Xplore,ACM数字图书馆和DBLP。 我们主要关注附录C中提供的高质量期刊和会议论文集上的文章。

UNSUPERVISED METHODS

无监督方法

The basic steps of an unsupervised keyphrase extraction system are the following (Hasan and Ng, 2010, 2014):

无监督的关键字短语提取系统的基本步骤如下(Hasan和Ng,2010年,2014年):

  1. Selection of the candidate lexical units based on some heuristics. Examples of such heuristics are the exclusion of stopwords and the selection of words that belong to a specific part-of-speech (POS).

  2. Ranking of the candidate lexical units.

  3. Formation of the keyphrases by selecting words from the top-ranked ones or by selecting a phrase with a high rank score or whose parts have a high score.

  4. 根据一些启发式方法选择候选词汇单位。 此类启发式方法的示例包括排除停用词和选择属于特定词性(POS)的词。

  5. 候选词汇单位的排名。

  6. 通过从排名靠前的单词中选择单词或通过选择得分高或短语得分高的短语来形成关键词。

在这里插入图片描述

。。。。。。。

CONCLUSIONS AND FUTURE DIRECTIONS

结论和未来方向

Keyphrases are multi-purpose knowledge gems. They constitute a concise summary of documents that is extremely useful both for human inspection and machine consumption, in support of tasks such as faceted search, document classification and clustering, query expansion and document recommendation. Our article reviews the existing body of work on keyphrase extraction and presents a comprehensive organization of the material that aims to help newcomers and veterans alike navigate the large amount of prior art and grasp its evolution.

关键短语是多用途的知识宝库。 它们构成了文档的简要摘要,对于人工检查和机器消耗都非常有用,可支持诸如分面搜索,文档分类和聚类,查询扩展和文档推荐之类的任务。 我们的文章回顾了现有的关键短语提取工作,并提出了一种全面的材料组织,旨在帮助新手和老手都可以浏览大量现有技术并掌握其发展。

We present a large number of both unsupervised and supervised keyphrase extraction methods, including recent deep learning methods, categorizing them according to their main features and properties, and highlighting their strengths and weaknesses. We discuss the challenges that supervised methods face, namely the subjectivity that characterizes the existing annotated datasets and the imbalance of keyphrases versus non-keyphrases. In addition, we discuss how keyphrase extraction methods are currently evaluated, and present a long list of free and commercial keyphrase extraction software and APIs, as well as the main collections of documents with associated keyphrases that are used for obtaining experimental results.

我们介绍了大量无监督和有监督的关键词提取方法,包括最近的深度学习方法,并根据其主要特征和特性对其进行分类,并突出其优缺点。 我们讨论了监督方法所面临的挑战,即表征现有注释数据集的主观性以及关键短语与非关键短语的不平衡。 此外,我们讨论了当前如何评估关键短语提取方法,并提供了一长串免费的和商用的关键短语提取软件和API,以及用于获取实验结果的主要文档以及相关的关键短语。

Our review includes an extensive empirical evaluation study of keyphrase extraction. We compare commercial APIs, as well as unsupervised methods ourselves, while for supervised methods we include a table with results collected from the corresponding papers. The results show that simple unsupervised methods, such as TfIdf2, are strong baselines that should be considered in empirical studies and that deep learning methods achieve state-of-the-art results. Among unsupervised methods, we notice that graph-based methods work better for short, while statistical methods work better for long documents.

我们的综述包括对关键短语提取的广泛经验评估研究。 我们自己比较了商业API和非监督方法,而对于监督方法,我们提供了一个表格,其中包含从相应论文中收集的结果。 结果表明,简单的无监督方法(例如TfIdf2)是牢固的基线,应在经验研究中加以考虑,并且深度学习方法可达到最新的结果。 在无人监督的方法中,我们注意到基于图的方法短期内效果更好,而统计方法对于长文档则效果更好。

Our evaluation study presents a thorough analysis of the exact and partial matching approaches, concluding with the recommendation of considering their average, and highlighting the need for approaches that take the semantic similarity of predicted and golden keyphrases. In addition, our study investigates how the different golden keyphrase sources (authors and readers) affect the evaluation of keyphrase extraction methods, concluding that they play a significant role and should be explicitly considered and reported in empirical studies.

我们的评估研究对准确和部分匹配的方法进行了全面的分析,并建议考虑它们的平均值,并强调需要采用预测键和黄金键短语的语义相似性的方法。 此外,我们的研究调查了不同的金色关键短语来源(作者和读者)如何影响关键短语提取方法的评估,认为它们起着重要作用,应在经验研究中明确考虑和报告。

A lot of progress still remains to be done in this challenging task, as the accuracy of state-of-the-art systems has not reached satisfactory levels yet. At the moment, the most exciting developments in mastering language are coming from the frontier of deep learning and unsupervised language models (Devlin et al., 2019; Yang et al., 2019; Cer et al., 2018). The exploitation of such models for keyphrase extraction and/or generation appears as the most interesting future direction.

yphrase extraction and/or generation appears as the most interesting future direction.

由于当前系统的准确性尚未达到令人满意的水平,因此在这一具有挑战性的任务中仍有许多工作要做。 目前,掌握语言最令人兴奋的发展来自深度学习和无监督语言模型的前沿(Devlin等人,2019; Yang等人,2019; Cer等人,2018)。 此类模型用于关键字短语提取和/或生成的开发似乎是最有趣的未来方向。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值