问题:
现有的关键词抽取系统普遍存在的问题包括:1)complex and slow 2)over-generation (i.e. extracting redundant keyphrases)
资源:
1. 代码 https://github.com/swisscom/ai-research-keyphrase-extraction
相关工作:
1. Unsupervised Keyphrase Extraction
Graph-based: TextRank (Mihalcea and Tarau, 2004); SingleRank (Wan and Xiao, 2008); WordAttractionRank (Rui Wang, Wei Liu, 2015)
Others: KeyCluster (Liu et al., 2009); TopicRank (Bougouin et al., 2013)
与上述工作不同,本文提出的EmbedRank使用当前表现最好的语义文档嵌入方法将文档和候选关键短语表示成高维空间的向量,而不是简单地使用词向量的平均,因此可以计算出一个文档和候选短语间比较有意义的距离(提高informativeness)和候选短语之间的语义距离(提高diversity)
2. Word and Sentence Embeddings
Words: Word2Vec (Mikolov et al., 2013)
Sentences: Skip-Thought (Kiros et al., 2015)
Paragraph: Paragra