TextRank, 关键词和句子抽取

最新推荐文章于 2024-06-30 19:34:18 发布

yichudu

最新推荐文章于 2024-06-30 19:34:18 发布

阅读量3.7k

点赞数 1

分类专栏： NLP

天天开心

本文链接：https://blog.csdn.net/chuchus/article/details/77993499

版权

33 篇文章 0 订阅

订阅专栏

1. 简介

TextRank, 基于图模型的关键词和句子抽取. 与 google 的PageRank有一定的相通之处. 都是 unsupervised.

术语约定:

TextRank model 的基本思想是:

S (V i) = (1 - d) + d * \sum j \in I n (V i) 1 | O u t ( V j ) | S (V j) (1)

$S(V_i)=(1-d)+d*\sum_{j \in In(V_i)} \frac 1 {|Out(V_j)|} S(V_j) \tag 1$

基于图的排序模型, 那么关键就是建图.

two vertices are connected if they co-occur within a window of maximum N words, where N can be set anywhere between [2,10].

directed edge
若要有序, the direction was set following the natural flow of the text.
undirected edge
无向边可以认为是双向的弧.

句子由若干个word组成 , 需要对这些 word 作简单的处理, 才能当作 vertices , added to the graph.

pre-processing
根据stop word, 词性标注过滤一部分, 剩下的当作vertex.
论文实验显示, 仅考虑 noun 和 adjective 是好的.
processing
根据文本构建 graph 的过程.
post-processing
相邻接的 key word sequence 会 collapsed into multi-word keyword.
论文中是这么举例的. 在文章 Matlab code for plotting ambiguity functions 中, 如果 Matlab 与 code 总是邻接, 那么就可以考虑合成为 Matlab code, 丰富意思的表达.

论文中试验用的 data set is a collection of 500 abstracts from the Inspec database.
结果: precision:31.2, Recall:43.1, F-measure:36.2 .

可以用于自动生成摘要.