textrank算法的关键词提取

textrank虽然没有用在任务中提取关键词,但是还是做了来对比一下其他两个关键词算法的效果,在这里也简单说一下。
思想
1.如果一个单词出现在很多单词后面的话,那么说明这个单词比较重要
2.一个TextRank值很高的单词后面跟着的一个单词,那么这个单词的TextRank值会相应地因此而提高
3.通过词之间的相邻关系构建网络,然后用PageRank迭代计算每个节点的rank值,排序rank值即可得到关键词。PageRank本来是用来解决网页排名的问题,网页之间的链接关系即为图的边,迭代计算公式如下
在这里插入图片描述
实现
设置一个长度为N的滑动窗口,所有在这个窗口之内的词都视作词结点的相邻结点;则TextRank构建的词图为无向图。下图给出了由一个文档构建的词图(去掉了停用词并按词性做了筛选),考虑到不同词对可能有不同的共现(co-occurrence),TextRank将共现作为无向图边的权值。
主要代码
建立有权无向图


    def build_word_graph(self, window=2, pos=None):
        """Build a graph representation of the document in which nodes/vertices
        are words and edges represent co-occurrence relation. Syntactic filters
        can be applied to select only words of certain Part-of-Speech.
        Co-occurrence relations can be controlled using the distance between
        word occurrences in the document.

        As the original paper does not give precise details on how the word
        graph is constructed, we make the following assumptions from the example
        given in Figure 2: 1) sentence boundaries **are not** taken into account
        and, 2) stopwords and punctuation marks **are** considered as words when
        computing the window.

        Args:
            window (int): the window for connecting two words in the graph,
                defaults to 2.
            pos (set): the set of valid pos for words to be considered as nodes
                in the graph, defaults to ('NOUN', 'PROPN', 'ADJ').
        """

        if pos is None:
            pos = {'NOUN', 'PROPN', 'ADJ'}

        # flatten document as a sequence of (word, pass_syntactic_filter) tuples
        text = [(word, sentence.pos[i] in pos) for sentence in self.sentences
                for i, word in enumerate(sentence.stems)]

        # add nodes to the graph
        self.graph.add_nodes_from([word for word, valid in text if valid])

        # add edges to the graph
        for i, (node1, is_in_graph1) in enumerate(text):

            # speed up things
            if not is_in_graph1:
                continue

            for j in range(i + 1, min(i + window, len(text))):
                node2, is_in_graph2 = text[j]
                if is_in_graph2 and node1 != node2:
                    self.graph.add_edge(node1, node2)

参考
https://www.aclweb.org/anthology/W04-3252.pdf

展开阅读全文
©️2020 CSDN 皮肤主题: 深蓝海洋 设计师: CSDN官方博客 返回首页
实付0元
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值