textrank算法的关键词提取

最新推荐文章于 2024-06-30 19:34:18 发布

qq_41824131

最新推荐文章于 2024-06-30 19:34:18 发布

阅读量918

点赞数

文章标签： python 机器学习

本文链接：https://blog.csdn.net/qq_41824131/article/details/107051074

版权

textrank虽然没有用在任务中提取关键词，但是还是做了来对比一下其他两个关键词算法的效果，在这里也简单说一下。
思想
1.如果一个单词出现在很多单词后面的话，那么说明这个单词比较重要
2.一个TextRank值很高的单词后面跟着的一个单词，那么这个单词的TextRank值会相应地因此而提高
3.通过词之间的相邻关系构建网络，然后用PageRank迭代计算每个节点的rank值，排序rank值即可得到关键词。PageRank本来是用来解决网页排名的问题，网页之间的链接关系即为图的边，迭代计算公式如下
在这里插入图片描述
实现
设置一个长度为N的滑动窗口，所有在这个窗口之内的词都视作词结点的相邻结点；则TextRank构建的词图为无向图。下图给出了由一个文档构建的词图（去掉了停用词并按词性做了筛选），考虑到不同词对可能有不同的共现（co-occurrence），TextRank将共现作为无向图边的权值。
主要代码
建立有权无向图


    def build_word_graph(self, window=2, pos=None):
        """Build a graph representation of the document in which nodes/vertices
        are words and edges represent co-occurrence relation. Syntactic filters
        can be applied to select only words of certain Part-of-Speech.
        Co-occurrence relations can be controlled using the distance between
        word occurrences in the document.

        As the original paper does not give precise details on how the word
        graph is constructed, we make the following assumptions from the example
        given in Figure 2: 1) sentence boundaries **are not** taken into account
        and, 2) stopwords and punctuation marks **are** considered as words when
        computing the window.

        Args:
            window (int): the window for connecting two words in the graph,
                defaults to 2.
            pos (set): the set of valid pos for words to be considered as nodes
                in the graph, defaults to ('NOUN', 'PROPN', 'ADJ').
        """

        if pos is None:
            pos = {'NOUN', 'PROPN', 'ADJ'}

        # flatten document as a sequence of (word, pass_syntactic_filter) tuples
        text = [(word, sentence.pos[i] in pos) for sentence in self.sentences
                for i, word in enumerate(sentence.stems)]

        # add nodes to the graph
        self.graph.add_nodes_from([word for word, valid in text if valid])

        # add edges to the graph
        for i, (node1, is_in_graph1) in enumerate(text):

            # speed up things
            if not is_in_graph1:
                continue

            for j in range(i + 1, min(i + window, len(text))):
                node2, is_in_graph2 = text[j]
                if is_in_graph2 and node1 != node2:
                    self.graph.add_edge(node1, node2)