基于 TextRank 算法的关键词抽取

最新推荐文章于 2024-06-30 19:34:18 发布

拉克丝の碎花裙

最新推荐文章于 2024-06-30 19:34:18 发布

阅读量995

点赞数

分类专栏：笔记文章标签： python

本文链接：https://blog.csdn.net/qq_51945755/article/details/121020869

版权

笔记专栏收录该内容

21 篇文章 0 订阅

订阅专栏

2021SC@SDUSC

源码：

    def textrank(self, sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'), withFlag=False):
        """
        Extract keywords from sentence using TextRank algorithm.
        Parameter:
            - topK: return how many top keywords. `None` for all possible words.
            - withWeight: if True, return a list of (word, weight);
                          if False, return a list of words.
            - allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v'].
                        if the POS of w is not in this list, it will be filtered.
            - withFlag: if True, return a list of pair(word, weight) like posseg.cut
                        if False, return a list of words
        """
        self.pos_filt = frozenset(allowPOS)
        g = UndirectWeightedGraph()
        cm = defaultdict(int)
        words = tuple(self.tokenizer.cut(sentence))
       #===========================================================================
        for i, wp in enumerate(words):
            if self.pairfilter(wp):
                for j in xrange(i + 1, i + self.span):
                    if j >= len(words):
                        break
                    if not self.pairfilter(words[j]):
                        continue
                    if allowPOS and withFlag:
                        cm[(wp, words[j])] += 1
                    else:
                        cm[(wp.word, words[j].word)] += 1

        for terms, w in cm.items():
            g.addEdge(terms[0], terms[1], w)
        nodes_rank = g.rank()
        #==========================================================================
        if withWeight:
            tags = sorted(nodes_rank.items(), key=itemgetter(1), reverse=True)
        else:
            tags = sorted(nodes_rank, key=nodes_rank.__getitem__, reverse=True)

        if topK:
            return tags[:topK]
        else:
            return tags

textrank()接收的五个参数用法和基于TD-IDF算法的extract_tags（）使用基本类似，唯一不同的是allowPOS参数默认值变了，textrank()的默认值变成了('ns', 'n', 'vn', 'v')，默认限制在四个词性

使用 #=============将源码分割为三个部分

第一部分：
使用pos_filt存放冻结的allowPOS，g为创建的无向加权图，cm为创建的value初始值为（int）0的字典，防止在调用时出现任何错误。详情参见
words为使用posseg.cut(sentence)切分后的结果集。
self.tokenizer:

self.tokenizer = self.postokenizer = jieba.posseg.dt

第二部分：
enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列，同时列出数据和数据下标
pairfilter(wp)源码：

    def pairfilter(self, wp):
        return (wp.flag in self.pos_filt and len(wp.word.strip()) >= 2
                and wp.word.lower() not in self.stop_words)

如果词性符合要求，词条长度大于二并且不属于stop_words，返回True；否则返回False
用以判断切分的词是否符合要求（加入无向图）

for循环对下表为i往后self.span(默认值为5)-1个词进行判断，如果它长度小于第i个词，并且符合 pairfilter(wp)的条件，那么会在两词之间形成无向路径，且权值+1（之前处理过权值默认为0）。

将上述生成的边、权加入无向图，进行rank()排序，获得最终结果。

第三部分：

针对withWeight和topK参数的值对结果进行最后加工。

withWeight为True，排序时将freq.items()降序排列

否则排序时只将freq key部分降序排列。

返回前topK个关键字，如果topK为空，则返回所有结果。

测试范例：

测试代码：

import jieba.analyse as analyse
import jieba
jieba.initialize()
boundary = "="*40
content = open('lyric.txt','rb').read()
print(boundary)
print(1,"，topK=5")
tags = analyse.textrank(content,topK=5)
print(tags)
print(boundary)
print(2,"，topK=15")
tags = analyse.textrank(content,topK=15)
print(tags)
print(boundary)
print(3,"，topK=5,withWeight = True")
tags = analyse.textrank(content,topK=5,withWeight=True)
print(tags)
print(boundary)
print(4,"，topK=5,withWeight = True,allowPOS = ('n','v')")
tags = analyse.textrank(content,topK=5,withWeight=True,allowPOS=('n','v'))
print(tags)
print(boundary)
print(5,"，topK=5,withWeight = True,allowPOS = ('n','v'),withFlag=True")
tags = analyse.textrank(content,topK=5,withWeight=True,allowPOS=('n','v'),withFlag=True)
print(tags)
print(boundary)
print(6,"，topK=5,withWeight = True,allowPOS = ('n','v'),withFlag=True,set_idf_path()")
analyse.set_idf_path("../extra_dict/idf.txt.big")
tags = analyse.textrank(content,topK=5,withWeight=True,allowPOS=('n','v'),withFlag=True)
print(tags)
print(boundary)
print(7,"，topK=5,withWeight = True,allowPOS = ('n','v'),withFlag=True,set_stop_words()")
analyse.set_stop_words("../extra_dict/stop_words.txt")
tags = analyse.textrank(content,topK=5,withWeight=True,allowPOS=('n','v'),withFlag=True)
print(tags)

注：因为allowPOS默认值不为空值，所以withFlag参数不必像extract_tags()必须依赖于allowPOS的参数。