基于TF-IDF算法的关键词提取

最新推荐文章于 2022-05-10 20:06:23 发布

拉克丝の碎花裙

最新推荐文章于 2022-05-10 20:06:23 发布

阅读量198

点赞数

分类专栏：笔记文章标签： python

本文链接：https://blog.csdn.net/qq_51945755/article/details/120956925

版权

笔记专栏收录该内容

21 篇文章 0 订阅

订阅专栏

2021SC@SDUSC

直接通过源码来认识方法的使用：

    def extract_tags(self, sentence, topK=20, withWeight=False, allowPOS=(), withFlag=False):
        """
        Extract keywords from sentence using TF-IDF algorithm.
        Parameter:
            - topK: return how many top keywords. `None` for all possible words.
            - withWeight: if True, return a list of (word, weight);
                          if False, return a list of words.
            - allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v','nr'].
                        if the POS of w is not in this list,it will be filtered.
            - withFlag: only work with allowPOS is not empty.
                        if True, return a list of pair(word, weight) like posseg.cut
                        if False, return a list of words
        """
        if allowPOS:
            allowPOS = frozenset(allowPOS)
            words = self.postokenizer.cut(sentence)
        else:
            words = self.tokenizer.cut(sentence)
        #========================================================
        freq = {}
        for w in words:
            if allowPOS:
                if w.flag not in allowPOS:
                    continue
                elif not withFlag:
                    w = w.word
            wc = w.word if allowPOS and withFlag else w
            if len(wc.strip()) < 2 or wc.lower() in self.stop_words:
                continue
            freq[w] = freq.get(w, 0.0) + 1.0
        total = sum(freq.values())
        for k in freq:
            kw = k.word if allowPOS and withFlag else k
            freq[k] *= self.idf_freq.get(kw, self.median_idf) / total
        #=========================================================
        if withWeight:
            tags = sorted(freq.items(), key=itemgetter(1), reverse=True)
        else:
            tags = sorted(freq, key=freq.__getitem__, reverse=True)
        if topK:
            return tags[:topK]
        else:
            return tags

说明注释表明了该方法的用法。

我们可以看到该方法接受五个参数

sentence:待提取关键词的句子

topK：返回的关键词数量（权重从大到小）

withWeight：返回值是否带权重

allowPos：词性筛选，以元组形式传入

withFlag：返回值是否包含词性

源码被“#===========”切分为三部分：

第一部分：

如果参数allowPOS中有值，就把它自身冻结（frozenset），

然后使用jieba.posseg.cut()对sentence进行分词（包含词性）；

如果allowPOS中无值，使用jieba.cut()对sentence进行分词；

结果保存在 words 中。

第二部分：

遍历words中的值，

第一个if allowPOS：

如果allowPOS有值，即进行词性筛选，对不符合要求的值进行筛选（continue）；
符合要求的值判断是否需要附带词性，如果不需要，抛弃词性（w=w.words）

wc用以获取w的word部分，如果allowPOS有值且withFlag=True，wc=w.word,否则wc = w

第二个 if len（wc.strip()）……：

如果wc长度小于2（舍弃首尾的空格）或wc存在于stop_words中，不纳入计算

freq[w]=……：

计算每个符合条件的词的频率，存入 freq字典中。

for循环：

使用idf算法加工频率

第三部分：

针对withWeight和topK参数的值对结果进行最后加工。

withWeight为True，排序时将freq.items()降序排列

否则排序时只将freq key部分降序排列。

返回前topK个关键字，如果topK为空，则返回所有结果。

测试范例：

代码：

import jieba.analyse as analyse
import jieba
jieba.initialize()
boundary = "="*40
content = open('lyric.txt','rb').read()
print(boundary)
print(1,"，topK=5")
tags = analyse.extract_tags(content,topK=5)
print(tags)
print(boundary)
print(2,"，topK=15")
tags = analyse.extract_tags(content,topK=15)
print(tags)
print(boundary)
print(3,"，topK=5,withWeight = True")
tags = analyse.extract_tags(content,topK=5,withWeight=True)
print(tags)
print(boundary)
print(4,"，topK=5,withWeight = True,allowPOS = ('n','v')")
tags = analyse.extract_tags(content,topK=5,withWeight=True,allowPOS=('n','v'))
print(tags)
print(boundary)
print(5,"，topK=5,withWeight = True,allowPOS = ('n','v'),withFlag=True")
tags = analyse.extract_tags(content,topK=5,withWeight=True,allowPOS=('n','v'),withFlag=True)
print(tags)
print(boundary)
print(6,"，topK=5,withWeight = True,allowPOS = ('n','v'),withFlag=True,set_idf_path()")
analyse.set_idf_path("../extra_dict/idf.txt.big")
tags = analyse.extract_tags(content,topK=5,withWeight=True,allowPOS=('n','v'),withFlag=True)
print(tags)
print(boundary)
print(7,"，topK=5,withWeight = True,allowPOS = ('n','v'),withFlag=True,set_stop_words()")
analyse.set_stop_words("../extra_dict/stop_words.txt")
tags = analyse.extract_tags(content,topK=5,withWeight=True,allowPOS=('n','v'),withFlag=True)
print(tags)