jieBa analyse.extract_tags

最新推荐文章于 2024-04-26 19:13:06 发布

江_小_白

最新推荐文章于 2024-04-26 19:13:06 发布

阅读量4.5k

点赞数 3

文章标签： python

本文链接：https://blog.csdn.net/qq_45193988/article/details/127280731

版权

对于结巴的提取关键词一直不太清楚，直到看到了这个：

def extract_tags(self, sentence, topK=20, withWeight=False, allowPOS=(), withFlag=False):
# （1）中文分词
    if allowPOS:
        allowPOS = frozenset(allowPOS)
        words = self.postokenizer.cut(sentence)
    else:
        words = self.tokenizer.cut(sentence)

# （2）计算词频TF 
    freq = {}
    for w in words:
        if allowPOS:
            if w.flag not in allowPOS:
                continue
            elif not withFlag:
                w = w.word
        wc = w.word if allowPOS and withFlag else w
        if len(wc.strip()) < 2 or wc.lower() in self.stop_words:
            continue
        freq[w] = freq.get(w, 0.0) + 1.0
    total = sum(freq.values())

# （3）计算IDF
    for k in freq:
        kw = k.word if allowPOS and withFlag else k
        freq[k] *= self.idf_freq.get(kw, self.median_idf) / total

# （4）排序得到关键词集合
    if withWeight:
        tags = sorted(freq.items(), key=itemgetter(1), reverse=True)
    else:
        tags = sorted(freq, key=freq.__getitem__, reverse=True)
    if topK:
        return tags[:topK]
    else:
        return tags

extract_tags()函数将原始文本作为输入，输出文本的关键词集合，代码大致分为四个部分：（1）中文分词（2）计算词频TF （3）计算IDF （4）将所有词排序得到关键词集合。重点关注一下词频TF和IDF的计算，（2）部分代码简历一个字典freq，记录文本中所有词的出现次数。（3）部分代码计算IDF，前文提到IDF需要通过语料库计算，jieba.analyse中包含一个idf.txt。idf.txt中记录了所有词的IDF值，当然你可以使用自己的语料库idf.txt，详见fxsjy/jieba文档。
详细的叙述了analyse.extract_tags使用TFID模型提取关键词的过程，记录一下，另附原文帖子链接（https://zhuanlan.zhihu.com/p/95358646）