2021SC@SDUSC
直接通过源码来认识方法的使用:
def extract_tags(self, sentence, topK=20, withWeight=False, allowPOS=(), withFlag=False):
"""
Extract keywords from sentence using TF-IDF algorithm.
Parameter:
- topK: return how many top keywords. `None` for all possible words.
- withWeight: if True, return a list of (word, weight);
if False, return a list of words.
- allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v','nr'].
if the POS of w is not in this list,it will be filtered.
- withFlag: only work with allowPOS is not empty.
if True, return a list of pair(word, weight) like posseg.cut
if False, return a list of words
"""
if allowPOS:
allowPOS = frozenset(allowPOS)
words = self.postokenizer.cut(sentence)
else:
words = self.tokenizer.cut(sentence)
#========================================================
freq = {}
for w in words:
if allowPOS:
if w.flag not in allowPOS:
continue
elif not withFlag:
w = w.word
wc = w.word if allowPOS and withFlag else w
if len(wc.strip()) < 2 or wc.lower() in self.stop_words:
continue
freq[w] = freq.get(w, 0.0) + 1.0
total = sum(freq.values())
for k in freq:
kw = k.word if allowPOS and withFlag else k
freq[k] *= self.idf_freq.get(kw, self.median_idf) / total
#=========================================================
if withWeight:
tags = sorted(freq.items(), key=itemgetter(1), reverse=True)
else:
tags = sorted(freq, key=freq.__getitem__, reverse=True)
if topK:
return tags[:topK]
else:
return tags
说明注释表明了该方法的用法。
我们可以看到该方法接受五个参数
sentence:待提取关键词的句子
topK:返回的关键词数量(权重从大到小)
withWeight:返回值是否带权重
allowPos:词性筛选,以元组形式传入
withFlag:返回值是否包含词性
源码被“#===========”切分为三部分:
第一部分:
如果参数allowPOS中有值,就把它自身冻结(frozenset),
然后使用jieba.posseg.cut()对sentence进行分词(包含词性);
如果allowPOS中无值,使用jieba.cut()对sentence进行分词;
结果保存在 words 中。
第二部分:
遍历words中的值,
第一个if allowPOS:
如果allowPOS有值,即进行词性筛选,对不符合要求的值进行筛选(continue);
符合要求的值判断是否需要附带词性,如果不需要,抛弃词性(w=w.words)
wc用以获取w的word部分,如果allowPOS有值且withFlag=True,wc=w.word,否则wc = w
第二个 if len(wc.strip())……:
如果wc长度小于2(舍弃首尾的空格)或wc存在于stop_words中,不纳入计算
freq[w]=……:
计算每个符合条件的词的频率,存入 freq字典中。
for循环:
使用idf算法加工频率
第三部分:
针对withWeight和topK参数的值对结果进行最后加工。
withWeight为True,排序时将freq.items()降序排列
否则排序时只将freq key部分降序排列。
返回前topK个关键字,如果topK为空,则返回所有结果。
测试范例:
代码:
import jieba.analyse as analyse
import jieba
jieba.initialize()
boundary = "="*40
content = open('lyric.txt','rb').read()
print(boundary)
print(1,",topK=5")
tags = analyse.extract_tags(content,topK=5)
print(tags)
print(boundary)
print(2,",topK=15")
tags = analyse.extract_tags(content,topK=15)
print(tags)
print(boundary)
print(3,",topK=5,withWeight = True")
tags = analyse.extract_tags(content,topK=5,withWeight=True)
print(tags)
print(boundary)
print(4,",topK=5,withWeight = True,allowPOS = ('n','v')")
tags = analyse.extract_tags(content,topK=5,withWeight=True,allowPOS=('n','v'))
print(tags)
print(boundary)
print(5,",topK=5,withWeight = True,allowPOS = ('n','v'),withFlag=True")
tags = analyse.extract_tags(content,topK=5,withWeight=True,allowPOS=('n','v'),withFlag=True)
print(tags)
print(boundary)
print(6,",topK=5,withWeight = True,allowPOS = ('n','v'),withFlag=True,set_idf_path()")
analyse.set_idf_path("../extra_dict/idf.txt.big")
tags = analyse.extract_tags(content,topK=5,withWeight=True,allowPOS=('n','v'),withFlag=True)
print(tags)
print(boundary)
print(7,",topK=5,withWeight = True,allowPOS = ('n','v'),withFlag=True,set_stop_words()")
analyse.set_stop_words("../extra_dict/stop_words.txt")
tags = analyse.extract_tags(content,topK=5,withWeight=True,allowPOS=('n','v'),withFlag=True)
print(tags)