jieba分词TFIDF算法2

最新推荐文章于 2022-10-08 20:07:34 发布

Claire_Mk

最新推荐文章于 2022-10-08 20:07:34 发布

阅读量413

点赞数

本文链接：https://blog.csdn.net/Claire_Mk/article/details/120816706

版权

TF-IDF 关键词提取停用词逆文档频率信息检索

关键词由CSDN通过智能技术生成

2021SC@SDUSC
关键词提取器

class KeywordExtractor(object):

    STOP_WORDS = set((
        "the", "of", "is", "and", "to", "in", "that", "we", "for", "an", "are",
        "by", "be", "as", "on", "with", "can", "if", "from", "which", "you", "it",
        "this", "then", "at", "have", "all", "not", "one", "has", "or", "that"
    ))

    def set_stop_words(self, stop_words_path):
        abs_path = _get_abs_path(stop_words_path)
        if not os.path.isfile(abs_path):
            raise Exception("jieba: file does not exist: " + abs_path)
        content = open(abs_path, 'rb').read().decode('utf-8')
        for line in content.splitlines():
            self.stop_words.add(line)

    def extract_tags(self, *args, **kwargs):
        raise NotImplementedError

set_stop_words方法中将"the", “of”, “is”, “and”, “to”, “in”, “that”, “we”, “for”, “an”, “are”,
“by”, “be”, “as”, “on”, “with”, “can”, “if”, “from”, “which”, “you”, “it”,
“this”, “then”, “at”, “have”, “all”, “not”, “one”, “has”, “or”, “that”
这些无意义的副词提取，因为这样的token，我们对今后的token分析没有贡献。读入stopwords.txt以删除这些token。也就是说分词的过程不变，打印时做个集合差运算,另外一个方法是使用extract_tags函数，这个函数会根据TF-IDF算法将特征词提取出来，在提取之前会去掉停用词，可以人工指定停用词字典.

class IDFLoader(object):

def __init__(self, idf_path=None):
    self.path = ""
    self.idf_freq = {}
    self.median_idf = 0.0
    if idf_path:
        self.set_new_path(idf_path)

def set_new_path(self, new_idf_path):
    if self.path != new_idf_path:
        self.path = new_idf_path
        content = open(new_idf_path, 'rb').read().decode('utf-8')
        self.idf_freq = {}
        for line in content.splitlines():
            word, freq = line.strip().split(' ')
            self.idf_freq[word] = float(freq)
        self.median_idf = sorted(
            self.idf_freq.values())[len(self.idf_freq) // 2]

def get_idf(self):
    return self.idf_freq, self.median_idf

词频 (term frequency, TF) 指的是某一个给定的词语在该文件中出现的次数。这个数字通常会被归一化(一般是词频除以文章总词数), 以防止它偏向长的文件。IDF的主要思想是：如果包含词条t的文档越少, IDF越大，则说明词条具有很好的类别区分能力。某一特定词语的IDF，可以由总文件数目除以包含该词语之文件的数目，再将得到的商取对数得到。
解析 idf.txt，拿到词与idf的对应值，建立一个字典key = word，value = idf
extract_tags:
使用TF-IDF算法从句子中提取关键词。返回多少个关键字。None表示所有可能的单词。数量：如果为真，返回（单词、数量）列表；
如果为False，则返回一个单词列表。-允许的POS列表。如“ns”、“n”、“vn”、“v”、“nr”]。


class TFIDF(KeywordExtractor):

    def __init__(self, idf_path=None):
        self.tokenizer = jieba.dt
        self.postokenizer = jieba.posseg.dt
        self.stop_words = self.STOP_WORDS.copy()
        self.idf_loader = IDFLoader(idf_path or DEFAULT_IDF)
        self.idf_freq, self.median_idf = self.idf_loader.get_idf()

    def set_idf_path(self, idf_path):
        new_abs_path = _get_abs_path(idf_path)
        if not os.path.isfile(new_abs_path):
            raise Exception("jieba: file does not exist: " + new_abs_path)
        self.idf_loader.set_new_path(new_abs_path)
        self.idf_freq, self.median_idf = self.idf_loader.get_idf()

    def extract_tags(self, sentence, topK=20, withWeight=False, allowPOS=(), withFlag=False):
        """
        Extract keywords from sentence using TF-IDF algorithm.
        Parameter:
            - topK: return how many top keywords. `None` for all possible words.
            - withWeight: if True, return a list of (word, weight);
                          if False, return a list of words.
            - allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v','nr'].
                        if the POS of w is not in this list,it will be filtered.
            - withFlag: only work with allowPOS is not empty.
                        if True, return a list of pair(word, weight) like posseg.cut
                        if False, return a list of words
        """
        if allowPOS:
            allowPOS = frozenset(allowPOS)
            words = self.postokenizer.cut(sentence)
        else:
            words = self.tokenizer.cut(sentence)
        freq = {}
        for w in words:
            if allowPOS:
                if w.flag not in allowPOS:
                    continue
                elif not withFlag:
                    w = w.word
            wc = w.word if allowPOS and withFlag else w
            if len(wc.strip()) < 2 or wc.lower() in self.stop_words:
                continue
            freq[w] = freq.get(w, 0.0) + 1.0
        total = sum(freq.values())
        for k in freq:
            kw = k.word if allowPOS and withFlag else k
            freq[k] *= self.idf_freq.get(kw, self.median_idf) / total

        if withWeight:
            tags = sorted(freq.items(), key=itemgetter(1), reverse=True)
        else:
            tags = sorted(freq, key=freq.__getitem__, reverse=True)
        if topK:
            return tags[:topK]
        else:
            return tags

Claire_Mk

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
jieba分词TFIDF算法2

2021SC@SDUSC关键词提取器class KeywordExtractor(object): STOP_WORDS = set(( "the", "of", "is", "and", "to", "in", "that", "we", "for", "an", "are", "by", "be", "as", "on", "with", "can", "if", "from", "which", "you", "it", "this", "
复制链接

扫一扫