jieba库：Tokenizer()类详解：（三）词典增删词

最新推荐文章于 2024-07-03 14:34:12 发布

拉克丝の碎花裙

最新推荐文章于 2024-07-03 14:34:12 发布

阅读量1.4k

点赞数

分类专栏：笔记文章标签： python 开发语言后端

本文链接：https://blog.csdn.net/qq_51945755/article/details/121237775

版权

笔记专栏收录该内容

21 篇文章 0 订阅

订阅专栏

本文详细介绍了jieba分词库中添加和删除词的实现过程，包括`add_word`和`del_word`函数的内部逻辑。在添加词汇时，会根据参数设置词频，并更新词频字典，同时可选地添加词性。删除词汇实际上是将其词频设为0。这段内容主要针对自然语言处理和Python开发领域的读者。

摘要由CSDN通过智能技术生成

2021SC@SDUSC

源码：

    def add_word(self, word, freq=None, tag=None):
        """
        Add a word to dictionary.

        freq and tag can be omitted, freq defaults to be a calculated value
        that ensures the word can be cut out.
        """
        #检查是否初始化
        self.check_initialized()
        #改变编码
        word = strdecode(word)
        #根据实参确定freq，如果freq为None，freq就为suggest_freq()的返回值；否则freq为它本身
        freq = int(freq) if freq is not None else self.suggest_freq(word, False)
        #添加到词频字典中
        self.FREQ[word] = freq
        self.total += freq
        #添加词性
        if tag:
            self.user_word_tag_tab[word] = tag
        #把字典中没有的word的子word添加到字典中，词频为0
        for ch in xrange(len(word)):
            wfrag = word[:ch + 1]
            if wfrag not in self.FREQ:
                self.FREQ[wfrag] = 0
        #用来删除词
        if freq == 0:
            finalseg.add_force_split(word)

第一步同样是检查jieba库是否初始化，因为初始化后才会加载词典。

strdecode(sentence)源码：

def strdecode(sentence):
    if not isinstance(sentence, text_type):
        try:
            sentence = sentence.decode('utf-8')
        except UnicodeDecodeError:
            sentence = sentence.decode('gbk', 'ignore')
    return sentence

对sentence使用‘utf-8’进行改编码，如果失败就使用'gbk'。

如果freq为None，那么它将调用 suggest_freq(word,False)函数，获得该词可以被识别的词频。然后用该词频作为word的词频，添加word到词频FREQ字典。

如果tag为None，则不会添加word的词性到self.user_word_tag_tab字典。

也就是说，如果希望添加词并且使得它可以被识别，自定义词典中完全可以省略该词的词频。

删除词：

源码：

    def del_word(self, word):
        """
        Convenient function for deleting a word.
        """
        #使词频为0，调用finalseg.add_force_split(word)
        self.add_word(word, 0)