jieba库：Tokenizer()类详解：（五）tokenize分词

最新推荐文章于 2024-06-03 17:12:52 发布

拉克丝の碎花裙

最新推荐文章于 2024-06-03 17:12:52 发布

阅读量817

点赞数

分类专栏：笔记

本文链接：https://blog.csdn.net/qq_51945755/article/details/120993352

版权

jieba 分词源码分析 Python 自然语言处理

关键词由CSDN通过智能技术生成

笔记专栏收录该内容

21 篇文章 0 订阅

订阅专栏

2021SC@SDUSC

官方的文档里测试已经很明确了，就不在这里赘述了，分析一下源码好了~

源码：

 def tokenize(self, unicode_sentence, mode="default", HMM=True):
        """
        Tokenize a sentence and yields tuples of (word, start, end)

        Parameter:
            - sentence: the str(unicode) to be segmented.
            - mode: "default" or "search", "search" is for finer segmentation.
            - HMM: whether to use the Hidden Markov Model.
        """
        if not isinstance(unicode_sentence, text_type):
            raise ValueError("jieba: the input parameter should be unicode.")
        start = 0
        if mode == 'default':
            for w in self.cut(unicode_sentence, HMM=HMM):
                width = len(w)
                yield (w, start, start + width)
                start += width
        else:
            for w in self.cut(unicode_sentence, HMM=HMM):
                width = len(w)
                if len(w) > 2:
                    for i in xrange(len(w) - 1):
                        gram2 = w[i:i + 2]
                        if self.FREQ.get(gram2):
                            yield (gram2, start + i, start + i + 2)
                if len(w) > 3:
                    for i in xrange(len(w) - 2):
                        gram3 = w[i:i + 3]
                        if self.FREQ.get(gram3):
                            yield (gram3, start + i, start + i + 3)
                yield (w, start, start + width)
                start += width

可以看到，该方法接收三个参数 unicode_sentence,mode,HMM，且后两个都有默认值。

第一部分的 if语句用于判断unicode_sentence接收的实参是否为unicode编码的str，如果不是就报错。

第二部分就开始切分，（start用以记录单个词的起始位置），使用if else 语句决定使用的模式（default模式和search模式）。

如果参数 mode==‘default’，那么就是用默认模式，使用精确模式切分句子，然后遍历结果，把结果以及它在句子中的位置装在一个元组中返回给迭代器。

如果参数mode！=‘default’，那么使用搜索模式，使用精确模式切分句子，然后遍历结果，把结果中大于2和大于3的再次进行切分，可以成词的结果加上它的位置下标装成元组返回给迭代器，最后返回该值。

搜索模式的源码是不是看起来很眼熟，对，他就是 cut_for_search（）的孪生兄弟。

详情参见这一篇

一模一样有没有~