jieba源代码分析——四种分词模式（五）

最新推荐文章于 2024-05-20 09:29:45 发布

叮叮咚咚乐呵呵

最新推荐文章于 2024-05-20 09:29:45 发布

阅读量1.1k

点赞数

文章标签： python nlp

本文链接：https://blog.csdn.net/qq_47229425/article/details/122182510

版权

2021SC@SDUSC
2021SC@SDUSC
在分析完tokenizer类中其他需要被分词使用的函数后，我们正式开始分析四种分词模式直接调用的cut函数的代码。
通常情况下，会直接默认精确模式，但是通过指定参数cut_all=True和use_paddle=True可以选择是否选择全模式或paddle模式。
2021SC@SDUSC
2021SC@SDUSC

#jieba分词的主函数,返回结果是一个可迭代的 generator
    def cut(self, sentence, cut_all=False, HMM=True):
        '''
        The main function that segments an entire sentence that contains
        Chinese characters into seperated words.
        Parameter:
            - sentence: The str(unicode) to be segmented.
            - cut_all: Model type. True for full pattern, False for accurate pattern.
            - HMM: Whether to use the Hidden Markov Model.
        '''
        # 解码为unicode
        sentence = strdecode(sentence)
        # 不同模式下的正则
        if cut_all:
            re_han = re_han_cut_all  #re.compile("([\u4E00-\u9FD5]+)", re.U)
            re_skip = re_skip_cut_all #re.compile("[^a-zA-Z0-9+#\n]", re.U)
        else:
            re_han = re_han