jieba分词功能函数解析

最新推荐文章于 2024-11-13 17:43:27 发布

Claire_Mk

最新推荐文章于 2024-11-13 17:43:27 发布

阅读量5.3k

点赞数

文章标签： python 自然语言处理机器学习

本文链接：https://blog.csdn.net/Claire_Mk/article/details/121462812

版权

2021SC@SDUSC jieba分词的主要功能有如下几种：

jieba.cut：该方法接受三个输入参数：需要分词的字符串; cut_all 参数用来控制是否采用全模式；HMM参数用来控制是否适用HMM模型
‘’’
jieba分词主函数，返回generator
参数:
- sentence: 待切分文本.
- cut_all: 切分模式. True 全模式, False 精确模式.
- HMM: 是否使用隐式马尔科夫.
‘’’
可以看出jieba.cut返回一个可迭代的generator，可以使用 for 循环来获得分词后得到的每一个词语(也可以用jieba.lcut直接返回分词list结果)。

cut_all=True, HMM=_对应于全模式，即所有在词典中出现的词都会被切分出来，实现函数为__cut_all；
cut_all=False, HMM=False对应于精确模式且不使用HMM；按Unigram语法模型找出联合概率最大的分词组合，实现函数为__cut_DAG；
cut_all=False, HMM=True对应于精确模式且使用HMM；在联合概率最大的分词组合的基础上，HMM识别未登录词，实现函数为__cut_DAG_NO_HMM。


    def cut(self, sentence, cut_all=False, HMM=True, use_paddle=False):
        """
        The main function that segments an entire sentence that contains
        Chinese characters into separated words.

        Parameter:
            - sentence: The str(unicode) to be segmented.
            - cut_all: Model type. True for full pattern, False for accurate pattern.
            - HMM: Whether to use the Hidden Markov Model.
        """
        is_paddle_installed = check_paddle_install['is_paddle_installed']
        sentence = strdecode(sentence)
        if use_paddle and is_paddle_installed:
            # if sentence is null, it will raise core exception in paddle.
            if sentence is None or len(sentence) == 0:
                return
            import jieba.lac_small.predict as predict
            results = predict.get_sent(sentence)
            for sent in results:
                if sent is None:
                    continue
                yield sent
            return
        re_han = re_han_default
        re_skip = re_skip_default
        if cut_all:
            cut_block = self.__cut_all
        elif HMM:
            cut_block = self.__cut_DAG
        else:
            cut_block = self.__cut_DAG_NO_HMM
        blocks = re_han.split(sentence)
        for blk in blocks:
            if not blk:
                continue
            if re_han.match(blk):
                for word in cut_block(blk):
                    yield word
            else:
                tmp = re_skip.split(blk)
                for x in tmp:
                    if re_skip.match(x):
                        yield x
                    elif not cut_all:
                        for xx in x:
                            yield xx
                    else:
                        yield x

jieba.cut_for_search：该方法接受两个参数：需要分词的字符串；是否使用HMM模型，该方法适用于搜索引擎构建倒排索引的分词，粒度比较细。
从下面的代码中，可以看出：对于长度大于2的词，依次循环滚动取出在前缀词典中的二元子词；对于长度大于3的词，依次循环滚动取出在前缀词典中的三元子词。


    def cut_for_search(self, sentence, HMM=True):
        """
        Finer segmentation for search engines.
        """
        words = self.cut(sentence, HMM=HMM)
        for w in words:
            if len(w) > 2:
                for i in xrange(len(w) - 1):
                    gram2 = w[i:i + 2]
                    if self.FREQ.get(gram2):
                        yield gram2
            if len(w) > 3:
                for i in xrange(len(w) - 2):
                    gram3 = w[i:i + 3]
                    if self.FREQ.get(gram3):
                        yield gram3
            yield w