jieba中的分词方法

最新推荐文章于 2024-05-15 22:45:20 发布

tuqinag

最新推荐文章于 2024-05-15 22:45:20 发布

阅读量1.6k

点赞数

分类专栏：自然语言处理文章标签：自然语言处理 jieba 中文分词

本文链接：https://blog.csdn.net/tuqinag/article/details/54743474

版权

本文介绍了jieba库在自然语言处理中的分词方法，包括全模式分词，不使用隐马尔可夫模型（HMM）的分词，以及基于HMM的由字构词方法。全模式分词通过列举所有可能的分词组合，如对"我来到北京清华大学"的分词结果。由字构词法则将分词视为字的序列标注问题，简化了词表词和未登录词的识别，利用HMM进行字标注。

摘要由CSDN通过智能技术生成

jieba中的分词方法

最近刚好在看一些自然语言处理方面的东西，写的一些代码中也用到了jieba这个库，感觉从效果上来说还是可以的。就顺便把分词这一块的代码也给看了（关键词抽取部分的代码已经在之前的博客中提过了），接下来跟大家分享下其中的一些方法。

首先是入口函数：

re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._]+)", re.U)
re_skip_default = re.compile("(\r\n|\s)", re.U)
re_han_cut_all = re.compile("([\u4E00-\u9FD5]+)", re.U)
re_skip_cut_all = re.compile("[^a-zA-Z0-9+#\n]", re.U)

def cut(self, sentence, cut_all=False, HMM=True):
        '''
        The main function that segments an entire sentence that contains
        Chinese characters into seperated words.

        Parameter:
            - sentence: The str(unicode) to be segmented.
            - cut_all: Model type. True for full pattern, False for accurate pattern.
            - HMM: Whether to use the Hidden Markov Model.
        '''

        sentence = strdecode(sentence)

        if cut_all:
            re_han = re_han_cut_all
            re_skip = re_skip_cut_all
        else:
            re_han = re_han_default # 所用的非空白字符都会被匹配
            re_skip = re_skip_default # 所有的空白字符都会被跳过
        if cut_all:
            cut_block = self.__cut_all
        elif HMM:
            cut_block = self.__cut_DAG
        else:
            cut_block = self.__cut_DAG_NO_HMM
        blocks = re_han.split(sentence)
        for blk in blocks:
            if not blk:
                continue
            if re_han.match(blk):
                for word in cut_block(blk):
                    yield word # 使用yield关键字,将函数变为一个生成器
            else:
                tmp = re_skip.split(blk)
                for x in tmp:
                    if re_skip.match(x):
                        yield x
                    elif not cut_all:
                        for xx in x:
                            yield xx
                    else:
                        yield x

使用全模式进行分词

至于什么叫做全模式，看完代码，你自然能够明白。

# 用于生成self.FREQ，参数f表示字典的路径
def gen_pfdict(self, f):
    lfreq = {}
    ltotal = 0 # 所有频率之和
    f_name = resolve_filename(f)
    for lineno, line in enumerate(f, 1):
        try:
            line = line.strip().decode('utf-8')
            word, freq = line.s