jieba分词源码分析

最新推荐文章于 2023-07-21 17:10:53 发布

Flyingzhan

最新推荐文章于 2023-07-21 17:10:53 发布

阅读量1k

点赞数

分类专栏： NLP 文章标签： nlp

本文链接：https://blog.csdn.net/Flyingzhan/article/details/114002920

版权

NLP 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

jieba分词源码分析

jieba分词是开源的中文分词库，里面包含了分词，核心词提取等功能，使用范围非常广。下面介绍一下jieba分词的源码，方便之后查找回忆。

1：前缀词典

基于词典的切词方法需要一个好的语料库，jieba分词的作者在这里https://github.com/fxsjy/jieba/issues/7描述了语料库来源，主要来源于人民日报的语料库。初始化时会根据原始语料库生成前缀词典，可以用来得到之后的词频数据等。

@staticmethod
    def gen_pfdict(f):
        lfreq = {}
        ltotal = 0
        f_name = resolve_filename(f)
        for lineno, line in enumerate(f, 1):
            try:
                line = line.strip().decode('utf-8')
                word, freq = line.split(' ')[:2]
                freq = int(freq)
                lfreq[word] = freq
                ltotal += freq
                for ch in xrange(len(word)):
                    wfrag = word[:ch + 1]
                    if wfrag not in lfreq:
                        lfreq[wfrag] = 0
            except ValueError:
                raise ValueError(
                    'invalid dictionary entry in %s at Line %s: %s' % (f_name, lineno, line))
        f.close()
        return lfreq, ltotal

2：DAG

DAG也就是有向无环图，jieba分词使用dict存储sentence的DAG，key为单词的下标，value为该单词到句子末尾的所有前缀词（前缀词典中存在）。

def get_DAG(self, sentence):
        self.check_initialized()
        DAG = {}
        N = len(sentence)
        for k in xrange(N):
            tmplist = []
            i = k
            frag = sentence[k]
            while i < N and frag in self.FREQ:
                if self.FREQ[frag]:
                    tmplist.append(i)
                i += 1
                frag = sentence[k:i + 1]
            if not tmplist:
                tmplist.append(k)
            DAG[k] = tmplist
        return DAG

3：切词

jieba分词支持下面四种分词模式：

精确模式，试图将句子最精确地切开，适合文本分析；
全模式，把句子中所有的可以成词的词语都扫描出来, 速度非常快，但是不能解决歧义；
搜索引擎模式，在精确模式的基础上，对长词再次切分，提高召回率，适合用于搜索引擎分词。
paddle模式，利用PaddlePaddle深度学习框架，训练序列标注（双向GRU）网络模型实现分词。同时支持词性标注。paddle模式使用需安装paddlepaddle-tiny，pip install paddlepaddle-tiny==1.6.1。目前paddle模式支持jieba v0.40及以上版本。jieba v0.40以下版本，请升级jieba，pip install jieba --upgrade

3.1 全模式

只有句子长度等于1的时候考虑是否是英语（其实就是当前单词），全模式实际上是把DAG中所有可能的切词都取出来了（前缀词典中存在）。

def __cut_all(self, sentence):
        dag = self.get_DAG(sentence)
        old_j = -1
        eng_scan = 0
        eng_buf = u''
        for k, L in iteritems(dag):
            if eng_scan == 1 and not re_eng.match(sentence[k]):
                eng_scan = 0
                yield eng_buf
            if len(L) == 1 and k > old_j:
                word = sentence[k:L[0] + 1]
                if re_eng.match(word):
                    if eng_scan == 0:
                        eng_scan = 1
                        eng_buf = word
                    else:
                        eng_buf += word
                if eng_scan == 0:
                    yield word
                old_j = L[0]
            else:
                for j in L:
                    if j > k:
                        yield sentence[k:j + 1]
                        old_j = j
        if eng_scan == 1:
            yield eng_buf

3.2 精确模式（HMM）

DAG能够得到语料库中存在的所有可见前缀，和最大匹配算法类似，我们可以从前向贪心的找到最大词频的前缀输出。但是这样的方法不能保证一定产生全局最优的结果，因此需要进一步求解概率最大的分词序列。我们可以直接使用动态规划算法，求解最有分词路径。route[idx]记录的是开头为idx的切词结果，例如route[idx]=idx+3代表从idx到句末中，最先在idx+3的地方切词产生的序列能够得到最大的概率。

def calc(self, sentence, DAG, route):
        N = len(sentence)
        route[N] = (0, 0)
        logtotal = log(self.total)
        for idx in xrange(N - 1, -1, -1):
            route[idx] = max((log(self.FREQ.get(sentence[idx:x + 1]) or 1) -
                              logtotal + route[x + 1][0], x) for x in DAG[idx])

然后根据动态规划的结果，从左开始往右根据route把切词打印出来，这里会把单个单词存储到buf里面，当buf的长度大于1时，会使用veterbi算法进行新词发现，因为例如对于一句话『他来到了网易杭研大厦』，杭研没在前缀词典里，那么杭，杭研，杭研大，杭研大厦都不会被存储到DAG里面，只有单个单词『杭』。同样的『研』也是单个的单词，因此buf就是『杭研』，使用veterbi算法对新词进行切词。

def __cut_DAG(self, sentence):
        DAG = self.get_DAG(sentence)
        route = {}
        self.calc(sentence, DAG, route)
        x = 0
        buf = ''
        N = len(sentence)
        while x < N:
            y = route[x][1] + 1
            l_word = sentence[x:y]
            if y - x == 1:
                buf += l_word
            else:
                if buf:
                    if len(buf) == 1:
                        yield buf
                        buf = ''
                    else:
                        if not self.FREQ.get(buf):
                            recognized = finalseg.cut(buf)
                            for t in recognized:
                                yield t
                        else:
                            for elem in buf:
                                yield elem
                        buf = ''
                yield l_word
            x = y

        if buf:
            if len(buf) == 1:
                yield buf
            elif not self.FREQ.get(buf):
                recognized = finalseg.cut(buf)
                for t in recognized:
                    yield t
            else:
                for elem in buf:
                    yield elem

veterbi算法比较直观，使用动态规划的思路求全局最优路径，详细的介绍可以看别的博客https://www.zhihu.com/question/20136144。

算法非常的直观，和上面的图一样，从头开始列出所有可能的k种state，然后使用动态规划概率最大的序列，其中值得注意的是path的key是当前运行到的单词的state，不是开始的state。

def viterbi(obs, states, start_p, trans_p, emit_p):
    V = [{}]  # tabular
    path = {}
    for y in states:  # init
        V[0][y] = start_p[y] + emit_p[y].get(obs[0], MIN_FLOAT)
        path[y] = [y]
    for t in xrange(1, len(obs)):
        V.append({})
        newpath = {}
        for y in states:
            em_p = emit_p[y].get(obs[t], MIN_FLOAT)
            (prob, state) = max(
                [(V[t - 1][y0] + trans_p[y0].get(y, MIN_FLOAT) + em_p, y0) for y0 in PrevStatus[y]])
            V[t][y] = prob
            newpath[y] = path[state] + [y]
        path = newpath

    (prob, state) = max((V[len(obs) - 1][y], y) for y in 'ES')

    return (prob, path[state])

下面是veterbi算法的执行函数，在cut里面会设置必须可分的词，如果使用add_force_split添加了单词的话，会强制数据单个字符，cut里面会将句子分成汉字和英语，如果是汉字的话，会使用veterbi算法进行切分，如果是英语的话直接打印单个字符。

def __cut(sentence):
    global emit_P
    prob, pos_list = viterbi(sentence, 'BMES', start_P, trans_P, emit_P)
    begin, nexti = 0, 0
    # print pos_list, sentence
    for i, char in enumerate(sentence):
        pos = pos_list[i]
        if pos == 'B':
            begin = i
        elif pos == 'E':
            yield sentence[begin:i + 1]
            nexti = i + 1
        elif pos == 'S':
            yield char
            nexti = i + 1
    if nexti < len(sentence):
        yield sentence[nexti:]

re_han = re.compile("([\u4E00-\u9FD5]+)")
re_skip = re.compile("([a-zA-Z0-9]+(?:\.\d+)?%?)")


def add_force_split(word):
    global Force_Split_Words
    Force_Split_Words.add(word)

def cut(sentence):
    sentence = strdecode(sentence)
    blocks = re_han.split(sentence)
    for blk in blocks:
        if re_han.match(blk):
            for word in __cut(blk):
                if word not in Force_Split_Words:
                    yield word
                else:
                    for c in word:
                        yield c
        else:
            tmp = re_skip.split(blk)
            for x in tmp:
                if x:
                    yield x

如果不使用veterbi算法的话，就只需要把单个的词组合起来输出。

def __cut_DAG_NO_HMM(self, sentence):
        DAG = self.get_DAG(sentence)
        route = {}
        self.calc(sentence, DAG, route)
        x = 0
        N = len(sentence)
        buf = ''
        while x < N:
            y = route[x][1] + 1
            l_word = sentence[x:y]
            if re_eng.match(l_word) and len(l_word) == 1:
                buf += l_word
                x = y
            else:
                if buf:
                    yield buf
                    buf = ''
                yield l_word
                x = y
        if buf:
            yield buf
            buf = ''

示例代码：

# encoding=utf-8
import jieba

jieba.enable_paddle()# 启动paddle模式。 0.40版之后开始支持，早期版本不支持
strs=["我来到北京清华大学","乒乓球拍卖完了","中国科学技术大学"]
for str in strs:
    seg_list = jieba.cut(str,use_paddle=True) # 使用paddle模式
    print("Paddle Mode: " + '/'.join(list(seg_list)))

seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list))  # 全模式

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  # 精确模式

seg_list = jieba.cut("他来到了网易杭研大厦")  # 默认是精确模式
print(", ".join(seg_list))

seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所，后在日本京都大学深造")  # 搜索引擎模式
print(", ".join(seg_list))