jieba源碼研讀筆記（十） - 詞性標注功能初探

最新推荐文章于 2021-08-28 13:47:14 发布

keineahnung2345

最新推荐文章于 2021-08-28 13:47:14 发布

阅读量453

点赞数

分类专栏：機器學習 NLP jieba源碼研讀筆記文章标签： jieba nlp

本文链接：https://blog.csdn.net/keineahnung2345/article/details/86688409

版权

機器學習同时被 3 个专栏收录

23 篇文章 0 订阅

订阅专栏

NLP

18 篇文章 0 订阅

订阅专栏

jieba源碼研讀筆記

18 篇文章 2 订阅

订阅专栏

jieba源碼研讀筆記（十） - 詞性標注功能初探

前言
jieba/posseg的目錄結構
jieba/posseg/__init__.py
參考連結

前言

jieba除了分詞，還包括了詞性標注及關鍵詞提取的功能。
詞性標注的功能是在posseg這個模組中實現。

以下是jieba文檔中對詞性標注功能的描述：

标注句子分词后每个词的词性，采用和 ictclas 兼容的标记法

關於ictclas，可參考ICTCLAS 汉语词性标注集。
(參考連結中還列了一些對照表，有興趣的同學可以前往參看)

jieba/posseg的目錄結構

jieba/posseg:
    char_state_tab.p
    char_state_tab.py
    prob_emit.p
    prob_emit.py
    prob_start.p
    prob_start.py
    prob_trans.p
    prob_trans.py
    viterbi.py
    __init__.py

__init__.py及viterbi.py是詞性標注代碼所在，其它的.p檔及.py檔則是HMM的參數。

在__init__.py裡定義了POSTokenizer類別。POSTokenizer類別中的cut函數可以對句子進行分詞及回傳那些詞的詞性，它包含使用及不使用HMM兩種模式。
在使用HMM的模式下，它會間接地調用viterbi.py裡的viterbi函數來發現新詞。

以下先看看posseg/__init__.py檔的大架構，接著才進入核心算法的部份。

jieba/posseg/init.py

import其它模組

from __future__ import absolute_import, unicode_literals
import os
import re
import sys
import jieba
import pickle
from .._compat import *
from .viterbi import viterbi

pair類別

pair類別具有兩個屬性，分別是word及flag，它們代表詞彙本身及其詞性。
在POSTokenizer中的__cut_DAG_NO_HMM及__cut_DAG函數中，將會把分詞結果及詞性標注結果打包成pair類別的物件後回傳。

class pair(object):

    def __init__(self, word, flag):
        self.word = word
        self.flag = flag #詞性

    def __unicode__(self):
        return '%s/%s' % (self.word, self.flag)

    def __repr__(self):
        return 'pair(%r, %r)' % (self.word, self.flag)

    def __str__(self):
        if PY2:
            return self.__unicode__().encode(default_encoding)
        else:
            return self.__unicode__()

    def __iter__(self):
        return iter((self.word, self.flag))

    def __lt__(self, other):
        return self.word < other.word

    def __eq__(self, other):
        return isinstance(other, pair) and self.word == other.word and self.flag == other.flag

    def __hash__(self):
        return hash(self.word)

    def encode(self, arg):
        return self.__unicode__().encode(arg)

POSTokenizer類別

POSTokenizer類別中定義了__cut_DAG_NO_HMM及__cut_DAG函數，它們負責了詞性標注的核心算法。

class POSTokenizer(object):

    def __init__(self, tokenizer=None):
        ...

    def __repr__(self):
        ...

    def __getattr__(self, name):
        ...

    def initialize(self, dictionary=None):
        ...

    def load_word_tag(self, f):
        ...

    def makesure_userdict_loaded(self):
        ...

    def __cut(self, sentence):
        ...

    def __cut_detail(self, sentence):
        ...

    def __cut_DAG_NO_HMM(self, sentence):
        ...

    def __cut_DAG(self, sentence):
        ...

    def __cut_internal(self, sentence, HMM=True):
        ...

    def _lcut_internal(self, sentence):
        ...

    def _lcut_internal_no_hmm(self, sentence):
        ...

    def cut(self, sentence, HMM=True):
        ...

    def lcut(self, *args, **kwargs):
        ...

POSTokenizer相關的全局變數及函數

此處基於上述定義的POSTokenizer及pair類別，定義了幾個全局的變數及函數。

# default Tokenizer instance

dt = POSTokenizer(jieba.dt)

# global functions

initialize = dt.initialize


def _lcut_internal(s):
    return dt._lcut_internal(s)


def _lcut_internal_no_hmm(s):
    return dt._lcut_internal_no_hmm(s)


def cut(sentence, HMM=True):
    """
    Global `cut` function that supports parallel processing.
    Note that this only works using dt, custom POSTokenizer
    instances are not supported.
    """
    global dt
    if jieba.pool is None:
        for w in dt.cut(sentence, HMM=HMM):
            yield w
    else:
        parts = strdecode(sentence).splitlines(True)
        if HMM:
            result = jieba.pool.map(_lcut_internal, parts)
        else:
            result = jieba.pool.map(_lcut_internal_no_hmm, parts)
        for r in result:
            for w in r:
                yield w


def lcut(sentence, HMM=True):
    return list(cut(sentence, HMM))

參考連結

jieba文檔
 ICTCLAS 汉语词性标注集
 计算所汉语词性标记集
 jieba（结巴）分词种词性简介
 彙整中文與英文的詞性標註代號：結巴斷詞器與FastTag / Identify the Part of Speech in Chinese and English
結巴斷詞器的詞性標註分析
 pos_tag_mapping
luw2007/词性标记.md

keineahnung2345

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
jieba源碼研讀筆記（十） - 詞性標注功能初探

jieba除了分詞，還包括了詞性標注及關鍵詞提取的功能。詞性標注的功能是在posseg這個模組中實現。以下是jieba文檔中對詞性標注功能的描述：标注句子分词后每个词的词性，采用和 ictclas 兼容的标记法關於ictclas，可參考ICTCLAS 汉语词性标注集。
复制链接

扫一扫

专栏目录