jieba源碼研讀筆記（一） - 分詞功能初探

最新推荐文章于 2022-04-09 21:51:26 发布

keineahnung2345

最新推荐文章于 2022-04-09 21:51:26 发布

阅读量698

点赞数 1

分类专栏： NLP 機器學習 jieba源碼研讀筆記文章标签： jieba nlp

本文链接：https://blog.csdn.net/keineahnung2345/article/details/86609671

版权

機器學習同时被 3 个专栏收录

23 篇文章 0 订阅

订阅专栏

NLP

18 篇文章 0 订阅

订阅专栏

jieba源碼研讀筆記

18 篇文章 2 订阅

订阅专栏

jieba源碼研讀筆記（一） - 分詞功能初探

前言
jieba/__init__.py
參考連結

前言

jieba的分詞功能是由jieba這個模組本身及finalseg來完成。
而jieba這個模組裡包含了__init__.py，__main__.py，_compat.py及dict.txt四個檔案。
其中__init__.py定義了Tokenizer類別及一些全局函數，用於分詞本身。
__main__.py定義了jieba在命令行裡的使用方式。
_compat.py用於處理Python2/3相容性的問題。
dict.txt則是字典，記錄了各詞的詞頻及詞性。

jieba/init.py

Tokenizer類別

在__init__.py這個檔案裡面，定義了一個叫做Tokenizer的類別。
它擁有cut,cut_for_search,tokenize等多種方法，負責了分詞的工作。

class Tokenizer(object):

    def __init__(self, dictionary=DEFAULT_DICT):
        ...

    def __repr__(self):
        return '<Tokenizer dictionary=%r>' % self.dictionary

    def gen_pfdict(self, f):
        ...

    def initialize(self, dictionary=None):
        ...

    def check_initialized(self):
        ...

    def calc(self, sentence, DAG, route):
        ...

    def get_DAG(self, sentence):
        ...

    def __cut_all(self, sentence):
        ...

    def __cut_DAG_NO_HMM(self, sentence):
        ...

    def __cut_DAG(self, sentence):
        ...

    def cut(self, sentence, cut_all=False, HMM=True):
        ...

    def cut_for_search(self, sentence, HMM=True):
        ...

    def lcut(self, *args, **kwargs):
        ...

    def lcut_for_search(self, *args, **kwargs):
        ...

    _lcut = lcut
    _lcut_for_search = lcut_for_search

    def _lcut_no_hmm(self, sentence):
        ...

    def _lcut_all(self, sentence):
        ...

    def _lcut_for_search_no_hmm(self, sentence):
        ...

    def get_dict_file(self):
        ...

    def load_userdict(self, f):
        ...

    def add_word(self, word, freq=None, tag=None):
        ...
    
    def del_word(self, word):
        ...

    def suggest_freq(self, segment, tune=False):
        ...

    def tokenize(self, unicode_sentence, mode="default", HMM=True):
        ...
    
    def set_dictionary(self, dictionary_path):
        ...

Tokenizer相關的全局函數

根據README裡介紹的使用方法，我們可以直接調用jieba.cut來分詞，這是怎麼做到的呢？

在定義好Tokenizer類別後，__init__.py裡建立了一個Tokenizer類別的dt對象。
然後逐一定義全局函數，並將它們指向dt中相對應的函數，如以下代碼所示：

# default Tokenizer instance
dt = Tokenizer()

# global functions
get_FREQ = lambda k, d=None: dt.FREQ.get(k, d)
add_word = dt.add_word
calc = dt.calc
cut = dt.cut
lcut = dt.lcut
cut_for_search = dt.cut_for_search
lcut_for_search = dt.lcut_for_search
del_word = dt.del_word
get_DAG = dt.get_DAG
get_dict_file = dt.get_dict_file
initialize = dt.initialize
load_userdict = dt.load_userdict
set_dictionary = dt.set_dictionary
suggest_freq = dt.suggest_freq
tokenize = dt.tokenize
user_word_tag_tab = dt.user_word_tag_tab


def _lcut_all(s):
    return dt._lcut_all(s)


def _lcut(s):
    return dt._lcut(s)


def _lcut_no_hmm(s):
    return dt._lcut_no_hmm(s)


def _lcut_all(s):
    return dt._lcut_all(s)


def _lcut_for_search(s):
    return dt._lcut_for_search(s)


def _lcut_for_search_no_hmm(s):
    return dt._lcut_for_search_no_hmm(s)

如：cut = dt.cut這一句，它定義了一個全局函數cut，並把它指向dt對象的cut函數。
如此一來，我們就可以不用自己新建一個Tokenizer對象，而是直接使用jieba.cut來分詞。

全局函數_get_abs_path

上述全局函數皆指向dt所擁有的函數。除了這些函數外，還定義一個全局函數_get_abs_path。

_get_abs_path = lambda path: os.path.normpath(os.path.join(os.getcwd(), path))

這個函數的參數path是字典的名稱，它的作用是在字典名稱前加上當前路徑，然後把路徑正規化後回傳。

全局函數_replace_file

這個函數的功用是移動（或說重命名）檔案。

if os.name == 'nt':
    from shutil import move as _replace_file
else:
    _replace_file = os.rename

這裡使用if-else的寫法是為了處理重命名函數在不同作業系統上的相容性，確保_replace_file在不同的作業系統上皆能運作。
參考I don’t understand os.name==‘nt’: . what is nt and os.name，代碼第一行的os.name == 'nt'代表當前的作業系統是Windows。

正則表達式

re_userdict = re.compile('^(.+?)( [0-9]+)?( [a-z]+)?$', re.U)

re_eng = re.compile('[a-zA-Z0-9]', re.U)

# \u4E00-\u9FD5a-zA-Z0-9+#&\._ : All non-space characters. Will be handled with re_han
# \r\n|\s : whitespace characters. Will not be handled.
# re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%]+)", re.U)
# Adding "-" symbol in re_han_default
re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-]+)", re.U)

re_skip_default = re.compile("(\r\n|\s)", re.U)
re_han_cut_all = re.compile("([\u4E00-\u9FD5]+)", re.U)
re_skip_cut_all = re.compile("[^a-zA-Z0-9+#\n]", re.U)

這裡定義了數個正則表達式，它們會在分詞及載入字典時發揮作用。
此處定義的正則表達式將會獨立出來，在另外一篇文章中做介紹。

log相關函數

log_console = logging.StreamHandler(sys.stderr)
default_logger = logging.getLogger(__name__)
default_logger.setLevel(logging.DEBUG)
default_logger.addHandler(log_console)

def setLogLevel(log_level):
    global logger
    default_logger.setLevel(log_level)

default_logger如字面上的意思，是這個腳本檔中預設的logger。

並行分詞相關函數

pool = None

def _pcut(sentence, cut_all=False, HMM=True):
    parts = strdecode(sentence).splitlines(True)
    if cut_all:
        result = pool.map(_lcut_all, parts)
    elif HMM:
        result = pool.map(_lcut, parts)
    else:
        result = pool.map(_lcut_no_hmm, parts)
    for r in result:
        for w in r:
            yield w


def _pcut_for_search(sentence, HMM=True):
    parts = strdecode(sentence).splitlines(True)
    if HMM:
        result = pool.map(_lcut_for_search, parts)
    else:
        result = pool.map(_lcut_for_search_no_hmm, parts)
    for r in result:
        for w in r:
            yield w


def enable_parallel(processnum=None):
    """
    Change the module's `cut` and `cut_for_search` functions to the
    parallel version.
    Note that this only works using dt, custom Tokenizer
    instances are not supported.
    """
    global pool, dt, cut, cut_for_search
    from multiprocessing import cpu_count
    if os.name == 'nt':
        raise NotImplementedError(
            "jieba: parallel mode only supports posix system")
    else:
        from multiprocessing import Pool
    dt.check_initialized()
    if processnum is None:
        processnum = cpu_count()
    pool = Pool(processnum)
    cut = _pcut
    cut_for_search = _pcut_for_search


def disable_parallel():
    global pool, dt, cut, cut_for_search
    if pool:
        pool.close()
        pool = None
    cut = dt.cut
    cut_for_search = dt.cut_for_search