jieba源碼研讀筆記（三） - 分詞之Tokenizer初探

最新推荐文章于 2024-07-05 18:00:26 发布

keineahnung2345

最新推荐文章于 2024-07-05 18:00:26 发布

阅读量1.5k

点赞数

分类专栏： NLP 機器學習 jieba源碼研讀筆記文章标签： jieba nlp

本文链接：https://blog.csdn.net/keineahnung2345/article/details/86977785

版权

機器學習同时被 3 个专栏收录

23 篇文章 0 订阅

订阅专栏

NLP

18 篇文章 0 订阅

订阅专栏

jieba源碼研讀筆記

18 篇文章 2 订阅

订阅专栏

前言

jieba/__init__.py負責分詞的功能，在前篇中己經將它的架構做了簡要的介紹。
jieba/__init__.py的核心部份是Tokenizer類別，這將是本篇介紹的重點。

jieba/init.py中的Tokenizer類別

類別架構

首先再來看一下Tokenizer類別中的所有函數名稱：

class Tokenizer(object):

    def __init__(self, dictionary=DEFAULT_DICT):
        ...

    def __repr__(self):
        return '<Tokenizer dictionary=%r>' % self.dictionary

    def gen_pfdict(self, f):
        ...

    def initialize(self, dictionary=None):
        ...

    def check_initialized(self):
        ...

    def calc(self, sentence, DAG, route):
        ...

    def get_DAG(self, sentence):
        ...

    def __cut_all(self, sentence):
        ...

    def __cut_DAG_NO_HMM(self, sentence):
        ...

    def __cut_DAG(self, sentence):
        ...

    def cut(self, sentence, cut_all=False, HMM=True):
        ...

    def cut_for_search(self, sentence, HMM=True):
        ...

    def lcut(self, *args, **kwargs):
        ...

    def lcut_for_search(self, *args, **kwargs):
        ...

    _lcut = lcut
    _lcut_for_search = lcut_for_search

    def _lcut_no_hmm(self, sentence):
        ...

    def _lcut_all(self, sentence):
        ...

    def _lcut_for_search_no_hmm(self, sentence):
        ...

    def get_dict_file(self):
        ...

    def load_userdict(self, f):
        ...

    def add_word(self, word, freq=None, tag=None):
        ...
    
    def del_word(self, word):
        ...

    def suggest_freq(self, segment, tune=False):
        ...

    def tokenize(self, unicode_sentence, mode="default", HMM=True):
        ...
    
    def set_dictionary(self, dictionary_path):
        ...

init函數

在Tokenizer類別中有__init__及initialize這兩個函數，他們發揮的都是初始化的作用。
但是__init__函數是比較輕量級的，在該函數中只簡單地定義了幾個屬性。
分詞所必需的字典載入則延後至initialize函數中完成。

class Tokenizer(object):
    #...
	def __init__(self, dictionary=DEFAULT_DICT):
	    self.lock = threading.RLock()
	    if dictionary == DEFAULT_DICT:
	        self.dictionary = dictionary
	    else:
	        self.dictionary = _get_abs_path(dictionary)
	    self.FREQ = {}
	    self.total = 0
	    self.user_word_tag_tab = {}
	    self.initialized = False
	    self.tmp_dir = None
	    self.cache_file = None

其中lock這個屬性是一個threading.RLock型別的物件，我們會在稍後的initialize函數中看到它的作用。

repr函數

這裡覆寫了object類別的__repr__函數。

class Tokenizer(object):
    #...
	def __repr__(self):
	        return '<Tokenizer dictionary=%r>' % self.dictionary

關於__repr__函數，可以參考object.__repr__(self)。

get_dict_file函數

這個函數的作用是讀取字典檔案，開啟後回傳。
它預設讀取dict.txt，但使用者也可以自定義字典。

class Tokenizer(object):
    #...
	def get_dict_file(self):
	    if self.dictionary == DEFAULT_DICT:
	        return get_module_res(DEFAULT_DICT_NAME)
	    else:
	        return open(self.dictionary, 'rb')

gen_pfdict函數

在initialize中會調用gen_pfdict函數：

self.FREQ, self.total = self.gen_pfdict(self.get_dict_file())

從上面這句就可以看出來，gen_pfdict的作用是從一個己開啟的字典file object中，獲取每個詞的出現頻率以及所有詞的出現次數總和。
以下是它的函數定義：

class Tokenizer(object):
    #...
	def gen_pfdict(self, f): # gen_pfdict接受的參數是一個以二進制、讀取模式開啟的檔案。
	    lfreq = {} #記錄每個詞的出現次數
	    ltotal = 0 #所有詞出現次數的總和
	    #他們會在函數的最後被回傳
	    
	    #resolve_filename定義於_compat.py
	    #它的作用是獲取一個己開啟的檔案的名字
	    f_name = resolve_filename(f)
	    for lineno, line in enumerate(f, 1):#逐行讀取檔案f的內容
	        try:
	            #因為是以二進制的方式讀檔，
	            # 所以這裡用decode來將它由bytes型別轉成字串型別
	            line = line.strip().decode('utf-8') 
	            #更新lfreq及ltotal
	            word, freq = line.split(' ')[:2]
	            freq = int(freq)
	            lfreq[word] = freq
	            ltotal += freq
	            
	            #把word的前ch+1個字母當成一個出現次數為0的單詞，加入lfreq這個字典中
	            # 我們待會可以在get_DAG函數裡看到這樣做的用意
	            
	            #這裡的xragne在Python3也會被認識，這是因為在_compat.py中定義了xrange
	            #，並將它指向Python3裡的range函數
	            for ch in xrange(len(word)):
	                wfrag = word[:ch + 1]
	                if wfrag not in lfreq:
	                    lfreq[wfrag] = 0
	        #在使用.decode('utf-8')的過程中有可能拋出UnicodeDecodeError錯誤。
	        #而我們可以由inspect.getmro(UnicodeDecodeError)這個函數來得知:
	        #ValueError是UnicodeDecodeError的parent class。
	        #所以可以接住UnicodeDecodeError這個異常
	        except ValueError:
	            raise ValueError(
	                'invalid dictionary entry in %s at Line %s: %s' % (f_name, lineno, line))
	    #記得參數f是一個己開啟的檔案
	    #這裡將這個檔案給關閉
	    f.close()
	    return lfreq, ltotal

initialize函數

initialize函數的功能是載入字典，雖然與__init__函數一樣都是用於初始化。但是它不像__init__是在物件創造時就被執行，而是在使用者要存取字典或是開始進行分詞的時候才會執行。

initialize函數會調用前面介紹的get_dict_file，gen_pfdict，_get_abs_path，DICT_WRITING，default_logger等函數及變數。

在initialize函數的定義中，使用到了tempfile，marshal套件以及threading中的RLock類別。我們先來看看他們各自的功能：

threading.Lock

當我們希望某一段代碼能完整地被執行而不被打斷時，我們可以使用threading.Lock來達成。
它的使用方法可以參考MULTITHREADING : USING LOCKS IN THE WITH STATEMENT (CONTEXT MANAGER)：
方法一：

with some_lock:
    # do something...

方法二：

some_lock.acquire()
try:
    # do something...
finally:
    some_lock.release()

Lock與RLock的區別

但是initialize裡使用的是threading.RLock而非threading.Lock，兩者之間的區別可以參考：class threading.RLock及What is the difference between Lock and RLock。
在上述連結的例子中，函數a跟函數b都需要lock，並且函數a會呼叫函數b。如果這時候使用threading.Lock將會導致程序卡死，因此我們必須使用threading.RLock。

RLock的特性是可以重複地被獲取，參考以下例子：
首先測試threading.Lock：

import time
import threading
lock = threading.Lock()
with lock:
    with lock:
        time.sleep(1)

如果真的去運行上面那段代碼，會發現它的執行時間遠遠超過一秒。
這是因為第二個with lock:無法成功地獲取lock，因而導致程序卡死。

再來測試threading.RLock：

import time
import threading
lock = threading.RLock()
with lock:
    with lock:
        time.sleep(1)

這段代碼便可以在1秒左右結束。
因為threading.RLock就是為了這種場景而設計的。

tempfile

tempfile.gettempdir

tempfile.gettempdir官方文檔：

Return the name of the directory used for temporary files.
This defines the default value for the dir argument to all functions in this module.
Python searches a standard list of directories to find one which the
calling user can create files in.

import tempfile
tempfile.gettempdir() 
# '/var/folders/gp/cf5j4s914r73_4bxc4cnksg80000gn/T' #on Mac
#'C:\\Users\\user\\AppData\\Local\\Temp' # on Windows

tempfile.gettempdir的作用旨在尋找一個可以寫入暫存檔的目錄。

tempfile.mkstemp

tempfile.mkstemp文檔：

tempfile.mkstemp([suffix=’’[, prefix=‘tmp’[, dir=None[, text=False]]]]):
Creates a temporary file in the most secure manner possible.
If dir is specified, the file will be created in that directory.
mkstemp() returns a tuple containing an OS-level handle to an open file (as would be returned by os.open()) and the absolute pathname of that file, in that order.

tempfile.mkstemp的作用旨在使用最安全的方式創建一個暫存檔。
它回傳的是一個file descriptor，以及該檔案的絕對路徑。

os.fdopen

os.fdopen文檔：

Return an open file object connected to the file descriptor fd.

利用傳入的file descriptor fd，回傳一個開啟的檔案物件。

marshal

參考Serializing Data Using the marshal Module：marshal.dump及marshal.load是用來儲存及載入Python物件的工具。

initialize代碼

DICT_WRITING = {}

class Tokenizer(object):
    #...
	def initialize(self, dictionary=None):
	    #abs_path代表的是字典的絕對路徑
	    #如果使用者傳入了dictionary參數，則需要更新abs_path
	    #否則的話，就直接使用在__init__()中己經設好的self.dictionary
	    if dictionary:
	        abs_path = _get_abs_path(dictionary)
	        if self.dictionary == abs_path and self.initialized:
	            #因為詞典己載入，所以返回
	            return
	        else:
	            self.dictionary = abs_path
	            self.initialized = False
	    else:
	        abs_path = self.dictionary
	    
	    #載入詞典的過程必須被完整執行，所以使用lock
	    with self.lock:
	        #這一段try-except的內容都是pass，似乎沒有作用
	        try:
	            with DICT_WRITING[abs_path]:
	                pass
	        except KeyError:
	            pass
	        #如果self.intialized為True，代表字典己載入
	        #這時就直接返回
	        if self.initialized:
	            return
	
	        default_logger.debug("Building prefix dict from %s ..." % (abs_path or 'the default dictionary'))
	        t1 = time.time()
	        #將cache_file設定快取檔案的名稱
	        if self.cache_file:
	            cache_file = self.cache_file
	        # default dictionary
	        elif abs_path == DEFAULT_DICT:
	            cache_file = "jieba.cache"
	        # custom dictionary
	        else:
	            cache_file = "jieba.u%s.cache" % md5(
	                abs_path.encode('utf-8', 'replace')).hexdigest()
	        #將cache_file更新為其絕對路徑
	        cache_file = os.path.join(
	            self.tmp_dir or tempfile.gettempdir(), cache_file)
	        #快取檔案的目錄
	        # prevent absolute path in self.cache_file
	        tmpdir = os.path.dirname(cache_file)
	
	        load_from_cache_fail = True
	        #載入cache_file
	        #首先檢查cache_file是否存在，並且是一個檔案
	        #如果不是的話則略過這部份;
	        #如果是的話則接著確認如果使用的是預設的字典DEFAULT_DICT
	        #如果不是使用預設的字典，則要確認cache_file的修改時間晚於自訂義字典的修改時間
	        #如果都符合條件，則從快取檔案中載入self.FREQ, self.total這兩個值,
	        #並將load_from_cache_fail設為False
	        if os.path.isfile(cache_file) and (abs_path == DEFAULT_DICT or
	            #os.path.getmtime: 獲取檔案的最後修改時間
	            os.path.getmtime(cache_file) > os.path.getmtime(abs_path)):
	            default_logger.debug(
	                "Loading model from cache %s" % cache_file)
	            try:
	                with open(cache_file, 'rb') as cf:
	                    #底下有marshal的介紹
	                    self.FREQ, self.total = marshal.load(cf)
	                load_from_cache_fail = False
	            except Exception:
	                load_from_cache_fail = True
	                
	        #如果cache_file載入失敗，就重新讀取字典檔案，
	        # 獲取self.FREQ, self.total然後生成快取檔案
	        if load_from_cache_fail:
	            #可能是怕程式中斷，所以先把lock存到DICT_WRITING這個字典裡
	            #中斷後繼續執行時就可以不用再重新生成一個lock
	            wlock = DICT_WRITING.get(abs_path, threading.RLock())
	            DICT_WRITING[abs_path] = wlock
	            #在這個程式區塊中，又需要一個lock，用來鎖住寫檔的這一區塊
	            with wlock:
	                self.FREQ, self.total = self.gen_pfdict(self.get_dict_file())
	                default_logger.debug(
	                    "Dumping model to file cache %s" % cache_file)
	                try:
	                    # tmpdir是剛剛決定好的快取檔案的路徑
	                    # prevent moving across different filesystems
	                    fd, fpath = tempfile.mkstemp(dir=tmpdir)
	                    # 使用marshal.dump將剛拿到的
	                    # (self.FREQ, self.total)倒入temp_cache_file
	                    with os.fdopen(fd, 'wb') as temp_cache_file:
	                        marshal.dump(
	                            (self.FREQ, self.total), temp_cache_file)
	                    #把檔案重命名為cache_file
	                    _replace_file(fpath, cache_file)
	                except Exception:
	                    default_logger.exception("Dump cache file failed.")
	
	            try:
	                del DICT_WRITING[abs_path]
	            except KeyError:
	                pass
	        
	        #之後會利用self.initialized這個屬性
	        # 來檢查self.FREQ, self.total是否己被設為有意義的值
	        self.initialized = True
	        default_logger.debug(
	            "Loading model cost %.3f seconds." % (time.time() - t1))
	        default_logger.debug("Prefix dict has been built successfully.")

在initialize結束後，self.FREQ才會被賦予有意義的值，而self.FREQ在分詞的時候會被用到。

check_initialized函數

檢查self.FREQ及self.total是否己被設為有意義的值。
如果還沒，則調用initialize函數從字典導入。

class Tokenizer(object):
    #...
    def check_initialized(self):
        if not self.initialized:
            self.initialize()

分詞核心函數

這裡定義的幾個函數實現了Tokenizer的核心功能，在後續的章節裡將會一一介紹。

class Tokenizer(object):
    #...
    def calc(self, sentence, DAG, route):
        ...

    def get_DAG(self, sentence):
        ...

    def __cut_all(self, sentence):
        ...
    
    def __cut_DAG_NO_HMM(self, sentence):
        ...

    def __cut_DAG(self, sentence):
        ...

    def cut(self, sentence, cut_all=False, HMM=True):
        ...

    def cut_for_search(self, sentence, HMM=True):
        ...

分詞函數wrapper

這裡定義了分詞函數的wrapper。
原有的分詞函數回傳的是generator型別的變數。
下面以l開頭的函數們調用了原有的分詞函數，將它們的回傳值轉為list型別，提升了其易用性。

class Tokenizer(object):
    #...
    def lcut(self, *args, **kwargs):
        return list(self.cut(*args, **kwargs))

    def lcut_for_search(self, *args, **kwargs):
        return list(self.cut_for_search(*args, **kwargs))

    _lcut = lcut
    _lcut_for_search = lcut_for_search

    def _lcut_no_hmm(self, sentence):
        return self.lcut(sentence, False, False)

    def _lcut_all(self, sentence):
        return self.lcut(sentence, True)

    def _lcut_for_search_no_hmm(self, sentence):
        return self.lcut_for_search(sentence, False)
        
    def tokenize(self, unicode_sentence, mode="default", HMM=True):
        ...

另外tokenize函數則是將cut函數回傳的字串包裝成(字串起始位置，字串終止位置，字串)的三元組後回傳。

自定義詞典

jieba支持自定義詞典，因為這不是核心功能，在此僅列出相關函數，並不多做介紹。

class Tokenizer(object):
    #...
    def load_userdict(self, f):
        '''
        Load personalized dict to improve detect rate.
        Parameter:
            - f : A plain text file contains words and their ocurrences.
                  Can be a file-like object, or the path of the dictionary file,
                  whose encoding must be utf-8.
        Structure of dict file:
        word1 freq1 word_type1
        word2 freq2 word_type2
        ...
        Word type may be ignored
        '''
        self.check_initialized()
        if isinstance(f, string_types):
            f_name = f
            f = open(f, 'rb')
        else:
            f_name = resolve_filename(f)
        for lineno, ln in enumerate(f, 1):
            line = ln.strip()
            if not isinstance(line, text_type):
                try:
                    line = line.decode('utf-8').lstrip('\ufeff')
                except UnicodeDecodeError:
                    raise ValueError('dictionary file %s must be utf-8' % f_name)
            if not line:
                continue
            # match won't be None because there's at least one character
            word, freq, tag = re_userdict.match(line).groups()
            if freq is not None:
                freq = freq.strip()
            if tag is not None:
                tag = tag.strip()
            self.add_word(word, freq, tag)

    def add_word(self, word, freq=None, tag=None):
        """
        Add a word to dictionary.
        freq and tag can be omitted, freq defaults to be a calculated value
        that ensures the word can be cut out.
        """
        self.check_initialized()
        word = strdecode(word)
        freq = int(freq) if freq is not None else self.suggest_freq(word, False)
        self.FREQ[word] = freq
        self.total += freq
        if tag:
            self.user_word_tag_tab[word] = tag
        for ch in xrange(len(word)):
            wfrag = word[:ch + 1]
            if wfrag not in self.FREQ:
                self.FREQ[wfrag] = 0
        if freq == 0:
            finalseg.add_force_split(word)

    def del_word(self, word):
        """
        Convenient function for deleting a word.
        """
        self.add_word(word, 0)

    def suggest_freq(self, segment, tune=False):
        """
        Suggest word frequency to force the characters in a word to be
        joined or splitted.
        Parameter:
            - segment : The segments that the word is expected to be cut into,
                        If the word should be treated as a whole, use a str.
            - tune : If True, tune the word frequency.
        Note that HMM may affect the final result. If the result doesn't change,
        set HMM=False.
        """
        self.check_initialized()
        ftotal = float(self.total)
        freq = 1
        if isinstance(segment, string_types):
            word = segment
            for seg in self.cut(word, HMM=False):
                freq *= self.FREQ.get(seg, 1) / ftotal
            freq = max(int(freq * self.total) + 1, self.FREQ.get(word, 1))
        else:
            segment = tuple(map(strdecode, segment))
            word = ''.join(segment)
            for seg in segment:
                freq *= self.FREQ.get(seg, 1) / ftotal
            freq = min(int(freq * self.total), self.FREQ.get(word, 0))
        if tune:
            add_word(word, freq)
        return freq

    def set_dictionary(self, dictionary_path):
        with self.lock:
            abs_path = _get_abs_path(dictionary_path)
            if not os.path.isfile(abs_path):
                raise Exception("jieba: file does not exist: " + abs_path)
            self.dictionary = abs_path
            self.initialized = False