jieba源碼研讀筆記(三) - 分詞之Tokenizer初探
前言
jieba/__init__.py
負責分詞的功能,在前篇中己經將它的架構做了簡要的介紹。
jieba/__init__.py
的核心部份是Tokenizer
類別,這將是本篇介紹的重點。
jieba/init.py中的Tokenizer類別
類別架構
首先再來看一下Tokenizer
類別中的所有函數名稱:
class Tokenizer(object):
def __init__(self, dictionary=DEFAULT_DICT):
...
def __repr__(self):
return '<Tokenizer dictionary=%r>' % self.dictionary
def gen_pfdict(self, f):
...
def initialize(self, dictionary=None):
...
def check_initialized(self):
...
def calc(self, sentence, DAG, route):
...
def get_DAG(self, sentence):
...
def __cut_all(self, sentence):
...
def __cut_DAG_NO_HMM(self, sentence):
...
def __cut_DAG(self, sentence):
...
def cut(self, sentence, cut_all=False, HMM=True):
...
def cut_for_search(self, sentence, HMM=True):
...
def lcut(self, *args, **kwargs):
...
def lcut_for_search(self, *args, **kwargs):
...
_lcut = lcut
_lcut_for_search = lcut_for_search
def _lcut_no_hmm(self, sentence):
...
def _lcut_all(self, sentence):
...
def _lcut_for_search_no_hmm(self, sentence):
...
def get_dict_file(self):
...
def load_userdict(self, f):
...
def add_word(self, word, freq=None, tag=None):
...
def del_word(self, word):
...
def suggest_freq(self, segment, tune=False):
...
def tokenize(self, unicode_sentence, mode="default", HMM=True):
...
def set_dictionary(self, dictionary_path):
...
__init__函數
在Tokenizer
類別中有__init__
及initialize
這兩個函數,他們發揮的都是初始化的作用。
但是__init__
函數是比較輕量級的,在該函數中只簡單地定義了幾個屬性。
分詞所必需的字典載入則延後至initialize
函數中完成。
class Tokenizer(object):
#...
def __init__(self, dictionary=DEFAULT_DICT):
self.lock = threading.RLock()
if dictionary == DEFAULT_DICT:
self.dictionary = dictionary
else:
self.dictionary = _get_abs_path(dictionary)
self.FREQ = {}
self.total = 0
self.user_word_tag_tab = {}
self.initialized = False
self.tmp_dir = None
self.cache_file = None
其中lock
這個屬性是一個threading.RLock
型別的物件,我們會在稍後的initialize
函數中看到它的作用。
__repr__函數
這裡覆寫了object類別的__repr__函數。
class Tokenizer(object):
#...
def __repr__(self):
return '<Tokenizer dictionary=%r>' % self.dictionary
關於__repr__函數,可以參考object.__repr__(self)。
get_dict_file函數
這個函數的作用是讀取字典檔案,開啟後回傳。
它預設讀取dict.txt
,但使用者也可以自定義字典。
class Tokenizer(object):
#...
def get_dict_file(self):
if self.dictionary == DEFAULT_DICT:
return get_module_res(DEFAULT_DICT_NAME)
else:
return open(self.dictionary, 'rb')
gen_pfdict函數
在initialize
中會調用gen_pfdict
函數:
self.FREQ, self.total = self.gen_pfdict(self.get_dict_file())
從上面這句就可以看出來,gen_pfdict
的作用是從一個己開啟的字典file object中,獲取每個詞的出現頻率以及所有詞的出現次數總和。
以下是它的函數定義:
class Tokenizer(object):
#...
def gen_pfdict(self, f): # gen_pfdict接受的參數是一個以二進制、讀取模式開啟的檔案。
lfreq = {} #記錄每個詞的出現次數
ltotal = 0 #所有詞出現次數的總和
#他們會在函數的最後被回傳
#resolve_filename定義於_compat.py
#它的作用是獲取一個己開啟的檔案的名字
f_name = resolve_filename(f)
for lineno, line in enumerate(f, 1):#逐行讀取檔案f的內容
try:
#因為是以二進制的方式讀檔,
# 所以這裡用decode來將它由bytes型別轉成字串型別
line = line.strip().decode('utf-8')
#更新lfreq及ltotal
word, freq = line.split(' ')[:2]
freq = int(freq)
lfreq[word] = freq
ltotal += freq
#把word的前ch+1個字母當成一個出現次數為0的單詞,加入lfreq這個字典中
# 我們待會可以在get_DAG函數裡看到這樣做的用意
#這裡的xragne在Python3也會被認識,這是因為在_compat.py中定義了xrange
#,並將它指向Python3裡的range函數
for ch in xrange(len(word)):
wfrag = word[:ch + 1]
if wfrag not in lfreq:
lfreq[wfrag] = 0
#在使用.decode('utf-8')的過程中有可能拋出UnicodeDecodeError錯誤。
#而我們可以由inspect.getmro(UnicodeDecodeError)這個函數來得知:
#ValueError是UnicodeDecodeError的parent class。
#所以可以接住UnicodeDecodeError這個異常
except ValueError:
raise ValueError(
'invalid dictionary entry in %s at Line %s: %s' % (f_name, lineno, line))
#記得參數f是一個己開啟的檔案
#這裡將這個檔案給關閉
f.close()
return lfreq, ltotal
initialize函數
initialize
函數的功能是載入字典,雖然與__init__
函數一樣都是用於初始化。但是它不像__init__
是在物件創造時就被執行,而是在使用者要存取字典或是開始進行分詞的時候才會執行。
initialize
函數會調用前面介紹的get_dict_file
,gen_pfdict
,_get_abs_path
,DICT_WRITING
,default_logger
等函數及變數。
在initialize
函數的定義中,使用到了tempfile
,marshal
套件以及threading
中的RLock
類別。我們先來看看他們各自的功能:
threading.Lock
當我們希望某一段代碼能完整地被執行而不被打斷時,我們可以使用threading.Lock
來達成。
它的使用方法可以參考MULTITHREADING : USING LOCKS IN THE WITH STATEMENT (CONTEXT MANAGER):
方法一:
with some_lock:
# do something...
方法二:
some_lock.acquire()
try:
# do something...
finally:
some_lock.release()
Lock與RLock的區別
但是initialize
裡使用的是threading.RLock
而非threading.Lock
,兩者之間的區別可以參考:class threading.RLock及What is the difference between Lock and RLock。
在上述連結的例子中,函數a跟函數b都需要lock,並且函數a會呼叫函數b。如果這時候使用threading.Lock
將會導致程序卡死,因此我們必須使用threading.RLock
。
RLock的特性是可以重複地被獲取,參考以下例子:
首先測試threading.Lock
:
import time
import threading
lock = threading.Lock()
with lock:
with lock:
time.sleep(1)
如果真的去運行上面那段代碼,會發現它的執行時間遠遠超過一秒。
這是因為第二個with lock:
無法成功地獲取lock
,因而導致程序卡死。
再來測試threading.RLock
:
import time
import threading
lock = threading.RLock()
with lock:
with lock:
time.sleep(1)
這段代碼便可以在1秒左右結束。
因為threading.RLock
就是為了這種場景而設計的。
tempfile
tempfile.gettempdir
Return the name of the directory used for temporary files.
This defines the default value for the dir argument to all functions in this module.
Python searches a standard list of directories to find one which the
calling user can create files in.
import tempfile
tempfile.gettempdir()
# '/var/folders/gp/cf5j4s914r73_4bxc4cnksg80000gn/T' #on Mac
#'C:\\Users\\user\\AppData\\Local\\Temp' # on Windows
tempfile.gettempdir
的作用旨在尋找一個可以寫入暫存檔的目錄。
tempfile.mkstemp
tempfile.mkstemp([suffix=’’[, prefix=‘tmp’[, dir=None[, text=False]]]]):
Creates a temporary file in the most secure manner possible.
If dir is specified, the file will be created in that directory.
mkstemp() returns a tuple containing an OS-level handle to an open file (as would be returned by os.open()) and the absolute pathname of that file, in that order.
tempfile.mkstemp
的作用旨在使用最安全的方式創建一個暫存檔。
它回傳的是一個file descriptor,以及該檔案的絕對路徑。
os.fdopen
Return an open file object connected to the file descriptor fd.
利用傳入的file descriptor fd
,回傳一個開啟的檔案物件。
marshal
參考Serializing Data Using the marshal Module:marshal.dump
及marshal.load
是用來儲存及載入Python物件的工具。
initialize代碼
DICT_WRITING = {}
class Tokenizer(object):
#...
def initialize(self, dictionary=None):
#abs_path代表的是字典的絕對路徑
#如果使用者傳入了dictionary參數,則需要更新abs_path
#否則的話,就直接使用在__init__()中己經設好的self.dictionary
if dictionary:
abs_path = _get_abs_path(dictionary)
if self.dictionary == abs_path and self.initialized:
#因為詞典己載入,所以返回
return
else:
self.dictionary = abs_path
self.initialized = False
else:
abs_path = self.dictionary
#載入詞典的過程必須被完整執行,所以使用lock
with self.lock:
#這一段try-except的內容都是pass,似乎沒有作用
try:
with DICT_WRITING[abs_path]:
pass
except KeyError:
pass
#如果self.intialized為True,代表字典己載入
#這時就直接返回
if self.initialized:
return
default_logger.debug("Building prefix dict from %s ..." % (abs_path or 'the default dictionary'))
t1 = time.time()
#將cache_file設定快取檔案的名稱
if self.cache_file:
cache_file = self.cache_file
# default dictionary
elif abs_path == DEFAULT_DICT:
cache_file = "jieba.cache"
# custom dictionary
else:
cache_file = "jieba.u%s.cache" % md5(
abs_path.encode('utf-8', 'replace')).hexdigest()
#將cache_file更新為其絕對路徑
cache_file = os.path.join(
self.tmp_dir or tempfile.gettempdir(), cache_file)
#快取檔案的目錄
# prevent absolute path in self.cache_file
tmpdir = os.path.dirname(cache_file)
load_from_cache_fail = True
#載入cache_file
#首先檢查cache_file是否存在,並且是一個檔案
#如果不是的話則略過這部份;
#如果是的話則接著確認如果使用的是預設的字典DEFAULT_DICT
#如果不是使用預設的字典,則要確認cache_file的修改時間晚於自訂義字典的修改時間
#如果都符合條件,則從快取檔案中載入self.FREQ, self.total這兩個值,
#並將load_from_cache_fail設為False
if os.path.isfile(cache_file) and (abs_path == DEFAULT_DICT or
#os.path.getmtime: 獲取檔案的最後修改時間
os.path.getmtime(cache_file) > os.path.getmtime(abs_path)):
default_logger.debug(
"Loading model from cache %s" % cache_file)
try:
with open(cache_file, 'rb') as cf:
#底下有marshal的介紹
self.FREQ, self.total = marshal.load(cf)
load_from_cache_fail = False
except Exception:
load_from_cache_fail = True
#如果cache_file載入失敗,就重新讀取字典檔案,
# 獲取self.FREQ, self.total然後生成快取檔案
if load_from_cache_fail:
#可能是怕程式中斷,所以先把lock存到DICT_WRITING這個字典裡
#中斷後繼續執行時就可以不用再重新生成一個lock
wlock = DICT_WRITING.get(abs_path, threading.RLock())
DICT_WRITING[abs_path] = wlock
#在這個程式區塊中,又需要一個lock,用來鎖住寫檔的這一區塊
with wlock:
self.FREQ, self.total = self.gen_pfdict(self.get_dict_file())
default_logger.debug(
"Dumping model to file cache %s" % cache_file)
try:
# tmpdir是剛剛決定好的快取檔案的路徑
# prevent moving across different filesystems
fd, fpath = tempfile.mkstemp(dir=tmpdir)
# 使用marshal.dump將剛拿到的
# (self.FREQ, self.total)倒入temp_cache_file
with os.fdopen(fd, 'wb') as temp_cache_file:
marshal.dump(
(self.FREQ, self.total), temp_cache_file)
#把檔案重命名為cache_file
_replace_file(fpath, cache_file)
except Exception:
default_logger.exception("Dump cache file failed.")
try:
del DICT_WRITING[abs_path]
except KeyError:
pass
#之後會利用self.initialized這個屬性
# 來檢查self.FREQ, self.total是否己被設為有意義的值
self.initialized = True
default_logger.debug(
"Loading model cost %.3f seconds." % (time.time() - t1))
default_logger.debug("Prefix dict has been built successfully.")
在initialize
結束後,self.FREQ
才會被賦予有意義的值,而self.FREQ
在分詞的時候會被用到。
check_initialized函數
檢查self.FREQ
及self.total
是否己被設為有意義的值。
如果還沒,則調用initialize
函數從字典導入。
class Tokenizer(object):
#...
def check_initialized(self):
if not self.initialized:
self.initialize()
分詞核心函數
這裡定義的幾個函數實現了Tokenizer
的核心功能,在後續的章節裡將會一一介紹。
class Tokenizer(object):
#...
def calc(self, sentence, DAG, route):
...
def get_DAG(self, sentence):
...
def __cut_all(self, sentence):
...
def __cut_DAG_NO_HMM(self, sentence):
...
def __cut_DAG(self, sentence):
...
def cut(self, sentence, cut_all=False, HMM=True):
...
def cut_for_search(self, sentence, HMM=True):
...
分詞函數wrapper
這裡定義了分詞函數的wrapper。
原有的分詞函數回傳的是generator
型別的變數。
下面以l
開頭的函數們調用了原有的分詞函數,將它們的回傳值轉為list
型別,提升了其易用性。
class Tokenizer(object):
#...
def lcut(self, *args, **kwargs):
return list(self.cut(*args, **kwargs))
def lcut_for_search(self, *args, **kwargs):
return list(self.cut_for_search(*args, **kwargs))
_lcut = lcut
_lcut_for_search = lcut_for_search
def _lcut_no_hmm(self, sentence):
return self.lcut(sentence, False, False)
def _lcut_all(self, sentence):
return self.lcut(sentence, True)
def _lcut_for_search_no_hmm(self, sentence):
return self.lcut_for_search(sentence, False)
def tokenize(self, unicode_sentence, mode="default", HMM=True):
...
另外tokenize
函數則是將cut
函數回傳的字串包裝成(字串起始位置,字串終止位置,字串)的三元組後回傳。
自定義詞典
jieba
支持自定義詞典,因為這不是核心功能,在此僅列出相關函數,並不多做介紹。
class Tokenizer(object):
#...
def load_userdict(self, f):
'''
Load personalized dict to improve detect rate.
Parameter:
- f : A plain text file contains words and their ocurrences.
Can be a file-like object, or the path of the dictionary file,
whose encoding must be utf-8.
Structure of dict file:
word1 freq1 word_type1
word2 freq2 word_type2
...
Word type may be ignored
'''
self.check_initialized()
if isinstance(f, string_types):
f_name = f
f = open(f, 'rb')
else:
f_name = resolve_filename(f)
for lineno, ln in enumerate(f, 1):
line = ln.strip()
if not isinstance(line, text_type):
try:
line = line.decode('utf-8').lstrip('\ufeff')
except UnicodeDecodeError:
raise ValueError('dictionary file %s must be utf-8' % f_name)
if not line:
continue
# match won't be None because there's at least one character
word, freq, tag = re_userdict.match(line).groups()
if freq is not None:
freq = freq.strip()
if tag is not None:
tag = tag.strip()
self.add_word(word, freq, tag)
def add_word(self, word, freq=None, tag=None):
"""
Add a word to dictionary.
freq and tag can be omitted, freq defaults to be a calculated value
that ensures the word can be cut out.
"""
self.check_initialized()
word = strdecode(word)
freq = int(freq) if freq is not None else self.suggest_freq(word, False)
self.FREQ[word] = freq
self.total += freq
if tag:
self.user_word_tag_tab[word] = tag
for ch in xrange(len(word)):
wfrag = word[:ch + 1]
if wfrag not in self.FREQ:
self.FREQ[wfrag] = 0
if freq == 0:
finalseg.add_force_split(word)
def del_word(self, word):
"""
Convenient function for deleting a word.
"""
self.add_word(word, 0)
def suggest_freq(self, segment, tune=False):
"""
Suggest word frequency to force the characters in a word to be
joined or splitted.
Parameter:
- segment : The segments that the word is expected to be cut into,
If the word should be treated as a whole, use a str.
- tune : If True, tune the word frequency.
Note that HMM may affect the final result. If the result doesn't change,
set HMM=False.
"""
self.check_initialized()
ftotal = float(self.total)
freq = 1
if isinstance(segment, string_types):
word = segment
for seg in self.cut(word, HMM=False):
freq *= self.FREQ.get(seg, 1) / ftotal
freq = max(int(freq * self.total) + 1, self.FREQ.get(word, 1))
else:
segment = tuple(map(strdecode, segment))
word = ''.join(segment)
for seg in segment:
freq *= self.FREQ.get(seg, 1) / ftotal
freq = min(int(freq * self.total), self.FREQ.get(word, 0))
if tune:
add_word(word, freq)
return freq
def set_dictionary(self, dictionary_path):
with self.lock:
abs_path = _get_abs_path(dictionary_path)
if not os.path.isfile(abs_path):
raise Exception("jieba: file does not exist: " + abs_path)
self.dictionary = abs_path
self.initialized = False
參考連結
I don’t understand os.name==‘nt’: . what is nt and os.name
What is the difference between Lock and RLock
MULTITHREADING : USING LOCKS IN THE WITH STATEMENT (CONTEXT MANAGER)
tempfile — Generate temporary files and directories
Serializing Data Using the marshal Module