2021SC@SDUSC
一、gen_pfdict函数
在初始化的过程中,会构建前缀词典。构建前缀词典的入口函数是gen_pfdict(self, f),解析离线统计词典文本文件,每一行分别对应着词、词频、词性,将词和词频提取出来,以词为key,以词频为value,加入到前缀词典中。对于每个词,再分别获取它的前缀词,如果前缀词已经存在于前缀词典中,则不处理;如果该前缀词不在前缀词典中,则将其词频置为0,便于后续构建有向无环图。
def gen_pfdict(f):
lfreq = {}
ltotal = 0
f_name = resolve_filename(f)
for lineno, line in enumerate(f, 1):
try:
line = line.strip().decode('utf-8')
word, freq = line.split(' ')[:2]
freq = int(freq)
lfreq[word] = freq
ltotal += freq
for ch in xrange(len(word)):
wfrag = word[:ch + 1]
if wfrag not in lfreq:
lfreq[wfrag] = 0
except ValueError:
raise ValueError(
'invalid dictionary entry in %s at Line %s: %s' % (f_name, lineno, line))
f.close()
return lfreq, ltotal
为什么jieba没有使用trie树作为前缀词典存储的数据结构?
参考jieba中的issue–不用Trie,减少内存加快速度;优化代码细节 #187,本处直接引用该issue的comment,如下,
对于get_DAG()函数来说,用Trie数据结构,特别是在Python环境,内存使用量过大。
二、initialize函数
如果cache缓存中存在词库,就不会进入initialize。
利用get_DAG(sentence)函数获得待切分句子的DAG,首先检测(check_initialized)进程是否已经加载词库,若未初始化词库则调用Tolenizer中的initialize函数进行初始化,initialize中判断有无已经缓存的前缀词典cache_file文件,若有相应的cache文件则直接使用 marshal.load 方法加载前缀词典,若无则通过gen_pfdict对指定的词库dict.txt进行计算生成前缀词典,到jieba进程的初始化工作完成后就调用get_DAG获得句子的DAG;
def initialize(self, dictionary=None):
default_logger.debug("Building prefix dict from %s ..." % (abs_path or 'the default dictionary'))
t1 = time.time()
if self.cache_file:
cache_file = self.cache_file
# default dictionary
elif abs_path == DEFAULT_DICT:
cache_file = "jieba.cache"
# custom dictionary
else:
cache_file = "jieba.u%s.cache" % md5(
abs_path.encode('utf-8', 'replace')).hexdigest()
cache_file = os.path.join(
self.tmp_dir or tempfile.gettempdir(), cache_file)
# prevent absolute path in self.cache_file
tmpdir = os.path.dirname(cache_file)
load_from_cache_fail = True
if os.path.isfile(cache_file) and (abs_path == DEFAULT_DICT or
os.path.getmtime(cache_file) > os.path.getmtime(abs_path)):
default_logger.debug(
"Loading model from cache %s" % cache_file)
try:
with open(cache_file, 'rb') as cf:
self.FREQ, self.total = marshal.load(cf)
load_from_cache_fail = False
except Exception:
load_from_cache_fail = True
self.initialized = True
default_logger.debug(
"Loading model cost %.3f seconds." % (time.time() - t1))
default_logger.debug("Prefix dict has been built successfully.")
三、中文分词器
class ChineseTokenizer(Tokenizer):
def __call__(self, text, **kargs):
words = jieba.tokenize(text, mode="search")
token = Token()
for (w, start_pos, stop_pos) in words:
if not accepted_chars.match(w) and len(w) <= 1:
continue
token.original = token.text = w
token.pos = start_pos
token.startchar = start_pos
token.endchar = stop_pos
yield token