2021SC@SDUSC
源码:
def add_word(self, word, freq=None, tag=None):
"""
Add a word to dictionary.
freq and tag can be omitted, freq defaults to be a calculated value
that ensures the word can be cut out.
"""
#检查是否初始化
self.check_initialized()
#改变编码
word = strdecode(word)
#根据实参确定freq,如果freq为None,freq就为suggest_freq()的返回值;否则freq为它本身
freq = int(freq) if freq is not None else self.suggest_freq(word, False)
#添加到词频字典中
self.FREQ[word] = freq
self.total += freq
#添加词性
if tag:
self.user_word_tag_tab[word] = tag
#把字典中没有的word的子word添加到字典中,词频为0
for ch in xrange(len(word)):
wfrag = word[:ch + 1]
if wfrag not in self.FREQ:
self.FREQ[wfrag] = 0
#用来删除词
if freq == 0:
finalseg.add_force_split(word)
第一步同样是检查jieba库是否初始化,因为初始化后才会加载词典。
strdecode(sentence)源码:
def strdecode(sentence):
if not isinstance(sentence, text_type):
try:
sentence = sentence.decode('utf-8')
except UnicodeDecodeError:
sentence = sentence.decode('gbk', 'ignore')
return sentence
对sentence使用‘utf-8’进行改编码,如果失败就使用'gbk'。
如果freq为None,那么它将调用 suggest_freq(word,False)函数,获得该词可以被识别的词频。然后用该词频作为word的词频,添加word到词频FREQ字典。
如果tag为None,则不会添加word的词性到self.user_word_tag_tab字典。
也就是说,如果希望添加词并且使得它可以被识别,自定义词典中完全可以省略该词的词频。
删除词:
源码:
def del_word(self, word):
"""
Convenient function for deleting a word.
"""
#使词频为0,调用finalseg.add_force_split(word)
self.add_word(word, 0)