中文分词算法

柳成荫~

于 2021-12-28 19:47:45 发布

阅读量988

点赞数 2

分类专栏：自然语言处理文章标签：中文分词算法自然语言处理

本文链接：https://blog.csdn.net/qq_41335232/article/details/122201056

版权

自然语言处理专栏收录该内容

4 篇文章

订阅专栏

分词算法

基于规则的分词

最大匹配法

最大匹配法设定一个最大词长度，每次匹配尽可能匹配最长的词

算法过程示例

最大词长度为4

s1	s2	w
结合成分子时	null	结合成分
结合成分子时	null	结合成
结合成分子时	null	结合
成分子时	结合/	成分子时
成分子时	结合/	成分子
成分子时	结合/	成分
子时	结合/成分/	子时
子时	结合/成分/	子
时	结合/成分/子	时
null	结合/成分/子/时	null

实现代码

先贴一段统计词频的代码，其它算法也使用这个

class Tokenizer:
    def __init__(self,path,delta=0,trained=False) -> None:
        self.vocab = self.get_vocab(path,delta,trained)
    def get_vocab(self,path,delta,trained):
        words = []
        tags = []
        if trained:
            with open(path,mode='r') as f:
                counter = json.loads(f.read())
        else:
            with open(path,mode='r',encoding='utf-8') as f:
                content_split = f.read().split(" ")
                for content in content_split:
                    if content.strip()!="":
                        words.append(content.split("/")[0])
                        tags.append(content.split("/")[1])

            counter = Counter(words)
            counter = {key:(value+delta)/(len(words)+len(counter)*delta) for key,value in counter.items()}
            with open("chinese_word_split/data/vocab.json",mode='w') as f:
                f.write(json.dumps(counter))
                f.close()
        return counter

def max_match(s,vocab,max_word_length=4):
    words = []
    while len(s)>0:
        length = min(max_word_length,len(s))
        for i in range(length):
            w = s[0:length-i]
            if w in vocab or len(w)==1:
                words.append(w)
                s = s[length-i:]
                break
    return words

if __name__ == '__main__':
    tokenizer = Tokenizer('chinese_word_split\data\PeopleDaily_clean.txt',delta=1)
    print(max_match("结合成分子时",tokenizer.vocab))

在这里插入图片描述

最少分词法

略，因其和最大概率法相近，最大概率法的概率均设为1即为最少分词法。而且即有权和无权的差别

最大概率法

切分候选词(带前驱线索)

算法示例

最大词长度为4

s1	s2	w
结合成分子时	null	结
结合成分子时	结(0,0)/	结合
结合成分子时	结/结合(0,1)/	结合成
结合成分子时	结/结合/	结合成分
合成分子时	结/结合/	合
合成分子时	结/结合/合(1,0)/	合成
合成分子时	结/结合/合/合成(1,1)/	合成分
合成分子时	结/结合/合/合成/	合成分子
成分子时	结/结合/合/合成/	成
成分子时	结/结合/合/合成/成(2,0)/	成分
成分子时	结/结合/合/合成/成/成分(2,1)/	成分子
成分子时	结/结合/合/合成/成/成分/	成分子时
分子时	结/结合/合/合成/成/成分/	分
分子时	结/结合/合/合成/成/成分/分(3,0)/	分子
分子时	结/结合/合/合成/成/成分/分/分子(3,1)/	分子时
子时	结/结合/合/合成/成/成分/分/分子/	子
子时	结/结合/合/合成/成/成分/分/分子/子(4,0)/	子时
时	结/结合/合/合成/成/成分/分/分子/子/	时
null	结/结合/合/合成/成/成分/分/分子/子/时(5,0)/	null

实现代码

def get_candidates(self,s,max_word_length=4):
    candidates = {}
    j = 0
    while len(s)>0:
        for i in range(min(max_word_length,len(s))):
            w = s[0:i+1]
            if len(w)==1 or (w in self.vocab):
                candidates[(j,i)] = w
        s = s[1:]
        j += 1
    return candidates

tokenizer = Tokenizer('chinese_word_split\data\PeopleDaily_clean.txt',delta=1)
print(tokenizer.get_candidates("结合成分子时"))

在这里插入图片描述

寻找前驱

算法示例

前驱信息隐藏在候选词的键当中，比如分的键是(3,0)，那么3的前驱计算如下

index = 3-1 = 2，满足index1+index2 = index的所有(index1,index2)都是潜在的前驱

(index1,index2) = (2,0) => candidates => 成

(index1,index2) = (1,1) => candidates => 合成

(index1,index2) = (0,2) => candidates => 不存在，把它丢弃

实现代码

def get_prior(self,index):
    if index==0:
        return [-1]
    index = index-1
    priors = []
    for i in range(index+1):
        temp = (index-i,i)
        if temp in self.candidates:
            priors.append(self.candidates_index[self.candidates[temp]])
    return priors

构建DAG

采用静态链表的形式存储DAG

text	prob	best	priors	index
结	20(乱填的)	-1	[-1]	0
结合	20(乱填的)	-1	[-1]	0
合	20(乱填的)	-1	[0]	1
…	…	…	…	…
子	20(乱填的)	-1	[5,6]	4
时	20(乱填的)	-1	[8]	5

实现代码

class DAG:
    class Word:
        def __init__(self,prob,priors,text,index) -> None:
            self.best = -1 # 最佳前驱索引
            self.priors = priors # 前驱索引列表
            self.prob = -np.log(prob) # 概率转化成负对数
            self.index = index # 候选词首字符索引
            self.text = text # 候选词

    def __init__(self,candidates,vocab) -> None:
        self.words = []
        index = 0
        self.candidates = candidates
        self.candidates_index = {value:i for i,value in enumerate(candidates.values())}
        for (index,_),candidate in candidates.items():
            word = DAG.Word(vocab[candidate],self.get_prior(index),candidate,index)
            self.words.append(word)
        self.final_index = index

前向计算

算法示例

词	费用	前趋词	最佳前趋词	累积费用
结	3.573	Null	Null	0+3.573=3.573
结合	3.543	Null	Null	0+3.543=3.543
合	3.518	结	结	3.573+3.518=7.091
合成	4.194	结	结	3.573+4.194=7.767
成	2.800	合、结合	结合	3.543+2.800=6.343
成分	3.908	合、结合	结合	3.543+3.908=7.451
分	2.862	成、合成	成	6.343+2.862=9.205
分子	3.465	成、合成	成	6.343+3.465=9.808
子	3.304	分、成分	成分	7.451+3.304=10.755
子时	6.000	分、成分	成分	7.451+6.000=13.451
时	2.478	子、分子	分子	9.808+2.478=12.286

实现代码

def forward(self):
    # 前向计算累积概率,并记录最佳前驱
    for word in self.words:
        min_prob = -1
        min_prior = -1
        for prior in word.priors:
            if min_prob==-1 or self.words[prior].prob < min_prob:
                min_prob = self.words[prior].prob
                min_prior = prior
        word.prob += min_prob
        word.best = min_prior

回溯

算法示例

找到最佳终点

终点有子时和时，时的费用低，所以时是终点

按最佳前驱词回溯：时=>分子=>成=>结合

实现代码

def backward(self):
    # 找到所有终点中累积费用最小的作为真正的终点
    min_final_word = None
    min_final_prob = -1
    for word in self.words:
        if word.index==self.final_index:
            if min_final_prob==-1 or word.prob<min_final_prob:
                min_final_prob = word.prob
                min_final_word = word
	# 从该终点开始进行回溯
    results = []
    results.append(min_final_word.text)
    while min_final_word.best!=-1:
        min_final_word = self.words[min_final_word.best]
        results.insert(0,min_final_word.text)
    return results

if __name__== '__main__':
    tokenizer = Tokenizer('chinese_word_split\data\PeopleDaily_clean.txt',delta=1)
    dag = DAG(tokenizer.get_candidates("结合成分子时"),tokenizer.vocab)
    print(dag.min_path())