中文分词算法：逆向最大匹配法

最新推荐文章于 2022-04-05 22:53:31 发布

程裕强

最新推荐文章于 2022-04-05 22:53:31 发布

阅读量2.0k

点赞数 1

分类专栏：自然语言处理 Python

本文链接：https://blog.csdn.net/chengyuqiang/article/details/102719876

版权

Python 同时被 2 个专栏收录

10 篇文章 2 订阅

订阅专栏

自然语言处理

4 篇文章 1 订阅

订阅专栏

1、词典

./data/rmm_dic.utf8

南京市
南京市长
长江大桥
人民解放军
大桥

2、RMM算法

#逆向最大匹配
class RMM(object):
    def __init__(self, dic_path):
        self.dictionary = set()
        self.maximum = 0
        #读取词典
        with open(dic_path, 'r', encoding='utf8') as f:
            for line in f:
                #移除字符串头尾指定的字符（默认为空格或换行符）或字符序列
                line = line.strip()
                if not line:
                    continue
                self.dictionary.add(line)
                if len(line) > self.maximum:
                    self.maximum = len(line)
    def cut(self, text):
        result = []
        index = len(text)
        while index > 0:
            print('index='+str(index))
            word = None
            # 语法 range(start, stop[, step])
            for size in range(self.maximum, 0, -1):
                print('size='+str(size))
                # 切片开始位置：index - size
                if index - size < 0:
                    continue
                #逆向切片，提取当前可能存在的最长词
                piece = text[(index - size):index]
                print('piece='+str(piece))
                #切分出一个词
                if piece in self.dictionary:
                    word = piece
                    result.append(word)
                    #切分词后，剩下的字符串长度
                    index -= size
                    break
            if word is None:
                index -= 1
        return result[::-1]

def main():
    text = "南京市长江大桥"
    tokenizer = RMM('./data/rmm_dic.utf8')
    print(tokenizer.cut(text))

main()

执行结果

index=7
size=5
piece=市长江大桥
size=4
piece=长江大桥
index=3
size=5
size=4
size=3
piece=南京市
['南京市', '长江大桥']

程序解析：
1、初始值
待切分字符串长度index =7
最大词典长度maximum=5
2、第1次大循环while
index =7
（1）第1次for循环
size=5
切分出可能的最长词（5）=市长江大桥，非词，进入下次循环
（2）第2次for循环
size=4
切分出可能的最长词=长江大桥，命中
剩下待切分长度index=3
跳出for循环
3、第2次大循环while
index =3
（1）第1次for循环
size=5，大于待切分长度3
（2）第2次for循环
size=4，大于待切分长度3
（3）第3次for循环
size=3，等于待切分长度3
切分出可能的最长词=南京市，命中
剩下待切分长度index=0
跳出for循环
4、第3次大循环while
不满足index > 0，退出

程裕强

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
中文分词算法：逆向最大匹配法

1、词典./data/rmm_dic.utf8南京市南京市长长江大桥人民解放军大桥2、RMM算法#逆向最大匹配class RMM(object): def __init__(self, dic_path): self.dictionary = set() self.maximum = 0 #读取词典 wit...
复制链接

扫一扫