这是一种有着广泛应用的机械分词方法,该方法依据一个分词词表和一个基本的切分评估原则,即“长词优先”原则,来进行分词。这种评估原则虽然在大多数情况下是合理的,但也会引发一些切分错误。这种切分方法,需要最少的语言资源(仅需一个词表,不需要任何词法、句法、语义知识),程序实现简单,开发周期短,是一个简单实用的方法。
下面是一个简单的MM算法Python实现:
1 #{entry1:cateory1, entry2:category2, ..., entryN:categoryM} 2 dict = {} 3 max_len = 0 4 5 def segment(str): 6 '''Implements MM(Maxium Matching) Method 7 This function using a predefined dictionary 8 segments a given string str into entries which 9 following a category according to the dictioanry. 10 A list of entry and corresponding category pairs 11 would be returned. 12 ''' 13 global dict, max_len 14 ret = [] 15 l = len(str) 16 if l>max_len: 17 l = max_len 18 c = 0 19 while c<len(str): 20 tl = l 21 while tl>1: 22 t = str[c:c+tl] 23 if dict.has_key(t): #match 24 r = (t, dict[t]) 25 ret.append(r) 26 break 27 else: #truncate the last char 28 tl -= 1 29 else: #no match 30 r = (str[c], 'CHAR') 31 ret.append(r) 32 c += tl 33 return ret
MSN Space Link: http://spaces.msn.com/vanzolo/blog/cns!4A43F3D396FBF12F!1186.entry?_c11_blogpart_blogpart=blogview&_c=blogpart#permalink