基于词表的切词——最短路径方法

最新推荐文章于 2023-11-16 11:21:03 发布

lalalalala

最新推荐文章于 2023-11-16 11:21:03 发布

阅读量2k

点赞数

分类专栏： Algorithms Search Engine 文章标签： path structure algorithm string import c

本文链接：https://blog.csdn.net/lalalalala/article/details/725850

版权

Algorithms 同时被 2 个专栏收录

9 篇文章 0 订阅

订阅专栏

Search Engine

2 篇文章 0 订阅

订阅专栏

最短路径方法的目的是使得分词后得到的词最少，实现的方法是从句子中匹配出所有在词表中的词，以词为边(边的权重为1)、词与词的间隔(切分点)为节点构造出一个有向无环图(DAG)，有唯一的起点(句子的开始)和唯一的终点(句子的结束)，找到一条最短路径，即切分最少。

和正向最大匹配一样最短路径方法只需要一个词表即可进行切词，但得到的结果要更好，主要原因是考虑到了上下文的衔接性，把切分过程从一维扩展到了二维。但是这种方法时间复杂性更高，而且为了保证图的连通性需要在词匹配时要分割到字。

最短路径方法可以使用词频方便的扩展。将词频作为边的权重，将2-gram词频作为节点权重，这样可以容易的得到加入了词频信息的最短路径切分。如果有可能的话，可以应用N-Gram词频信息在路径选择上，甚至可以将词法信息加入进来，通过于词法图的匹配提取最佳路径。

最短路径路径方法还有助于对专有名词的切分。由于在切分专有名词时，经常出现短词或单字，可以在进行路径选择之前先对这样的边进行识别并赋予较低的权重使得在路径选择时能够倾向于这些边。

最短路径算法只能得到一条结果路径，和结果路径权重相同的其它路径都被舍弃了，而且接近最短路径的k最短路径均被舍弃，这样往往会失去正确切分。这些问题可以通过使用k最短路径的其它算法得到解决。

下面是一个简单的最短路径法的Python实现：

1 ''' Implements SPM(Shortest Path Matching) Method
2 '''
3
4 import string
5 import codecs
6 import re
7
8 # {entry1:cateory1, entry2:category2, ..., entryN:categoryM}
9 dict = {}
10 # a string contains delimiting punctuations
11 punc = ''
12 max_len = 0
13
14 def segment(str):
15     ''' segment the given string in a method which made the number of tokens after
16     segmentation is minium. The algorithm used here can bu summaried as following:
17     1. use delimiting punctuations to segment the given string into short sentences.
18     2. pick the first sentence
19     3. find all known words in this picked sentence.
20     4. organize all words into DAG
21     5. find the shortest path from the start to the end, which is the segmentation we want
22     6. pick the next sentence and repeat from 3 until all sentences has been processed.
23     '''
24     global punc
25     ret = []
26
27     re_sent = re.compile( ' ([^%s]+)([%s])+ ' % (punc, punc), re.MULTILINE)
28     cnt = 0
29     for match in re_sent.finditer(str):
30         sen = match.group( 1 )
31         # print sen
32         dag = organize(sen)
33         # print dag
34         path = find_path(dag)
35         for i, l in path:
36             # print sen[i:i+l],
37             ret.append(sen[i:i + l])
38         # print
39         # append a punctuation after the sentence
40         # NOTICE: multiple punctuations is not supported
41         ret.append(match.group( 2 ))
42     return ret
43
44 def organize(sentence):
45     ''' find all known words in the given sentence and organize it into a DAG
46     To represent nodes in a DAG, here a data structure of node is used as following:
47     [hop1, hop2]
48     hop is the distence from this node to the next one. On one char in the sentence
49     there could be more than one node structures that represent the multiple ways to
50     the segment the chars after this one. There is an ending node, [0], for easily
51     traversing the DAG.
52     And to represent the DAG, the following structure is used:
53     [[2,5], ..., [0]]
54     '''
55     global dict
56     dag = []
57     # find all known words
58     n = l = len(sentence)
59     if l > max_len:
60         l = max_len
61     c = 0
62     while c < len(sentence):
63         tl = l
64         if c + tl > len(sentence):
65             tl = len(sentence) - c
66         while tl > 1 :
67             t = sentence[c:c + tl]
68             if dict.has_key(t): # find
69                 if len(dag) == c: # first time to reach a node
70                     dag.append([len(t)])
71                 else :
72                     dag[c].append(len(t))
73             # truncate one and retry
74             tl -= 1
75         else : # only one char left
76             if len(dag) == c:
77                 dag.append([ 1 ])
78             else :
79                 dag[c].append( 1 )
80         c += 1 # try from next char
81
82     dag.append([0])
83     return dag
84
85 def find_path(dag):
86     ''' uses statnd Dijkstra algorithm to find the shortest path
87     from in the given dag. returns the path in such a format:
88     [(0,2), (2, 3), (5, 1), (6,4)]
89     the format of tuples in above sequence is (n, l), in which n
90     represent the index of this token and l is the length of this
91     token.
92     '''
93     wt = []
94     rc = []
95     pre = []
96     es = []
97     for i in range(0, len(dag)):
98         wt.append(len(dag) + 1 )
99         rc.append(0)
100         pre.append(i - 1 )
101     rc[0] = 1
102     wt[0] = 0
103     es.append(0)
104     while 1 :
105         if len(es) == 0:
106             break
107         min_node = - 1
108         min = len(dag) + 1
109         for e in es:
110             if wt[e] < min:
111                 min_node = e
112                 min = e
113         c = min_node
114         es.remove(c)
115
116         for e in dag[c]:
117             t = e + c
118             if not rc[t]:
119                 d = wt[c] + 1
120                 if d < wt[t]:
121                     wt[t] = d
122                     pre[t] = c
123                     es.append(t)
124     c = len(dag) - 1
125     path = []
126     while pre[c] !=- 1 :
127         path.append((pre[c], c - pre[c]))
128         c = pre[c]
129     path.reverse()
130     return path

MSN Space Link: http://spaces.msn.com/vanzolo/blog/cns!4A43F3D396FBF12F!1198.entry?_c11_blogpart_blogpart=blogview&_c=blogpart#permalink