#翻译# 介绍后缀树(suffix tree)

#翻译# 介绍后缀树(suffix tree)

撰写于 2013 年 01 月 06 日 | 分类 :数据结构 | 5条评论

看过非常多的不靠谱suffix tree介绍后,本文是我在网上发现至今最好的一篇,通过三个规则讲述了整棵后缀树的构建过程,图形结合,非常容易理解,并且本文尊重原作者Ukkonen的论文术语,清楚的讲解了出现在suffix tree中的每一个概念,花时3个小时翻译之,共勉,部分有修改和抛弃。

正文如下:

接下来我将通过一个简单的字符串(不包含重复的字符)来试着分析Ukkonen算法,接着来讲述完整的算法框架。

首先,一点简单的事前描述

1. 我们构建的是一个简单的类似搜索字典类(search trie)结构,所以存在一个根节点(root node)。树的边(edges)指向一个新的节点,直到叶节点。

2. 但是,不同于搜索字典类(search trie),边标签(edge label)不是单个字符,相反,每一个边被标记为一对整数[from, to]。这一对整数是输入字符串的索引(index)。这样,每一个边记录了任意长度的子字符(substring),但是只需要O(1)空间复杂度(一对整数索引)。

基本约定

下面我将用一个没有重复字符的字符串来说明如何创建一颗后缀树(suffix tree):

abc

本算法将从字符串的左边向右边一步一步的执行。每一步处理输入字符串的一个字符,并且每一步抑或涉及不止一种的操作,但是所有的操作数和是O(n)时间复杂度的。

好,我们现在将字符串左边的a插入到后缀树,并且将此边标记为[0, #],它的意思是此边代表了从索引0开始,在#索引结束的子字符串(我使用符号#表示当前结束索引,现在的值是1,恰好在a位置后面)。

所以,我们有初始化后的后缀树:

其意思是:

现在我们处理索引2,字符b。我们每步的目的是将所有后缀(suffixes)的结束索引更新当前的索引。我们可以这样做:

1. 拓展存在的a边,使其成为ab;

2. 为b插入一条新边。

然后变成这样:

其意思是:

我们观察到了二点:

  1. 表示ab的边同我们初始化的后缀树:[0, #]。它意味着将会自动改变,我们仅仅更新#,使其成为2即可;
  2. 每一步只需要O(1)的空间复杂度,因为我们只记录了一对整数索引而已。

接下来,我们继续自增#索引,现在我们需要插入字符c了。我们将c插入到后缀树中的每一条边,然后在为后缀c插入一条新边。

它们像下面:

其意思是:

我们注意到:

  1.  在每一步后,恰好都是一颗正确的后缀树;
  2. 总共需要字符串长度的数量的操作;
  3. 所有的操作都是O(1)。

第一次拓展:简单的重复字符串

上面的算法工作的非常正确,接下来我们来看看更加复杂的字符串:

abcabxabcd

步骤1至3:正如之前的例子:

步骤4:我们自增#到索引4。这将自动的更新所有已存在的边:

接着,我们需要将当前步骤的后缀a(suffix a), 插入到根节点。

在此之前,我们引入另外二个变量(不包括之前的变量#):

  1. active point, 其实一个三元组(active_node, active_edge, active_length);
  2. remainder, 一个用来记录还有剩余多少新的后缀需要插入的整数数量。

这二个变量的准确意义将在后面愈来愈清楚,至于现在我们可以这样来解说:

  1. 在abc这个例子中,active point始终都是(root, None, 0), 其作用也就是说,如果有一条新边需要插入,那么都插入到根节点下。
  2. remainder变量在每一步的开始都始终设为1。其意思是,我们必须要插入的后缀数量是1,也就是待插入单个字符本身。

现在这二个变量将有所改变。当我们在root节点插入当前的最后一个字符a时,我们注意到已经存在了一条以a开头的边了,它就是abca。所以,出现这种情况下,我们需要如下做:

  1. 我们不在根节点插入新的节点;
  2. 相反,我们注意到后缀a已经存在了我们的树中,所以我们不管它;
  3. 修改active point为(root, ‘a’, 1)。它的意思是现在active point如今指向从根节点出发的一条以a开头的边上,并且在索引1后面。
  4. 我们增加remainder的值,现在其为2了。

注意:当我们最后一个需要插入的后缀如果已经存在在这颗树种,那么我们什么都不做,只是更新active point和remainder即可。现在这棵树已经不在非常标准的后缀树了,但是它包含所有的后缀,只是最后一个后缀a被隐式的包含了。

步骤5:我们更新当前的索引#为5. 这将自动的进行如下更新:

因为remainder是2,我们需要插入二个当前索引的最后二个后缀:abb。这是因为:

  1. 上一步的后缀a并没有合适的插入,只是被隐式包含在abca边中。所以,它需要被保留,而且,在当前这步,它已经从a变成了ab了。
  2. 我们需要插入新的边b

实际上,这意味着我们需到active point节点(它指向边abcab的索引1,在a的后面位置)插入最后字符b,但是恰好b也已经存在在同一条边了。

所以,我们什么也不做,仅仅:

  1. 更新active point为(root, ‘a’, 2)(同一个节点和边,但是我们指向b的后面,所以active point的长度变成了2,指向b的后面)
  2. 增加remainder为3,因为我们也没有插入。

当然,我们不得不在当前步骤插入abb,但是因为ab已经存在,所以我们更新active point并且我们不插入b。为什么?因为ab在这颗树中,那么它的每一个后缀必定存在这个树中。也许它是隐士被包含的,但是,它一定存在,因为我们是一步一步如此建这颗树的。

步骤6:增加#为6,这棵树自动更新为:

因为remainder是3,所以我们不得不插入abxbxx。active point告诉我们ab在哪里结束,所以我们仅仅只需要跳到此节点,然后插入x。更加准确的说,x如果也不存在,我们需要分隔边abcabx,并且插入一个内部节点:

边的表示还是用一对整数索引表示,所以分隔和插入只需要O(1)时间复杂度。

到目前为止,我们处理了后缀abx,并且remainder减少到了2. 好,我们还要继续插入接下来的后缀bx。但是在我们插入之前,我们需要更新active point。

这里有一个规则,其是在分隔和插入一条边后,叫做Rule1,它将起作用当active node是根节点时,至于其他的情况,我们有Rule3, 后面将会介绍。
这里是Rule1,在从root插入之后:

  1. active_node 依然不变;
  2. active_edge 被设置为下一个新后缀的第一个字符,本例中是b;
  3. active_length 自减。

到现在为止,新active point三元组(root, ‘b’, 1)表示下一步插入将在边bcabx发生,本例中是在b的后面。我们检查x是否已经存在,如果存在,我们将结束当前步骤,什么都不做;如果不存在,我们分隔此边,插入该字符。

它将花费O(1)的时间,更新remainder为1,并且根据Rule1更新active point为(root, ‘x’, 0)。

我们还有其他事需要做,接着我们介绍Rule2:

如果我们分隔一条边并且插入一个新的节点,而且这个新的节点不是在当前步骤中第一个新的节点,我们需要将之前创建的节点指向这个新创建的节点,这条边称为 suffix link。我们将在后面发现其非常有用,这里使用虚线表示 suffix link。

在插入后缀bx后,加上suffix link后:

到这里,我们还有后缀x还没有插入。因为active point(root, ‘x’, 0)中active_length是0,所以,最后一个后缀x直接从root插入,因为这里没有一个边以x作为前缀。

从上图看,之前遗留的三个后缀abx,bx和x已悉数插入。

步骤7:更新#为7,其将自动的添加下一个字符a到所有的叶边(leaf edges). 然后,我们试着插入新的最后字符到active point(root,’x', 0),但是发现字符a已经存在,所以我们什么也不做,只是更新active point为(root, ‘a’, 1)和自增remainder,此时为2。

步骤8:#=8, 我们需要插入ab和b,因为remainder为2。我们插入ab,正如之前的例子,这个也只需要更新update point为(root, ‘a’, 2)即可,并且自增remainder,因为b已经。这是我们发现active point现在处于一条边的终端。我们设此节点为node1,然后active point可以变为(node1, None, 0)。这里我使用node1表示边ab的终结点。

步骤9:#=9, 我们将要理解后缀树中最后一个难点。

第二次拓展:使用suffix link

现在,#已经更新到了字符c,它将会自动的添加到叶边(leaf edges),并且我们跳到active point是否我们能插入字符c。结果被证明c已经存在,所以我们设active point为(node1, ‘c’, 1),自增remainder,不需要做什么。

步骤10:经过几步的自增remainder,现在其值已经是4.所以在步骤10,我们首先需要通过向active point插入d来实现插入abcd(其值追溯到前三步,它们分别插入abc).

将d插入到active point。

这个被标记的active node在图中用红色被标识。

这是最后一个规则Rule3:

在从一个非root节点的active_node分隔一条边后,我们沿着suffix link(如果存在的话),将active_node设定为其指向的节点;如果不存在的话,设定active_node为root根节点。active_edge和active_length保持不变。

所以在应用Rule3后,现在指向(node2, ‘c’, 1),node2在下图被标识为红色:

因为后缀abcd已经被插入,所以自减remainder为3. 接着插入bcd。因为Rule3已经设定active point到了node2,所以我们只需要在active point后插入d即可。

通过插入d,将应用Rule2,我们必须创建suffix link。

我们注意到,suffix link可以让我们重新设定active point,使接下来的插入操作能够在O(1)时间完成。

步骤10还没有完成,因为remainder是2. 我们需要使用Rule3重新设定active point,因为当前的active_node(被标识为红色)没有suffix link,所以我们设定其为root,这样,active point被标记为(root, ‘c’, 1)。

也就是说,下面的插入将发生在从root节点,以c起始的边上。所以,在插入d后:

在自减remainder后,是1,继续应用Rule2,加入新的suffix link从之前被创建的节点。

最后,remainder被设定为1,因为active node是root,所以我们使用Rule1来更新active point(root, ‘d’, 0),这以为着,我们将在根节点加入d.

到此为止,所有的步骤已经完成。

这是最后的一点思考:

  1. 在每一步,我们自增#。这将自动的更新所有的叶节点(leaf nodes)在O(1)的时间内。
  2. 但是,其并没有处理由之前的步骤产生的后缀,只是被隐士的包含了。
  3. remainder告诉我们还有多少额外的插入需要做,并且active point能准确的告诉我们在哪里插入。

总结一下三个规则:
规则1:
从root节点插入之后:

  1. active_node 依然不变;
  2. active_edge 被设置为下一个新后缀的第一个字符,本例中是b;
  3. active_length 自减。

规则2:
如果我们分隔一条边并且插入一个新的节点,而且这个新的节点不是在当前步骤中第一个新的节点,我们需要将之前创建的节点指向这个新创建的节点,这条边称为 suffix link。我们将在后面发现其非常有用,这里使用虚线表示 suffix link。

规则3:
在从一个非root节点的active_node分隔一条边后,我们沿着suffix link(如果存在的话),将active_node设定为其指向的节点;如果不存在的话,设定active_node为root根节点。active_edge和active_length保持不变。

References:
http://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english

» 标签: 数据结构 , 算法

#翻译# 介绍后缀树(suffix tree)

he problem

Matching string sequences is a problem that computer programmers face on a regular basis. Some programming tasks, such as data compression or DNA sequencing, can benefit enormously from improvements in string matching algorithms. This article discusses a relatively unknown data structure, thesuffix tree, and shows how its characteristics can be used to attack difficult string matching problems.

Imagine that you've just been hired as a programmer working on a DNA sequencing project. Researchers are busy slicing and dicing viral genetic material, producing fragmented sequences of nucleotides. They send these sequences to your server, which is then expected to locate the sequences in a database of genomes. The genome for a given virus can have hundreds of thousands of nucleotide bases, and you have hundreds of viruses in your database. You are expected to implement this as a client/server project that gives real-time feedback to the impatient PhD.s. What's the best way to go about it?

It is obvious at this point that a brute force string search is going to be terribly inefficient. This type of search would require you to perform a string comparison at every single nucleotide in every genome in your database. Testing a long fragment that has a high hit rate of partial matches would make your client/server system look like an antique batch processing machine. Your challenge is to come up with an efficient string matching solution.

The intuitive solution

Since the database that you are testing against is invariant, preprocessing it to simplify the search seems like a good idea. One preprocessing approach is to build a search trie. For searching through input text, a straightforward approach to a search trie yields a thing called a suffix trie. (The suffix trie is just one step away from my final destination, the suffix tree.) A trie is a type of tree that has N possible branches from each node, where N is the number of characters in the alphabet. The word 'suffix' is used in this case to refer to the fact that the trie contains all of the suffixes of a given block of text (perhaps a viral genome.)



Figure 1
The Suffix Trie Representing "BANANAS"

Figure 1 shows a Suffix trie for the word BANANAS. There are two important facts to note about this trie. First, starting at the root node, each of the suffixes of BANANAS is found in the trie, starting with BANANAS, ANANAS, NANAS, and finishing up with a solitary S. Second, because of this organization, you can search for any substring of the word by starting at the root and following matches down the tree until exhausted.

The second point is what makes the suffix trie such a nice construct. If you have a input text of length N, and a search string of length M, a traiditonal brute force search will take as many as N*M character comparison to complete. Optimized searching techniques, such as the Boyer-Moore algorithm can guarantee searches that require no more than M+N comparisons, with even better average performance. But the suffix trie demolishes this performance by requiring just M character comparisons, regardless of the length of the text being searched!

Remarkable as this might seem, it means I could determine if the word BANANAS was in the collected works of William Shakespeare by performing just seven character comparisons. Of course, there is just one little catch: the time needed to construct the trie.

The reason you don't hear much about the use of suffix tries is the simple fact that constructing one requires O(N2) time and space. This quadratic performance rules out the use of suffix tries where they are needed most: to search through long blocks of data.

Under the spreading suffix tree

A reasonable way past this dilemma was proposed by Edward McCreight in 1976, when he published his paper on what came to be known as the suffix tree. The suffix tree for a given block of data retains the same topology as the suffix trie, but it eliminates nodes that have only a single descendant. This process, known as path compression, means that individual edges in the tree now may represent sequences of text instead of single characters.



Figure 2
The Suffix Trie Representing "BANANAS"

Figure 2 shows what the suffix trie from Figure 1 looks like when converted to a suffix tree. You can see that the tree still has the same general shape, just far fewer nodes. By eliminating every node with just a single descendant, the count is reduced from 23 to 11.

In fact, the reduction in the number of nodes is such that the time and space requirements for constructing a suffix tree are reduced from O(N2) to O(N). In the worst case, a suffix tree can be built with a maximum of 2N nodes, where N is the length of the input text. So for a one-time investment proportional to the length of the input text, we can create a tree that turbocharges our string searches.

Even you can make a tree

McCreight's original algorithm for constructing a suffix tree had a few disadvantages. Principle among them was the requirement that the tree be built in reverse order, meaning characters were added from the end of the input. This ruled the algorithm out for on line processing, making it much more difficult to use for applications such as data compression.

Twenty years later, Esko Ukkonen from the University of Helsinki came to the rescue with a slightly modified version of the algorithm that works from left to right. Both my sample code and the descriptions that follow are based on Ukkonen's work, published in the September 1995 issue of Algorithmica.

For a given string of text, T, Ukkonen's algorithm starts with an empty tree, then progressively adds each of the N prefixes of T to the suffix tree. For example, when creating the suffix tree for BANANAS, B is inserted into the tree, then BA, then BAN, and so on. When BANANAS is finally inserted, the tree is complete.



Figure 3
Progressively Building the Suffix Tree

Suffix tree mechanics

Adding a new prefix to the tree is done by walking through the tree and visiting each of the suffixes of the current tree. We start at the longest suffix (BAN in Figure 3), and work our way down to the shortest suffix, which is the empty string. Each suffix ends at a node that consists of one of these three types:

  • A leaf node. In Figure 4, the nodes labeled 1,2, 4, and 5 are leaf nodes.
  • An explicit node. The non-leaf nodes that are labeled 0 and 3 in Figure 4 are explicit nodes. They represent a point on the tree where two or more edges part ways.
  • An implicit node. In Figure 4, prefixes such as BO, BOO, and OO all end in the middle of an edge. These positions are referred to as implicit nodes. They would represent nodes in the suffix trie, but path compression eliminated them. As the tree is built, implicit nodes are sometimes converted to explicit nodes.



Figure 4
BOOKKEEPER after adding BOOK

In Figure 4, there are five suffixes in the tree (including the empty string) after adding BOOK to the structure. Adding the next prefix, BOOKK to the tree means visiting each of the suffixes in the existing tree, and adding letter K to the end of the suffix.

The first four suffixes, BOOK, OOK, OK, and K, all end at leaf nodes. Because of the path compression applied to suffix trees, adding a new character to a leaf node will always just add to the string on that node. It will never create a new node, regardless of the letter being added.

After all of the leaf nodes have been updated, we still need to add character 'K' to the empty string, which is found at node 0. Since there is already an edge leaving node 0 that starts with letter K, we don't have to do anything. The newly added suffix K will be found at node 0, and will end at the implicit node found one character down along the edge leading to node 2.

The final shape of the resulting tree is shown in Figure 5.



Figure 5
The same tree after adding BOOKK

Things get knotty

Updating the tree in Figure 4 was relatively easy. We performed two types of updates: the first was simply the extension of an edge, and the second was an implicit update, which involved no work at all. Adding BOOKKE to the tree shown in Figure 5 will demonstrate the two other types of updates. In the first type, a new node is created to split an existing edge at an implicit node, followed by the addition of a new edge. The second type of update consists of adding a new edge to an explicit node.



Figure 6
The Split and Add Update

When adding BOOKKE to the tree in Figure 5, we once again start with the longest suffix, BOOKK, and work our way to the shortest, the empty string. Updating the longer suffixes is trivial as long as we are updating leaf nodes. In Figure 5, the suffixes that end in leaf nodes are BOOKK, OOKK, OKK, and KK. The first tree in Figure 6 shows what the tree looks like after these suffixes have been updated using the simple string extension.

The first suffix in Figure 5 that doesn't terminate at a leaf node is K. When updating a suffix tree, the first non-leaf node is defined as the active point of the tree. All of the suffixes that are longer than the suffix defined by the active point will end in leaf nodes. None of the suffixes after this point will terminate in leaf nodes.

The suffix K terminates in an implicit node part way down the edge defined by KKE. When testing non-leaf nodes, we need to see if they have any descendants that match the new character being appended. In this case, that would be E.

A quick look at the first K in KKE shows that it only has a single descendant: K. So this means we have to add a descendent to represent Letter E. This is a two step process. First, we split the edge holding the arc so that it has an explicit node at the end of the suffix being tested. The middle tree in Figure 6 shows what the tree looks like after the split.

Once the edge has been split, and the new node has been added, you have a tree that looks like that in the third position of Figure 6. Note that the K node, which has now grown to be KE, has become a leaf node.

Updating an explicit node

After updating suffix K, we still have to update the next shorter suffix, which is the empty string. The empty string ends at explicit node 0, so we just have to check to see if it has a descendant that starts with letter E. A quick look at the tree in Figure 6 shows that node 0 doesn't have a descendant, so another leaf node is added, which yields the tree shown in Figure 7.



Figure 7

Generalizing the algorithm

By taking advantage of a few of the characteristics of the suffix tree, we can generate a fairly efficient algorithm. The first important trait is this: once a leaf node, always a leaf node. Any node that we create as a leaf will never be given a descendant, it will only be extended through character concatenation. More importantly, every time we add a new suffix to the tree, we are going to automatically extend the edges leading into every leaf node by a single character. That character will be the last character in the new suffix.

This makes management of the edges leading into leaf nodes easy. Any time we create a new leaf node, we automatically set its edge to represent all the characters from its starting point to the end of the input text. Even if we don't know what those characters are, we know they will be added to the tree eventually. Because of this, once a leaf node is created, we can just forget about it! If the edge is split, its starting point may change, but it will still extend all the way to the end of the input text.

This means that we only have to worry about updating explicit and implicit nodes at the active point, which was the first non-leaf node. Given this, we would have to progress from the active point to the empty string, testing each node for update eligibility.

However, we can save some time by stopping our update earlier. As we walk through the suffixes, we will add a new edge to each node that doesn't have a descendant edge starting with the correct character. When we finally do reach a node that has the correct character as a descendant, we can simply stop updating. Knowing how the construction algorithm works, you can see that if you find a certain character as a descendant of a particular suffix, you are bound to also find it as a descendant of every smaller suffix.

The point where you find the first matching descendant is called the end point. The end point has an additional feature that makes it particularly useful. Since we were adding leaves to every suffix between the active point and the end point, we now know that every suffix longer than the end point is a leaf node. This means the end point will turn into the active point on the next pass over the tree!

By confining our updates to the suffixes between the active point and the end point, we cut way back on the processing required to update the tree. And by keeping track of the end point, we automatically know what the active point will be on the next pass. A first pass at the update algorithm using this information might look something like this (in C-like pseudo code) :

C:
  1. Update ( new_suffix  )
  2. {
  3.   current_suffix = active_point
  4.   test_char = last_char in new_suffix
  5.   done =  false;
  6.    while  ( !done  )  {
  7.      if current_suffix ends at an explicit node  {
  8.        if the node has no descendant edge starting with test_char
  9.         create new leaf edge starting at the explicit node
  10.        else
  11.         done =  true;
  12.      }  else  {
  13.        if the implicit node 's next char isn't test_char  {
  14.         split the edge at the implicit node
  15.         create new leaf edge starting at the split in the edge
  16.        }  else
  17.         done =  true;
  18.      }
  19.      if current_suffix is the empty  string
  20.       done =  true;
  21.      else
  22.        current_suffix = next_smaller_suffix ( current_suffix  )
  23.    }
  24.   active_point = current_suffix
  25. }

The Suffix Pointer

The pseudo-code algorithm shown above is more or less accurate, but it glosses over one difficulty. As we are navigating through the tree, we move to the next smaller suffix via a call tonext_smaller_suffix(). This routine has to find the implicit or explicit node corresponding to a particular suffix.

If we do this by simply walking down the tree until we find the correct node, our algorithm isn't going to run in linear time. To get around this, we have to add one additional pointer to the tree: the suffix pointer. The suffix pointer is a pointer found at each internal node. Each internal node represents a sequence of characters that start at the root. The suffix pointer points to the node that is the first suffix of that string. So if a particular string contains characters 0 through N of the input text, the suffix pointer for that string will point to the node that is the termination point for the string starting at the root that represents characters 1 through N of the input text.

Figure 8 shows the suffix tree for the string ABABABC. The first suffix pointer is found at the node that represents ABAB. The first suffix of that string would be BAB, and that is where the suffix pointer at ABAB points. Likewise, BAB has its own suffix pointer, which points to the node for AB.



Figure 7
The suffix tree for ABABABC with suffix pointers shown as dashed lines

The suffix pointers are built at the same time the update to the tree is taking place. As I move from the active point to the end point, I keep track of the parent node of each of the new leaves I create. Each time I create a new edge, I also create a suffix pointer from the parent node of the last leaf edge I created to the current parent edge. (Obviously, I can't do this for the first edge created in the update, but I do for all the remaining edges.)

With the suffix pointers in place, navigating from one suffix to the next is simply a matter of following a pointer. This critical addition to the algorithm is what reduces it to an O(N) algorithm.

Tree houses

To help illustrate this article, I wrote a short program, STREE.CPP, that reads in a string of text from standard input and builds a suffix tree using fully documented C++. A second version, STREED.CPP, has extensive debug output as well. Links to both are available at the bottom of this article.

Understanding STREE.CPP is really just a matter of understanding the workings of the data structures that it contains. The most important data structure is the Edge object. The class definition for Edge is:

C++:
  1. class Edge  {
  2.      public :
  3.          int first_char_index;
  4.          int last_char_index;
  5.          int end_node;
  6.          int start_node;
  7.          void Insert ( );
  8.          void  Remove ( );
  9.         Edge ( );
  10.         Edge (  int init_first_char_index,
  11.                int init_last_char_index,
  12.                int parent_node  );
  13.          int SplitEdge ( Suffix &s  );
  14.          static Edge Find (  int node,  int c  );
  15.          static  int Hash (  int node,  int c  );
  16. };

Each time a new edge in the suffix tree is created, a new Edge object is created to represent it. The four data members of the object are defined as follows:

first_char_indexlast_char_index:
Each of the edges in the tree has a sequence of characters from the input text associated with it. To ensure that the storage size of each edge is identical, we just store two indices into the input text to represent the sequence.
start_node:
The number of the node that represents the starting node for this edge. Node 0 is the root of the tree.
end_node:
The number of the node that represents the end node for this edge. Each time an edge is created, a new end node is created as well. The end node for every edge will not change over the life of the tree, so this can be used as an edge id as well.

One of the most frequent tasks performed when building the suffix tree is to search for the edge emanating from a particular node based on the first character in its sequence. On a byte oriented computer, there could be as many as 256 edges originating at a single node. To make the search reasonably quick and easy, I store the edges in a hash table, using a hash key based on their starting node number and the first character of their substring. The Insert() and Remove() member functions are used to manage the transfer of edges in and out of the hash table.

The second important data structure used when building the suffix tree is the Suffix object. Remember that updating the tree is done by working through all of the suffixes of the string currently stored in the tree, starting with the longest, and ending at the end point. A Suffix is simply a sequence of characters that starts at node 0 and ends at some point in the tree.

It makes sense that we can then safely represent any suffix by defining just the position in the tree of its last character, since we know the first character starts at node 0, the root. The Suffix object, whose definition is shown here, defines a given suffix using that system:

C++:
  1. class Suffix  {
  2.      public :
  3.          int origin_node;
  4.          int first_char_index;
  5.          int last_char_index;
  6.         Suffix (  int node,  int start,  int stop  );
  7.          int  Explicit ( );
  8.          int Implicit ( );
  9.          void Canonize ( );
  10. };

The Suffix object defines the last character in a string by starting at a specific node, then following the string of characters in the input sequence pointed to by the first_char_index and last_char_index members. For example, in Figure 8, the longest suffix "ABABABC" would have an origin_node of 0, a first_char_index of 0, and a last_char_index of 6.

Ukkonen's algorithm requires that we work with these Suffix definitions in canonical form. TheCanonize() function is called to perform this transformation any time a Suffix object is modified. The canonical representation of the suffix simply requires that the origin_node in the Suffix object be the closest parent to the end point of the string. This means that the suffix string represented by the pair (0, "ABABABC"), would be canonized by moving first to (1, "ABABC"), then (4, "ABC"), and finally (8,"").

When a suffix string ends on an explicit node, the canonical representation will use an empty string to define the remaining characters in the string. An empty string is defined by setting first_char_index to be greater than last_char_index. When this is the case, we know that the suffix ends on an explicit node. If first_char_index is less than or equal to last_char_index, it means that the suffix string ends on an implicitnode.

Given these data structure definitions, I think you will find the code in STREE.CPP to be a straightforward implementation of the Ukkonen algorithm. For additional clarity, use STREED.CPP to dump copious debug information out at runtime.

Acknowledgments

I was finally convinced to tackle suffix tree construction by reading Jesper Larsson's paper for the 1996 IEEE Data Compression Conference. Jesper was also kind enough to provide me with sample code and pointers to Ukkonen's paper.

References

E.M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23:262-272, 1976.

E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, September 1995.

Source Code

Good news - this source code has been updated. It was originally published in 1996, pre-standard, and needed just a few nips and tucks to work properly in today's world. These new versions of the code should be pretty portable - the build properly with g++ 3.x, 4.x and Visual C++ 2003.

stree2006.cpp
A simple program that builds a suffix tree from an input string.
streed2006.cpp
The same program with much debugging code added.

The original code is her for the curious, but should not be used:

stree.cpp
A simple program that builds a suffix tree from an input string.
streed.cpp
The same program with much debugging code added.


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值