后缀树及后缀数组

最新推荐文章于 2024-07-15 11:27:12 发布

xlliu0226

最新推荐文章于 2024-07-15 11:27:12 发布

阅读量2.3k

点赞数 1

分类专栏： Program 文章标签： tree construction string compression algorithm arrays

本文链接：https://blog.csdn.net/xlliu0226/article/details/1857530

版权

Program 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

后缀树:
后缀树是一种数据结构，它支持有效的字符串匹配和查询。

一个具有m个词的字符串S的后缀树T，就是一个包含一个根节点的有向树，该树恰好带有m个叶子，这些叶子被赋予从1到m的标号。每一个内部节点，除了根节点以外，都至少有两个子节点，而且每条边都用S的一个非空子串来标识。出自同一节点的任意两条边的标识不会以相同的词开始。后缀树的关键特征是：对于任何叶子i，从根节点到该叶子所经历的边的所有标识串联起来后恰好拼出S的从i位置开始的后缀，即S[i,…,m]。树中节点的标识被定义为从根到该节点的所有边的标识的串联。

图中示意了字符串 "I know you know I know "的后缀树。内部节点用圆来表示，叶子用矩形来表示,该例子中有六个叶子，被标号为1到6。终止字符在图中被省略掉了。

同理, 若干字符串组成的后缀树, 称为一个扩展的后缀树：n个字符串Sn，其中字符串的长度为mn, 由这些字符串组成一个扩展的后缀树 T ，它是一个包含一个根节点的有向树，该树有mn个叶子，每个叶子都用一个两数字的坐标tuple(k,l)来标识，其中k的范围是从1到n，而l的范围是从1到mk ，每一个内部节点，除了根节点外，都有两个子节点并且每条边都用一个非空的S中若干单词构成的一个子串来标识。并且出自同一节点的任意两条边的标识的第一个单词不能相同。对于任意的叶子(i,j),从根节点到该叶子所经历的所有边的标识的串联恰好拼出后缀Si，该后缀从位置j开始，就是说它们拼出了Si [j..mi]。
在字符串处理当中，后缀树和后缀数组都是非常有力的工具，其中后缀树大家了解得比较多，关于后缀数组则很少见于国内的资料。其实后缀数组是后缀树的一个非常精巧的替代品，它比后缀树容易编程实现，能够实现后缀树的很多功能而时间复杂度也不太逊色，并且，它比后缀树所占用的空间小很多。可以说，在信息学竞赛中后缀数组比后缀树要更为实用。因此在本文中笔者想介绍一下后缀数组的基本概念、构造方法，以及配合后缀数组的最长公共前缀数组的构造方法，最后结合一些例子谈谈后缀数组的应用。

后缀数组:

首先明确一些必要的定义：

字符集一个字符集∑是一个建立了全序关系的集合，也就是说，∑中的任意两个不同的元素α和β都可以比较大小，要么α<β，要么β<α（也就是α>β）。字符集∑中的元素称为字符。
字符串一个字符串S是将n个字符顺次排列形成的数组，n称为S的长度，表示为len(S)。S的第i个字符表示为S[i]。
子串字符串S的子串S[i..j]，i≤j，表示S串中从i到j这一段，也就是顺次排列S[i],S[i+1],...,S[j]形成的字符串。
后缀后缀是指从某个位置i开始到整个串末尾结束的一个特殊子串。字符串S的从i开头的后缀表示为Suffix(S,i)，也就是Suffix(S,i)=S[i..len(S)]。

关于字符串的大小比较，是指通常所说的“字典顺序”比较，也就是对于两个字符串u、v，令i从1开始顺次比较u[i]和v[i]，如果相等则令i加1，否则若u[i]<v[i]则认为u<v，u[i]>v[i]则认为u>v（也就是v<u），比较结束。如果i>len (u)或者i>len(v)仍未比较出结果，那么若len(u)<len(v)则认为u<v，若len(u)=len(v)则认为u= v，若len(u)>len(v)则u>v。
从字符串的大小比较的定义来看，S的两个开头位置不同的后缀u和v进行比较的结果不可能是相等，因为u=v的必要条件len(u)=len(v)在这里不可能满足。

下面我们约定一个字符集∑和一个字符串S，设len(S)=n，且S[n]='$'，也就是说S以一个特殊字符'$'结尾，并且'$'小于∑中的任何一个字符。除了S[n]之外，S中的其他字符都属于∑。对于约定的字符串S，从位置i开头的后缀直接写成Suffix(i)，省去参数S。

后缀数组后缀数组SA是一个一维数组，它保存1..n的某个排列SA[1],SA[2],...SA[n]，并且保证 Suffix(SA[i])<Suffix(SA[i+1]),1≤i<n。也就是将S的n个后缀从小到大进行排序之后把排好序的后缀的开头位置顺次放入SA中。
名次数组名次数组Rank=SA-1，也就是说若SA[i]=j，则Rank[j]=i，不难看出Rank[i]保存的是Suffix(i)在所有后缀中从小到大排列的“名次”。

构造方法
如何构造后缀数组呢？最直接最简单的方法当然是把S的后缀都看作一些普通的字符串，按照一般字符串排序的方法对它们从小到大进行排序。
不难看出，这种做法是很笨拙的，因为它没有利用到各个后缀之间的有机联系，所以它的效率不可能很高。即使采用字符串排序中比较高效的Multi-key Quick Sort，最坏情况的时间复杂度仍然是O(n2)的，不能满足我们的需要。
下面介绍倍增算法(Doubling Algorithm)，它正是充分利用了各个后缀之间的联系，将构造后缀数组的最坏时间复杂度成功降至O(nlogn)。

对一个字符串u，我们定义u的k-前缀

定义k-前缀比较关系<k、=k和≤k：
设两个字符串u和v，
u<kv 当且仅当 uk<vk
u=kv 当且仅当 uk=vk
u≤kv 当且仅当 uk≤vk

直观地看这些加了一个下标k的比较符号的意义就是对两个字符串的前k个字符进行字典序比较，特别的一点就是在作大于和小于的比较时如果某个字符串的长度不到k也没有关系，只要能够在k个字符比较结束之前得到第一个字符串大于或者小于第二个字符串就可以了。
根据前缀比较符的性质我们可以得到以下的非常重要的性质：
性质1.1 对k≥n，Suffix(i)<kSuffix(j) 等价于 Suffix(i)<Suffix(j)。
性质1.2 Suffix(i)=2kSuffix(j)等价于
Suffix(i)=kSuffix(j) 且 Suffix(i+k)=kSuffix(j+k)。
性质1.3 Suffix(i)<2kSuffix(j) 等价于
Suffix(i)<kS(j) 或 (Suffix(i)=kSuffix(j) 且 Suffix(i+k)<kSuffix(j+k))。
这里有一个问题，当i+k>n或者j+k>n的时候Suffix(i+k)或Suffix(j+k)是无明确定义的表达式，但实际上不需要考虑这个问题，因为此时Suffix(i)或者Suffix(j)的长度不超过k，也就是说它们的k-前缀以'$'结尾，于是k-前缀比较的结果不可能相等，也就是说前k个字符已经能够比出大小，后面的表达式自然可以忽略，这也就看出我们规定S以'$'结尾的特殊用处了。

定义k-后缀数组 SAk保存1..n的某个排列SAk[1],SAk[2],…SAk[n]使得Suffix(SAk[i]) ≤kSuffix(SAk[i+1]),1≤i<n。也就是说对所有的后缀在k-前缀比较关系下从小到大排序，并且把排序后的后缀的开头位置顺次放入数组SAk中。
定义k-名次数组Rankk，Rankk[i]代表Suffix(i)在k-前缀关系下从小到大的“名次”，也就是1加上满足Suffix(j)<kSuffix(i)的j的个数。通过SAk很容易在O(n)的时间内求出Rankk。
假设我们已经求出了SAk和Rankk，那么我们可以很方便地求出SA2k和Rank2k，因为根据性质1.2和1.3，2k-前缀比较关系可以由常数个k -前缀比较关系组合起来等价地表达，而Rankk数组实际上给出了在常数时间内进行<k和=k比较的方法，即：
Suffix(i)<kSuffix(j) 当且仅当 Rankk[i]<Rankk[j]
Suffix(i)=kSuffix(j) 当且仅当 Rankk[i]=Rankk[j]
因此，比较Suffix(i)和Suffix(j)在k-前缀比较关系下的大小可以在常数时间内完成，于是对所有的后缀在≤k关系下进行排序也就和一般的排序没有什么区别了，它实际上就相当于每个Suffix(i)有一个主关键字Rankk[i]和一个次关键字Rankk[i+k]。如果采用快速排序之类O (nlogn)的排序，那么从SAk和Rankk构造出SA2k的复杂度就是O(nlogn)。更聪明的方法是采用基数排序，复杂度为O(n)。
求出SA2k之后就可以在O(n)的时间内根据SA2k构造出Rank2k。因此，从SAk和Rankk推出SA2k和Rank2k可以在O(n)时间内完成。
下面只有一个问题需要解决：如何构造出SA1和Rank1。这个问题非常简单：因为<1，=1和≤1这些运算符实际上就是对字符串的第一个字符进行比较，所以只要把每个后缀按照它的第一个字符进行排序就可以求出SA1，不妨就采用快速排序，复杂度为O(nlogn)。
于是，可以在O(nlogn)的时间内求出SA1和Rank1。
求出了SA1和Rank1，我们可以在O(n)的时间内求出SA2和Rank2，同样，我们可以再用O(n)的时间求出SA4和Rank4，这样，我们依次求出：
SA2和Rank2，SA4和Rank4，SA8和Rank8，……直到SAm和Rankm，其中m=2k且m≥n。而根据性质1.1，SAm和SA是等价的。这样一共需要进行logn次O(n)的过程，因此
可以在O(nlogn)的时间内计算出后缀数组SA和名次数组Rank。

最长公共前缀
现在一个字符串S的后缀数组SA可以在O(nlogn)的时间内计算出来。利用SA我们已经可以做很多事情，比如在O(mlogn)的时间内进行模式匹配，其中m,n分别为模式串和待匹配串的长度。但是要想更充分地发挥后缀数组的威力，我们还需要计算一个辅助的工具——最长公共前缀（Longest Common Prefix）。
对两个字符串u,v定义函数lcp(u,v)=max{i|u=iv}，也就是从头开始顺次比较u和v的对应字符，对应字符持续相等的最大位置，称为这两个字符串的最长公共前缀。
对正整数i,j定义LCP(i,j)=lcp(Suffix(SA[i]),Suffix(SA[j])，其中i,j均为1至n的整数。LCP(i,j)也就是后缀数组中第i个和第j个后缀的最长公共前缀的长度。
关于LCP有两个显而易见的性质：
性质2.1 LCP(i,j)=LCP(j,i)
性质2.2 LCP(i,i)=len(Suffix(SA[i]))=n-SA[i]+1
这两个性质的用处在于，我们计算LCP(i,j)时只需要考虑i<j的情况，因为i>j时可交换i,j，i=j时可以直接输出结果n-SA[i]+1。

直接根据定义，用顺次比较对应字符的方法来计算LCP(i,j)显然是很低效的，时间复杂度为O(n)，所以我们必须进行适当的预处理以降低每次计算LCP的复杂度。
经过仔细分析，我们发现LCP函数有一个非常好的性质：
设i<j，则LCP(i,j)=min{LCP(k-1,k)|i+1≤k≤j} （LCP Theorem）

要证明LCP Theorem，首先证明LCP Lemma:
对任意1≤i<j<k≤n，LCP(i,k)=min{LCP(i,j),LCP(j,k)}
证明：设p=min{LCP(i,j),LCP(j,k)}，则有LCP(i,j)≥p,LCP(j,k)≥p。
设Suffix(SA[i])=u,Suffix(SA[j])=v,Suffix(SA[k])=w。
由u=LCP(i,j)v得u=pv；同理v=pw。
于是Suffix(SA[i])=pSuffix(SA[k])，即LCP(i,k)≥p。 (1)

又设LCP(i,k)=q>p，则
u[1]=w[1],u[2]=w[2],...u[q]=w[q]。
而min{LCP(i,j),LCP(j,k)}=p说明u[p+1]≠v[p+1]或v[p+1]≠w[q+1]，
设u[p+1]=x,v[p+1]=y,w[p+1]=z，显然有x≤y≤z，又由p<q得p+1≤q，应该有x=z，也就是x=y=z，这与u[p+1]≠v[p+1]或v[p+1]≠w[q+1]矛盾。
于是，q>p不成立，即LCP(i,k)≤p。 (2)
综合(1),(2)知 LCP(i,k)=p=min{LCP(i,j),LCP(j,k)}，LCP Lemma得证。

于是LCP Theorem可以证明如下：
当j-i=1和j-i=2时，显然成立。
设j-i=m时LCP Theorem成立，当j-i=m+1时，
由LCP Lemma知LCP(i,j)=min{LCP(i,i+1),LCP(i+1,j)}，
因j-(i+1)≤m，LCP(i+1,j)=min{LCP(k-1,k)|i+2≤k≤j}，故当j-i=m+1时，仍有
LCP(i,j)=min{LCP(i,i+1),min{LCP(k-1,k)|i+2≤k≤j}}=min{LCP(k-1,k}|i+1≤k≤j)
根据数学归纳法，LCP Theorem成立。

根据LCP Theorem得出必然的一个推论：
LCP Corollary 对i≤j<k，LCP(j,k)≥LCP(i,k)。

定义一维数组height，令height[i]=LCP(i-1,i)，1<i≤n，并设height[1]=0。
由LCP Theorem，LCP(i,j)=min{height[k]|i+1≤k≤j}，也就是说，计算LCP(i,j)等同于询问一维数组height中下标在i+1到j范围内的所有元素的最小值。如果height数组是固定的，这就是非常经典的RMQ（Range Minimum Query）问题。
RMQ问题可以用线段树或静态排序树在O(nlogn)时间内进行预处理，之后每次询问花费时间O(logn)，更好的方法是RMQ标准算法，可以在O(n)时间内进行预处理，每次询问可以在常数时间内完成。
对于一个固定的字符串S，其height数组显然是固定的，只要我们能高效地求出height数组，那么运用RMQ方法进行预处理之后，每次计算LCP(i,j)的时间复杂度就是常数级了。于是只有一个问题——如何尽量高效地算出height数组。
根据计算后缀数组的经验，我们不应该把n个后缀看作互不相关的普通字符串，而应该尽量利用它们之间的联系，下面证明一个非常有用的性质：
为了描述方便，设h[i]=height[Rank[i]]，即height[i]=h[SA[i]]。h数组满足一个性质：
性质3 对于i>1且Rank[i]>1，一定有h[i]≥h[i-1]-1。
为了证明性质3，我们有必要明确两个事实：

设i<n,j<n，Suffix(i)和Suffix(j)满足lcp(Suffix(i),Suffix(j)>1，则成立以下两点：
Fact 1 Suffix(i)<Suffix(j) 等价于 Suffix(i+1)<Suffix(j+1)。
Fact 2 一定有lcp(Suffix(i+1),Suffix(j+1))=lcp(Suffix(i),Suffix(j))-1。
看起来很神奇，但其实很自然：lcp(Suffix(i),Suffix(j))>1说明Suffix(i)和Suffix(j)的第一个字符是相同的，设它为α，则Suffix(i)相当于α后连接Suffix(i+1)，Suffix(j)相当于α后连接Suffix(j+1)。比较Suffix (i)和Suffix(j)时，第一个字符α是一定相等的，于是后面就等价于比较Suffix(i)和Suffix(j)，因此Fact 1成立。Fact 2可类似证明。

于是可以证明性质3：
当h[i-1]≤1时，结论显然成立，因h[i]≥0≥h[i-1]-1。
当h[i-1]>1时，也即height[Rank[i-1]]>1，可见Rank[i-1]>1，因height[1]=0。
令j=i-1,k=SA[Rank[j]-1]。显然有Suffix(k)<Suffix(j)。
根据h[i-1]=lcp(Suffix(k),Suffix(j))>1和Suffix(k)<Suffix(j)：
由Fact 2知lcp(Suffix(k+1),Suffix(i))=h[i-1]-1。
由Fact 1知Rank[k+1]<Rank[i]，也就是Rank[k+1]≤Rank[i]-1。
于是根据LCP Corollary，有
LCP(Rank[i]-1,Rank[i])≥LCP(Rank[k+1],Rank[i])
=lcp(Suffix(k+1),Suffix(i))
=h[i-1]-1
由于h[i]=height[Rank[i]]=LCP(Rank[i]-1,Rank[i])，最终得到 h[i]≥h[i-1]-1。

根据性质3，可以令i从1循环到n按照如下方法依次算出h[i]：
若Rank[i]=1，则h[i]=0。字符比较次数为0。
若i=1或者h[i-1]≤1，则直接将Suffix(i)和Suffix(Rank[i]-1)从第一个字符开始依次比较直到有字符不相同，由此计算出h[i]。字符比较次数为h[i]+1，不超过h[i]-h[i-1]+2。
否则，说明i>1，Rank[i]>1，h[i-1]>1，根据性质3，Suffix(i)和Suffix(Rank[i]-1)至少有前h[i-1]-1个字符是相同的，于是字符比较可以从h[i-1]开始，直到某个字符不相同，由此计算出h[i]。字符比较次数为h[i]-h[i- 1]+2。

设SA[1]=p，那么不难看出总的字符比较次数不超过

也就是说，整个算法的复杂度为O(n)。
求出了h数组，根据关系式height[i]=h[SA[i]]可以在O(n)时间内求出height数组，于是
可以在O(n)时间内求出height数组。

结合RMQ方法，在O(n)时间和空间进行预处理之后就能做到在常数时间内计算出对任意(i,j)计算出LCP(i,j)。
因为lcp(Suffix(i),Suffix(j))=LCP(Rank[i],Rank[j])，所以我们也就可以在常数时间内求出S的任何两个后缀之间的最长公共前缀。这正是后缀数组能强有力地处理很多字符串问题的重要原因之一。

WiKi Answer:

suffix tree

Suffix tree for the string BANANA padded with $. The six paths from the root to a leaf (shown as boxes) correspond to the six suffixes A$, NA$, ANA$, NANA$, ANANA$ and BANANA$. The numbers in the boxes give the start position of the corresponding suffix. Suffix links drawn dashed.

In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a certain data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations.

The suffix tree for a string $S$ is a tree whose edges are labeled with strings, and such that each suffix of $S$ corresponds to exactly one path from the tree's root to a leaf. It is thus a radix tree for the suffixes of $S$ .

Constructing such a tree for the string $S$ takes time and space linear in the length of $S$ . Once constructed, several operations can be performed quickly, for instance locating a substring in $S$ , locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc. Suffix trees also provided one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string's suffix tree typically requires significantly more space than storing the string itself.

History

The concept was first introduced as a position tree by Weiner in 1973^[1] in a paper which Donald Knuth subsequently characterized as "Algorithm of the Year 1973". The construction was greatly simplified by McCreight in 1976 ^[2] , and also by Ukkonen in 1995^[3]^[4]. Ukkonen provided the first linear-time online-construction of suffix trees, now known as Ukkonen's algorithm.

Definition

The suffix tree for the string $S$ of length $n$ is defined as a tree such that (^[5] page 90):

the paths from the root to the leaves have a one-to-one relationship with the suffixes of $S$ ,
edges spell non-empty strings,
and all internal nodes (except perhaps the root) have at least two children.

Since such a tree does not exist for all strings, $S$ is padded with a terminal symbol not seen in the string (usually denoted $). This ensures that no suffix is a prefix of another, and that there will be $n$ leaf nodes, one for each of the $n$ suffixes of $S$ . Since all internal non-root nodes are branching, there can be at most $n - 1$ such nodes, and $n + (n - 1) + 1 = 2 n$ nodes in total.

Suffix links are a key feature for linear-time construction of the tree. In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the string $χα$ , where $χ$ is a single character and $α$ is a string (possibly empty), it has a suffix link to the internal node representing $α$ . See for example the suffix link from the node for ANA to the node for NA in the figure above. Suffix links are also used in some algorithms running on the tree.

Functionality

A suffix tree for a string $S$ of length $n$ can be built in $Θ(n)$ time, if the alphabet is constant or integer ^[6]. Otherwise, the construction time depends on the implementation. The costs below are given under the assumption that the alphabet is constant. If it is not, the cost depends on the implementation (see below).

Assume that a suffix tree has been built for the string $S$ of length $n$ , or that a generalised suffix tree has been built for the set of strings $D = {S 1, S 2,..., S K}$ of total length $n = | n 1 | + | n 2 | + ... + | n K |$ . You can:

Search for strings:
- Check if a string $P$ of length $m$ is a substring in $O (m)$ time (^[5] page 92).
- Find the first occurrence of the patterns $P 1,..., P q$ of total length $m$ as substrings in $O (m)$ time, when the suffix tree is built using Ukkonen's algorithm.
- Find all $z$ occurrences of the patterns $P 1,..., P q$ of total length $m$ as substrings in $O (m + z)$ time (^[5] page 123).
- Search for a regular expression P in time expected sublinear on $n$ (^[7]).
- Find for each suffix of a pattern $P$ , the length of the longest match between a prefix of $P [i ... m]$ and a substring in $D$ in $Θ(m)$ time (^[5] page 132). This is termed the matching statistics for $P$ .
Find properties of the strings:
- Find the longest common substrings of the string $S i$ and $S j$ in $Θ(n i + n j)$ time (^[5] page 125).
- Find all maximal pairs, maximal repeats or supermaximal repeats in $Θ(n + z)$ time (^[5] page 144).
- Find the Lempel-Ziv decomposition in $Θ(n)$ time (^[5] page 166).
- Find the longest repeated substrings in $Θ(n)$ time.
- Find the most frequently occurring substrings of a minimum length in $Θ(n)$ time.
- Find the shortest strings from $Σ$ that do not occur in $D$ , in $O (n + z)$ time, if there are $z$ such strings.
- Find the shortest substrings occurring only once in $Θ(n)$ time.
- Find, for each $i$ , the shortest substrings of $S i$ not occurring elsewhere in $D$ in $Θ(n)$ time.

The suffix tree can be prepared for constant time lowest common ancestor retrieval between nodes in $Θ(n)$ time (^[5] chapter 8). You can then also:

Find the longest common prefix between the suffixes $S i [p .. n i]$ and $S j [q .. n j]$ in $Θ(1)$ (^[5] page 196).
Search for a pattern $P$ of length $m$ with at most $k$ mismatches in $O (k n + z)$ time, where $z$ is the number of hits (^[5] page 200).
Find all $z$ maximal palindromes in $Θ(n)$ (^[5] page 198), or $Θ(g n)$ time if gaps of length $g$ are allowed, or $Θ(k n)$ if $k$ mismatches are allowed (^[5] page 201).
Find all $z$ tandem repeats in $O (n log n + z)$ , and k-mismatch tandem repeats in $O (k n log(n / k) + z)$ (^[5] page 204).
Find the longest substrings common to at least $k$ strings in $D$ for $k = 2.. K$ in $Θ(n)$ time (^[5] page 205).

Uses

Suffix trees are often used in bioinformatics applications, where they are used for searching for patterns in DNA or protein sequences, which can be viewed as long strings of characters. The ability to search efficiently with mismatches might be the suffix tree's greatest strength. It is also used in data compression, where on the one hand it is used to find repeated data and on the other hand it can be used for the sorting stage of the Burrows-Wheeler transform. Variants of the LZW compression schemes use it (LZSS). A suffix tree is also used in something called suffix tree clustering which is a data clustering algorithm used in some search engines.

Implementation

If each node and edge can be represented in $Θ(1)$ space, the entire tree can be represented in $Θ(n)$ space. The total length of the edges in the tree is $O (n 2)$ , but each edge can be stored as the position and length of a substring of S, giving a total space usage of $Θ(n)$ computer words. The worst-case space usage of a suffix tree is seen with a fibonacci string, giving the full $2 n$ nodes.

An important choice when making a suffix tree implementation is the parent-child relationships between nodes. The most common is using linked lists called sibling lists. Each node has pointer to its first child, and to the next node in the child list it is a part of. Hash maps, sorted/unsorted arrays (with array doubling), and balanced search trees may also be used, giving different running time properties. We are interested in:

The cost of finding the child on a given character.
The cost of inserting a child.
The cost of enlisting all children of a node (divided by the number of children in the table below).

Let $σ$ be the size of the alphabet. Then you have the following costs:

	Lookup	Insertion	Traversal
Sibling lists / unsorted arrays	$O (σ)$	$Θ(1)$	$Θ(1)$
Hash maps	$Θ(1)$	$Θ(1)$	$O (σ)$
Balanced search tree	$O (logσ)$	$O (logσ)$	$O (1)$
Sorted arrays	$O (logσ)$	$O (σ)$	$O (1)$
Hash maps + sibling lists	$O (1)$	$O (1)$	$O (1)$

Note that the insertion cost is amortised, and that the costs for hashing are given perfect hashing.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten to twenty times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of four, and researchers have continued to find smaller indexing structures.

References

^ P. Weiner (1973). "Linear pattern matching algorithm". 14th Annual IEEE Symposium on Switching and Automata Theory: 1-11.
^ Edward M. McCreight (1976). "A Space-Economical Suffix Tree Construction Algorithm". Journal of the ACM 23 (2): 262--272.
^ E. Ukkonen (1995). "On-line construction of suffix trees". Algorithmica 14 (3): 249--260.
^ R. Giegerich and S. Kurtz (1997). "From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction". Algorithmica 19 (3): 331--353.
^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l ^m ⁿ Gusfield, Dan [1997] (1999). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. USA: Cambridge University Press. ISBN 0-521-58519-8.
^ Martin Farach (1997). "Optimal suffix tree construction with large alphabets". Foundations of Computer Science, 38th Annual Symposium on: 137--143.
^ Ricardo A. Baeza-Yates and Gaston H. Gonnet (1996). "Fast text searching for regular expressions or automaton searching on tries". Journal of the ACM 43: 915--936. DOI:10.1145/235809.235810. ISSN 0004-5411.