快速生成后缀树的McCreight算法及其实现
作者: ljs
2011-07-03
(版权所有,转载请注明)
McCreight算法(简称mcc算法)是基于蛮力法,即已知输入文本串T的内容(注:Ukkonen算法是online的,所以不要求事先知道T的全部内容),逐步缩短插入到树中的后缀长度,直到将最后一个后缀(等于末尾那个字符)插入到前面已经生成的树中为止。它与蛮力法的区别是,T的最后一个字符必须与前面的n-1个字符中的任何一个字符不同(n是T的长度),换句话说,T的最后一个字符不属于字母表(希腊字母大写SIGMA)中任何字符,这样生成的Suffix Tree的特点是,所有的后缀都终止于叶子结点,而且每个叶子结点必定对应一个后缀。也就是说,任何内部结点都不会是后缀的终止结点。这个要求是McCreight算法和Ukkonen算法的假设前提。
mcc算法基本流程可以描述如下:如上图所示,这里树的左枝对应插入后缀Suffix[i-1]之后的效果,v和u都是内部结点,其中v是head[i-1]对应的内部结点,u是该树枝中v的上一个内部结点(u最接近v,也可以等于v)。接下来在插入Suffix[i]时,先沿着u的suffix link(因为在插入完Suffix[i-1]之后,除了v可能没有suffix link外,其余的内部结点都有suffix link - 这一点可以用归纳法证明),找到树T[i-1]中的内部结点s(u)(注意:suffix link指向的结点一定都是内部结点)。这时开始进行插入Suffix[i]的操作,接下来分两步完成插入第i个后缀,第一步使用快速扫描(fast scan),沿着s(u)结点往树叶方向搜索,直到找到w结点为止,这个w结点就是v的suffix link应该指向的结点(但是此时很有可能这个suffix link还不存在),建立v到w的suffix link;第二步在找到w的基础上,使用慢速扫描(slow scan),即沿着w结点往树叶方向搜索,直到找到head[i]为止。这时就可以结束插入Suffix[i]的工作。需要注意,在fastscan和slowscan中都需要记录u'的结点位置,这样在插入下一个后缀Suffix[i+1]时可以快速jump到s(u'),从结点s(u')开始,而不需要像蛮力法那样从root结点去搜索了,这就是为什么mcc算法能够达到O(n)线性复杂度的原因。
mcc算法执行过程中,需要注意几点:
1)suffix link指向的结点一定是内部结点(包括root,别忘了root也是内部结点)。因为u结点永远都是内部结点,所以不需要给叶子结点(tail)建立suffix link。
2)只有head结点可能没有suffix link,但是其它的内部结点都已经有了指向另外一个内部结点的suffix link。如果head结点还没有suffix link, 在插入下一个后缀时该head结点的suffix link会被建立。数学归纳法可证明,除了当前的head[i]结点以外的任何内部结点都一定有suffix link。
3)fast scan的目标是找w,slow scan的目标是找head[i];在fast scan结束时,找到的w结点一定是一个内部结点(即有两个或以上孩子的分岔结点)。
4)不管是在fast scan还是slow scan阶段中,一旦新建了一个内部结点(该结点可能w,也可能是head[i]),那么应该立即结束当前后缀的插入工作。如果新建的内部结点是w,那么它也一定是head[i]结点;如果新建的内部结点不是w结点,那么说明w是已经存在的一个内部结点。
5) tail[i]永远不空,因为文本串是T=S$这样的形式(见上面解释)。
作者: ljs
2011-07-03
(版权所有,转载请注明)
McCreight算法(简称mcc算法)是基于蛮力法,即已知输入文本串T的内容(注:Ukkonen算法是online的,所以不要求事先知道T的全部内容),逐步缩短插入到树中的后缀长度,直到将最后一个后缀(等于末尾那个字符)插入到前面已经生成的树中为止。它与蛮力法的区别是,T的最后一个字符必须与前面的n-1个字符中的任何一个字符不同(n是T的长度),换句话说,T的最后一个字符不属于字母表(希腊字母大写SIGMA)中任何字符,这样生成的Suffix Tree的特点是,所有的后缀都终止于叶子结点,而且每个叶子结点必定对应一个后缀。也就是说,任何内部结点都不会是后缀的终止结点。这个要求是McCreight算法和Ukkonen算法的假设前提。
mcc算法的核心思想是suffix link(后缀连接)和head/tail的概念。所谓结点X的suffix link指向的结点Y,指的是如果从根结点出发到X结点终止时的字符串等于xW(其中小写字母x表示单个字符,W表示一个字符串),那么从根结点出发到Y结点终止时的字符串等于W。head[i]指的是后缀树T中,Suffix[i]与所有后缀共享的前缀中最长的前缀。
mcc算法基本流程可以描述如下:如上图所示,这里树的左枝对应插入后缀Suffix[i-1]之后的效果,v和u都是内部结点,其中v是head[i-1]对应的内部结点,u是该树枝中v的上一个内部结点(u最接近v,也可以等于v)。接下来在插入Suffix[i]时,先沿着u的suffix link(因为在插入完Suffix[i-1]之后,除了v可能没有suffix link外,其余的内部结点都有suffix link - 这一点可以用归纳法证明),找到树T[i-1]中的内部结点s(u)(注意:suffix link指向的结点一定都是内部结点)。这时开始进行插入Suffix[i]的操作,接下来分两步完成插入第i个后缀,第一步使用快速扫描(fast scan),沿着s(u)结点往树叶方向搜索,直到找到w结点为止,这个w结点就是v的suffix link应该指向的结点(但是此时很有可能这个suffix link还不存在),建立v到w的suffix link;第二步在找到w的基础上,使用慢速扫描(slow scan),即沿着w结点往树叶方向搜索,直到找到head[i]为止。这时就可以结束插入Suffix[i]的工作。需要注意,在fastscan和slowscan中都需要记录u'的结点位置,这样在插入下一个后缀Suffix[i+1]时可以快速jump到s(u'),从结点s(u')开始,而不需要像蛮力法那样从root结点去搜索了,这就是为什么mcc算法能够达到O(n)线性复杂度的原因。
mcc算法执行过程中,需要注意几点:
1)suffix link指向的结点一定是内部结点(包括root,别忘了root也是内部结点)。因为u结点永远都是内部结点,所以不需要给叶子结点(tail)建立suffix link。
2)只有head结点可能没有suffix link,但是其它的内部结点都已经有了指向另外一个内部结点的suffix link。如果head结点还没有suffix link, 在插入下一个后缀时该head结点的suffix link会被建立。数学归纳法可证明,除了当前的head[i]结点以外的任何内部结点都一定有suffix link。
3)fast scan的目标是找w,slow scan的目标是找head[i];在fast scan结束时,找到的w结点一定是一个内部结点(即有两个或以上孩子的分岔结点)。
4)不管是在fast scan还是slow scan阶段中,一旦新建了一个内部结点(该结点可能w,也可能是head[i]),那么应该立即结束当前后缀的插入工作。如果新建的内部结点是w,那么它也一定是head[i]结点;如果新建的内部结点不是w结点,那么说明w是已经存在的一个内部结点。
5) tail[i]永远不空,因为文本串是T=S$这样的形式(见上面解释)。
mcc算法实现:
import java.util.LinkedList;
import java.util.List;
/**
*
* Build Suffix Tree using McCreight Algorithm
*
* Copyright (c) 2011 ljs (http://blog.csdn.net/ljsspace/)
* Licensed under GPL (http://www.opensource.org/licenses/gpl-license.php)
*
* @author ljs
* 2011-07-03
*
*/
public class McCreightAlgorithm {
private class SuffixNode {
private String text;
private List<SuffixNode> children = new LinkedList<SuffixNode>();
private SuffixNode link;
private int start;
private int end;
private int pathlen;
public SuffixNode(String text,int start,int end,int pathlen){
this.text = text;
this.start = start;
this.end = end;
this.pathlen = pathlen;
}
public SuffixNode(String text){
this.text = text;
this.start = -1;
this.end = -1;
this.pathlen = 0;
}
public int getLength(){
if(start == -1) return 0;
else return end - start + 1;
}
public String getString(){
if(start != -1){
return this.text.substring(start,end+1);
}else{
return "";
}
}
public boolean isRoot(){
return start == -1;
}
public String getCoordinate(){
return "[" + start+".." + end + "/" + this.pathlen + "]";
}
public String toString(){
return getString() + "(" + getCoordinate()
+ ",link:" + ((this.link==null)?"N/A":this.link.getCoordinate())
+ ",children:" + children.size() +")";
}
}
private class State{
private SuffixNode u; //parent(head)
private SuffixNode w; //s(head[i-1])
private SuffixNode v; //head[i-1]
private int j; //the global index of text starting from 0 to text.length()
private boolean finished; //is this suffix insertion finished?
}
private SuffixNode root;
private String text;
public McCreightAlgorithm(String text){
this.text = text;
}
//build a suffix-tree for a string of text
private void buildSuffixTree() throws Exception{
if(root==null){
root = new SuffixNode(text);
root.link = root; //link to itself
}
SuffixNode u = root;
SuffixNode v = root;
State state = new State();
for(int i=0;i<text.length();i++){
//process each suffix
SuffixNode s = u.link;
int uvLen=v.pathlen - u.pathlen;
if(u.isRoot() && !v.isRoot()){
uvLen--;
}
int j = s.pathlen + i;
//init state
state.u = s;
state.w = s; //if uvLen = 0
state.v = s;
state.j = j;
state.finished = false;
//execute fast scan
if(uvLen > 0) {
fastscan(state,s,uvLen,j);
}
//establish the suffix link with v
SuffixNode w = state.w;
v.link = w;
//execute slow scan
if(!state.finished){
j = state.j;
state.u = w; //w must be an internal node when state.finished=false, then it must have a suffix link, so u can be updated.
slowscan(state,w,j);
}
u = state.u;
v = state.v;
}
}
//slow scan until head(=state.v) is found
private void slowscan(State state,SuffixNode currNode,int j){
boolean done = false;
int keyLen = text.length() - j;
for(int i=0;i<currNode.children.size();i++){
SuffixNode child = currNode.children.get(i);
//use min(child.key.length, key.length)
int childKeyLen = child.getLength();
int len = childKeyLen<keyLen?childKeyLen:keyLen;
int delta = 0;
for(;delta<len;delta++){
if(text.charAt(j+delta) != text.charAt(child.start+delta)){
break;
}
}
if(delta==0){//this child doesn't match any character with the new key
//order keys by lexi-order
if(text.charAt(j) < text.charAt(child.start)){
//e.g. child="e" (currNode="abc")
// abc abc
// / \ =========> / | \
// e f insert "c^" c^ e f
int pathlen = text.length() - j + currNode.pathlen;
SuffixNode node = new SuffixNode(text,j,text.length()-1,pathlen);
currNode.children.add(i,node);
//state.u = currNode; //currNode is already registered as state.u, so commented out
state.v = currNode;
state.finished = true;
done = true;
break;
}else{ //key.charAt(0)>child.key.charAt(0)
//don't forget to add the largest new key after iterating all children
continue;
}
}else{//current child's key partially matches with the new key
if(delta==len){
if(keyLen>childKeyLen){ //suffix tree with ^ ending can't have other two cases
//e.g. child="ab"
// ab ab
// / \ ==========> / | \
// e f insert "abc^" c^ e f
//recursion
state.u = child;
j += childKeyLen;
state.j = j;
slowscan(state,child,j);
}
}else{//0<delta<len
//e.g. child="abc"
// abc ab
// / \ ==========> / \
// e f insert "abd^" c d^
// / \
// e f
//insert the new node: ab
int nodepathlen = child.pathlen
- (child.getLength()-delta);
SuffixNode node = new SuffixNode(text,
child.start,child.start + delta - 1,nodepathlen);
node.children = new LinkedList<SuffixNode>();
int tailpathlen = (text.length() - (j + delta)) + nodepathlen;
SuffixNode tail = new SuffixNode(text,
j+delta,text.length()-1,tailpathlen);
//update child node: c
child.start += delta;
if(text.charAt(j+delta)<text.charAt(child.start)){
node.children.add(tail);
node.children.add(child);
}else{
node.children.add(child);
node.children.add(tail);
}
//update parent
currNode.children.set(i, node);
//state.u = currNode; //currNode is already registered as state.u, so commented out
state.v = node;
state.finished = true;
}
done = true;
break;
}
}
if(!done){
int pathlen = text.length() - j + currNode.pathlen;
SuffixNode node = new SuffixNode(text,j,text.length()-1,pathlen);
currNode.children.add(node);
//state.u = currNode; //currNode is already registered as state.u, so commented out
state.v = currNode;
state.finished = true;
}
}
//fast scan until w is found
private void fastscan(State state,SuffixNode currNode,int uvLen,int j){
for(int i=0;i<currNode.children.size();i++){
SuffixNode child = currNode.children.get(i);
if(text.charAt(child.start) == text.charAt(j)){
int len = child.getLength();
if(uvLen==len){
//then we find w
//uvLen = 0;
//need slow scan after this child
state.u = child;
state.w = child;
state.j = j+len;
}else if(uvLen<len){
//branching and cut child short
//e.g. child="abc",uvLen = 2
// abc ab
// / \ ================> / \
// e f suffix part: "abd^" c d^
// / \
// e f
//insert the new node: ab; child is now c
int nodepathlen = child.pathlen
- (child.getLength()-uvLen);
SuffixNode node = new SuffixNode(text,
child.start,child.start + uvLen - 1,nodepathlen);
node.children = new LinkedList<SuffixNode>();
int tailpathlen = (text.length() - (j + uvLen)) + nodepathlen;
SuffixNode tail = new SuffixNode(text,
j+uvLen,text.length()-1,tailpathlen);
//update child node: c
child.start += uvLen;
if(text.charAt(j+uvLen)<text.charAt(child.start)){
node.children.add(tail);
node.children.add(child);
}else{
node.children.add(child);
node.children.add(tail);
}
//update parent
currNode.children.set(i, node);
//uvLen = 0;
//state.u = currNode; //currNode is already registered as state.u, so commented out
state.w = node;
state.finished = true;
state.v = node;
}else{//uvLen>len
//e.g. child="abc", uvLen = 4
// abc
// / \ ================>
// e f suffix part: "abcdefg^"
//
//
//jump to next node
uvLen -= len;
state.u = child;
j += len;
state.j = j;
fastscan(state,child,uvLen,j);
}
break;
}
}
}
//for test purpose only
public void printTree(){
System.out.format("The suffix tree for S = %s is: %n",this.text);
this.print(0, this.root);
}
private void print(int level, SuffixNode node){
for (int i = 0; i < level; i++) {
System.out.format(" ");
}
System.out.format("|");
for (int i = 0; i < level; i++) {
System.out.format("-");
}
//System.out.format("%s(%d..%d/%d)%n", node.getString(),node.start,node.end,node.pathlen);
System.out.format("(%d,%d)%n", node.start,node.end);
for (SuffixNode child : node.children) {
print(level + 1, child);
}
}
public static void main(String[] args) throws Exception {
//test suffix-tree
System.out.println("****************************");
String text = "xbxb^"; //the last char must be unique!
McCreightAlgorithm stree = new McCreightAlgorithm(text);
stree.buildSuffixTree();
stree.printTree();
System.out.println("****************************");
text = "mississippi^";
stree = new McCreightAlgorithm(text);
stree.buildSuffixTree();
stree.printTree();
System.out.println("****************************");
text = "GGGGGGGGGGGGCGCAAAAGCGAGCAGAGAGAAAAAAAAAAAAAAAAAAAAAA^";
stree = new McCreightAlgorithm(text);
stree.buildSuffixTree();
stree.printTree();
System.out.println("****************************");
text = "ABCDEFGHIJKLMNOPQRSTUVWXYZ^";
stree = new McCreightAlgorithm(text);
stree.buildSuffixTree();
stree.printTree();
System.out.println("****************************");
text = "AAAAAAAAAAAAAAAAAAAAAAAAAA^";
stree = new McCreightAlgorithm(text);
stree.buildSuffixTree();
stree.printTree();
System.out.println("****************************");
text = "minimize"; //the last char e is different from other chars, so it is ok.
stree = new McCreightAlgorithm(text);
stree.buildSuffixTree();
stree.printTree();
System.out.println("****************************");
//the example from McCreight's: A Space-Economical Suffix Tree Construction Algorithm
text = "bbbbbababbbaabbbbbc^";
stree = new McCreightAlgorithm(text);
stree.buildSuffixTree();
stree.printTree();
}
}
测试输出:
****************************
The suffix tree for S = xbxb^ is:
|(-1,-1)
|-(4,4)
|-(1,1)
|--(4,4)
|--(2,4)
|-(0,1)
|--(4,4)
|--(2,4)
****************************
The suffix tree for S = mississippi^ is:
|(-1,-1)
|-(11,11)
|-(1,1)
|--(11,11)
|--(8,11)
|--(2,4)
|---(8,11)
|---(5,11)
|-(0,11)
|-(8,8)
|--(10,11)
|--(9,11)
|-(2,2)
|--(4,4)
|---(8,11)
|---(5,11)
|--(3,4)
|---(8,11)
|---(5,11)
****************************
The suffix tree for S = GGGGGGGGGGGGCGCAAAAGCGAGCAGAGAGAAAAAAAAAAAAAAAAAAAAAA^ is:
|(-1,-1)
|-(15,15)
|--(16,16)
|---(17,17)
|----(18,18)
|-----(35,35)
|------(36,36)
|-------(37,37)
|--------(38,38)
|---------(39,39)
|----------(40,40)
|-----------(41,41)
|------------(42,42)
|-------------(43,43)
|--------------(44,44)
|---------------(45,45)
|----------------(46,46)
|-----------------(47,47)
|------------------(48,48)
|-------------------(49,49)
|--------------------(50,50)
|---------------------(51,51)
|----------------------(52,53)
|----------------------(53,53)
|---------------------(53,53)
|--------------------(53,53)
|-------------------(53,53)
|------------------(53,53)
|-----------------(53,53)
|----------------(53,53)
|---------------(53,53)
|--------------(53,53)
|-------------(53,53)
|------------(53,53)
|-----------(53,53)
|----------(53,53)
|---------(53,53)
|--------(53,53)
|-------(53,53)
|------(53,53)
|-----(19,53)
|-----(53,53)
|----(19,53)
|----(53,53)
|---(19,53)
|---(53,53)
|--(19,19)
|---(27,27)
|----(32,53)
|----(28,29)
|-----(32,53)
|-----(30,53)
|---(20,20)
|----(25,53)
|----(21,53)
|--(53,53)
|-(12,12)
|--(15,15)
|---(16,53)
|---(26,53)
|--(13,13)
|---(22,53)
|---(14,53)
|-(0,0)
|--(22,22)
|---(32,53)
|---(23,23)
|----(29,29)
|-----(32,53)
|-----(30,53)
|----(24,53)
|--(12,12)
|---(15,15)
|----(16,53)
|----(26,53)
|---(13,13)
|----(22,53)
|----(14,53)
|--(1,1)
|---(12,53)
|---(2,2)
|----(12,53)
|----(3,3)
|-----(12,53)
|-----(4,4)
|------(12,53)
|------(5,5)
|-------(12,53)
|-------(6,6)
|--------(12,53)
|--------(7,7)
|---------(12,53)
|---------(8,8)
|----------(12,53)
|----------(9,9)
|-----------(12,53)
|-----------(10,10)
|------------(12,53)
|------------(11,53)
|-(53,53)
****************************
The suffix tree for S = ABCDEFGHIJKLMNOPQRSTUVWXYZ^ is:
|(-1,-1)
|-(0,26)
|-(1,26)
|-(2,26)
|-(3,26)
|-(4,26)
|-(5,26)
|-(6,26)
|-(7,26)
|-(8,26)
|-(9,26)
|-(10,26)
|-(11,26)
|-(12,26)
|-(13,26)
|-(14,26)
|-(15,26)
|-(16,26)
|-(17,26)
|-(18,26)
|-(19,26)
|-(20,26)
|-(21,26)
|-(22,26)
|-(23,26)
|-(24,26)
|-(25,26)
|-(26,26)
****************************
The suffix tree for S = AAAAAAAAAAAAAAAAAAAAAAAAAA^ is:
|(-1,-1)
|-(0,0)
|--(1,1)
|---(2,2)
|----(3,3)
|-----(4,4)
|------(5,5)
|-------(6,6)
|--------(7,7)
|---------(8,8)
|----------(9,9)
|-----------(10,10)
|------------(11,11)
|-------------(12,12)
|--------------(13,13)
|---------------(14,14)
|----------------(15,15)
|-----------------(16,16)
|------------------(17,17)
|-------------------(18,18)
|--------------------(19,19)
|---------------------(20,20)
|----------------------(21,21)
|-----------------------(22,22)
|------------------------(23,23)
|-------------------------(24,24)
|--------------------------(25,26)
|--------------------------(26,26)
|-------------------------(26,26)
|------------------------(26,26)
|-----------------------(26,26)
|----------------------(26,26)
|---------------------(26,26)
|--------------------(26,26)
|-------------------(26,26)
|------------------(26,26)
|-----------------(26,26)
|----------------(26,26)
|---------------(26,26)
|--------------(26,26)
|-------------(26,26)
|------------(26,26)
|-----------(26,26)
|----------(26,26)
|---------(26,26)
|--------(26,26)
|-------(26,26)
|------(26,26)
|-----(26,26)
|----(26,26)
|---(26,26)
|--(26,26)
|-(26,26)
****************************
The suffix tree for S = minimize is:
|(-1,-1)
|-(7,7)
|-(1,1)
|--(4,7)
|--(2,7)
|--(6,7)
|-(0,1)
|--(2,7)
|--(6,7)
|-(2,7)
|-(6,7)
****************************
The suffix tree for S = bbbbbababbbaabbbbbc^ is:
|(-1,-1)
|-(19,19)
|-(5,5)
|--(12,19)
|--(6,6)
|---(7,19)
|---(9,10)
|----(11,19)
|----(16,19)
|-(0,0)
|--(5,5)
|---(12,19)
|---(6,6)
|----(7,19)
|----(9,19)
|--(1,1)
|---(5,5)
|----(12,19)
|----(6,19)
|---(2,2)
|----(5,5)
|-----(12,19)
|-----(6,19)
|----(3,3)
|-----(5,19)
|-----(4,4)
|------(5,19)
|------(18,19)
|-----(18,19)
|----(18,19)
|---(18,19)
|--(18,19)
|-(18,19)
参考资料:
EDWARD M. McCREIGHT, Journal of the Association for Computing Machinery, Vol 23, No. 2, April 1976, A Space-Economical Suffix Tree Construction Algorithm