所谓LCP(Longest Common Prefix)是指后缀数组中相邻两个后缀的最长公共前缀的长度。在后缀数组的应用中,LCP是很重要的信息。
设后缀数组为SA, 用LCP(i)定义为第SA[i]个后缀和第SA[i-1]个后缀之间的最长公共前缀长度。由于输入文本T的第p个后缀和第p-1个后缀之间存在如下关系:LCP(p) >= LCP(p-1) - 1,因此如果已知第p-1个后缀的LCP(p-1),那么在计算第p个后缀的LCP(p)时,可以直接跳过第p个后缀的前LCP(p-1)-1个字符,然后在下一个字符位置开始与后缀数组中与p相邻的前一个后缀(设它为文本T的第q个后缀,即q=SA[Rank[p]-2])依次按照LCP的定义计算出LCP(p)的值。按照该算法计算出的LCP数组的复杂度为O(n)。
实现:
/**
*
* Derive LCP(Longest Common Prefix) from Suffix Array in O(n)
*
*
* Copyright (c) 2011 ljs (http://blog.csdn.net/ljsspace/)
* Licensed under GPL (http://www.opensource.org/licenses/gpl-license.php)
*
* @author ljs
* 2011-07-20
*
*/
public class LCP {
//rank[p]'s index starts with 1 (not 0)
public int[] solve(String text,int[] sa,int[] rank){
if(text == null)return null;
int len = text.length();
if(len == 0) return null;
int[] lcpz = new int[len];
//base case: p=0
//caculate LCP of suffix[0]
int lcp = 0;
int r = rank[0]-1;
if(r>0){
int q=sa[r-1];
//caculate LCP by definition
for(int i=0,j=q;i<len && j<len;i++,j++){
if(text.charAt(i) != text.charAt(j)){
lcp=i;
break;
}
}
}
lcpz[0] = lcp;
//other cases: p>=1
//ignore p == sa[0] because LCP=0 for suffix[p] where rank[p]=0
for(int p=1;p<len && p != sa[0];p++){
int h = lcpz[p-1];
int q=sa[rank[p]-2];
lcp = 0;
if(h>1){ //for h<=1, caculate LCP by definition (i.e. start with lcp=0)
//jump h-1 chars for suffix[p] and suffix[q]
lcp = h-1;
}
for(int i=p+lcp,j=q+lcp,k=0;i<len && j<len;i++,j++,k++){
if(text.charAt(i) != text.charAt(j)){
lcp+=k;
break;
}
}
lcpz[p] = lcp;
}
//caculate LCP
int[] LCP = new int[len];
for(int i=0;i<len;i++){
LCP[i] = lcpz[sa[i]];
}
return LCP;
}
public static void main(String[] args) {
String text = "mississippi#";
LCP solver = new LCP();
int[] LCP = solver.solve(text,
new int[]{11,10,7,4,1,0,9,8,6,3,5,2},
new int[]{6,5,12,10,4,11,9,3,8,7,2,1});
System.out.format("LCP array for text: %s%n",text);
for(int i=0;i<LCP.length;i++){
System.out.format(" %d",LCP[i]);
}
}
}
测试:
LCP array for text: mississippi#
0 0 1 1 4 0 0 1 0 2 1 3