String Algorithm 字符串算法专题

最新推荐文章于 2022-05-20 19:30:59 发布

u010944926

最新推荐文章于 2022-05-20 19:30:59 发布

阅读量612

点赞数

分类专栏： C/C++数据结构

C/C++数据结构专栏收录该内容

58 篇文章 0 订阅

订阅专栏

这个专题主要要处理的字符串匹配（String Matching Problem）strstr 问题：

假设有一个字符串Text T，长度：n，即T[0...n-1]

现在要在T中找Pattern P，长度：m，即P[0...m-1] (n>=m)

常用的算法有：

1）暴力法 Brute Force Method

2）Rabin-Karp String Matching Algorithm

3）String Matching with Finite Automata

4）KMP Algorithm

5）Boyce-Moore Algorithm

6）后缀树 Suffix Trees

1）暴力法 Brute Force Method:

[java]view plaincopy 
   
 package String;  
   
 public class BruteForce {  
   
     public static void main(String[] args) {  
         String T = "mississippi";  
         String P = "ssi";  
         bruteForce(T, P);  
     }  
   
     // Time: O(nm), Space: O(1)  
     public static void bruteForce(String T, String P) {  
         int n = T.length();     // Text  
         int m = P.length();     // Pattern  
           
         for(int i=0; i<=n-m; i++) {      // i表示在T上的offset，注意最后一个开始检查位置是n-m  
             int j = 0;  
             while(j < m && T.charAt(i+j) == P.charAt(j)) {   // j表示匹配到P的哪个位置了  
                 j++;  
             }  
             if(j == m) {        // j已经全部匹配完P的长度，返回第一个匹配开始的地点  
                 System.out.println("Pattern found at index " + i);  
             }  
         }  
     }  
 }  

http://www.geeksforgeeks.org/searching-for-patterns-set-1-naive-pattern-searching/

2）Rabin-Karp String Matching Algorithm

Rabin-Karp的预处理时间是O(m)，匹配时间O( ( n - m + 1 ) m )既然与朴素算法的匹配时间一样，而且还多了一些预处理时间，那为什么我们还要学习这个算法呢？虽然Rain-Karp在最坏的情况下与朴素匹配一样，但是实际应用中往往比朴素算法快很多。而且该算法的期望匹配时间是O(n)【参照《算法导论》】，但是Rabin-Karp算法需要进行数值运算，速度必然不会比KMP算法快，那我们有了KMP算法以后为什么还要学习Rabin-Karp算法呢？个人认为学习的是一种思想，一种解题的思路，当我们见识的越多，眼界也就也开阔，面对实际问题的时候，就能找到更加合适的算法。比如二维模式匹配，Rabin-Karp就是一种好的选择。
而且Rabin-Karp算法非常有趣，将字符当作数字来处理，基本思路：如果Tm是一个长度为 |P| 的T的子串，且转换为数值后模上一个数（一般为素数）与模式字符串P转换成数值后模上同一个数的值相同，则Tm可能是一个合法的匹配。

该算法的难点就在于p和t的值可能很大，导致不能方便的对其进行处理。对这个问题有一个简单的补救办法，用一个合适的数q来计算p和t的模。每个字符其实十一个十进制的整数，所以p，t以及递归式都可以对模q进行，所以可以在O(m)的时间里计算出模q的p值，在O（n - m + 1）时间内计算出模q的所有t值。参见《算法导论》或http://net.pku.edu.cn/~course/cs101/2007/resource/Intro2Algorithm/book6/chap34.htm
递推式是如下这个式子：
ts+1 = (d ( ts -T[s + 1]h) + T[s + m + 1 ] ) mod q
例如，如果d = 10 （十进制）m= 5, ts = 31415,我们希望去掉最高位数字T[s + 1] = 3,再加入一个低位数字（假定 T[s+5+1] = 2)就得到：
ts+1 = 10(31415 - 1000*3) +2 = 14152

平均，最好时间复杂度：O(n+m)

最坏时间复杂度：O(nm)

最差情况发生于所有的text的字串和pattern都有相同的哈希值，使算法退化到O(nm)

[java]view plaincopy 
   
 package String;  
   
 public class RabinKarp {  
   
     public static void main(String[] args) {  
         String T = "mississippi";  
         String P = "ssi";  
         int q = 101;  
         search(P, T, q);  
     }  
       
     public static int d = 256;  
       
     public static void search(String P, String T, int q) {  
         int M = P.length();  
         int N = T.length();  
         int i, j;  
         int p=0;        // hash value for pattern   
         int t=0;        // hash value for txt  
         int h=1;  
           
         // The value of h would be "pow(d, M-1)%q"  
         for(i=0; i<M-1; i++) {  
             h = (h*d)%q;  
         }  
           
         // Calculate the hash value of pattern and first window of text  
         for(i=0; i<M; i++) {  
             p = (d*p + P.charAt(i)) % q;  
             t = (d*t + T.charAt(i)) % q;  
         }  
           
         // Slide the pattern over text one by one   
         for(i=0; i<=N-M; i++) {  
             // Chaeck the hash values of current window of text and pattern  
             // If the hash values match then only check for characters on by one  
             if(p == t) {  
                 // Check for characters one by one  
                 for(j=0; j<M; j++) {  
                     if(T.charAt(i+j) != P.charAt(j)) {  
                         break;  
                     }  
                 }  
                 if(j == M) {        // if p == t and pat[0...M-1] = txt[i, i+1, ...i+M-1]  
                     System.out.println("Pattern found at index " + i);  
                 }  
             }  
               
             // Calulate hash value for next window of text: Remove leading digit,   
             // add trailing digit  
             if(i < N-M) {  
                 // Rehash, O(1)  
                 t = ( d*(t - T.charAt(i)*h) + T.charAt(i+M) ) % q;  
                 // We might get negative value of t, converting it to positive  
                 if(t < 0) {  
                     t = t + q;  
                 }  
             }  
         }  
     }  
   
 }  

http://www.geeksforgeeks.org/searching-for-patterns-set-3-rabin-karp-algorithm/

http://www.youtube.com/watch?v=d3TZpfnpJZ0

3）String Matching with Finite Automata

假设要对文本字符串T进行扫描，找出模式P的所有出现位置。这个方法可以通过一些办法先对模式P进行预处理，然后只需要对T的每个文本字符检查一次，并且检查每个文本字符所用时间为常数，所以在预处理建好自动机之后进行匹配所需时间只是Θ（n）。

　　假设文本长度为n，模式长度为m，则自动机将会有0,1，...，m这么多种状态，并且初始状态为0。先抛开自动机是怎样计算出来的细节，只关注自动机的作用。在从文本从左到右扫描时，对于每一个字符a，根据自动机当前的状态还有a的值可以找出自动机的下一个状态，这样一直扫描下去，并且一定自动机状态值变为m的时候我们就可以认为成功进行了一次匹配。先看下面简单的例子：

假设现在文本和模式只有三种字符a,b,c，已经文本T为"abababaca",模式P为"ababaca"，根据模式P建立自动机如下图(b)（先不管实现细节）：

　（a)图为一些状态转化细节

如图(c),对照自动机转换图(b),一个个的扫描文本字符，扫描前状态值初始化为0，这样在i = 9的时候状态值刚好变成7 = m，所以完成一个匹配。

　　现在问题只剩下怎样根据给出的模式P计算出相应的一个自动机了。这个过程实际上并没有那么困难，下面只是介绍自动机的构建，而详细的证明过程可以参考书本。

　　还是用上面的那里例子，建立模式P = "ababaca"的有限自动机。首先需要明白一点，如果当前的状态值为k，其实说明当前文本的后缀与模式的前缀的最大匹配长度为k，这时读进下一个文本字符，即使该字符匹配，状态值最多变成k + 1.假设当前状态值为5，说明文本当前位置的最后5位为"ababa"，等于模式的前5位。

　　如果下一位文本字符是"c"，则状态值就可以更新为6.如果下一位是"a"，这时我们需要重新找到文本后缀与模式前缀的最大匹配长度。简单的寻找方法可以是令k = 6(状态值最大的情况），判断文本后k位与模式前k位是否相等，不等的话就k = k - 1继续找。由于刚才文本后5位"ababa"其实就是模式的前5位，所以实际上构建自动机时不需要用到文本。这样可以找到这种情况状态值将变为1(只有a一位相等）。同理可以算出下一位是"b"时状态值该变为4（模式前4位"abab"等于"ababab"的后缀）

　　下面是书本伪代码：∑代表字符集，δ(q,a)可以理解为读到加进字符a后的状态值

　　用上面的方法计算自动机，如果字符种数为k，则建立自动机预处理的时间是O(m ^ 3 * k)，有方法可以将时间改进为O(m * k)。预处理完后需要Θ（n）的处理时间。

[java]view plaincopy 
   
 package String;  
   
 public class FiniteAutomata {  
   
     public static int getNextState(String pat, int M, int state, int x) {  
         // If the character c is same as next character in pattern,  
         // then simply increment state  
         if(state < M && x == pat.charAt(state)) {  
             return state + 1;  
         }  
           
         int ns, i;      // ns stores the result which is next state  
           
         // ns finally contains the longest prefix which is also suffix  
         // in "pat[0..state-1]c"  
        
         // Start from the largest possible value and stop when you find  
         // a prefix which is also suffix  
         for(ns = state; ns > 0; ns--) {  
             if(pat.charAt(ns-1) == x) {  
                 for(i=0; i<ns-1; i++) {  
                     if(pat.charAt(i) != pat.charAt(state-ns+1+i)) {  
                         break;  
                     }  
                 }  
                 if(i == ns-1) {  
                     return ns;  
                 }  
             }  
         }  
         return 0;  
     }  
       
     public static int NO_OF_CHARS = 256;  
       
     /* This function builds the TF table which represents Finite Automata for a 
        given pattern  */  
     public static void computeTF(String pat, int M, int[][] TF) {  
         int state, x;  
         for(state=0; state<=M; state++) {  
             for(x=0; x<NO_OF_CHARS; x++) {  
                 TF[state][x] = getNextState(pat, M, state, x);  
             }  
         }  
     }  
       
     /* Prints all occurrences of pat in txt */  
     public static void search(String pat, String txt) {  
         int M = pat.length();  
         int N = txt.length();  
           
         int[][] TF = new int[M+1][NO_OF_CHARS];  
           
         computeTF(pat, M, TF);  
           
         // Process txt over FA.  
         int i, state = 0;  
         for(i=0; i<N; i++) {  
             state = TF[state][txt.charAt(i)];  
             if(state == M) {  
                 System.out.println("Pattern found at index " +(i-M+1));  
             }  
         }  
     }  
       
     public static void main(String[] args) {  
         String T = "mississippi";  
         String P = "ssi";  
         search(P, T);  
     }  
 }  

http://www.cnblogs.com/jolin123/p/3443543.html

http://www.geeksforgeeks.org/searching-for-patterns-set-5-finite-automata/

4）KMP Algorithm

KMP算法，最易懂的莫过于Princeton的Robert Sedgewick讲解的，他用自动机模型来讲解。难点在于建立dfa表，精妙处在于维护一个X的变量，每次基于match和mismatch两种情况，再结合X位置的情况，来推出当前位置的情况，而且更新X的值。

有了dfa表后，search就变成了线性的过程，要点是i指针一直向前走，从来不往后退。j表示不同的状态，也是match上字符的个数。

Time Complexity: O(n+m)

Space Complexity: O(m)

[java]view plaincopy 
   
 package String;  
   
 public class KMP {  
   
     private static int[][] dfa;  
       
     // return offset of first match; N if no match  
     public static int search(String text, String pat) {  
           
         createDFA(pat);  
           
         // simulate operation of DFA on text  
         int M = pat.length();  
         int N = text.length();  
           
         int i, j;  
         for(i=0, j=0; i<N && j<M; i++) {  
             j = dfa[text.charAt(i)][j];  
         }  
         if(j == M) {        // found  
             System.out.println("Find pattern at index: " + (i-M));  
             return i - M;  
         }  
         System.out.println("Not found");  
         return N;       // not found  
     }  
       
     // create the DFA from a String  
     public static void createDFA(String pat) {  
         int R = 256;  
         int M = pat.length();       // build DFA from pattern  
         dfa = new int[R][M];  
         dfa[pat.charAt(0)][0] = 1;  
         for(int X=0, j=1; j<M; j++) {  
             for(int c=0; c<R; c++) {  
                 dfa[c][j] = dfa[c][X];      // Copy mismatch cases.   
             }  
             dfa[pat.charAt(j)][j] = j+1;    // Set match case.   
             X = dfa[pat.charAt(j)][X];      // Update restart state.   
         }  
     }  
       
       
     // ======================================= char array  
       
     // return offset of first match; N if no match  
     public static int search(char[] text, char[] pattern) {  
           
         createDFA(pattern, 256);  
           
         int M = pattern.length;  
         int N = text.length;  
         int i, j;  
         for(i=0, j=0; i<N && j<M; i++) {  
             j = dfa[text[i]][j];  
         }  
         if(j == M) {        // found  
             System.out.println("Find pattern at index: " + (i-M));  
             return i - M;  
         }  
         System.out.println("Not found");  
         return N;       // not found  
     }  
       
     // create the DFA from a character array over R-character alphabet  
     public static void createDFA(char[] pattern, int R) {  
         int M = pattern.length;  
         dfa = new int[R][M];  
           
         dfa[pattern[0]][0] = 1;  
         for(int X=0, j=1; j<M; j++) {  
             for(int c=0; c<R; c++) {  
                 dfa[c][j] = dfa[c][X];  
             }  
             dfa[pattern[j]][j] = j+1;  
             X = dfa[pattern[j]][X];  
         }  
     }  
       
     public static void main(String[] args) {  
         String T = "mississippi";  
         String P = "ssi";  
         search(T, P);  
         search(T.toCharArray(), P.toCharArray());  
     }  
   
 }  

http://algs4.cs.princeton.edu/53substring/KMP.java.html

https://www.cs.princeton.edu/courses/archive/fall10/cos226/demo/53KnuthMorrisPratt.pdf

http://www.cmi.ac.in/~kshitij/talks/kmp-talk/kmp.pdf

5）Boyce-Moore Algorithm

[java]view plaincopy 
   
 package String;  
   
 public class BoyerMoore {  
   
     private final int R;     // the radix  
     private int[] right;     // the bad-character skip array  
   
     private char[] pattern;  // store the pattern as a character array  
     private String pat;      // or as a string  
   
     // pattern provided as a string  
     public BoyerMoore(String pat) {  
         this.R = 256;  
         this.pat = pat;  
   
         // position of rightmost occurrence of c in the pattern  
         right = new int[R];  
         for (int c = 0; c < R; c++)  
             right[c] = -1;  
         for (int j = 0; j < pat.length(); j++)  
             right[pat.charAt(j)] = j;  
     }  
   
     // pattern provided as a character array  
     public BoyerMoore(char[] pattern, int R) {  
         this.R = R;  
         this.pattern = new char[pattern.length];  
         for (int j = 0; j < pattern.length; j++)  
             this.pattern[j] = pattern[j];  
   
         // position of rightmost occurrence of c in the pattern  
         right = new int[R];  
         for (int c = 0; c < R; c++)  
             right[c] = -1;  
         for (int j = 0; j < pattern.length; j++)  
             right[pattern[j]] = j;  
     }  
   
     // return offset of first match; N if no match  
     public int search(String txt) {  
         int M = pat.length();  
         int N = txt.length();  
         int skip;  
         for (int i = 0; i <= N - M; i += skip) {  
             skip = 0;  
             for (int j = M-1; j >= 0; j--) {  
                 if (pat.charAt(j) != txt.charAt(i+j)) {  
                     skip = Math.max(1, j - right[txt.charAt(i+j)]);  
                     break;  
                 }  
             }  
             if (skip == 0) return i;    // found  
         }  
         return N;                       // not found  
     }  
   
   
     // return offset of first match; N if no match  
     public int search(char[] text) {  
         int M = pattern.length;  
         int N = text.length;  
         int skip;  
         for (int i = 0; i <= N - M; i += skip) {  
             skip = 0;  
             for (int j = M-1; j >= 0; j--) {  
                 if (pattern[j] != text[i+j]) {  
                     skip = Math.max(1, j - right[text[i+j]]);  
                     break;  
                 }  
             }  
             if (skip == 0) return i;    // found  
         }  
         return N;                       // not found  
     }  
   
   
   
     // test client  
     public static void main(String[] args) {  
         String pat = "ssi";  
         String txt = "mississippi";  
         char[] pattern = pat.toCharArray();  
         char[] text    = txt.toCharArray();  
   
         BoyerMoore boyermoore1 = new BoyerMoore(pat);  
         BoyerMoore boyermoore2 = new BoyerMoore(pattern, 256);  
         int offset1 = boyermoore1.search(txt);  
         int offset2 = boyermoore2.search(text);  
   
         System.out.println("Find in offset: " + offset1);  
         System.out.println("Find in offset: " + offset2);  
     }  
 }  

https://www.youtube.com/watch?v=rDPuaNw9_Eo

http://algs4.cs.princeton.edu/53substring/BoyerMoore.java.html

6）后缀树 Suffix Trees

在写后缀树之前，还得先介绍两种也很常用的存储String的数据结构：Trie和Ternary Search Tree

Trie的总结可参考这里：http://blog.csdn.net/fightforyourdream/article/details/18332799

Trie的优点在于查找速度很快，缺点在于内存需要很多，因为每个节点都要存26个指针，指向其孩子。

因此Ternary Search Tree应运而生。它结合了BST的内存高效和Trie的时间高效。

具体Ternary Search Tree的解释可以参考：

http://www.cnblogs.com/rush/archive/2012/12/30/2839996.html

http://www.geeksforgeeks.org/ternary-search-tree/

举个例子：

把字符串AB，ABCD，ABBA和BCD插入到三叉搜索树中，首先往树中插入了字符串AB，接着我们插入字符串ABCD，由于ABCD与AB有相同的前缀AB，所以C节点都是存储到B的CenterChild中，D存储到C的CenterChild中；当插入ABBA时，由于ABBA与AB有相同的前缀AB，而B字符少于字符C，所以B存储到C的LeftChild中；当插入BCD时，由于字符B大于字符A，所以B存储到C的RightChild中。

其实还可以用Hashtable来存放字符串，它内存高效，但是无法排序。

后缀树视频：

https://www.youtube.com/watch?v=hLsrPsFHPcQ

v_JULY_v的从Trie树（字典树）谈到后缀树（10.28修订）http://blog.csdn.net/v_july_v/article/details/6897097

后缀树组

u010944926

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
String Algorithm 字符串算法专题

这个专题主要要处理的字符串匹配（String Matching Problem）strstr 问题：假设有一个字符串Text T，长度：n，即T[0...n-1]现在要在T中找Pattern P，长度：m，即P[0...m-1] (n>=m)常用的算法有：1）暴力法 Brute Force Method2）Rabin-Karp String
复制链接

扫一扫