笔试算法题(52):简介 - KMP算法(D.E. Knuth, J.H. Morris, V.R. Pratt Algorithm)

议题:KMP算法(D.E. Knuth, J.H. Morris, V.R. Pratt Algorithm)

分析:

  • KMP算法用于在一个主串中找出特定的字符或者模式串。现在假设主串为长度n的数组T[1,n],模式串为长度m的数组P[1,m];数组T和P满足:n>m,且所有元素都来自有限字母表中的字符;

  • 常规比较方式是将模式字符串作为滑动窗口从左向右匹配主串的每一个位置,每到一个位置的时候都从当前的第一个字符开始比较,相同则比较下一个字符,否则移 到下一个位置。下左图中顶端字母行表示主串,模式串为nano;可以发现此方式在index=2,index=6和index=10的时候进行了无意义的 比较,其原因在于没有利用模式串本身的性质,比如在index2=2的时候已经发现index=3的位置是a,其肯定不会跟模式串的首位字符匹配,所以可 以直接跳到index=4;KMP算法正是利用模式串本身的信息在一次匹配失败之后,一次跳过后面几个不可能匹配的位置,而不是仅跳回到本次匹配起始位置 的下一个位置。

       

     
  • 覆盖函数(Overlay Function):为了更好的表述模式串的字符信息,KMP引入了覆盖函数,用于计算字符串的左右自我覆盖程度;上面右图表示字符串abaabcaba的自我覆盖函数结果,0表示有一个覆盖。自我覆盖的数学定义如下:

    对于字符串:a0a1…aj-1aj

    其自我覆盖OF(j)=k定义为:a0a1…ak-1ak = aj-kaj-k+1…aj-1aj (0<=j<=Pattern_len)

    从 覆盖函数的定义中可知,如果某次匹配成功匹配到模式串的第k个字符,则通过OF(k)可以得知模式串末尾端与起始端的覆盖长度,而之前的位置都不可能再次 匹配,所以滑动窗口可以一次移动(Pattern_len-OF(j))长度的位置,并从模式串的OF(j)+1个字符开始比较(之前的字符已经由覆盖函 数确定匹配);同时需要注意,k值需要尽量大,这样才不会将某些匹配丢失。OF(j)表示长度为j的字符串中左端和右端可以匹配的字符的个 数,0<=j<=Pattern_len;

  • 使用递归方式计算模式串中子串对应的OF(k)值:如果已经知道模式串前j个字符的的OF(j)=k,则对于模式串的前j+1个字符的分析如下:

    如果pattern[k+1]==pattern[j+1],则自我覆盖可以往右延长:OF(j+1)=OF(j)+1=k+1

    如果pattern[k+1]!=pattern[j+1],则自我覆盖需要往左缩减:此时的前提条件变成已知前k个字符的OF(k)=h:

            如果pattern[h+1]==pattern[j+1],则OF(j+1)=OF(k)+1=h+1

            如果pattern[h+1]!=pattern[j+1],则重复求OF(h)的过程直到最左边的字符

  • KMP算法中当在主串的m位置作为匹配起始位置,并且当在模式串的j长度时发生匹配失败,下一次在主串中的起始匹配位置不用再回到m+1的位置,而是一次性跳到m+(j-overlay_func[j])的位置;

  • 时间复杂度为O(M+N),其他字符串匹配算法还有如BM(Boyer-Moore)算法和Horspool算法,BM算法的改进算法是SUNDAY- Boyer-Moore-Horspool-Sunday算法;尽管KMP和BM都为线性时间,但是BM算法还是比KMP算法快3-5倍,最快的字符匹配 算法是SUNDAY算法(每次匹配失败后移动的距离更大);

样例:

  1 int* ComputeOverlay(const char *pattern, int pat_len) {
  2 
  3         /**
  4          * overlay_func数组存储不同长度子数组的自我覆盖度
  5          * 子数组都是以最左边的字符作为开始,一次向右增加一个
  6          * 字符得到的字符串
  7          * */
  8         int *overlay_func=new int[pat_len];
  9         int index;
 10         /**
 11          * 设定单独字符或者没有覆盖的字符串的覆盖度为-1
 12          * */
 13         overlay_func[0]=-1;
 14 
 15         for(int i=1;i<pat_len;i++) {
 16 
 17                 /**
 18                  * 当前长度i的字符串的自我覆盖度由长度为i-1的字符串
 19                  * 确定
 20                  * */
 21                 index=overlay_func[i-1];
 22 
 23                 /**
 24                  * pattern[i]是新加入的字符,如果其不等于pattern[index+1]
 25                  * 则说明:
 26                  * &&&&-------&&&&+
 27                  * &&&&表示i-1长度字符串中左右端覆盖的字符,如果-与+不等,则说明
 28                  * 继续向右移动index也没用,只能在&&&&内部寻找更短的自我覆盖,由于
 29                  * i-1长度的左右端已经自我覆盖,所以overlay_func[index]可以找出
 30                  * 左/右端内部的覆盖
 31                  * */
 32                 while(index>=0 && pattern[i]!=pattern[index+1])
 33                         index=overlay_func[index];
 34 
 35                 if(pattern[i]==pattern[index+1])
 36                         overlay_func[i]=index+1;
 37                 else
 38                         /**
 39                          * 如果index小于0并且i和index+1位置的字符不等,则说
 40                          * 明自我覆盖度为-1
 41                          * */
 42                         overlay_func[i]=-1;
 43 
 44         }
 45 
 46         for(int i=0;i<pat_len;i++)
 47                 printf("%d, ",overlay_func[i]);
 48         return overlay_func;
 49 }
 50 
 51 int kmp_func(const char *target, int tar_len, const char *pattern, int pat_len) {
 52 
 53         /**
 54          * 首先针对pattern字符串进行自我覆盖度的计算,once for all。
 55          * */
 56         int *overlay_func=ComputeOverlay(pattern, pat_len);
 57 
 58         int pat_index=0;
 59         int tar_index=0;
 60 
 61         /**
 62          * 从左向右遍历target
 63          * */
 64         while(pat_index<pat_len && tar_index<tar_len) {
 65                 if(target[tar_index]==pattern[pat_index]) {
 66                         /**
 67                          * 如果target和pattern上对应位置的字符相等
 68                          * 则两个索引都向右移动一位
 69                          * */
 70                         tar_index++;
 71                         pat_index++;
 72                 } else if(pat_index==0)
 73                         /**
 74                          * 如果pattern的索引在第一位,说明第一个字符就
 75                          * 不等,所以直接向右移动target上的索引
 76                          * */
 77                         tar_index++;
 78                 else
 79                         /**
 80                          * 如果pattern的索引不在第一位,说明pat_index-1
 81                          * 索引位置之前的字符匹配,则利用KMP的规则,不用将
 82                          * tar_index回移,而是直接将pat_index进行移动,其
 83                          * 的滑动窗口直接移动在其右端的自我覆盖部分,从而避免
 84                          * 中间不必要的匹配循环
 85                          * */ pat_index=overlay_func[pat_index-1]+1;
 86         }
 87 
 88         /**
 89          * 如果pat_index已经在末尾了,说明在target上成功匹配pattern
 90          * ,此时tar_index-pat_index就是主串上模式串的起始位置
 91          * */
 92         if(pat_index==pat_len) {
 93                 delete [] overlay_func;
 94                 return tar_index-pat_index;
 95         }
 96         else {
 97                 delete [] overlay_func;
 98                 return -1;
 99         }
100 }
101 
102 int main() {
103         char *target="annbcdanacadsannannanna";
104         char *pattern="annanna";
105         printf("\n%d",kmp_func(target, 23, pattern, 7));
106         return 0;
107 }

 

参考链接:
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
http://www.matrix67.com/blog/archives/115
http://blog.csdn.net/v_JULY_v/article/details/6111565

转载于:https://www.cnblogs.com/leo-chen-2014/p/3758683.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
资源来自http://b-ok.org/,本人对相关版权等问不了解,可联系删除 The Art of Computer Programming, Volume 2: Seminumerical Algorithms (3rd Edition) Donald E. Knuth Volume 2 of Donald Knuth's classic series The Art of Computer Programming covers seminumerical algorithms, with topics ranging from random number generators to floating point operations and other optimized arithmetic algorithms. Truly comprehensive and meticulously written, this book (and series) is that rarest of all creatures--a work of authoritative scholarship in classical computer science, but one that can be read and used profitably by virtually all working programmers. The book begins with fundamental questions regarding random numbers and how to use algorithms to generate them. Subsequent chapters demonstrate efficient computation of single-precision and double-precision arithmetic calculations and modular arithmetic. The text then presents prime factorization (which can be used in cryptography, for instance) and algorithms for calculating fractions. This volume ends with algorithms for polynomial arithmetic and manipulation of power-series topics, which will benefit those with some knowledge of calculus. Throughout this beautifully presented edition, Knuth incorporates hundreds of useful exercises for trying out the algorithms. These range from simple problems to larger research project topics. (The book provides answers, where appropriate, at the end of the book.) The result is a text that's suitable for college or graduate-level computer science courses or individual study by programmers. Volume 2 is an indispensable part of any working programmer's library.
资源来自http://b-ok.org/,本人对相关版权等问不了解,可联系删除 The Art of Computer Programming, Volume 4A: Combinatorial Algorithms, Part 1 Donald E. Knuth The Art of Computer Programming, Volume 4A: Combinatorial Algorithms, Part 1 Knuth’s multivolume analysis of algorithms is widely recognized as the definitive description of classical computer science. The first three volumes of this work have long comprised a unique and invaluable resource in programming theory and practice. Scientists have marveled at the beauty and elegance of Knuth’s analysis, while practicing programmers have successfully applied his “cookbook” solutions to their day-to-day problems. The level of these first three volumes has remained so high,… —Data Processing Digest Knuth, Volume n has been published, where n = 4A. In this long-awaited new volume, the old master turns his attention to some of his favorite topics in broadword computation and combinatorial generation (exhaustively listing fundamental combinatorial objects, such as permutations, partitions, and trees), as well as his more recent interests, such as binary decision diagrams. The hallmark qualities that distinguish his previous volumes are manifest here anew: detailed coverage of the basics, illustrated with well-chosen examples; occasional forays into more esoteric topics and problems at the frontiers of research; impeccable writing peppered with occasional bits of humor; extensive collections of exercises, all with solutions or helpful hints; a careful attention to history; implementations of many of the algorithms in his classic step-by-step form. There is an amazing amount of information on each page. Knuth has obviously thought long and hard about which topics and results are most central and important, and then, what are the most intuitive and succinct ways of presenting that material. Since the areas that he covers in this volume have exploded since he first envisioned writing about them, it is wonderful how he has ma
资源来自http://b-ok.org/,本人对相关版权等问不了解,可联系删除 The Art of Computer Programming, Volume 1: Fundamental Algorithms (3rd Edition) Donald E. Knuth This magnificent tour de force presents a comprehensive overview of a wide variety of algorithms and the analysis of them. Now in its third edition, The Art of Computer Programming, Volume I: Fundamental Algorithms contains substantial revisions by the author and includes numerous new exercises. Although this book was conceived several decades ago, it is still a timeless classic. One of the book's greatest strengths is the wonderful collection of problems that accompany each chapter. The author has chosen problems carefully and indexed them according to difficulty. Solving a substantial number of these problems will help you gain a solid understanding of the issues surrounding the given topic. Furthermore, the exercises feature a variety of classic problems. Fundamental Algorithms begins with mathematical preliminaries. The first section offers a good grounding in a variety of useful mathematical tools: proof techniques, combinatorics, and elementary number theory. Knuth then details the MIX processor, a virtual machine architecture that serves as the programming target for subsequent discussions. This wonderful section comprehensively covers the principles of simple machine architecture, beginning with a register-level discussion of the instruction set. A later discussion of a simulator for this machine includes an excellent description of the principles underlying the implementation of subroutines and co-routines. Implementing such a simulator is an excellent introduction to computer design. In the second section, Knuth covers data structures--stacks, queues, lists, arrays, and trees--and presents implementations (in MIX assembly) along with techniques for manipulating these structures. Knuth follows many of the algorithms with careful time and space analysis. In the section on tree structures, the discussion includes a series

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值