之前的方法需要一个个的字母读取,进行状态转移。这一节将不一个个字母读取。
作者说,这种方法不一定所有情况都能用,所以下面把正则表达式改变了。
这一节研究的不再是(AT|GA) ((AG|AAA)*) ,而是((GA|AAA)*) (TA|AG),反过来了。给个图
Figure 5.19: Glushkov automaton built on the regular expression ((GA|AAA)*) (TA|AG).
Given a regular expression, we compute the lengthlmin of its shortest occurrence. Any method based on skipping text characters must examine at least one out of everyℓmin characters to avoid missing an occurrence. Hence, in general we will use a window of lengthℓmin.
定义了一个ℓmin.用来表示正则表达式能生成最短的字符串长度。用最短路径算法,就可以求出。
5.5.1Multistring matching approach
This method [Wat96], which we call MultiStringRE, consists of generating the prefixes of lengthlmin for all the strings matching the regular expressionPref(RE). In the regular expressionRE = ((GA|AAA)*) (TA|AG) we haveℓmin(RE) = 2, and the set of length-2 prefixes of strings matching the pattern isPref(RE) ={GA, AA, TA, AG}. A more complex example would beRE =(AT|GA)(AG|AAA)((AG|AAA)+), whereℓmin(RE) = 6 and the set of prefixes isPref(RE) ={ATAGAG, ATAGAA, ATAAAA, GAAGAG, GAAGAA, GAAAAA}.
先找到所有匹配模式lmin长度的前缀。
作者给出一个算法伪代码,trie树保存。 It receives an ε-free NFA andℓmin and returnsPref in trie form andActive at the leaves。
The effectiveness of this method depends basically on two values: ℓmin (the search is faster for larger ℓmin) and the size of Pref(RE) (the search is faster for less prefixes).
效率取决于两个条件,lmin与Pref(RE)的大小
这个不需要self-loop
Multi String RE(N = (Q, Σ, I, F, Δ), lmin)
1. Preprocessing
/*Construction of Pref */
2. (Pref, Active) ← Compute)Pref(N, lmin)
/* Construction of the DFA (Figure 5.17) without initial self-loop */
3. Produce bit-parallel version N′ = (Qn, Σ, In, Fn, Bn) of N
4. (B, Td) ← BuildTran(N′)
5. Searching
/* Multipattern search of Pref. Check each occurrence with the DFA */
6. For (pos, i) ∈ output of multipattern search of Pref Do
7. D ← Active(i), j ← pos + 1
8. While j ≤ n AND D & Fn = 0m+1 AND D ≠ 0m+1 Do
9. D ← Td[D] & B[tj]
10. End of while
11. If D & Fn ≠ 0m+1 Then
12. Report an occurrence beginning at pos + 1 - lmin
13. End of if
14. End of for
Figure 5.22: MultiStringRE search algorithm. It receives an NFA and the minimum length of a string accepted by it and reports the initial positions of occurrences. We assume that the verification is done with the bit-parallel Glushkov simulation of Section 5.4.2. Consequently, we assume a bit map representation ofActive.
然后又是一个例子 ,太长了。
描述一下算法:
1. 计算lmin大小,构造Pref(RE)
2. 计算Td表,B表,确定结束状态
3. Multiple search of Pref,从第一个为之开始,D赋初值。(根据Multiple search的特点,可能跨单词检索)
4. 如果没到达最后and没到达接受状态andD不为0,那么D ← Td[D] &B[tj] (操作与Glu位并行算法的一样)。到达最后,说明文章已经检索完;到达接受状态,就可以标记;D=0,表示这种转移不接受。
5. 如果第4步是因为到达了接收状态而中断的话,那么标记一下。然后跳到第3步,继续下一次前缀检索。
好像就是Multiple search 加上bit-parallellism的综合。
5.5.2 Gnu's heuristic based on necessary factors
A heuristic used in Gnu Grep consists of selecting anecessary set of factors. We call it MultiFactRE. In the simplest case, we may find that a given string must appear in every occurrence of the regular expression. For example, if we look for(AG | GA) ATA ((TT)*), then the string ATA is a necessary factor. The idea in general is to find a set of necessary factors and perform a multipattern search for all of them. There are many ways to choose a suitable set, Note thatPref is just a particular case of this approach. The advantage ofPref is that we know where the match should start, while the general method may need a verification in both directions starting from the factor found.
Grep原来使用了这种方法。找到表达式中固定的部分,把其作为necessary factor的集合。
之前的前缀法师这种方法的特例,只是从开头往后寻找。而一般的时候,我们需要两方向查找。
The selection of the best set of necessary factors has two parts. The first part is an algorithm that detects the correct candidate sets. The second part is a function that evaluates the cost to search using a candidate set and the number of potential matches it produces. A good measure for evaluating a set is its overall probability of occurrence, but finer considerations may include knowledge of the search algorithm used.
选取的necessaty factor的好坏取决于两部分:能正确识别出候选集合的算法;能evaluate查找候选集合以及可能的配对字符串的cost的function。统计整体的可能出现次数是一个好办法,但更好的方法是采用启发式的算法。
之后给出一个代码用于在parse tree中寻找最好的necessary factors的集合,由(all, pref, suff, fact)组成。
原理是:code works recursively on the parse tree of the regular expression and returns (all, pref, suff, fact), whereall is the set of all the strings matching the expression,pref is the best set of prefixes, suff is the best set of suffixes, and fact is the best set of factors. Our answer is the fourth element of the tuple returned. If this isθ, then no finite set of necessary factors exists.
This method gives better results than MultiStringRE because it has the potential of choosing the best set. In the example((GA|AAA)*)(TA|AG), instead of choosing a set of four strings as MultiStringRE does, it can choose {TA, AG}, which is smaller.
这个比之前讲的MultiString的方法好,因为它选择的集合小,而且通常更有效。
5.5.3 An approach based on BNDM
Our final technique able to skip characters [NR99a, Nav01b] is an extension ofBNDM (Sections 2.4.2,4.3.2, and 4.2.2) to regular expressions. We call it RegularBNDM. It has the benefit of using the same space as a forward search.
这是一个由BNDM扩展来的的算法,单向的。
er……这个也比较长,而且又要颠倒箭头......那个BNDM我也没看过。先放着吧,老师说有用再学~