# 【波兰黑科技(持续更新)16.5.6UPD】Small-Space Multiple-Pattern Matching

Claris老司机昨天向我安利了这篇波兰黑科技论文,主要讲的是怎么使用Hash来做AC自动机能做的那些问题,那么为了黑科技事业的蓬勃发展我今天就来把它翻译一下.翻译进度可能会非常非常慢….在线持久更新

## Section1.引言

Multiple-pattern matching, the task of locating the occurrences of s
patterns of total length m in a single text of length n, is a
fundamental problem in the field of string algorithms. The algorithm
by Aho and Corasick [2] solves this problem using O(n+m) time and O(m)
working space in addition to the space needed for the text and
patterns. To list all occ occurrences rather than, e.g., the leftmost
ones, extra O(occ) time is necessary. When the space is limited, we
can use a compressed Aho-Corasick automaton [11]. In extreme cases,
one could apply a linear-time constant-space single-pattern matching
algorithm sequentially for each pattern in turn, at the cost of
increasing the running time to O(n · s + m). Well-known examples of
such algorithms include those by Galil and Seiferas [8], Crochemore
and Perrin [5], and Karp and Rabin [13] (see [3] for a recent survey).

It is easy to generalize Karp-Rabin matching to handle multiple patterns in O(n+ m) expected time and
O(s) working space provided that all patterns are of the same length [10]. To do this, we store the fingerprints
of the patterns in a hash table, and then slide a window over the text maintaining the fingerprint of the
fragment currently in the window. The hash table lets us check if the fragment is an occurrence of a pattern.
If so, we report it and update the hash table so that every pattern is returned at most once. This is a very simple and actually applied idea [1], but it is not clear how to extend it for patterns with many distinct
lengths. In this paper we develop a dictionary matching algorithm which works for any set of patterns in
O(nlog n + m) time and O(s) working space, assuming that read-only random access to the text and the
patterns is available. If required, we can compute for every pattern its longest prefix occurring in the text,
also in O(nlog n + m) time and O(s) working space.

In a very recent independent work Clifford et al. [4] gave a dictionary matching algorithm in the streaming
model. In this setting the patterns and later the text are scanned once only (as opposed to read-only random
access) and an occurrence needs to be reported immediately after its last character is read. Their algorithm
uses O(slog ) space and takes O(loglog(s +)) time per character where is the length of the longest pattern ( m s ≤ ≤ m). Even though some of the ideas used in both results are similar, one should note that
the streaming and read-only models are quite different. In particular, computing the longest prefix occurring
in the text for every pattern requires Ω(mlogmin(n, |Σ|)) bits of space in the streaming model, as opposed
to the O(s) working space achieved by our solution in the read-only setting.

As a prime application of our dictionary matching algorithm, we show how to approximate the Lempel
Ziv 77 (LZ77) parse [18] of a text of length n using working space proportional to the number of phrases
(again, we assume read-only random access to the text). Computing the LZ77 parse in small space is an
issue of high importance, with space being a frequent bottleneck of today’s systems. Moreover, LZ77 is
useful not only for data compression, but also as a way to speed up algorithms [15]. We present a general
approximation algorithm working in O(z) space for inputs admitting LZ77 parsing with z phrases. For any
ε ∈ (0,1], the algorithm can be used to produce a parse consisting of (1+ ε)z phrases in O(ε−1nlog n) time.
To the best of our knowledge, approximating LZ77 factorization in small space has not been considered
before, and our algorithm is significantly more efficient than methods producing the exact answer. A recent
sublinear-space algorithm, due to K¨arkk¨ainen et al. [12], runs in O(nd) time and uses O(n/d) space, for any
parameter d. An earlier online solution by Gasieniec et al. [9] uses O(z) space and takes O(z2 log2 z) time
for each character appended. Other previous methods use significantly more space when the parse is small
relative to n; see [7] for a recent discussion.

Structure of the paper. Sect. 2 introduces terminology and recalls several known concepts. This is
followed by the description of our dictionary matching algorithm. In Sect. 3 we show how to process
patterns of length at most s and in Sect. 4 we handle longer patterns, with different procedures for repetitive
and non-repetitive ones. In Sect. 5 we extend the algorithm to compute, for every pattern, the longest
prefix occurring in the text. Finally, in Sect. 7, we apply the dictionary matching algorithm to construct an
approximation of the LZ77 parsing, and in Sect. 6 we explain how to modify the algorithms to make them
Las Vegas.

Model of computation. Our algorithms are designed for the word-RAM with Ω(log n)-bit words and
assume integer alphabet of polynomial size. The usage of Karp-Rabin fingerprints makes them Monte
Carlo randomized: the correct answer is returned with high probability, i.e., the error probability is inverse
polynomial with respect to input size, where the degree of the polynomial can be set arbitrarily large. With
some additional effort, our algorithms can be turned into Las Vegas randomized, where the answer is always
correct and the time bounds hold with high probability. Throughout the whole paper, we assume read
only random access to the text and the patterns, and we do not include their sizes while measuring space
consumption.

## Section2.预备说明

We consider finite words over an integer alphabet Σ = {0, … , σ − 1}, where σ = poly(n + m). For a word
w = w[1] … w[n] ∈ Σn, we define the length of w as |w| = n. For 1 ≤ i ≤ j ≤ n, a word u = w[i] … w[j]
is called a subword of w. By w[i..j] we denote the occurrence of u at position i, called a fragment of w. A
fragment with i = 1 is called a prefix and a fragment with j = n is called a suffix.

A positive integer p is called a period of w whenever w[i] = w[i + p] for all i = 1, 2, … , |w| − p. In this
case, the prefix w[1..p] is often also called a period of w. The length of the shortest period of a word w is
denoted as per(w). A word w is called periodic if per(w) ≤ |w|/2 and highly periodic if per(w) ≤ |w|/3. The
well-known periodicity lemma [6] says that if p and q are both periods of w, and p + q ≤ |w|, then gcd(p, q)
is also a period of w. We say that word w is primitive if per(w) is not a proper divisor of |w|. Note that the
shortest period w[1.. per(w)] is always primitive.

### 2.1使用哈希

2.1 Fingerprints
Our randomized construction is based on Karp-Rabin fingerprints; see [13]. Fix a word w[1..n] over an
alphabet Σ = {0, … , σ−1}, a constant c ≥ 1, a prime number p > max(σ, nc+4), and choose x ∈ Zp uniformly
at random. We define the fingerprint of a subword w[i..j] as Φ(w[i..j]) = w[i]+w[i+1]x+…+w[j]xj−i mod p.
With probability at least 1 − 1
nc , no two distinct subwords of the same length have equal fingerprints. The
situation when this happens for some two subwords is called a false-positive. From now on when stating the
results we assume that there are no false-positives to avoid repeating that the answers are correct with high
probability. For dictionary matching, we assume that no two distinct subwords of w = T P1 … Ps have equal
fingerprints. Fingerprints let us easily locate many patterns of the same length. A straightforward solution
described in the introduction builds a hash table mapping fingerprints to patterns. However, then we can
only guarantee that the hash table is constructed correctly with probability 1 − O( s 1 c ) (for an arbitrary
constant c), and we would like to bound the error probability by O( (n+ 1m)c ). Hence we replace hash table
with a deterministic dictionary as explained below. Although it increases the time by O(s log s), the extra
term becomes absorbed in the final complexities.

#### 定理1

Theorem 1. Given a text T of length n and patterns P1, … , Ps, each of length exactly , we can compute the the leftmost occurrence of every pattern Pi in T using O(n + s + s log s) total time and O(s) space.

#### 证明

Proof. We calculate the fingerprint Φ(Pj) of every pattern. Then we build in O(s log s) time [16] a deter
ministic dictionary D with an entry mapping Φ(Pj) to j. For multiple identical patterns we create just
one entry, and at the end we copy the answers to all instances of the pattern. Then we scan the text T
with a sliding window of length while maintaining the fingerprint Φ(T[i..i + − 1]) of the current window.
Using D, we can find in O(1) time an index j such that Φ(T[i..i + − 1]) = Φ(Pj), if any, and update the answer for P j if needed (i.e., if there was no occurrence of Pj before). If we precompute x−1, the fingerprints Φ(T[i..i + − 1]) can be updated in O(1) time while increasing i.

### 2.2使用Trie

A trie of a collection of strings P1, … , Ps is a rooted tree whose nodes correspond to prefixes of the strings.
The root represents the empty word and the edges are labeled with single characters. The node corresponding
to a particular prefix is called its locus. In a compacted trie unary nodes that do not represent any Pi are
dissolved and the labels of their incidents edges are concatenated. The dissolved nodes are called implicit as
opposed to the explicit nodes, which remain stored. The locus of a string in a compacted trie might therefore
be explicit or implicit. All edges outgoing from the same node are stored on a list sorted according to the
first character, which is unique among these edges. The labels of edges of a compacted trie are stored as
pointers to the respective fragments of strings Pi. Consequently, a compacted trie can be stored in space
proportional to the number of explicit nodes, which is O(s).
Consider two compacted tries T1 and T2. We say that (possibly implicit) nodes v1 ∈ T1 and v2 ∈ T2 are
twins if they are loci of the same string. Note that every v1 ∈ T1 has at most one twin v2 ∈ T2.

Lemma 2. Given two compacted tries T1 and T2 constructed for s1 and s2 strings, respectively, in O(s1+s2)
total time and space we can find for each explicit node v1 ∈ T1 a node v2 ∈ T2 such that if v1 has a twin in1
T2, then v2 is its twin. (If v1 has no twin in T2, the algorithm returns an arbitrary node v2 ∈ T2).

Proof. We recursively traverse both tries while maintaining a pair of nodes v1 ∈ T1 and v2 ∈ T2, starting
with the root of T1 and T2 satisfying the following invariant: either v1 and v2 are twins, or v1 has no twin in
T2. If v1 is explicit, we store v2 as the candidate for its twin. Next, we list the (possibly implicit) children
of v1 and v2 and match them according to the edge labels with a linear scan. We recurse on all pairs of
matched children. If both v1 and v2 are implicit, we simply advance to their immediate children. The last
step is repeated until we reach an explicit node in at least one of the tries, so we keep it implicit in the
implementation to make sure that the total number of operations is O(s1 + s2). If a node v ∈ T1 is not
visited during the traversal, for sure it has no twin in T2. Otherwise, we compute a single candidate for its
twin.

## Section3.对于较短的模板串

To handle the patterns of length not exceeding a given threshold , we first build a compacted trie for those
patterns. Construction is easy if the patterns are sorted lexicographically: we insert them one by one into
the compacted trie first naively traversing the trie from the root, then potentially partitioning one edge into
two parts, and finally adding a leaf if necessary. Thus, the following result suffices to efficiently build the
tries.

Lemma 3. One can lexicographically sort strings P1, … , Ps of total length m in O(m+ σε) time using O(s)
space, for any constant ε > 0.

Proof. We separately sort the √m + σε/2 longest strings and all the remaining strings, and then merge both
sorted lists. Note these longest strings can be found in O(s) time using a linear time selection algorithm.
Long strings are sorted using insertion sort. If the longest common prefixes between adjacent (in the
sorted order) strings are computed and stored, inserting Pj can be done in O(j + |Pj|) time. In more detail,
let S1, S2, … , Sj−1 be the sorted list of already processed strings. We start with k := 1 and keep increasing
k by one as long as Sk is lexicographically smaller than Pj while maintaining the longest common prefix
between Sk and Pj, denoted . After increasing k by one, we update using the longest common prefix
between Sk−1 and Sk, denoted 0, as follows. If0 > , we keep unchanged. If 0 =, we try to iteratively
increase by one as long as possible. In both cases, the new value of allows us to lexicographically compare
Sk and Pj in constant time. Finally, 0 < guarantees that Pj < Sk and we may terminate the procedure.
Sorting the √m + σε/2 longest strings using this approach takes O(m + (√m + σε/2)2) = O(m + σε) time.

The remaining strings are of length at most √m each, and if there are any, then s ≥ σε/2. We sort these
strings by iteratively applying radix sort, treating each symbol from Σ as a sequence of 2
ε
symbols from
{0, 1, … , σε/2 − 1}. Then a single radix sort takes time and space proportional to the number of strings
involved plus the alphabet size, which is O(s + σε/2) = O(s). Furthermore, because the numbers of strings
involved in the subsequent radix sorts sum up to m, the total time complexity is O(m+σε/2√m) = O(m+σε).
Finally, the merging takes time linear in the sum of the lengths of all the involved strings, so the total
complexity is as claimed.

Next, we partition T into O( n ) overlapping blocks T1 = T[1..2], T2 = T[+1..3], T3 = T[2+1..4], ….
Notice that each subword of length at most is completely contained in some block. Thus, we can consider every block separately. The suffix tree of each block Ti takes O( log ) time [17] and O() space to construct and store (the suffix
tree is discarded after processing the block). We apply Lemma 2 to the suffix tree and the compacted trie
of patterns; this takes O(+ s) time. For each pattern Pj we obtain a node such that the corresponding subword is equal to Pj provided that Pj occurs in Ti. We compute the leftmost occurrence Ti[b..e] of the subword, which takes constant time if we store additional data at every explicit node of the suffix tree, and then we check whether Ti[b..e] = Pj using fingerprints. For this, we precompute the fingerprints of all patterns, and for each block Ti we precompute the fingerprints of its prefixes in O() time and space, which
allows to determine the fingerprint of any of its subwords in constant time.

In total, we spend O(m + σε) for preprocessing and O(log + s) for each block. Since σ = (n + m)O(1),
for small enough ε this yields the following result.

Theorem 4. Given a text T of length n and patterns P1, … , Ps of total length m, using O(nlog +s n +m)
total time and O(s + ) space we can compute the leftmost occurrences in T of every pattern Pj of length at most.

## Section4.对于较长的模板串

To handle patterns longer than a certain threshold, we first distribute them into groups according to the
value of blog4/3 |Pj|c. Patterns longer than the text can be ignored, so there are O(log n) groups. Each
group is handled separately, and from now on we consider only patterns Pj satisfying blog4/3 |Pj|c = i.
We classify the patterns into classes depending on the periodicity of their prefixes and suffixes. We
set = d(4/3)ie and define αj and βj as, respectively, the prefix and the suffix of length of Pj. Since
23
(|αj| + |βj|) = 4 3 ≥ |Pj|, the following fact yields a classification of the patterns into three classes: either
Pj
is highly periodic, or αj is not highly periodic, or βj is not highly periodic. The intuition behind this
classification is that if the prefix or the suffix is not repetitive, then we will not see it many times in a short
subword of the text. On the other hand, if both the prefix and suffix are repetitive, then there is some
structure that we can take advantage of.

Fact 5. Suppose x and y are a prefix and a suffix of a word w, respectively. If |x| + |y| ≥ |w| + p and p is a
period of both x and y, then p is a period of w.
Proof. We need to prove that w[i] = w[i + p] for all i = 1,2, … , |w| − p. If i + p ≤ |x| this follows from p
being a period of x, and if i ≥ |w| − |y|+1 from p being a period of y. Because |x|+ |y| ≥ |w|+ p, these two
cases cover all possible values of i.

To assign every pattern to the appropriate class, we compute the periods of Pj, αj and βj using small
space. Roughly the same result has been proved in [14], but for completeness we provide the full proof here.

Lemma 6. Given a read-only string w one can decide in O(|w|) time and constant space if w is periodic
and if so, compute per(w).
Proof. Let v be the prefix of w of length d 1 2|w|e and p be the starting position of the second occurrence of v
in w, if any. We claim that if per(w) ≤ 1 2|w|, then per(w) = p − 1. Observe first that in this case v occurs
at a position per(w) + 1. Hence, per(w) ≥ p − 1. Moreover p − 1 is a period of w[1..|v| + p − 1] along with
per(w). By the periodicity lemma, per(w) ≤ 1 2|w| ≤ |v| implies that gcd(p − 1,per(w)) is also a period of
that prefix. Thus per(w) > p − 1 would contradict the primitivity of w[1..per(w)].

The algorithm computes the position p using a linear time constant-space pattern matching algorithm.
If it exists, it uses letter-by-letter comparison to determine whether w[1..p −1] is a period of w. If so, by the
discussion above per(w) = p − 1 and the algorithm returns this value. Otherwise, 2per(w) > |w|, i.e., w is
not periodic. The algorithm runs in linear time and uses constant space.

### 4.1对于不具备高度周期性前缀的模板串

Below we show how to deal with patterns with non-highly periodic prefixes αj. Patterns with non-highly
periodic suffixes βj can be processed using the same method after reversing the text and the patterns.
Lemma 7. Let be an arbitrary integer. Suppose we are given a text T of length n and patterns P1, . . . , Ps such that for 1 ≤ j ≤ s we have ≤ |Pj| < 4 3and αj = Pj[1..] is not highly periodic. We can compute the
leftmost and the rightmost occurrence of each pattern Pj in T using O(n+ s(1+ n )log s+ s) time and O(s)
space.

The algorithm scans the text T with a sliding window of length . Whenever it encounters a subword equal to the prefix αj of some Pj, it creates a request to verify whether the corresponding suffix βj of length
occurs at the appropriate position. The request is processed when the sliding window reaches that position.
This way the algorithm detects the occurrences of all the patterns. In particular, we may store the leftmost
and rightmost occurrence of each pattern.

We use the fingerprints to compare the subwords of T with αj and βj. To this end, we precompute Φ(αj)
and Φ(βj) for each j. We also build a deterministic dictionary D [16] with an entry mapping Φ(αj) to j for
every pattern (if there are multiple patterns with the same value of Φ(αj), the dictionary maps a fingerprint
to a list of indices). These steps take O(s) and O(s log s), respectively. Pending requests are maintained
in a priority queue Q, implemented using a binary heap1 as pairs containing the pattern index (as a value)
and the position where the occurrence of βj is anticipated (as a key).

Algorithm 1 provides a detailed description of the processing phase. Let us analyze its time and space
complexities. Due to the properties of Karp-Rabin fingerprints, line 2 can be implemented in O(1) time.
Also, the loops in lines 3 and 5 takes extra O(1) time even if the respective collections are empty. Apart from
these, every operation can be assigned to a request, each of them taking O(1) (lines 3 and 5-6) or O(log |Q|)
(lines 4 and 8) time. To bound |Q|, we need to look at the maximum number of pending requests.

Fact 8. For any pattern Pj just O(1 + n  ) requests are created and at any time at most one of them is pending.

Proof. Note that there is a one-to-one correspondence between requests concerning Pj and the occurrences
of α
j in T. The distance between two such occurrences must be at least 1 3, because otherwise the period of α j would be at most 1 3, thus making αj highly periodic. This yields the O(1 + n ) upper bound on the total number of requests. Additionally, any request is pending for at most |Pj| − < 1 3 iterations of the
main for loop. Thus, the request corresponding to an occurrence of αj is already processed before the next
occurrence appears.

Hence, the scanning phase uses O(s) space and takes O(n + s(1 + n  ) log s) time. Taking preprocessing
into account, we obtain bounds claimed in Lemma 7.

### 4.2对于具备高度周期性的模板串

Lemma 9. Let be an arbitrary integer. Given a text T of length n and a collection of highly periodic patterns P1, . . . , Ps such that for 1 ≤ j ≤ s we have ≤ |Pj| < 4 3, we can compute the leftmost occurrence of each pattern Pj in T using O(n + s(1 + n ) log s + s) total time and O(s) space.

The solution is basically the same as in the proof of Lemma 7, except that the algorithm ignores certain
shiftable occurrences. An occurrence of x at position i of T is called shiftable if there is another occurrence of x at position i − per(x). The remaining occurrences are called non-shiftable. Notice that the leftmost
occurrence is always non-shiftable, so indeed we can safely ignore some of the shiftable occurrences of the
patterns. Because 2 per(Pj) ≤ 2 3|Pj| ≤ 8 9<, the following fact implies that if an occurrence of Pj is
non-shiftable, then the occurrence of αj at the same position is also non-shiftable.

Fact 10. Let y be a prefix of x such that |y| ≥ 2 per(x). Suppose x has a non-shiftable occurrence at position
i in w. Then, the occurrence of y at position i is also non-shiftable.

x=ρkρ$x=\rho^k\rho'$,其中ρ$\rho$是x的最小循环节.因为|y|per(x)$|y|\ge per(x)$,y在i-per(x)处出现意味着ρ$\rho$也会在相同位置出现.因此有w[iper(x)i+|x|1]=ρk+1ρ$w[i-per(x)\dots i+|x|-1]=\rho^{k+1}\rho'$.但接下来x显然会在i-per(x)处出现,这与之前的假设”x在i处的出现是不可拆分的”相矛盾.

Proof. Note that per(y) + per(x) ≤ |y| so the periodicity lemma implies that per(y) = per(x).
Let x = ρkρ0 where ρ is the shortest period of x. Suppose that the occurrence of y at position i is
shiftable, meaning that y occurs at position i − per(x). Since |y| ≥ per(x), y occurring at position i − per(x)
implies that ρ occurs at the same position. Thus w[i−per(x)..i+|x|−1] = ρk+1ρ0. But then x clearly occurs
at position i−per(x), which contradicts the assumption that its occurrence at position i is non-shiftable.

Consequently, we may generate requests only for the non-shiftable occurrences of αj. In other words, if
an occurrence of α
j is shiftable, we do not create the requests and proceed immediately to line 5. To detect
and ignore such shiftable occurrences, we maintain the position of the last occurrence of every αj. However,
if there are multiple patterns sharing the same prefix αj1 = … = αjk, we need to be careful so that the
time to detect a shiftable occurrence is O(1) rather than O(k). To this end, we build another deterministic
dictionary, which stores for each Φ(αj) a pointer to the variable where we maintain the position of the
previously encountered occurrence of αj. The variable is shared by all patterns with the same prefix αj.

It remains to analyze the complexity of the modified algorithm. First, we need to bound the number
of non-shiftable occurrences of a single αj. Assume that there is a non-shiftable occurrence αj at positions
i0 < i such that i0 ≥ i − 1
2. Then i − i0 ≤ 1 2 is a period of T [i0..i + − 1]. By the periodicity lemma, per(αj) divides i−i0, and therefore αj occurs at position i0 − per(αj), which contradicts the assumption that the occurrence at position i0 is non-shiftable. Consequently, the non-shiftable occurrences of every αj are at least 1 2 characters apart, and the total number of requests and the maximum number of pending requests
can be bounded by O(s(1 + n  )) and O(s), respectively, as in the proof of Lemma 7. Taking into the account
the time and space to maintain the additional components, which are O(n + s log s) and O(s), respectively,
the final bounds remain the same.

### 4.3小结

Theorem 11. Given a text T of length n and patterns P1, … , Ps of total length m, using O(n log n + m +
s n
log s) total time and O(s) space we can compute the leftmost occurrences in T of every pattern Pj of length at least.

Proof. The algorithm distributes the patterns into O(log n) groups according to their lengths, and then into
three classes according to their repetitiveness, which takes O(m) time and O(s) space in total. Then, it
applies either Lemma 7 or Lemma 9 on every class. It remains to show that the running times of all those
calls sum up to the claimed bound. Each of them can be seen as O(n) plus O(|Pj|+(1+ |P n j|) log s) per every
pattern Pj. Because ≤ |Pj| ≤ n and there are O(log n) groups, this sums up to O(n log n+m+s n log s).

Using Thm. 4 for all patterns of length at most min(n, s), and (if s ≤ n) Thm. 11 for patterns of length
at least s, we obtain our main theorem.
Theorem 12. Given a text T of length n and patterns P1, … , Ps of total length m, we can compute the
leftmost occurrence in T of every pattern Pj using O(n log n + m) total time and O(s) space.

• 评论

3

• 上一篇
• 下一篇