DDJ文章:Full-Text Searching & the Burrows-Wheeler Transform

原创 2004年04月08日 09:17:00

Finding any character sequence in source text—and fast!

By Kendall Willets

Kendall is a software engineer living in San Francisco and can be contacted at kendall@willets.org.

When it comes to full-text indexing, we usually think of methods such as inverted indices that break text on word boundaries, consequently requiring search terms to be whole words only. Yet all of us probably have had the experience of searching for not-quite-words—C++, VM/CMS, SQL*Plus,, <blink>, and the like—that were skipped by an inverted index, or broken into less-distinctive pieces. The same goes when you are working with data such as DNA sequences, where you need to quickly find any sequence of symbols.

In this article, I examine an indexing method that lets you find any character sequence in the source text—in time only proportional to the sequence length—using a structure that can compress the entire source text and index into less space than the text alone. This technique is exceptionally fast at detecting and counting occurrences of any string in the source text. The fact that you can build a string match incrementally—adding one character at a time and seeing the result at each step—gives you the flexibility to explore variable patterns such as regular expressions with maximum effectiveness.

In "Fast String Searching with Suffix Trees" (DDJ, August 1996), Mark Nelson addressed full-text indexing using suffix trees; while in "Data Compression with the Burrows-Wheeler Transform" (DDJ, September 1996), he focused on the use of the Burrows-Wheeler Transform (BWT) for compression. While the BWT is commonly known as a data-compression technique, researchers have found that block-sorted data has a structure that lends itself naturally to search, while using space close to its minimal compressed size. This was first demonstrated in the FM index (see "Opportunistic Data Structures with Applications," by Paolo Ferragina and Giovanni Manzini, Proceedings of the 41st IEEE Symposium on Foundations of Computer Science, 2000; http://www.mfn.unipmn.it/~manzini/fmindex/). In short, the same transformation that yields high-compression ratios, by grouping similar substrings together, also lets you find arbitrary substrings with little overhead.

Block Sorting Reviewed

When block sorting, you construct a list of n blocks consisting of the source text S (usually followed by a special End-of-String character $) of length n, cyclically shifted from zero to n-1 positions. When you sort the blocks, you get a matrix M; see Figure 1(a). The first column of M is called F, and the last, L. F has a simple structure, containing all the characters of S in sorted order, with duplicates. Column L has a more complex structure that contains enough information to reconstruct the original string, and usually forms the basis for BWT compression.

In its naive form, M contains n2 characters, but a simple trick represents each block by its first character and a link to the block starting one character to the right; Figures 1(b) and 1(c). To decode a block, you read its first character, follow its link to the next block, read its character and link, and repeat the process until the desired number of characters have been read. This character-and-link representation slashes spatial complexity from O(n2) to O(n).

The links act as a permutation on M, which I call FL because it permutes the orderly F column to the higher entropy L column. FL is the permutation caused by shifting M one column to the left and resorting it; each row i moves to a new position FL[i].

Since the F column is a sorted list of characters, the next space saver is to change from explicitly storing the F column, to simply recording the starting position for each character's range in F, using an array that I call "C"; Figure 1(c). At a given position i in M, you look through C to find the section c of F that contains i. The F() method in bwtindex.cc (available electronically; see "Resource Center," page 5) applies this idea. Also see Figure 1(d).

By storing only FL and C, you have a reasonable—but not minimal—representation of M, and you can decode successive characters of the source from left to right. The decode method in bwtindex.cc shows how to carry out this iteration. See Figure 1(e).

Useful Properties of the Permutation

Figure 2 shows how FL is order preserving on blocks that start with the same character. That is, given two blocks i and j that both start with c, lexical comparison implies that if i<j, then FL[i]<FL[j]. This is one of the core elements of the BWT.

This order-preserving property means that FL consists of sections of integers in ascending order, one section for each character. You can search one of these sections for a target value quickly, using binary search.

Pattern Matching

If you pick an arbitrary pattern string P, say "abr," one way to find all occurrences of it is to search the sorted blocks in M, finding the range of blocks that start with a, then narrowing it to blocks prefixed by "ab," and so on, extending the pattern from left to right. This method is workable, but a more efficient algorithm (first developed by Ferragina and Manzini) works in the opposite direction, extending and matching the pattern one character to the left at each turn.

To understand this method inductively, first consider how to match one character c, then how to extend a single character beyond a pattern that has already been matched. The answer to the first problem is easy, since you know blocks in the range C[c]...C[c+1]-1 start with c. I call this range "Rc."

To left-extend a pattern match, consider the string cP formed by prepending a character c onto the already matched string P. Use FL, starting from the range of locations prefixed by P, mapping FL inversely to find the interval of blocks prefixed by cP, as follows.

Given the next character c and the range RP of blocks prefixed by P, you need to find the range RcP of blocks prefixed by cP. You know two things about RcP:

  • It must fall within the range Rc of blocks starting with c, that is, RcP is a subrange of Rc.
  • FL must map all blocks in RcP into blocks in RP, because every block starting with cP must left-shift to a block that starts with P.

My approach is to start with the widest possible range Rc, and narrow it down to those entries that FL maps into RP. Because of the sorting, you know that entries prefixed by cP form a contiguous range. Since FL is order-preserving on Rc, you can find RcP as follows:

  • Scan Rc from the start until you find the lowest position i that FL maps into RP.
  • Scan backwards from the end to find the highest position j that FL maps into RP (in practice, you use binary search, but the idea is the same).

The resulting [i,j] range is RcP, the range of blocks prefixed by cP. Figures 3(a), 3(b), 3(c) and 3(d) show this narrowing-down process; the refine method implements this algorithm in bwtindex.cc.

The result at each step in this process is a start/end position for a range of blocks prefixed by the pattern matched so far. The difference between these positions is the number of matches in the text, and starting from any position in this range, you can decode characters in and following each match.

The Location Problem

There is one valuable piece of information you haven't found: The exact offset of each match within the original text. I can call this the "location problem," because there is virtually no information in a sorted block to tell you how far you are from the start or end of the text, unless you decode and count all the characters in between.

There are a number of solutions to the location problem that I won't address here except to say that all of them require extra information beyond FL and C, or any BWT representation. The simple but bulky solution is just to save the offset of each block in an array of n integers, reducing the problem to a simple lookup, but adding immensely to the space requirement. The problem is how to get the equivalent information into less space.

Some approaches rely on marking an explicit offset milepost at only a few chosen blocks, so you quickly encounter a milepost while decoding forward from any block. Others use the text itself as a key, to index locations by unique substrings. Another possibility lets you jump ahead many characters at a time from certain blocks, so as to reach the end of the text more quickly while counting forward. The variety of possible solutions makes it impossible to cover them here.

A Word About Compression

Recall that I promised a full-text index that consumes only a few bits per character, but so far you've only seen a structure taking at least one int per character—hardly an improvement. However, the integers in FL have a distribution that makes them highly compressible. You already know FL contains long sections of integers in ascending order. Another useful fact is that consecutive entries often differ by only one; in normal text, as many as 70 percent of these differences are one, with the distribution falling off rapidly with magnitude. My own experiments using simple differential and gamma coding have

shrunk FL to fewer than 4 bits per character, and more sophisticated methods (see "Second Step Algorithms in the Burrows-Wheeler Compression Algorithm," by Sebastian Deorowicz; Software: Practice and Experience, Volume 32, Issue 2, 2002), have shrunk FL to even more competitive levels.

The practical problem with compression is that elements of FL then vary in size, so finding an element FL[i] requires scanning from the beginning of the packed array. To eliminate most of the scanning, you need to use a separate bucket structure, which records the value and position of the first element of each bucket. To find FL[i], you scan forward from the beginning of the closest bucket preceding i, adding the encoded differences from that point until position i is reached. The process is laborious, but does not affect the higher level search and decoding algorithms.



Burrows–Wheeler transform 算法

Burrows–Wheeler transform 算法 #!/usr/bin/python # Burrows–Wheeler transform Algorithm def tran...
  • CherylNatsu
  • CherylNatsu
  • 2011年09月22日 21:37
  • 3891

Burrows–Wheeler transform

poj 1147 http://poj.org/problem?id=1147 字符串压缩,给出原字符串所有的移位字符串,按字典序排序,给出最后一列,还原第一行 00011 00110 01100 1...
  • kcnoize
  • kcnoize
  • 2014年03月12日 11:44
  • 705

BWT (Burrows–Wheeler_transform) 解码分析

原文地址: BWT (Burrows–Wheeler_transform)数据转换算法 原文讲解十分详细,但关键地方有点绕,故作分析注释   因为进行的是循环移位,且是循环左移注意下面的性质:...
  • windroid
  • windroid
  • 2016年01月23日 18:31
  • 1451

Fast and accurate short read alignment with Burrows-Wheeler transform

Fast and accurate short read alignment with Burrows-Wheeler transform本文主要将生物信息中,BWA的算法设计。1. BWT变换假设 ...
  • ruoshui1
  • ruoshui1
  • 2015年07月21日 21:44
  • 400


Burrower-Wheeler变换  在数据压缩领域最近时间一个比较有趣的发展就是1994年 Michael Burrows 和 David Wheeler在《A Block-sorting Los...
  • hit_kongquan
  • hit_kongquan
  • 2011年06月03日 21:11
  • 1095


K-均值散列:学习二进制压缩码的近邻保留量化方法 摘要:在计算机视觉中,人们对散列码的学习兴趣日益增加,散列码的汉明距离近似于数据的相似性。散列函数在量化向量空间,并生成相似性保护代码这两个方面都发...
  • Fan0920
  • Fan0920
  • 2014年02月21日 14:30
  • 1923

mysql索引优化 btree rtree hash full-text

转自:http://www.linuxidc.com/Linux/2011-07/39245.htm 一、MySQL索引类型 mysql里目前只支持4种索引分别是:full-text,b-...
  • u011334621
  • u011334621
  • 2016年07月28日 11:43
  • 179

Java入门到精通——调错篇之Eclipse No Java virtual machine was found after searching the following locations

一、错误现象。        在一次启动Eclipse的时候弹出了下面的错误 二、错误原因        原因是没有找到javaw.exe文件的路径。 三、解决方案        在...
  • gwblue
  • gwblue
  • 2014年12月23日 08:30
  • 1769

ubuntu linux下解决“no java virtual machine was found after searching the following locations:” 方法

现象:安装好jdk和环境配置后,打开eclipse后,提示no java virtual machine was found after searching the following locatio...
  • gengyiping18
  • gengyiping18
  • 2017年07月27日 10:20
  • 714

error:/usr/bin/ld:skipping incompatible ./libxxxx.so when searching for -lxxxx

一次在linux上部署环境,需要编译一个测试库; 编译过程中出现这一个提示: error:/usr/bin/ld:skipping incompatible ./libxxxx.so when s...
  • some_times
  • some_times
  • 2014年07月01日 15:06
  • 2347
您举报文章:DDJ文章:Full-Text Searching & the Burrows-Wheeler Transform