Efficient in-memory extensible inverted file

Abstract

The growing amount of on-line data demands efficient parallel and distributed indexing mechanisms to manage large resource requirements and unpredictable system failures. Parallel and distributed indices built using commodity hardware like personal computers (PCs) can substantially save cost because PCs are produced in bulk, achieving the scale of economy. However, PCs have limited amount of random access memory (RAM) and the effective utilization of RAM for in-memory inversion is crucial. This paper presents an analytical investigation and an empirical evaluation of storage-efficient inmemory extensible inverted files, which are represented by fixed- or variable-sized linked list nodes. The size of these linked list nodes is determined by minimizing the storage wastes or maximizing storage utilization under different conditions, which lead to different storage allocation schemes. Minimizing storage wastes also reduces the number of address indirections (i.e., chaining). We evaluated our storage allocation schemes using a number of reference collections. We found that the arrival rate scheme is the best in terms of both storage utilization and the mean number of chainings per term. The final storage utilization can be over 90% in our evaluation if there is a sufficient number of documents indexed. The mean number of chainings is not large (less than 2.6 for all the reference collections). We have also showed that our best storage allocation scheme can be used for our extensible compressed inverted file. The final storage utilization of the extensible compressed inverted file can be over 90% in our evaluation provided that there is a sufficient number of documents indexed. The proposed storage allocation schemes can also be used by compressed extensible inverted files with word positions

Keywords: Information retrieval; Indexing; Optimization



1. Introduction

As more and more data are made available on-line, it becomes increasingly difficult to manage a single inverted file. This difficulty arises from the substantial resource requirement for large-scale indexing and from the long indexing time, making the system vulnerable to unpredictable system failures. For examples, the very large collection (VLC) from TREC [1] requires 100 Gb of storage and TREC terabyte track requires 426 Gb [2]. The WebBase repository [3] requires 220 Gb, estimated to be only 4% of the indexable web pages. The volume of high writing quality, non-English content is also increasing. In the near future, Japanese patent data from NTCIR [4] may be as large as 160 Gb. One way to manage such large quantities of data is to create the index by merging smaller indices, which are built using multiple machines indexing different document subsets in parallel [5]. This would limit the impact of system failure to individual machines and increase indexing speed.

Acquiring the computing machines in bulk using commodity hardware substantially reduces monetary costs. Also, commodity hardware, like personal computers (PCs), makes in-memory inversion an attractive proposition because random access memory (RAM) for the PC market is relatively cheap and fast, and because RAM has the potential to be upgraded later at lower prices (e.g. DDR-300 RAM to DDR-400 RAM). However, PCs can only hold a relatively small amount of RAM (i.e., 4 Gb) compared with mainframe computers. Efficient RAM utilization becomes an important issue for in-memory inversion using a large number of PCs because typically the entire inverted index cannot be stored in RAM due to the large volume of data on-line. Instead, the inverted file is typically built in relatively small batches [6] and [7]. For each batch, a partial index is built and held in RAM, which is written out as a run on disk. The runs are then merged as the final inverted file.

Efficient RAM utilization can reduce indexing time by reducing the number of inverted files for merging because efficient RAM utilization enables more documents to be indexed per run. During updates, temporary indices are maintained in memory, and then integrated into the main inverted file in batches. Lester and Zobel [8] showed that the amortized time cost to integrate the temporary index with the main inverted file is reduced for different inverted file maintenance methods (i.e., re-build, re-merge and in-place methods) when more documents are indexed. Therefore, during both initial construction and update, making better use of memory resources can reduce overall costs. With the potential to better balance system resource utilization as indexing is made memory-intensive whereas loading/flushing data are made disk- or network-intensive, efficient in-memory inversion is crucial in index construction.

Our major contribution of this paper is in enhancing existing simple-to-implement single-pass in-memory inversion to be storage-efficient for creating partial inverted files and/or temporary index by developing novel storage-efficient allocation schemes that predict the needed storage with minimal storage wastes. The partial index created by our in-memory inversion can be merged with the main or other partial inverted files. The temporary index created by our in-memory inversion can also be searched during the time that the temporary index is being built. This reduces the latency of the availability of the recently indexed documents for searching and this is important for certain applications (e.g. searching recently available news articles).

An evaluation was carried out to determine which of our storage allocation schemes is the best and whether the results are comparable to existing methods (Section 6). The evaluation was carried out using 3.5 Gb of test data from the VLC. The best allocation scheme was the arrival rate scheme, which achieved 95% final storage utilization for this VLC dataset. To ascertain the generality of the results, various independent datasets for both English (TREC-2, TREC-6 and TREC-2005) and Chinese (NTCIR-5) are also used to evaluate the best storage allocation scheme. We also showed that the indexing speed of our best storage allocation is similar to the indexing speed of the reported results by others [6] and [9].

The rest of this paper is organized as follows. Section 2 discusses our extensible inverted file structures, the modifications of our extensible inverted file to incorporate compressed postings and word positions, and the related storage wastes. This section also provides the rationale behind the choice of our data structure for our storage allocation schemes and the rationale behind the need to optimize storage wastes of our storage allocation schemes. Section 3 describes the first approach to determine optimal node sizes using a stepwise optimization strategy. This approach results in three related storage allocation schemes. Section 4 discusses the second approach which determines the optimal nodes size that minimizes the asymptotic worst-case storage waste per period for individual terms. Section 5 evaluates these storage allocation schemes and discusses the best scheme in terms of storage utilization, the mean number of chainings, robustness in performance and indexing speed. This section also shows that our storage allocation schemes can be used for allocating nodes to store compressed postings using the best storage allocation scheme and variable byte compression as an example. Section 6 discusses the related work and describes how the storage allocation schemes can predict our extended inverted file that incorporates compressed postings and word positions. Section 7 is the concluding section.

2. Extensible inverted file and storage wastes

This section describes the structure of the extensible inverted file and the related considerations for handling compressed postings and word position information. This section also discusses the storage wastes of the extensible inverted file and the rationale to optimize them, as well as the rationale for using the variable-size linked list data structure.

2.1. Extensible inverted file

An in-memory inverted file can be considered as a set of inverted lists, which are implemented as linked lists. Each linked list node holds a set of (basic) postings of the form left angle bracketdi, tf(di,tk)right-pointing angle bracket where each basic posting consists of a document identifier di and the within-document term frequency tf(di, tk) of the kth term in the ith document. The rest of this paper assumes that unless otherwise indicated, all postings are basic postings. If a linked list node can hold a variable number of postings, then two additional fields of information other than postings are stored in each node, namely a node size variable and an extension pointer. The node size variable specifies the amount of storage allocated for the current node and the extension pointer facilitates the chaining of nodes.

Fig. 1 shows the conceptual structure of the extensible inverted file, implemented using a set of variable-size linked list nodes. The dictionary data structure holds the set of index terms and the start address of the corresponding variable-size linked list. A new node is allocated whenever the linked list of the kth index term tk is full (e.g., the linked list of the index term “Network” in Fig. 1) and when a new posting for tk arrived. The size of the new node is determined using one of the storage allocation schemes discussed in the next two sections. If the linked list nodes hold a fixed number of postings per node, then the node size variable can be discarded, saving storage space.


Display Full Size version of this image (41K)

Fig. 1. The conceptual structure of our extensible inverted file, represented as variable-sized nodes. The start and last pointers point to, respectively, the first and last linked list nodes of the inverted list.

Each dictionary entry for an index term has a start pointer and a last pointer, which, respectively point to the beginning of the inverted list and the last linked list node of the inverted list. The last pointer reduces the traversal of the linked lists when a new posting for the index term is inserted. In this case, during insertion, the last linked list node needs to be exclusively locked to maintain the integrity of the data for concurrent access [10]. To reduce memory usage, the start pointers can be stored in a file since start pointers are used only for retrieval and not for inserting new postings. For clarity of presentation, each dictionary entry may contain additional information (e.g. document frequency) not shown in Fig. 1. In particular, each dictionary entry should hold a variable, say mpos, which indicates the position of the unfilled portion of the last node, to improve the posting insertion speed.

The extensible inverted file can support storing a special type of posting for block addressing inverted files [9] and [11] that index a fixed-size text block instead of variable-size documents. This special type of posting, called block-address posting in this paper, has only di field without the term frequency tf(di, tk) field of the basic posting where di is the block identifier instead of the document identifier and tk is the kth term. Our storage waste optimization discussed in Sections 3 and 4 can minimize the storage wastes of the nodes that store basic postings or block-address postings because the storage of these postings are constants (i.e., c1) in our storage waste optimization.

The extensible inverted file can support storage of compressed postings [12], as well as word positions. For compressed postings (e.g., γ [13] or variable byte compression [14]), each dictionary entry keeps track of the bit position (again using mpos) or the byte position of the unfilled portion of the last node. The new compressed posting is inserted at mpos as a bit/byte string. If the last node does not have enough memory for a new compressed posting, then the new compressed posting is split between the last node and the newly allocated node.

There are two general approaches to storing postings with word positions. One approach stores the word positions in the nodes and one way (as in [6]) to do this is to store the posting followed by a sequence of word positions, i.e. left angle bracketdi,tf(di,tk)right-pointing angle bracket.left angle bracketpos(di,tk,1),…,pos(i,tk,tf(di,tk))right-pointing angle bracket where di is the identifier of the ith document, tf(di, tk) is the within-document frequency of the kth term in the ith document, pos(di,tk,x) is the xth word position of the kth term in the ith document. In this case, the node size includes the storage to hold word positions as well as the postings. Another approach stores extended postings of the form left angle bracketdi, tf(di,tk), f(di,tk)right-pointing angle bracket, where f(di, tk) is the file position in the auxiliary file that stores the sequence of word position of the kth term in the ith document. In this approach, the word positions are stored in an auxiliary file. Whenever a new extended posting is added, the last position of the auxiliary file is stored as f(di, tk) of this new extended posting. The word positions of the term associated with the new extended posting are appended sequentially to the auxiliary file. These word positions in the auxiliary file can be compressed, for example, using differential coding compression [9], [12], [13], [14], [15] and [16]. If the within-document term frequency is one (i.e. tf(di, tk)=1), then f(di, tk) of the extended posting can directly store the single word position of the kth term in the ith document, saving both storage and access time. For both approaches that store word positions, the storage allocation schemes can be modified to determine the node sizes, as discussed in Section 5.

2.2. Rationale for variable-size linked lists

The variable-size linked-list data structure is chosen here because it is used in many in-memory inverted files (sometimes they are called buckets or fixed list blocks) and because it is relatively simple to implement and to analyze. Instead of linked lists, postings can be stored in RAM blocks that can expand to hold more postings when they are inserted. This type of RAM block expansion may involve copying and moving data chunks. Our work can be considered as extending the RAM block approach where the block is pre-allocated with storage to hold the expected amount of postings so that it is not necessary to copy or move data chunks in the RAM blocks. This pre-allocation avoids memory fragmentation and the difficulty is shifted in predicting the expected number postings for each term instead of using advanced storage allocators. If a fast storage allocator is used so that the allocation time is amortized to be a constant, then the storage utilization may be sacrificed. Instead of using advanced storage allocators, dynamic data structures like hash tables, skip lists and balanced trees can be used that support deletion of postings, as well as insertion of postings. However, the storage utilizations of these dynamic data structures are typically low (i.e., no more than 60% if each node contains at least one 4-byte pointer and one 6-byte posting). It is possible to store multiple postings in these data structures. In this case, the problem of optimizing the storage wastes per node re-appears whether one is dealing with a dynamic data structure (e.g., balanced trees) or a variable-size linked list. Therefore, we propose to use variable-size linked lists in this study because they are simple to program, use simple and fast storage allocators, are commonly used by in-memory inverted files, can easily be adopted to store compressed postings and can be optimized for storage waste in the same way as other dynamic data structures (e.g. balanced trees).

Our choice of using linked lists to store (compressed) postings implies that our extensible inverted files are designed largely for append-only operations where direct deletions and direct modifications can be avoided. Deletions can be done indirectly by filtering document identifiers that are known to be invalid (or deleted) from a registry of stale documents [8] and [17] because search engines can trade-off data consistency with availability [17] according to the CAP theorem [18] and [19]. It is expected that there will be little deletions or modifications during the in-memory inversion because the incoming documents are relatively fresh. On the other hand, deletions or modifications may occur more often than during in-memory inversion when the inverted file is transferred to disks. Since disk storage cost per byte is much cheaper than RAM, deletions by filtering document identifiers are practical solutions for large-scale search engines. Similar to deletions, modifications can be implemented effectively as a deletion operation (implemented as filtering) followed by a document insertion. When the registry of stale document identifiers is becoming large or when the temporary index is full, the main inverted file on disk can be maintained by re-building, re-merging or in-place updating approaches [8]. Therefore, the choice of an append-only data structure, like the variable-size linked lists, may not be a severe handicap.

The use of the variable-size linked list representation of inverted lists requires some consideration on how to merge partial inverted files as follows. The first approach saves the in-memory inverted lists that are represented as linked lists as contiguous lists on disk. This requires the system to follow the extension pointers of the linked list nodes when transferring the in-memory inverted file to disk. Tracing all the extension pointers incurs some additional time due to cache misses. However, most of the time cost is due to transferring data to disks provided that the mean number of chainings per term is not large. Once the in-memory inverted file is transferred to disk as a set of contiguous lists on disk, the conventional inverted file-merging algorithm can be used to merge these partial inverted file on disk. An alternative approach dumps the in-memory inverted file onto disk as it is. During the first level of partial inverted file merging, the merging algorithm combines the two inverted lists on disk, represented by two sets of linked lists, into a single contiguous inverted list on disk. The merged partial inverted file has a set of contiguous inverted lists on disk and it can be merged subsequently with other merged partial inverted file using conventional merging algorithm. However, following extension pointers on disk requires file seek operations that incur more time than a cache miss. Therefore, we prefer the first approach because the time cost of cache miss is less that of a file seek and because this approach is applicable to the inverted file maintenance methods mentioned by Lester and Zobel [8] (i.e., re-build, re-merge and in-place updates) using an in-memory temporary index (called the document buffer in [8]).

2.3. Rationale for storage waste optimization

The success of representing inverted lists by linked lists is based on the ability to accurately predict the required storage needed so that the RAM storage utilization is maximized. Otherwise, if final storage utilization is low (say 60%), other data structures that can support deletion should be used instead. The storage utilization of the extensible inverted file is the ratio of the total storage P of all (compressed) postings and the total storage (i.e., P+S, where S is the total storage waste). Maximization of the storage utilization U can be considered as the minimization of storage wastes of the extensible inverted file as follows:


Click to view the MathML source

since P is treated as a constant which is fixed for a particular data collection.

Storage wastes in the extensible inverted file can be divided into two types:

(a) The first type of storage waste is the storage overhead ε that includes the storage for extension pointers and for the node size variables; and

(b) The second type of storage waste is the latent free storage, which has been allocated but is yet unfilled with posting information. If this type of latent free storage is not considered as storage wastes, then the optimal linked-list node size would be as large as possible so that the overhead will appear minimal in the total storage allocated to that node.

The storage waste of each node in the extensible inverted file is the sum of these two types of storage wastes of the node.

There are many advantages to optimize storage wastes. First, it maximizes the storage utilization that can reduce the number of inverted file merge operation and can reduce the amortized time cost of inverted file maintenance [8]. Second, it can indirectly reduce the number of chainings per term. This can reduce (a) the time to search the temporary index when it is built on the fly, (b) the time to store the inverted lists in RAM as lists without chaining on disks when the partial inverted file is transferred to disk and (c) the number of file seeks when merging two partial inverted files on disk if these inverted files represent inverted lists as linked lists. Third, the analysis of optimizing the storage wastes can be applied to not just linked lists but to other dynamic data structures (e.g. balanced trees) where each node holds more than one (compressed) posting. In this case, the optimization analysis of these dynamic data structures treats the storage waste of each node of these dynamic structures as a constant ε with a different value.

3. Stepwise storage allocation approach

The stepwise storage allocation approach determines the optimal node size for the incoming documents based on statistics of the current set of indexed documents. This approach optimizes the expected worst-case storage waste E(W(S(N))) after N documents are indexed as follows:


Click to view the MathML source(1)

where E(.) is the expectation operator, W(.) returns the worst-case function of its argument, S( n) is the storage waste after indexing n documents. The reason for optimizing the expected storage waste is to minimize the area under the storage waste curve against the number of documents indexed so that the storage waste is kept to a minimum for the different number of documents indexed (up to N documents). This approach assumes that the optimal node size for N documents is close to the optimal node size for NN documents, where Δ N is a small quantity compared with N, implying that this assumption is true when N is sufficiently large. Also, this approach assumes that the measured system parameters for determining the optimal node size should be smooth without large discontinuities. Otherwise, parameters (e.g. the size of the vocabulary) obtained after indexing N documents may vary substantially, leading to drastic changes in the optimal node size, implying that we cannot predict the optimal node size based on past statistics.

This approach has three related storage allocation schemes. The first storage allocation scheme determines the optimal node size after indexing N documents, which is the same as the optimal node size for a static collection of N documents. This scheme is called the fixed-sized node scheme (F16). The next storage allocation scheme is called the vocabulary growth rate (VGR) scheme. It extends the formula of the F16 scheme by determining the optimal node size based on the parameter values extracted at the time when a new node is allocated. The assumption is that this optimal node size remains more or less the same between the time that the node is allocated and the time that the node is filled (i.e., the system behavior should be smooth). Unfortunately, the VGR scheme allocates the same optimal node size at a given time instance for common terms and non-common terms, which are known to have widely different number of postings and different desirable node sizes. Thus, the final storage allocation scheme, called the term growth rate (TGR) scheme, determines the optimal node size for individual terms. The first two schemes optimize the expected worst-case storage waste E(W(S(N))) for all terms (as in Eq. (1)). The TGR scheme optimizes the expected worst-case storage waste E(W(S(N, tk))) for the kth term after indexing N documents where S(n, tk) is the storage waste of the kth term after indexing n documents. Similar to Eq. (1), the quantity E(W(S(N, tk))) is defined as follows:


Click to view the MathML source(2)

3.1. Fixed-sized node scheme (F16)

The fixed-sized node storage allocation scheme allocates storage to hold B postings for each new node. The overhead εp of a node allocated by this scheme is the storage for the extension pointer. Assuming that each posting occupies c1 bytes, the node requires c1 × B+εp bytes. The storage waste S(n, tk) for term tk after indexing n documents is the latent free storage of the last node plus the storage overhead of all the chained nodes. The latent-free storage of the last node is c1(left ceiling(df(n,tk)/B)right ceilingB-df(n,tk)), where df(n, tk) is the number of documents that contain the kth term, and left ceiling.right ceiling is the ceiling function. The storage overhead of all the chained nodes (including the new node) is due to the extension pointer and this overhead is εpleft ceiling(df(n,tk)/B)right ceiling (including the last unfilled node). The storage waste S(n, tk) for term tk is


Click to view the MathML source

The relative frequency estimate of the probability p( tk) that the kth term appeared in a document is df(n, tk)/n. Hence, df(n, tk)=p(tk)n. The above storage waste for n indexed documents can be rewritten as

Click to view the MathML source

where D( n) is the number of unique terms after indexing n documents. Since it is hard to optimize the closed form of S( n) that has discontinuous functions (e.g., ceiling function), the upper bound and the lower bound of the optimal node sizes are considered as follows. An upper bound W( S( n)) of S( n) is the storage overhead due to the extension pointers of all the chained nodes plus the latent free space where the last node is assumed to have no postings. Hence

Click to view the MathML source(3)

A lower bound of S( n) is the total storage of the extension pointers and this bound assumes that there is no latent free space in the last node. Therefore, the lower bound of S( n) is

Click to view the MathML source

The above two bounds only differ by the amount of latent free space and the storage of one extension pointer. As the number of indexed document increases, these two bounds converge to S( n) since the latent free space becomes small compared with the storage overhead of the set of chained filled nodes. Thus, the optimal node sizes of the upper and lower bounds are valid approximations of S( n) for large collections.

By disregarding the storage waste due to the latent-free space, the lower bound of the storage waste can be approximated as


Click to view the MathML source

where Click to view the MathML source, is called the total nominal (term) arrival rate after indexing n documents. This lower bound is not useful for the purpose of finding the optimal node size because it has no optimal value for finite values of B and it discounts the latent free storage, encouraging the undesirable allocation of larger than necessary node sizes. Alternatively, optimizing the upper bound W( S( n)) (as in Eq. (3)) of S( n) limits the storage waste after indexing n documents, as well as accounting for the latent-free storage. Based on Eq. (3), the worst-case (upper bound) storage waste after indexing n documents is Click to view the MathML source. Substituting W( S( n)) into Eq. (1), the expected E( W( S( N))) worst-case storage waste after indexing N documents is Click to view the MathML source.

The optimal value of B that leads to the global optimal expected worst-case storage waste can be found by differentiating E(W(S(N))) with respect to B, i.e., Click to view the MathML source, where cd and cp are constants. Since the second derivative of E(W(S(N))) is always positive for εp, and N and B are positive, the global turning point must be a minimum. The optimal worst-case expected storage waste E(W(S(N))) occurs when the first derivative of E(W(S(N))) is zero, which yields the following optimal node size Click to view the MathML source for a static collection of N documents. Since Click to view the MathML source is close to one, Click to view the MathML source in this paper. We define a quantity, called α(N), as D(N)/N, which is a measure of the vocabulary growth rate. Using this quantity, Bopt,F(N) is simplified to:


Click to view the MathML source(4)

However, the vocabulary size and the number of documents typically follow Heap's law [20] where D( N) =πN μ, and π and μ are constants and therefore, (4) is re-expressed as follows:

Click to view the MathML source(5)

This shows that when the vocabulary size follows Heap's law, there is no optimum fixed-sized node, for collections that grow indefinitely. However, the optimal node sizes determined by Eqs. (4) and (5) grow slowly as N increases, which is consistent with the assumption that the optimal node size after indexing N documents is similar to the optimal node size after indexing NN documents. For simplicity, Eq. (4) is used to derive the optimal node size for the next storage allocation scheme for dynamic collections and we check whether Eq. (4) is a reasonable approximation by comparing its predicted value with the experimentally determined optimal node size in Section 5.

3.2. Vocabulary growth rate scheme (VGR)

The previous (F16) scheme is not suitable to dynamic collections because (a) the vocabulary size is not constant (Fig. 2) and (b) different document sets and different languages have different vocabulary growth rate α. Therefore, this (VGR) scheme extends the previous (F16) scheme for dynamic collections by assuming that the optimal node size determined after indexing N documents is similar in the near future (i.e., when the new node is allocated), i.e. E(W(S(N))) ≈ E(W(S(N+ΔN))), where ΔN is a small quantity compared with N. Effectively, the parameters for determining the node size have to be smooth over time, which is likely to be true when the number of indexed documents is large. There are at least two choices of parameters, based on Eqs. (4) and (5). The two choices differ by whether to estimate α(N) after indexing N documents or to estimate π and μ after indexing N documents. We decided to estimate α(N) to obtain the optimal node size because it is simpler than estimating two parameters (i.e., π and μ).


Display Full Size version of this image (63K)

Fig. 2. The vocabulary growth rate, estimated total nominal arrival rate Ψ(n), the estimated vocabulary growth rate (Click to view the MathML source a by averaging a fixed amount of running past histories (i.e., averaging the past 90 backward differenced values) and the allocation of storage based on Eq. (5).

Here, α(N) is estimated in a piece-wise linear manner using differencing operations (i.e., Click to view the MathML source) at different time points. Hence, Bopt,F(N) in Eq. (4) becomes Bopt,V(N) which has a constant storage overhead is ε (including the node size variable) instead of εp, i.e.


Click to view the MathML source(6)

In practice, ψ( N) is approximated by summing the estimated arrival rates of all terms after indexing N documents (i.e., Click to view the MathML source, where df( N, tk) is the number of documents containing tk after indexing N documents). Fig. 2 shows the vocabulary growth rate estimated by backward differencing (gray) the number of distinct terms after indexing N and N−1 documents, using the TREC VLC data [1]. The estimated value of α is not smooth, violating the assumption made for the VGR scheme. To reduce discontinuities, Click to view the MathML source is smoothed by taking the running average of a fixed amount (i.e., 90 in this case) of past backward differences. In Fig. 2, Click to view the MathML source is smoother than the backward difference estimates of α( N) and the (calculated optimal) storage allocation becomes smoother over the number of documents indexed. According to Fig. 2, the total nominal arrival rate ψ( N) is already very smooth and it is used directly to estimate the optimal node sizes. The remaining parameters (i.e., ε and c 1) for Bopt,V( N) are constants.

3.3. Term growth rate scheme (TGR)

The previous (VGR) scheme allocates nodes of the same size for both common and non-common terms if the number of indexed documents is the same. This appears to be counter-intuitive since common terms are expected to occur in more documents, resulting in more postings, than non-common terms. Therefore, the storage waste should be optimized for individual terms, and not for the aggregate storage wastes of all index terms (as in VGR).

For the TGR scheme, we optimize the expected worst-case storage waste E(W(S(N, tk))) for term tk after indexing N documents. Similar to E(W(S(N))) in Eq. (1) , E(W(S(N, tk))) is defined as in Eq. (2) based on the quantity W(S(n, tk)) which is the worst-case (i.e., upper bound) storage waste for storing the postings of the kth term after indexing n documents. By analogy to W(S(n)) in Eq. (3),


Click to view the MathML source

after indexing n documents where:

ε is the constant storage overhead for a variable-size linked list node, which is the sum of the storage for the extension pointer εp and the node size variable εs (i.e., ε=εps);

q(tk, n) is the number of chained and filled nodes for the term tk after indexing n documents;

B(tk) is the node size for the term tk.

If the number of documents indexed is large, then the size of each node is approximately the same as B(tk). Therefore, q(n,tk) is approximated as df(n, tk)/B(tk) (=np(tk)/B(tk)) and W(S(n, tk)) becomes Click to view the MathML source. After some calculus manipulation, the optimal node size Bopt,T(N, tk) of tk for this scheme after indexing N documents is approximated as


Click to view the MathML source(7)

Since the node size depends on df( N, tk) (i.e., the number of documents that has term k after indexing N documents), it varies with the number of indexed documents. As df( N, tk) is large, Bopt,T( N, tk) increases slowly since Click to view the MathML source for large values of df( N, tk). This is consistent with our approximation that q( N,tk) ≈ df( N, tk) /B( N, tk) after indexing N documents.

4. Stationary storage allocation approach

The stationary storage allocation approach consists of two related schemes that allocate new nodes periodically under some steady state conditions. The period u is measured in terms of the number of documents and it is a constant for different terms. The two storage allocation schemes of this approach minimize the asymptotic worst-case storage waste W(U(∞, tk)) of the term tk per period:


Click to view the MathML source(8)

where W( S( u× r, tk)) is the worst-case storage waste for the term tk after indexing u× r documents.

The first storage allocation scheme is called the arrival rate (AR) scheme which allocates the storage based on the arrival rate of a term and based on scaling the node size by the largest node size that can be defined using the node size variable. The second scheme is called the adaptive arrival rate (AAR) scheme which extends the AR scheme by a running estimate of the arrival rate of each term so that larger nodes can be allocated quickly in response to any sudden upsurge of term occurrences in documents.

4.1. AR allocation scheme

Given that the system is under steady state and the system has indexed u×r documents, the storage waste is the storage overhead due to the extension pointers of r+1 nodes (including the new node) and the latent free space of the new node (i.e., c1×BO(u×r, tk) where BO(u×r, tk) is the size of the node determined by this scheme after indexing u×r documents). Therefore, the worst-case storage waste W(S(u×r, tk)) for the term tk is


Click to view the MathML source

where u× r× p( tk) is the number of documents that contain the kth term after indexing u× r documents and the new node stores only the new posting. Substituting the above into Eq. (1), the asymptotic worst-case storage waste per period W( U( ∞, tk)) is

Click to view the MathML source

Taking the limits inside, W( U( ∞, tk)) becomes

Click to view the MathML source

Since at least one new node is expected to be allocated for every u documents indexed (otherwise u cannot be the period), W( U( ∞, tk)) greater-or-equal, slanted εp. Using this inequality and the above, we obtain the optimal asymptotic node size Bopt,O( ∞, tk) as follows:

Click to view the MathML source

The period u is the largest number m of postings that can be stored in any node. This number is dependent on the number of bytes allocated to the node size variable (i.e., m=28εs). In order to ensure that the node size Bopt, O( ∞, tk) is between 1 and m ( = u), Bopt, O( ∞, tk) is approximated as

Click to view the MathML source(9)

where N is the number of documents indexed, df( N, tk) is the number of documents that contain the term tk after indexing N documents, and df( N, tk) /N is the relative frequency estimate of p( tk).

4.2. AAR allocation scheme

If the arrival rate of individual terms is allowed to vary slowly over time, then the AR can be approximated piece-wise linearly, i.e., (df(N,tk)/N)≈(df(N,tk)-df(N-L,tk)/N-L) for some time lag L. The time lag L can be defined so that there is no additional storage waste as follows. When the last node is full, the earliest document identifier id1 in the last node is say N–L. The difference between id1 and N is the number of documents indexed between the current document and id1. The number of postings Bopt,A(N-L, tk) in the node allocated by this scheme is the number of documents between the current document and id1 that have tk. Thus, Bopt,A(N-L, tk)=df(N, tk)–df(N-L, tk) and (df(N,tk)/N)≈(Bopt,A(N-L,tk)/N-id1).

One problem with this estimation is that the number of postings in the node is not large for terms with small ARs. It is possible for the term to occur in two consecutive documents and never again afterwards. In that case, the storage allocation would have large errors. In view of this, the number of postings in a node should be larger than some constant b (> 1). Substituting the approximation of df(N, tk)/N into Eq. (9), the number of postings to be allocated is


Click to view the MathML source(10)

5. Comparing storage allocation schemes

This section examines which of the storage allocation schemes discussed in the last section is the best in terms of storage utilization and the number of chainings (or address indirection). Afterwards (in Section 5.5), we evaluate whether good performances can be achieved by the best storage allocation scheme for different datasets in order to ascertain its generality.

5.1. Set up

A subset of the VLC [1] from TREC starting from NEWS01 to NEWS04 is used for this evaluation in Section 5.1–5.4. This subset requires 3.5 Gb in order to store 1.67 million documents. This amount of data is used because the evaluation was carried out by constructing the extensible inverted file in memory, using a SUN server with 4 Gb of RAM. The extension pointer occupies 4 bytes (i.e., εp=4) and the node size variable occupies 2 bytes (i.e., εs=2). With the exception of the arrival rate scheme, the node size variable ε indicates the number of postings that the node can store by default. Each posting requires 6 bytes to store the document identifier and the associated term frequency. The storage allocation schemes discussed in the last section are given acronyms, as follows. The (optimal) fixed-sized node scheme is F16, which is determined using Eq. (4). Likewise, the vocabulary growth rate scheme is VGR, using Eq. (6), and the term growth rate scheme is TGR, using Eq. (7). For the arrival rate scheme, which uses Eq. (9) to determine node size, two variants are experimented with, in order to examine the effect of the node size variable ε. One variant, denoted as ARb, uses ε to specify node size in terms of the number of bytes. The other, denoted as ARp, uses ε to specify node size in terms of the number of postings. AAR refers to the adaptive arrival rate scheme, which uses Eq. (10) to determine node sizes. The minimum number b of postings to be allocated for the AAR scheme is two.

5.2. Performance measures

There are two performance measures of success used here. First, storage utilization should be measured, which is defined as the storage (in bytes) for all postings divided by the total inverted file storage (in bytes), which includes the storage for postings, the overhead storage (i.e., ε) and the latent free storage. This storage utilization is a micro-average measure because it is measured on the basis for each index. Therefore, it can immediately indicate the amount of storage of RAM used by the index. On the other hand, the (macro-average) storage utilization can be measured as the average of the storage utilization over the index terms. This macro-level average storage utilization is not preferred because it cannot directly indicate the amount of RAM used by the index, since the term occurrence statistics are highly skewed. It is also important to examine the time development of the storage utilization because (a) we assumed that the system behavior does not change abruptly (i.e., smooth) and (b) we are optimizing the expected storage over all N indexed documents or over regular intervals of u documents.

Second, the number of chainings per index term is also an important measure, since it indicates the likelihood of cache misses. If the inverted lists are stored in disk exactly as they are in RAM, the number of chainings per index term indicates the minimum number of logical file seeks per inverted list, which is a major factor in determining retrieval speed. Obviously, the minimum number of chaining is one because of the chaining from the dictionary to the first node.

The storage of the parameters of all the storage allocation schemes is not very significant compared with the storage of the final extensible inverted file. The F16 scheme holds 90 backward difference values to compute the running average of the vocabulary growth rate, plus a few values (e.g., ε, c1). For the TGR scheme, no backward difference values are stored. Instead, the dictionary stores the document frequency df(N, tk) after indexing N documents for tk, in the dictionary data structure together with the start and last pointers (not shown in Fig. 1). Alternatively, df(N, tk) can be stored in the storage allocated for the extension pointer of the last linked-list node of tk, thereby obviating the need for any additional storage. Because df(N, tk) is needed only when new nodes are allocated, both methods do not need to count df(N, tk) for every posting insertion, which saves processing time. Similar to the TGR scheme, the AR and AAR schemes store df(N, tk) for each index term, plus a parameter value for m. Based on this discussion, the storage for the parameter values of the storage allocation schemes is not significant.

Many inverted file construction methods [21], [22] and [23] are known to index at high speed. However, it is difficult to obtain valid comparison of indexing speed because the indexing speed depends on many practical factors, like operating system settings and detailed programming optimization that may not be part of the retrieval system per se. For example, our system can index faster using a faster dictionary lookup data structure, (e.g. Burst tries [24]). In Section 5.7, we compare the indexing speed of our storage allocation schemes with existing ones [6] and [9].

5.3. Storage utilization

An experiment was carried out to assess whether the optimal fixed node size determined using Eq. (4) is close to the observed result. Since the size is fixed, there is no node size variable (i.e., εs=0) and the overhead is just the storage of the next pointer (i.e., 4 bytes). The node with only one posting has a constant storage utilization rate (i.e., 60%), independent of the number of indexed documents. Fig. 3 shows the storage utilization variation over the number of indexed documents for different fixed-sized nodes. In general, the larger the fixed-node size, the longer it takes for the storage utilization to reach its asymptotic value, which is expected since it takes longer to fill the latent free storage. Notice that the storage utilization curves are smooth, as implied by the assumption that the system behavior is smooth.


Display Full Size version of this image (42K)

Fig. 3. Storage utilization performance with different fixed-sized nodes against the different numbers of documents indexed.

It is difficult to determine the best final storage utilization based on Fig. 3, since the final storage utilization rates for the fixed-sized nodes of 12, 16 and 20 postings per node were very close to each other after indexing 1.67 million documents. Fig. 4 was plotted to visualize clearly the best final storage utilization rate after indexing 1.67 million documents. It shows how the near optimal node size varies with final storage utilization rates after indexing 1.67 million documents. The best final storage utilization rate is 92.5%, achieved using 16 postings per node. This node size is very close to the optimal value of 15.9 postings calculated using Eq. (4). The final storage utilization is not sensitive to the exact value of the optimal fixed node size. Therefore, the use of a near-optimal fixed-sized node that is slightly larger than the true optimal size is preferred because chaining performance will improve and because some margin is provided for indexing more documents later. In addition, since the storage utilization is insensitive to the exact value of the optimal node size, the approximations and assumptions made for the VGR and TGR schemes should hold.


Display Full Size version of this image (31K)

Fig. 4. Final storage utilization performance after indexing 1.67 million documents for different fixed-sized nodes. The optimal node size is determined by using Eq. (4).

The storage utilization curves for different number of indexed documents are shown in Fig. 5 for different storage allocation schemes. The TGR and ARp schemes tied for the best asymptotic utilization rate, followed by ARb and F16, and then by VGR. Although ARp and TGR asymptotically achieved similar storage utilization, it appeared that TGR was able to reach the asymptotic performance earlier than ARp.


Display Full Size version of this image (35K)

Fig. 5. The storage utilization of different storage allocation schemes against the number of indexed documents.

Surprisingly, the storage utilization of AAR is the worst. Since the node size is at least two postings per node for the AAR scheme (i.e., b=2), the lower storage utilization is due to the latent-free space allocated to the node that was never used. This suggests that the estimated probabilities were too large, i.e. p(tk) << B(N-L,tk)/(N–id1). Such over-estimation is due to the fact that the denominator, N–id1, was small so that (a) B(N-L,tk) can be easily close to N–id1, causing over estimation and so that (b) quantization errors of 1/(N–id1) become significant, as they are amplified by a factor of m (i.e., 65,535 in this case). For example, suppose that the p(tk)=0.65, but the closest estimated probability based on relative frequency is 0.75 for N-id1=4. The estimation error of 0.1 results in allocating 6554 (i.e., 65,535×0.1) more postings than necessary. If the quantization errors cause under estimation, then the node will be filled more quickly and a new node will be allocated. Therefore, there is a bias towards noticing allocating more memory than necessary. These quantization errors are due to changes in the topic focus of the incoming documents. When there is a new topic, a new node is allocated, which is quickly filled since the same term occurred in several incoming documents. When a new node is allocated again, there will be large quantization errors because the previous node was small (i.e., N–id1). If there is a topic change at this point, the allocated large node will be mostly unfilled.

It is surprising that the asymptotic storage utilization of the VGR scheme is worse than that of the F16 scheme, because the VGR scheme is more sophisticated than the F16 scheme. Since the storage utilization curves for F16, VGR and TGR schemes in Fig. 5 are smooth, the assumption that the optimal value of the node size does not vary significantly over the near future should hold (i.e., E(W(S(N))) ≈ E(W(S(N+ΔN))). Therefore, the reason for the relatively lower storage utilization for VGR is due to problems with the estimation of α. According to Fig. 2, as the number of documents indexed increases, α decreases towards zero. However, the optimal node size varies as the reciprocal root of α, so that the relative errors become noticeable as α tends to zero after indexing more and more documents. These estimation errors of α will translate into large changes in the optimal node sizes as the number of documents indexed increases because they are amplified by the semi-monotonically increasing function, ψ(N). It is easier to notice allocating more memory than needed due to the under-estimation of α than allocating less memory due to the over-estimation of α because new nodes are allocated if the node size is too small.

The storage utilization of the TGR scheme was better than the VGR scheme, as expected. It was able to perform amongst the best using ARp. It owes its success to optimizing node sizes for individual terms, as well as the fact that the estimation error reduces as the number N of indexed documents increases (i.e., (N+1)p(tk) ≈ df(N, tk)). Since the node size was optimized for individual terms, the TGR scheme approaches to the asymptotic storage utilization much quicker than the VGR scheme.

In Fig. 5, the storage utilization curve of the ARp scheme has some saw-tooth patterns. These patterns repeat approximately for every 65,000 documents which repetitions roughly correspond to the period u. These patterns can be explained as follows. When most of the nodes are filled, the storage utilization will be at its (local) peak. Immediately after the local peak storage utilization, new nodes are allocated and therefore there is a relatively fast drop of storage utilization after the peak, producing a near-by (local) trough. As the nodes are filled steadily, the storage utilization improves steadily until most of the nodes are filled.

Similarly, the storage utilization of the ARb scheme has some saw-tooth patterns. However, these patterns are not as apparent as those for the ARp scheme because the period u is much shorter for the ARb scheme (as m is smaller for the ARb scheme). Although the asymptotic storage utilization of the ARp scheme outperformed that of the ARb scheme by only a small margin, the ARb scheme is able to reach the asymptotic storage utilization much quicker than the ARp scheme. However, if the time is measured in terms of the number (i.e., r) of saw-tooth cycles (i.e., u), then both the ARb and ARp schemes need about 20 saw-tooth cycles to reach the asymptotic storage utilization. Therefore, the rate of convergence to the asymptotic storage utilization should be measured in terms of the number of saw-tooth cycles instead of the physical time or the number of documents indexed.

5.4. Determining the best scheme

Fig. 6 can be used to find the better storage allocation scheme. The access time is defined as the mean number of chainings (or address indirection operations) per index term for retrieval. The best scheme is ARp (i.e., nearest to the top left-hand in Fig. 6). In general, the mean number of chaining per term for F16 is larger than that of the more sophisticated schemes, as expected. Also, schemes that optimize node sizes for individual terms (i.e., TGR and AR and AAR) have better chaining performance than schemes (i.e., F16 and VGR) that optimize node sizes based on the entire vocabulary.


Display Full Size version of this image (27K)

Fig. 6. Scatter diagram of the final (or near asymptotic ) storage utilization against the mean chaining per index term for different storage allocation schemes.

ARp has the best storage utilization rate (95%) and the second-smallest mean number of chainings (i.e., 1.8) after AAR (i.e., 1.5), which has the worst storage utilization rate (40%). Since AAR has a much better mean number of chaining per term performance than any other schemes but a much lower storage utilization rate, it confirmed that AAR allocated too large nodes, resulting in more latent free storage and less chaining needed. For the AR schemes, the mean number of chaining per term for ARp is much better than that for ARb because larger node sizes, due to scaling with a larger value of m, result in less chaining and the latent-free storage was used as more documents arrived unlike AAR. This suggested that the correct prediction of large nodes by the arrival rate scheme is quite robust since scaling the value of m substantially did not have an impact on the storage utilization. Otherwise, prediction errors of node sizes would have been translated into useless latent-free space, substantially degrading storage utilization similar to AAR. In principle, according to the law of large numbers, the prediction of p(tk), which is measured as the variance σk of the estimate of p(tk), improves as the number n of indexed documents increases, by a scaling factor of Click to view the MathML source.

TGR, which tied for the best asymptotic storage utilization rate with ARp, has more chainings than ARp. This is not entirely surprising since ARp can allocate large node sizes for common terms as the node sizes are linearly scaled with df(N, tk) and the maximum node size m, where as, for common terms, TGR can only allocate sub-linearly scaled node size with Click to view the MathML source and without any knowledge of the maximum node size m. The impact of m can be observed by comparing the performances of the ARp and ARb schemes (Fig. 6), where m=65,535 for ARp and m=255 for ARb. Therefore, the mean chaining performance of the TGR can be better than the AR scheme if m is sufficiently small. On average, ARp required just half the number of chainings compared with TGR. Since TGR reaches its best storage utilization much more quickly than ARp, if the available RAM is small and storage utilization is paramount, TGR may be a better storage allocation scheme in this special case, out-performing the ARb and ARp schemes.

5.5. Evaluating robustness of the best schemes

In this subsection, we evaluate the robustness of the best two storage allocation schemes, ARp and TGR schemes, found in the previous subsection, using four datasets: a subset of TREC-2 English dataset, TREC-6 English dataset, TREC-2005 Robust track dataset and NTCIR-5 Chinese dataset. The statistics of these datasets are in Table 1. The TREC datasets are articles in English and they have about a third and two third of the number of documents used in the previous evaluation. The TREC-2 dataset is included for indexing speed comparison. We chose the NTCIR-5 Chinese dataset for evaluation because the Chinese language is very different from alphabetic languages like English. Our system uses the Chinese word indexing strategy [25] that indexes the matched longest word in a given Chinese word list with the running text. The datasets used here are newswire articles whereas the previous VLC data set is from the web and some contains newsgroup data. Except TREC-2 data, each document is stored in a single file instead of reading a batch of documents in a single file. The disk storage is calculated by counting the actual amount of bytes occupied by the document content. The storage of the allocated disk blocks is the number of disk blocks used times the disk block size. This storage is obtained using the du facility in Linux.

Table 1.

Statistics of the test datasets

DatasetTREC-6TREC-2005NTCIR-5TREC-2a
LanguageEnglishEnglishChineseEnglish
Number of documents566,0771,033,461901,446510,637
Number of files566,0771,033,461901,4461,025
Number of unique index terms (×106)2.221.311.320.62
Number of postings (×106)91.518021358
Storage for content (Gb)2.13.01.00.9
Storage of allocated disk blocks (Gb)3.435.503.661.24
a TREC-2 is for comparison.

The SUN server used in the previous evaluation had a heavy load, multitasking various resource demanding jobs that make timing the indexing processes difficult. Instead, we used a PC-server in this evaluation for timing purposes because the PC-server has less loading. We used the CPU time because (a) the PC-server may be loaded by other users since it is a computing node in our computer cluster, (b) it is easier to compare performance using CPU time, and (c) the inversion is done based on using RAM. The PC-server has an AMD Opteron 242 (1.6 GHz) processor with 1Mb cache and 3 Gb RAM (DDR 300 MHz). The spindle speed of the disk is 7200 rpm. This is a reasonably fast PC-server although it is not the fastest at present.

We obtain the predicted storage utilization in Table 2 by looking up the storage utilization curves of ARp and TGR in Fig. 5 using the number of indexed documents rounded to the nearest hundred, i.e. 556,100 for TREC-6, 1,033,500 for TREC-2005 and 901,500 for NTCIR-5. The predicted final storage utilizations based on the operating curves of ARp and of TGR in Fig. 5 are not very different from the final storage utilizations achieved by the ARp and the TGR schemes (within three percentage points) for the three datasets in Table 2. The mean number of chainings per term for the ARp scheme remains less than two, similar to the results in the previous subsection (i.e., 1.8). By comparison, the mean number of chainings per term for the TGR scheme is at least twice as much as that of the ARp scheme. Therefore, it seemed the ARp scheme is the preferred storage allocation scheme compared with TGR scheme if the mean number of chainings per term is an important performance measure.

Table 2.

Indexing efficiency of the ARp and TGR schemes

 SchemeTREC-6TREC-2005NTCIR-5
Final storage utilizationARp88.3%93.0%94.8%
(Predicted using Fig. 5) (88.9%)(93.1%)(92.5%)
Final storage utilizationTGR90.9%93.4%94.5%
(Predicted using Fig. 5) (93.0%)(94.1%)(93.8%)
Mean number of chainings perARp1.261.901.75
TermTGR2.575.446.58
Indexing time (seconds)ARp207433224563
 TGR217833434493

5.6. Indexing speed comparison

A subset of TREC-2 dataset was used in [9] and the indexing speed is one of the highest. We used this dataset to show that the time efficiency of our indexing scheme is not significantly lower than that in [9]. We also use the results of a current single-pass in-memory inversion method by Heinz and Zobel because their work is similar to ours and because they have data about their document-level inverted index construction process for ease of comparison.

For valid comparisons, the TREC-2 files as distributed by TREC are used and these files typically contain more than one document. This reduces a noticeable amount of time (i.e., about 100 s in this experiment) to find and read the files. We also changed our tokenizer (as used in the other data sets here) to a simpler one that extracts tokens as strings over the set of alphanumeric characters as in [9].

We observe that the indexing time is a function of the total occurrences of all index terms. Therefore, we use the number of terms indexed per second for comparison of indexing speed. Table 3 shows the estimated indexing rates of the single pass index construction method by Heinz and Zobel, our ARp scheme and the block addressing inverted index construction by Navarro et al. [9] in terms of the number of terms indexed per second. In Table 3, we observe that the indexing rates of all the different schemes over different collections are similar (i.e., around 210,000+ terms indexed per second). The estimates of indexing rates of the index construction method by Heinz and Zobel [6] are based on their lowest indexing time reported for document-level inverted index construction (i.e., their best results). However, their results are based on the elapsed time. Our results in Table 3 show that the comparison of indexing rate needs to be carefully interpreted because it depends on how the indexing rate is measured and many other factors, e.g., the file organization of documents and the tokenization process.

Table 3.

Indexing rate in terms of the number of postings per second

ReferencesCollectionmusical sharp index time terms (s) (×106)Time (s)musical sharp index terms per second (×103)Final storage utilizationMean musical sharp chainings per term
Heinz and Zobel [6]WebV324a1475a220
 WebXX1262a5763a219
Navarro et al. [9]TREC-2137b600229
OursTREC-213749427888.5%1.25
(ARp)TREC-623993925489.5%1.61
 TREC 2005355145324594.0%2.56
a based on data in [6].
b based on our data.

The final storage utilizations of the ARp scheme for the TREC-2, TREC-6 and TREC-2005 data are within one percentage point of the predicted storage utilizations using the ARp curve in Fig. 5 (i.e., 88.7% for TREC-2, 88.9% for TREC-6 and 93.1% for TREC-2005). We observe that the final storage utilizations of the ARp scheme for TREC-6 and TREC-2005 data using the original tokenizer and the simplified tokenizer are also within two percentage points from each other (i.e., 88.3% and 89.5% for TREC-6, and 93.0.% and 94.0% for TREC-2005). The mean numbers of chainings per term of this ARp scheme for TREC-2 and TREC-6 data sets are similar to those in Table 2 and within two chainings per term on average. However, the mean number of chainings per term for TREC-2005 dataset using the simplified tokenizer is larger than two.

5.7. Extensible compressed inverted file

This subsection shows that the extensible inverted file can be compressed using integer compression techniques [9], [12], [13], [14], [15] and [16]. Postings are compressed using the variable byte compression [8] and [13] here because of its simplicity and because it is a single-pass compression method (unlike, for example, parameterized compression methods). The experimental set up is the same as in the previous subsection except that the storage allocation scheme is modified for compressed postings. More specifically, the storage allocation scheme is modified so that when the memory of the node is exhausted but there are still some remaining compressed data, this remaining compressed data is stored in the (following) chained new node. This modification can be applied to our other storage allocation schemes (e.g., TGR). Apart from this modification, we evaluated the ARp storage allocation scheme (assuming that the storage for each posting is six bytes) and the same scheme, called ARpr scheme, except that the original calculated node size is multiplied by the compression ratio R that is defined as the storage of the compressed posting information divided by the storage of the posting information. The reason for scaling the node size by R is that the storage of the compressed posting is on average R times smaller than the storage of the original posting information so that the node size can be reduced accordingly (on average). Multiplying the node size by R is called R scaling in here.

Table 4 shows the results of building the extensible compressed inverted file for TREC-2, TREC-6 and TREC-2005 English documents and some combinations of these collections. The compression ratio R is 37% for all the different collections. R is used to predict the storage utilization by looking up the ARp storage utilization curve in Fig. 5 based on the equivalent number of document indexed that is defined as the number of documents indexed times the compression ratio R. The rationale for using the equivalent number of documents indexed to predict the storage utilization is that the storage utilization is a function of the total allocated storage. An estimate of the total allocated storage is the number of documents indexed. However, for extensible compressed inverted files, the estimate of the total allocated storage is approximated by the number of documents indexed times the compression ratio R since the storage for postings is reduced by the compression ratio R. Since storage utilizations in Fig. 5 are measured after indexing every 100 documents, the equivalent number of documents indexed is rounded to the nearest hundred.

Table 4.

Final and predicted storage utilization of our extensible compressed inverted file

(Combined) TREC data collectionR scalingTime (s)Mean musical sharp chainings per termR (%)Final storage utilization (%)Predicted storage utilization (%)Equivalent musical sharp document indexed
2No5211.303768.274.7190,500
 Yes5002.203784.774.7190,500
6No9791.253772.374.5210,000
 Yes9371.953785.374.5210,000
2005No14771.643784.385.5380,000
 Yes14423.313790.985.5380,000
2+2005No19781.703788.089.5569,600
 Yes19523.523791.789.5569,600
6+2005No25141.583788.289.8590,000
 Yes23963.083791.589.8590,000
2+6+2005No30351.663790.091.9779,800
 Yes29163.393791.991.9779,800

We observe from Table 4 that the final storage utilizations achieved using the ARp scheme and the corresponding predicted storage utilizations using the equivalent number of documents indexed are within three percentage points for collections that have over 380,000 equivalent documents. It appears that the predicted and final storage utilizations are closer together for the larger data collections (i.e., TREC2+TREC6+TREC-2005). This might be due to the smaller slope of storage utilization curve as the number of documents indexed increases. The storage utilization using the ARp scheme without R scaling has a consistently lower storage utilization than the ARpr scheme that uses R scaling but the ARp scheme has a consistently lower mean number of chainings per term compared with the ARpr scheme. This is because the ARp scheme allocates larger but fewer nodes than ARpr scheme. We observe in Table 4 that the differences in final storage utilizations between the ARp and ARpr schemes are reducing for larger and larger collections while the difference in the mean number of chainings per term between ARp and ARpr schemes increases steadily with larger and larger collections. Therefore, if there are enough RAM to index a large number of documents, we prefer the ARp scheme without R scaling because of its good final storage utilization and the small mean number of chainings per term.

In Table 4, we observe that the indexing times of the extensible compressed inverted file for the TREC-2, TREC-6 and TREC-2005 data collections are similar to the corresponding indexing times of the extensible inverted file without compression for the corresponding collections in Table 3. We also observe that the indexing time of the combined collections (e.g. TREC-2+TREC-2005) is close to the sum of the indexing times of the individual collections (e.g., 1978 ≈ 521+1477 for the combined collection of TREC-2 and TREC-2005). The mean number of chainings per term (in Table 4) using the variable byte compression without R scaling is similar to that without variable byte compression (Table 3) for the corresponding data collections.

6. Related work

The inverted file is popular for indexing archival databases and free texts. It is considered [26] as the best choice for Internet searches. It can retrieve documents quickly [27] and it can be compressed [28] requiring storage as low as that of signatures [29]. For dynamic environments (e.g., Internet), inverted files were modified to support (incremental or batch) updates. Cutting and Pedersen [30] modified the B-tree structure with a heap data structure to store postings, thereby improving the storage utilization rate (from 66% to 86%) and reducing indexing time. Tomasic et al. [31] used a dual-list structure strategy to store short and long inverted lists, respectively. Asymptotic storage utilization rate can reach about 88%. Our work can be considered as an extension of their work by using variable size (linked) lists, instead of pre-defined short and long lists. Brown et al. [32] used a persistent object store to manage an incremental inverted file, which has a low RAM requirement. Heinz and Zobel [6] have shown that their single-pass (in-memory) inversion approach was the preferred method for inverted file construction but the RAM storage utilization was not reported. Zobel et al. [33] used a fixed-sized block of RAM to hold inverted lists. If the RAM block overflows, the block would be written onto disks. The storage utilization rate is good (93–98%), and it is robust to different block sizes. Our work can be considered as an extension of their work where the size of the RAM block is predicted without the need to move data chunks. Similar to our approaches but tested with a smaller collection, Shieh and Chung [34] used run-time statistics to determine the allocation of free space for the construction of inverted files by linearly interpolating the predictions of the number of arrivals under different extreme conditions. Here, our simple-to-implement single-pass method using existing (in-memory) inverted file structure achieved final storage utilization rates in between 87% and 95% depending on the amount of data indexed using various reference data sets that include Chinese documents. We found that as the amount of data indexed increases, the storage utilization of our storage allocation schemes increases in the long run.

In-memory inversion is relevant to the recent substantial interest [35] in building and accessing parallel-distributed indices [36], [37] and [38] for information retrieval (IR). Initial interest included partitioning the inverted file [23], [39] and [40] and using specialized parallel hardware [41], [42], [43] and [44] for efficiency. As computational power increases, parallel architecture for IR uses less specialized hardware (e.g., workstations). One solution uses symmetric multiple processors [45] in a share-everything memory organization. An alternative uses low-cost servers interconnected by a local area network in a share-nothing memory organization to index [23], [24] and [46] and retrieve [36] and [37] documents concurrently, possibly acting as state-of-the-art locally distributed web servers [47] for giant web services [48]. The hardware and software of the system should be balanced for effective system utilization [45], for instance, using (software) pipelining [25] and [49]. Couvreur et al. [50] analyzed the tradeoff between cost and performance, using different types of hardware. A more recent evaluation examined retrieval efficiency [51] for distributed IR. As the current trend in indexing large data collections is parallel indexing (in batches) (e.g., [5] and [7]), our storage allocation schemes can be used to build these partitioned indices. Our results serve as a reference for PC-based parallel indexing (e.g., [5]) since the amount of RAM in PCs is similar to the amount of RAM in our server.

The extensible inverted file can store compressed postings [9], [11], [12], [13], [14], [15], [16] and [28] and it is complementary to compressing posting to achieve better memory utilization for in-memory inversion. Specifically, the optimal node size for storing compressed postings can be determined by multiplying the optimal node size described in section three and four with the compression ratio R. This works with the better AR scheme because the storage utilization of this scheme is resilient to simple scaling, as attested by the similar storage utilization rates of ARp and of ARb (92.5% and 92%, respectively) where the scaling factor between the largest node size for ARp and ARb is as large as six. It is possible not to multiply the optimal node size by R but this effectively is enlarging the optimal node size by a factor of 1/R.

Our proposed storage allocation schemes can determine node sizes for inverted files that include word positions. One general approach stores word positions after the corresponding basic posting. In this case, the storage should be allocated conservatively because it is difficult to predict the within-document term frequency and the amount of word positions. Hence, when a new node is allocated for the index term tk in the current document n, the storage of the new node is the sum Xnew of:

(1) the storage specified using our storage allocation schemes (say c1×Bold bytes) and;

(2) the storage for the word positions of tk in the current document n (say Y bytes) and;

(3) the storage for the word positions of the Bold−1 postings assuming that the kth term will occur only once in these documents.

Therefore, the storage for posting information in the new node is Xnew=c1×Bold+Y+c2(Bold−1), where c2 is the storage for one word position. The storage overhead of the new node that stores the word positions is ε bytes and the storage utilization Vnew of this new node that stores word positions is Vnew=Xnew/(ε+Xnew). The storage utilization Vold of an equivalent new node that stores the same amount of postings but without word position information is Vold=c1 Bold/(ε+c1 Bold). Since Boldgreater-or-equal, slanted1 and Y>0, Vnew>Vold. Therefore, the storage utilization of the extensible inverted file that stores word positions should be larger than the storage utilization of the corresponding extensible inverted file that does not store word position information. There are other variants to determine the number of word positions in each document. For example, we can use the average term frequency in a document and its standard deviation to estimate the minimum number of word positions to store, with a prescribed confidence level (e.g., 95%).

The other general approach to handle word position information with inverted files is to use extended postings (Section 2.1) of the form left angle bracketdi, tf(di,tk), f(di,tk)right-pointing angle bracket. Our storage allocation schemes determine the optimal node size for the extended postings by simply altering the constant c1 in the original storage allocation schemes to c1+c3 where c3 is the storage for the file position. The storage utilization of this approach will be lower than storing the word positions in the nodes because the auxiliary file positions of extended postings are overhead and not information. For valid comparisons, the storage utilization of this approach should compare with another scheme or approach that uses the extended postings. In summary, storing word positions in the nodes or in the auxiliary files can make use of the storage allocation schemes in Sections 3 and 4.

7. Conclusion

Several storage allocation schemes are proposed for in-memory extensible inverted file construction and these schemes are based on minimizing the storage waste under different conditions. Minimization of storage waste is the equivalent to maximization of storage utilization that is important to reduce the number of inverted files for merging and to reduce the amortized time cost of inverted file maintenance [8]. Minimizing the storage waste also indirectly reduces the number of chainings and the access time, since an address indirection may lead to a cache miss (on RAM) or a file seek (on disk). The reduction of access time is important for the extensible inverted file as a temporary index in memory when it is being searched and as a partial index when it is being merged to form larger partial indexes or to form the final inverted file.

Our storage schemes were evaluated using a sizeable (i.e., 3.5 Gb) document subset of the VLC. In our experiment, the best scheme was the arrival rate (AR) scheme, which determines the node size using term ARs. The AR scheme was found to be the best in our experiment because it has the best final storage utilization rate of 95% for this subset of VLC data and because, at the same time, it has the second best mean number of chainings per term (i.e., 1.8). The adaptive AR scheme has the best mean number of chainings per term (i.e., 1.2) but it has the lowest final storage utilization (i.e., 42%). The TGR scheme has a similar final storage utilization as the final storage utilization of the AR scheme but the mean number of chainings per term for the TGR scheme is 3.8 that is about double the amount of chaining per terms of the AR scheme. Therefore, the AR scheme was our clear best scheme if the performance is measured in terms of both the final storage utilization rate and the mean number of chainings per term.

We have also evaluated the ARp scheme using four additional reference data collections (i.e., TREC-2, TREC-6, TREC-2005 and NTCIR-5). It appeared that the final storage utilization using the ARp scheme for these four data collections can be predicted within three percentage points using the storage utilization curve of the ARp scheme derived from the VLC data collection as a kind of operating curve. The indexing speed (i.e., number of terms indexed per second) of our system can be increased by optimizing the program code (e.g. use a simpler tokenizer) and the operating environment (e.g., combining documents into a single file for fast disk access). The resultant indexing speed of our system was similar to the indexing speed of others [6] and [9]. This illustrates that our storage allocation schemes did not incur a significant amount of time overhead when calculating node sizes.

We evaluated the ARp scheme for storing variable byte compressed postings [9] and [13] using the TREC-2, TREC-6, TREC-2005 and some of these combined collections. The measured compression ratio R is almost a constant (i.e., 37%). For compressed postings, the predicted storage utilization is determined by looking up the operating ARp curve in Fig. 5 based on the equivalent number of document indexed that is defined as the number of document indexed times R. We observe that when the equivalent number of document indexed is 210,000 or more, the final storage utilization of the ARp scheme is close to (i.e., within three percentage points) the corresponding predicted storage utilization. It seems that the storage utilization of the ARp scheme for the compressed posting is similar to that of the ARp scheme for the uncompressed postings when the number of documents indexed is sufficiently large.

The storage allocation schemes can determine node sizes for compressed inverted files with word position information. For nodes that directly store word position information, the node size can be the sum of the calculated optimal node size plus the storage for word positions of the index term in the current indexed document. For nodes that store extended postings, the optimal node size can be determined by the storage allocation schemes using the storage of an extended posting as c1.


Acknowledgments

We thank the Center for Intelligent Information Retrieval, University of Massachusetts, for facilitating Robert Luk to develop in part the IR system, when he was on leave there. We are grateful to ROCLING for providing its word list. This work is supported by the Hong Kong Polytechnic University Grant no. A-PE36.


References

[1] D. Hawking, N. Craswell, P.B. Thistlewaite, Overview of TREC-7 very large collection track, in: Proceedings of The Seventh TREC Conference, 1998, pp. 40–52.

[2] C. Clarke, N. Craswell, I. Soboroff. Terabyte track, http://www-nlpir.nist.gov/projects/terabyte/, 2003.

[3] J. Hirai, H. Garcia-Molina, A. Paepcke, S. Raghavan, WebBase: a repository of web pages, in: Proceedings of The Ninth International World Wide Web Conference, 2000, pp. 277–293.

[4] NTCIR Patent Retrieval Task, http://www.slis.tsukuba.ac.jp/~fujii/ntcir5/cfp-en.html, 2005.

[5] L.A. Barroso, J. Dean and U. Hölzle, Web search for a planet: the Google cluster architecture, IEEE Micro. 23 (2003) (2), pp. 22–28. Full Text via CrossRef

[6] S. Heinz and J. Zobel, Efficient single-pass index construction for text databases, J. Am. Soc. Inform. Sci. Technol. 54 (2003) (8), pp. 713–729. Full Text via CrossRef | View Record in Scopus | Cited By in Scopus

[7] R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, ACM Press, 1999.

[8] N. Lester, J. Zobel and H. Williams, Efficient online index maintenance for contiguous inverted lists, Informat. Process. Manage. 42 (2006), pp. 916–933. SummaryPlus | Full Text + Links | PDF (221 K) | View Record in Scopus | Cited By in Scopus

[9] G. Navarro, E.S. de Moura, M. Neubert and R. Baeza-Yates, Adding compression to block addressing inverted indexes, Inform. Retriev. 3 (2000), pp. 49–77. Full Text via CrossRef | View Record in Scopus | Cited By in Scopus

[10] A. MacFarlane, S.E. Robertson, J.A. McCann, On concurrency control of inverted files, in: F.C. Johnson (Ed.), Proceedings of the 18th MCS IRSG Annual Colloquium on Information Retrieval Research, 26–27 March 1996, pp. 67–79.

[11] U. Manber, S. Wu, Glimpse: a tool to search through entire file systems, in: Proceedings of the USENIX Winter 1994 Technical Conference, 1994, pp. 23–32.

[12] N. Ziviani, E.S. de Moura, G. Navarro and R. Baeza-Yates, Compression: a key for next-generation text retrieval systems, IEEE Comput. 33 (2000) (11), pp. 37–44. Full Text via CrossRef | View Record in Scopus | Cited By in Scopus

[13] P. Elias, Universal codeword sets and the representation of the integers, IEEE Trans. Inform. Theory 21 (1975) (2), pp. 194–203. MathSciNet | Full Text via CrossRef | View Record in Scopus | Cited By in Scopus

[14] A. Trotman, Compressing inverted files, Inform. Retriev. 6 (2003) (1), pp. 5–19. Full Text via CrossRef | View Record in Scopus | Cited By in Scopus

[15] H. Williams and J. Zobel, Compressing integers for fast file access, Comput. J. 42 (1999) (3), pp. 193–201. Full Text via CrossRef | View Record in Scopus | Cited By in Scopus

[16] S.W. Golomb, Run-length encodings, IEEE Trans. Inform. Theory 12 (1996) (3), pp. 399–401.

[17] E.A. Brewer, Combining systems and databases: a search engine retrospective. In: J.M. Hellerstein and M. Stonebraker, Editors, Reading in Database Systems. Fourth Edition, MIT Press, Cmabridge, MA (2005).

[18] A. Fox, E.A. Brewer, Harvest, yield, and scalable tolerant systems, in: Proceedings of the 16th SOSP, St. Malo, France, October 1997.

[19] S. Gilbert and N. Lynch, Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services, Sigact News 33 (2002) (2), pp. 51–59. Full Text via CrossRef

[20] H.S. Heap, Information Retrieval: Computational and Theoretical Aspects, Academic Press, New York (1978).

[21] C.L.A. Clarke, G. V. Cormack, Dynamic inverted indexes for a distributed full-text retrieval system, Technical Report MT-95-01, University of Waterloo, 1995.

[22] B. A. Ribeiro-Nero, J. P. Kitajima, G. Navarro, C. Santana, N. Ziviani, Parallel generation of inverted files for distributed text collections, in: Proceedings of the 18th International Conference of the Chilean Society of Computer Science, Chile, 1998, pp. 149–157.

[23] B. Ribeiro-Neto, E.S. Moura, M.S. Neubert, N. Ziviani, Efficient distributed algorithms to build inverted files, in: Proceedings of The 22nd Annual International ACM SIGIR conference on Research and development in information retrieval, Berkeley, 1999, pp. 105–112.

[24] S. Heinz, J. Zobel and H.E. Williams, Burst tries: a fast, efficient data structure for string keys, ACM Trans. Inform. Syst. 20 (2002) (2), pp. 192–223. Full Text via CrossRef | View Record in Scopus | Cited By in Scopus

[25] C. Kit, Y. Liu and N. Liang, On methods of Chinese automatic word segmentation, J. Chin. Inform. Process. 3 (1989) (1), pp. 13–20.

[26] S. Melnik, S. Raghavan, B. Yang and H. Garcia-Molina, Building a distributed full-text index for the Web, ACM Trans. Inform. Syst. 19 (2001) (3), pp. 217–247.

[27] J. Zobel, A. Moffat and K. Ramanohanarao, Inverted files versus signature files for text indexing, ACM Trans. Database Syst. 23 (1998) (4), pp. 453–490. Full Text via CrossRef | View Record in Scopus | Cited By in Scopus

[28] I. Witten, A. Moffat and T.C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images, Morgan Kaufmann Publishers, Los Alios, CA (1999).

[29] C. Faloutsos and S. Christodoulakis, Description and performance analysis of signature file methods, ACM Trans. Office Inform. Syst. 5 (1987) (3), pp. 237–257. Full Text via CrossRef | View Record in Scopus | Cited By in Scopus

[30] D. Cutting, J. Petersen, Optimizations for dynamic inverted index maintenance, in: Proceedings of the Thirteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1990, pp. 405–411.

[31] A. Tomasic, H. Garcia-Molina, K.A. Shoens, Incremental updates of inverted lists for text document retrieval, in: Proceedings of The ACM SIGMOD International Conference on Management of Data, 1994, pp. 289–300.

[32] E.W. Brown, J.P. Callan, W.B. Croft, Fast incremental indexing for full-text information retrieval, in: Proceedings of The 20th VLDB Conference, 1994, pp. 192–202.

[33] J. Zobel, A. Moffat, R. Sacks-Davis, Storage management for files of dynamic records, in: Proceedings of the Fourth Australian Database Conference, 1993, pp. 26–38.

[34] W.-Y. Shieh and C.-P. Chung, A statistics-based approach to incrementally update inverted files, Inform. Process. Manage. 41 (2005) (2), pp. 275–288. SummaryPlus | Full Text + Links | PDF (439 K) | View Record in Scopus | Cited By in Scopus

[35] A. Tomasic and H. Garcia-Molina, Issues in parallel information retrieval, Bull. Tech. Committee Data Eng. 17 (1994) (3), pp. 41–49.

[36] C. Badue, B.A. Ribeiro-Neto, R. Baeza-Yates, N. Ziviani, Distributed query processing using partitioned inverted files, in: Proceedings of the Eighth International Symposium on String Processing and Information Retrieval (SPIRE 2001), 2001, pp. 10–20.

[37] A. MacFarlane, J.A. McCann, S.E. Robertson, Parallel search using partitioned inverted files, in: Proceedings of the Seventh International Symposium on String Processing and Information Retrieval (SPIRE 2000), 2000, pp. 209–220.

[38] J.P. Callan, Z. Lu, W.B. Croft, Searching distributed collections with inference networks, in: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, 1995, pp. 21–28.

[39] C. Stanfill, Partitioned posting files: a parallel inverted file structure for information retrieval, in: Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Brussels, 1990, pp. 413–428.

[40] B.-S. Jeong and E. Omiecinski, Inverted file partitioning schemes in multiple disk systems, IEEE Trans. Parallel Distribut. Syst. 6 (1995) (2), pp. 142–153. View Record in Scopus | Cited By in Scopus

[41] S.-H. Chung, S.-C. Oh, K.R. Ryu, S.-H. Park, Parallel information retrieval on a distributed memory multiprocessor system, in: Proceedings of the International Conference on Algorithms and Architectures for Parallel Processing (ICAPP 97), 1997, pp. 163–176.

[42] P. Bailey, D. Hawking, A parallel architecture for query processing over a terabyte of text, Technical Report TR-CS-96-04, The Australia National University, June 1996.

[43] N. Goharian, T. El-Ghazawi, D. Grossman, Enterprise text processing: a sparse matrix approach, in: Proceedings of International Conference on Information Technology: Coding and Computing, , 2001, pp. 71–75.

[44] C. Stanfill and B. Kahle, Parallel free-text search on the connection machine system, Commun. ACM 29 (1986) (12), pp. 1229–1239. Full Text via CrossRef | View Record in Scopus | Cited By in Scopus

[45] Z. Lu, K.S. McKinley, B. Cahoon, The hardware/software balancing act for information retrieval on symmetric multiprocessors, in: Proceedings of Euro-Par 98, 1998, pp. 521–527.

[46] S. Melnik, S. Raghavan, B. Yang, H. Garcia-Molina, Building a distributed full-text index for the Web, in: Proceedings of The 10th International Conference on World Wide Web, Hong Kong, 2001, pp. 396–406.

[47] V. Cardellini, E. Casalicchio, M. Colajanni and P.S. Yu, The state of the art in locally distributed web servers, ACM Comput. Surveys 34 (2002) (2), pp. 263–311. Full Text via CrossRef | View Record in Scopus | Cited By in Scopus

[48] E.A. Brewer, Lessons from giant-scale services, IEEE Internet Comput. 5 (2001) (4), pp. 46–55. Full Text via CrossRef | View Record in Scopus | Cited By in Scopus

[49] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke and S. Raghavan, Searching the web, ACM Trans. Internet Technol. 1 (2001) (1), pp. 2–43. Full Text via CrossRef

[50] T.R. Couvreur, R.N. Benzel, S.F. Miller, D.N. Zeitler, D.L. Lee, M. Singhai, N. Shivaratri and W.Y.P. Wong, An analysis of performance and cost factors in searching large text databases using parallel search systems, J. Am. Soc. Inform. Sci. 45 (1994) (7), pp. 443–464. Full Text via CrossRef

[51] B. Cahoon, K.S. McKinley and Z. Lu, Evaluating the performance of distributed architectures for information retrieval using a variety of workloads, ACM Trans. Inform. Syst. 18 (2000) (1), pp. 1–43. Full Text via CrossRef | View Record in Scopus | Cited By in Scopus


 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值