信息索引导论第四章笔记(英文)

最新推荐文章于 2022-08-17 09:51:09 发布

Braylon1002

最新推荐文章于 2022-08-17 09:51:09 发布

阅读量948

点赞数 1

分类专栏：数据挖掘文章标签：信息检索

本文链接：https://blog.csdn.net/qq_40742298/article/details/107657377

版权

数据挖掘专栏收录该内容

53 篇文章 11 订阅

订阅专栏

Abstract

In chapter 4, it focuses on how to contrust an inverted index, which is called index contruction or indexing. In terms of the constrain of hardware, we need to introduce new idea to avoid the effect of computer hardware that are relevant for indexing. Firstly, section4.2 introduces an efficient single-machine algorithm, blocked sorted-based indexing. And then, single-pass in-memory indexing has even better scaling properties because it does not thold the vocabulary in memeory. In order to adapt frequent changes, we require dynamic indexing in section4.5. At last, we cover some complicating issues that an arise in indexing, comprising security and indexes for ranked retrieval.

Hardware Basics

在这里插入图片描述

Knowledge Foundation

Access to data in memory is much faster than access to data on disk.
When doing a disk read or write, it takes a while for the disk head to moveto the part of the disk where the data are located.
Operating systems generally read and write entire blocks.
Data transfers from disk to memory are handled by the system bus, not bythe processor.
Servers used in IR systems typically have several gigabytes (GB) of mainmemory, sometimes tens of GB.

Summary: Influencers of indexing

Index contruction refers to the entirely process of changing a doc to inverted index.

consider the size of memory, CPU clock cycle;

i.e
if there is a large memory, putting all the docs into memory would be fast to build the inverted index.

save the data to the memory as much as possible
consider the time wasted in seeking in disk

Block Sorted-Based Indexing

It is used when the memory cannot store the entire doc.

The specific process is as follows:

Divide the entire document collections into parts of the same size (each part is exactly a block size) and put it in memory.
Parse each section into a word term_ID-Doc_ID pair;
in order to make the index building more efficiently, we replace the term item itself with the term ID, and of course we need to maintain a mapping table for the term and the term_ID, then put it in memory;
sort this block’s term_ID-Doc_ID pairs by item_ID, and a temporary posting is built on the Doc_ID corresponding to the same term_Id.
write to the disk, continue to process the next block
when all doc collections finishes, The inverted index of each block is merged and finally we get a complete inverted index;

在这里插入图片描述

BSBI Algorithm

在这里插入图片描述

All in all, the key steps are No.2 and No.3, so the time complexity is considered with the sorting step.

Time Conplexity
θ（TlogT）
T：an upper bound for the number of items we must sort

Nevertheless, the actual indexing time is usually dominated by the time it takes to parse the documents and to do the final merge.

Disadvantages.
- Cost to maintain the term_id mapping table
- larger documents collections, larger mapping table
- large mapping data structure is not suitable to put into memory

Sinple pass in memory indexing

A more scalable alternative is single-pass in-memory indexing, which uses terms instead of itermIDs.
Thus, SPIMI can index collections of any size as long as ther is enough disk space available.

The specific process is as follows:

Split the entire document collections into items- document_ID pairs.
Process this series of term-document_ID pairs as a data stream.
In memory, a dictionary is built in advance which can be realized by hash table, therefore, use the term to find the location where document_ID should be placed.
When the memeory is full, sort dictionary by term, and then write the sorted dictionary-posting to the disk.
Merge the Multiple sorted dictionary-posting paris.

SPIMI Algorithm

在这里插入图片描述

According to the step 2 and step 3, the time complexity is O(T), T is the same to that in BSBI.

MapReduce —— Distributed indexing

Collections are often so large that we cannot perform index construction effi-ciently on a single machine.

In the face of this situation, we use the idea of divide and rule.

Main Idea

Distribute document collections across computer clusters for processing, using Master-Slave mode.
Master computer control the entire processing, such as when one computer suddenly breaks down while processing document sets, master hands over the document collections task.

在这里插入图片描述

Map
The document collections is divided into n pieces of data, and processed by the analyzer, into a sorted term-document_ID pair, which is the general BSBI or SPIMI algorithm.
Reduce

The dictionary of each machine is divided into two parts: A-I and j-z, and then the A-J part is handed over to a single inverter. Therefore, only two invertors are needed here.

Note:
Analyzer and Inverter are the same computer, what’s more, a computer can serve as a analyzer and a inverter.

Map in detail

The document collection exists in an independent machine. The collection is divided into data pieces and transmitted to the machine served as the analyzer (IO time needs to be considered here)
turn to term-docId pairs using BSBI or SPIMI

Reduce in detail

Transfer partition file to inverter by IO (IO time should be considered here)
sort pairs in inverter with time complexity of O(TlogT)
Write the inverted record table to an independent machine; (Note: we need to consider the size of the dictionary and the size of the posting. The size of the dictionary is the size of the term, and consider the IO time.)

在这里插入图片描述

In the above discussion, the premise is that the document does not change, but in fact, the document will be updated, inserted, deleted, etc;
Thus, we need to introduce dynamic index.

Dynamic index

Main Idea

An auxiliary index (representing added documents) is constructed in memory and maintaining an invalid bit vector (representing deleted documents). When the auxiliary index is large enough, it is merged with the primary index;

When we are querying, we must merge the both results of primary index and auxiliary index, and then filter the results in terms of invalid bit.

How to construct

introduce log(T/n) index(I0, I1, …Ii), each index size is (2^i)*n respectively.

T: an upper bound for the number of items we must sort
n: splits

In the beginning, when the auxiliary index is full, I0 is built and saved to disk to clear the auxiliary index
When the auxiliary index is full for the second time, it merges with I0 and becomes I1 (I0 is removed);
When the auxiliary index is full for the third time, I0 is constructed;
When the auxiliary index is full for the fourth time, I0, I1 and auxiliary indexes are merged to form I2 (I0 and I1 are cleared)

We can see that there is a rule: the process of building index is constructed by increasing binary number:

idx	I3	I2	I1	I0
No.1	0	0	0	0
No.2	0	0	0	1
No.3	0	0	1	0
No.4	0	0	1	1
No.5	0	1	0	0
No.6	0	1	0	1
No.7	0	1	1	0
No.8	0	1	1	1
No.9	1	0	0	0
No.10	1	0	0	1
No.11	1	0	1	0
…	…	…	…	…

Dynamic Algorithm

在这里插入图片描述

Access Control

Generally, the posting of inverted index is sorted according to the document ID, which is helpful for compression.
And it only needs to be inserted at the end.
However, for rank retrieval, it needs to sort by weight, but When adding operation happens, you need to scan it to determine the insertion position;

In general, access authorization refers to whether a user has access to a document through ACL (access control list), and the access control table can also be constructed with an inverted index structure.

Dictionary represents each user, while posting represents the document_ID that users can access. In other words, pc stores a user-docId matrix.

在这里插入图片描述

这里作为自己的笔记和总结，借鉴了manning原书还有csdn博主：

iteye_17686：（传送门）

英语难免有些错误大家见谅。
大家共勉~~

Braylon1002

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
信息索引导论第四章笔记(英文)

AbstractIn chapter 4, it focuses on how to contrust an inverted index, which is called index contruction or indexing. In terms of the constrain of hardware, we need to introduce new idea to avoid the effect of computer hardware that are relevant for index
复制链接

扫一扫

专栏目录