信息索引导论第一章笔记(英文)

最新推荐文章于 2024-07-30 21:53:07 发布

Braylon1002

最新推荐文章于 2024-07-30 21:53:07 发布

阅读量1.4k

点赞数 2

分类专栏：数据挖掘文章标签： python 信息检索布尔索引倒排索引

本文链接：https://blog.csdn.net/qq_40742298/article/details/107408569

版权

数据挖掘专栏收录该内容

53 篇文章 12 订阅

订阅专栏

Abstract

The first chapter of this book mainly introduces some crucial concepts and basic knowledges, comprising what the IR is, three categories depending on the size of data scale, the development from grepping to indexing approach and how to evaluate the retrieval results, etc. Eventually, accroding to the inverted index, the author described boolean retrieval operation steps and several key algorithms in details.

introduction

what is IR

IR(Information Retrieval) is finding material of unstructured nature that satisfies an information need from within large collections.

在这里插入图片描述

Unstructured data is the focus, because most of the data we have today is unstructured. The key is the data that the computer can’t use directly. Then structured data can be simply understood as something that relational databases store. However, speech, text, video, pictures are all unstructured data.

clustering and classification

Given a document collection, clustering is a task of automatically clustering based on document content. There is no topic guidance in advance, while classification defines the topic in advance.

three catogories

The scale of data processed by information retrieval can be divided into large-scale (such as web search), small-scale (such as personal information retrieval), and medium-sized (such as search for enterprises, institutions and specific fields).

how to evaluate retrieval results

Precision: The percentage of documents in the returned results that are really related to information requirements.

Recall: The percentage of all documents that are truly relevant to the information requirements that are returned by the retrieval system.

grepping

This way is quite likely to manual search, starting at the beginnning and to read through all the text to find what you want. Possibly，the computer is faster and makes less errors, nevertheless this is not enouth in big data time.

So, a new idea appears.

Indexing

Experts built index in advance, which got a terms-documents incidence matrix made up by boolean values.

Each row represents a word and each column represents a doc. Assumming doc A has word B, there must be 1 in the position of row B and column A, there is 0 otherwise.

The obvious drawback is that the matrix takes up too much storage space, in other words, terms-documents matrix wastes extensive storage space because it is exttremely sparse.

In order to solve this big problem, experts decided to record only the things that do occur.

Inverted index

Building inverted index comprises for steps:

collecting doc
tokenization
Linguistic pre-processing, resulting in naturalized terms as terms
building

Doc1:
I did enact Julius Caesar. I was killed i’ the Capitol; Brutus killed me.

Doc2:
So let is be with Caesar. The noble Brutus hath told you Caesar was ambitious.

According to the above graph, we analyze the operational process in detail.

two different docs are marked docId separately.
All tokens in each document plus the document ID are sorted alphabetically, which is the table on the left.
Merge with the same term item
Separate terms and document IDs, then we get the table on the right

Attention:
terms are stored in a dictionary. Each term has a pointer which points to postings lists.
What’s more, there are some summary infos stored in the dict structure, such as Document frequency…
And the postings lists stores where the word term appears and some other info(term frequency, position of the term in doc)

Storage overhead in inverted index

we need a scheme to ensure query efficiency and less storage space.

Thus, a hybird scheme comes out with singly linked list and variable length array.

Usually, We use memory to store singly linked list, use disk to store inverted records.

Just like this:

Storage schematic:

However, I don’t quite know how the specific inverted files are generated. I’ll share it later.

Database Index and Inverted Index

Firstly, let have a look what is the Principle of Index in DB.

As you know, DB uses B-tree structure.(we dont talk about B+ tree)

search process:

In DB, Indexes and data are separate, and the address of the record can be found through the index.

In inverted Index, The Term index can find Term’s position in Term Dictionary, and then the Posting List, with an inverted list, can find the document according to the ID.

In other words, Term Index is the equivalent of an index file, Term Dictionary is equivalent to a data file.

crucial algorithms

intersect

The function merges the algorithm module for the inverted record table when entering multiple terms and queries, first gets the number of inverted record tables through the len function, then gets the inverted record table through the subscript loop, then gets the inverted record table element through the loop, merges the inverted record table with the last merged result, and finally returns the result list.

Python implementation:

前面生成数据省略
def Intersect(p):
    r = p[0]
    for i in range(1, len(p)):
        j, k = 0, 0
        r0 = []
        while(j < len(p[i]) and k < len(r)):
            if(p[i][j] == r[k]):
                r0.append(r[k])
                j, k = j + 1, k + 1
            elif(p[i][j] > r[k]):
                k = k + 1
            else:
                j = j + 1
        r = r0
    return r

在这里插入图片描述
2. sort the postings lists

python implementation

def sort(p):
    l = len(p)
    for i in range(0, l):
        p[i].append(len(p[i]))
    p = sorted(p, key = (lambda x:x[-1]))
    for i in range(0, l):
        p[i].pop() 
    return p