信息索引导论第一章笔记(英文)

Abstract

The first chapter of this book mainly introduces some crucial concepts and basic knowledges, comprising what the IR is, three categories depending on the size of data scale, the development from grepping to indexing approach and how to evaluate the retrieval results, etc. Eventually, accroding to the inverted index, the author described boolean retrieval operation steps and several key algorithms in details.

introduction

what is IR

IR(Information Retrieval) is finding material of unstructured nature that satisfies an information need from within large collections.

在这里插入图片描述

Unstructured data is the focus, because most of the data we have today is unstructured. The key is the data that the computer can’t use directly. Then structured data can be simply understood as something that relational databases store. However, speech, text, video, pictures are all unstructured data.

clustering and classification

Given a document collection, clustering is a task of automatically clustering based on document content. There is no topic guidance in advance, while classification defines the topic in advance.

three catogories

The scale of data processed by information retrieval can be divided into large-scale (such as web search), small-scale (such as personal information retrieval), and medium-sized (such as search for enterprises, institutions and specific fields).

how to evaluate retrieval results

Precision: The percentage of documents in the returned results that are really related to information requirements.

Recall: The percentage of all documents that are truly relevant to the information requirements that are returned by the retrieval system.

grepping

This way is quite likely to manual search, starting at the beginnning and to read through all the text to find what you want. Possibly,the computer is faster and makes less errors, nevertheless this is not enouth in big data time.

So, a new idea appears.

Indexing

Experts built index in advance, which got a terms-documents incidence matrix made up by boolean values.

Each row represents a word and each column represents a doc. Assumming doc A has word B, there must be 1 in the position of row B and column A, there is 0 otherwise.

The obvious drawback is that the matrix takes up too much storage space, in other words, terms-documents matrix wastes extensive storage space because it is exttremely sparse.

In order to solve this big problem, experts decided to record only the things that do occur.

Inverted index

Building inverted index comprises for steps:

  1. collecting doc
  2. tokenization
  3. Linguistic pre-processing, resulting in naturalized terms as terms
  4. building

  • Doc1:
    I did enact Julius Caesar. I was killed i’ the Capitol; Brutus killed me.
  • Doc2:
    So let is be with Caesar. The noble Brutus hath told you Caesar was ambitious.

According to the above graph, we analyze the operational process in detail.

  1. two different docs are marked docId separately.
  2. All tokens in each document plus the document ID are sorted alphabetically, which is the table on the left.
  3. Merge with the same term item
  4. Separate terms and document IDs, then we get the table on the right

Attention:
terms are stored in a dictionary. Each term has a pointer which points to postings lists.
What’s more, there are some summary infos stored in the dict structure, such as Document frequency…
And the postings lists stores where the word term appears and some other info(term frequency, position of the term in doc)

Storage overhead in inverted index

we need a scheme to ensure query efficiency and less storage space.

Thus, a hybird scheme comes out with singly linked list and variable length array.

Usually, We use memory to store singly linked list, use disk to store inverted records.

Just like this:

Storage schematic:

However, I don’t quite know how the specific inverted files are generated. I’ll share it later.

Database Index and Inverted Index

Firstly, let have a look what is the Principle of Index in DB.

As you know, DB uses B-tree structure.(we dont talk about B+ tree)

search process:

In DB, Indexes and data are separate, and the address of the record can be found through the index.

In inverted Index, The Term index can find Term’s position in Term Dictionary, and then the Posting List, with an inverted list, can find the document according to the ID.

In other words, Term Index is the equivalent of an index file, Term Dictionary is equivalent to a data file.

crucial algorithms

  1. intersect

The function merges the algorithm module for the inverted record table when entering multiple terms and queries, first gets the number of inverted record tables through the len function, then gets the inverted record table through the subscript loop, then gets the inverted record table element through the loop, merges the inverted record table with the last merged result, and finally returns the result list.

Python implementation:

前面生成数据省略
def Intersect(p):
    r = p[0]
    for i in range(1, len(p)):
        j, k = 0, 0
        r0 = []
        while(j < len(p[i]) and k < len(r)):
            if(p[i][j] == r[k]):
                r0.append(r[k])
                j, k = j + 1, k + 1
            elif(p[i][j] > r[k]):
                k = k + 1
            else:
                j = j + 1
        r = r0
    return r

在这里插入图片描述
2. sort the postings lists

python implementation

def sort(p):
    l = len(p)
    for i in range(0, l):
        p[i].append(len(p[i]))
    p = sorted(p, key = (lambda x:x[-1]))
    for i in range(0, l):
        p[i].pop() 
    return p    

这里作为自己的笔记和总结,借鉴了manning原书还有csdn两位博主:

对了,英语难免有些错误大家见谅。
大家共勉~~

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值