Abstract
The first chapter of this book mainly introduces some crucial concepts and basic knowledges, comprising what the IR is, three categories depending on the size of data scale, the development from grepping to indexing approach and how to evaluate the retrieval results, etc. Eventually, accroding to the inverted index, the author described boolean retrieval operation steps and several key algorithms in details.
introduction
what is IR
IR(Information Retrieval) is finding material of unstructured nature that satisfies an information need from within large collections.
Unstructured data is the focus, because most of the data we have today is unstructured. The key is the data that the computer can’t use directly. Then structured data can be simply understood as something that relational databases store. However, speech, text, video, pictures are all unstructured data.
clustering and classification
Given a document collection, clustering is a task of automatically clustering based on document content. There is no topic guidance in advance, while classification defines the topic in advance.
three catogories
The scale of data processed by information retrieval can be divided into large-scale (such as web search), small-scale (such as personal information retrieval), and medium-sized (such as search for enterprises, institutions and specific fields).
how to evaluate retrieval results
Precision: The percentage of documents in the returned results that are really related to information requirements.
Recall: The percentage of all documents that are truly relevant to the information requirements that are returned by the retrieval system.
grepping
This way is quite likely to manual search, starting at the beginnning and to read through all the text to find what you want. Possibly,the computer is faster and makes less errors, nevertheless this is not enouth in big data time.
So, a new idea appears.
Indexing
Experts built index in advance, which got a terms-documents incidence matrix made up by boolean values.
Each row represents a word and each column represents a doc. Assumming doc A has word B, there must be 1 in the position of row B and column A, there is 0 otherwise.
The obvious drawback is that the matrix takes up too much storage space, in other words, terms-documents matrix wastes extensive storage space because it is exttremely sparse.
In order to solve this big problem, experts decided to record only the things that do occur.
Inverted index
Building inverted index comprises for steps:
- collecting doc
- tokenization
- Linguistic pre-processing, resulting in naturalized terms as terms
- building
- Doc1:
I did enact Julius Caesar. I was killed i’ the Capitol; Brutus killed me.
- Doc2:
So let is be with Caesar. The noble Brutus hath told you Caesar was ambitious.
According to the above graph, we analyze the operational process in detail.
- two different docs are marked docId separately.
- All tokens in each document plus the document ID are sorted alphabetically, which is the table on the left.
- Merge with the same term item
- Separate terms and document IDs, then we get the table on the right
Attention:
terms are stored in a dictionary. Each term has a pointer which points to postings lists.
What’s more, there are some summary infos stored in the dict structure, such as Document frequency…
And the postings lists stores where the word term appears and some other info(term frequency, position of the term in doc)
Storage overhead in inverted index
we need a scheme to ensure query efficiency and less storage space.
Thus, a hybird scheme comes out with singly linked list and variable length array.
Usually, We use memory to store singly linked list, use disk to store inverted records.
Just like this:
Storage schematic:
However, I don’t quite know how the specific inverted files are generated. I’ll share it later.
Database Index and Inverted Index
Firstly, let have a look what is the Principle of Index in DB.
As you know, DB uses B-tree structure.(we dont talk about B+ tree)
search process:
In DB, Indexes and data are separate, and the address of the record can be found through the index.
In inverted Index, The Term index can find Term’s position in Term Dictionary, and then the Posting List, with an inverted list, can find the document according to the ID.
In other words, Term Index is the equivalent of an index file, Term Dictionary is equivalent to a data file.
crucial algorithms
- intersect
The function merges the algorithm module for the inverted record table when entering multiple terms and queries, first gets the number of inverted record tables through the len function, then gets the inverted record table through the subscript loop, then gets the inverted record table element through the loop, merges the inverted record table with the last merged result, and finally returns the result list.
Python implementation:
前面生成数据省略
def Intersect(p):
r = p[0]
for i in range(1, len(p)):
j, k = 0, 0
r0 = []
while(j < len(p[i]) and k < len(r)):
if(p[i][j] == r[k]):
r0.append(r[k])
j, k = j + 1, k + 1
elif(p[i][j] > r[k]):
k = k + 1
else:
j = j + 1
r = r0
return r
2. sort the postings lists
python implementation
def sort(p):
l = len(p)
for i in range(0, l):
p[i].append(len(p[i]))
p = sorted(p, key = (lambda x:x[-1]))
for i in range(0, l):
p[i].pop()
return p
这里作为自己的笔记和总结,借鉴了manning原书还有csdn两位博主:
对了,英语难免有些错误大家见谅。
大家共勉~~