信息索引导论第三章笔记(英文)

最新推荐文章于 2024-03-20 21:11:29 发布

Braylon1002

最新推荐文章于 2024-03-20 21:11:29 发布

阅读量799

点赞数 1

分类专栏：数据挖掘文章标签：信息检索

本文链接：https://blog.csdn.net/qq_40742298/article/details/107656908

版权

数据挖掘专栏收录该内容

53 篇文章 11 订阅

订阅专栏

Abstract

In the forward two chapters, the authors have developed ideas underlying inverted indexes for handling boolean and proximity queries. However, we still find a better approach to improve the Robustness to typographical errors. According to that, this chapter is gonna to introduce data structures that help the search for terms in the vocabulary in an inverted index. And then, there is a new idea of a wildcard query. In addition to that, the section 3.3 turns to other forms of imprecisely posed queries, focusing on spelling errors. Eventually, we study a method for seeking vocabulary terms that are phonetically close to the query terms.

Introduction

Hash Table & Search Tree

Hash Table

在这里插入图片描述

Hash Function: A function that maps the keyword in the lookup table to the address corresponding to the keyword.

Hash(KEY) = ADDR

Hash Table: A data structure that is accessed directly based on keywords. That is, the hash table establishes a direct mapping relationship between the keyword and the stored address.
implement HT in python

if is None, insert
if collisions:
if key1 == key2 and key1 existed: Replace k1 with k2.
if key1 <> key2: use Linear detection, look back for a position until you find the empty position, and then fill in.

class HashTable:
    def __init__(self):
        self.size=11
        self.slots=[None]*self.size  # hold the key items 
        self.data=[None]*self.size  # hold the data values
    def hashfunction(self,key,size):
        return key%size
    def rehash(self,oldhash,size):
        return (oldhash+1)%size
    def put(self,key,data):
        hashvalue=self.hashfunction(key,len(self.slots))
        if self.slots[hashvalue]==None:  # 如果slot内是empty，就存进去
            self.slots[hashvalue]=key
            self.data[hashvalue]=data
        else:  # slot内已有key
            if self.slots[hashvalue]==key: # 如果已有值等于key,更新data
                self.data[hashvalue]=data # replace
            else:  # 如果slot不等于key,找下一个为None的地方
                nextslot=self.rehash(hashvalue,len(self.slots)) 
                while self.slots[nextslot]!=None and self.slots[nextslot]!=key:
                    nextslot=self.rehash(nextslot,len(self.slots))
                    print('while nextslot:',nextslot)
                if self.slots[nextslot]==None:
                    self.slots[nextslot]=key
                    self.data[nextslot]=data
                    print('slots None')
                else:
                    self.data[nextslot]=data
                    print('slots not None')
            
    def get(self,key):
        startslot=self.hashfunction(key,len(self.slots))
        data=None
        stop=False
        found=False
        position=startslot
        while self.slots[position]!=None and not found and not stop:
            if self.slots[position]==key:
                found=True
                data=self.data[postion]
            else:
                position=self.rehash(position,len(self.slots))
                if position==startslot:
                    stop=True
        return data
    
    
    def __getitem__(self,key):
        return self.get(key)
    def __setitem__(self,key,data):
        print('key:',key)
        print('data:',data)
        self.put(key,data)

Advantages
- the Hash Table has a main advantage which is the low Query efficiency(O(1)).
Shortcomings
- hash collisions
- doesn’t support the proximity queries
- hash func need to change to adapt the demand again and again

in order to overcome these issues, the search tree makes a lot contributions

Search Tree

binary search tree

在这里插入图片描述

Set x for the node in binary search tree, x node contains the keyword key, in a word that the left child is smaller than the parent node, the right child is larger than the parent node, and a feature is that “middle order traversal” can order nodes.

Attention:

If the left tree of any node is not empty, the value of all nodes in the left tree is less than the value of its root node;
The right subtree of any node is not empty, then the value of all nodes in the right sub-tree is greater than the value of its root node;
The left and right subtrees of any node are also binary looking for trees;
There are no nodes with equal key values.

B-Tree

在这里插入图片描述

B-tree is a kind of data structure designed for secondary storage and is commonly used in databases and file systems.

In terms of the structure of B-tree, every internal node ha a number of children in the interval[a, b], where a and b are appropriate positive integers. This is especially advantageous wen some of the dictionary is disk resident, in which case this collapsing serves the function of prefetching imminent binary tests.

All in all, we get Conclusions are as follows:

advantages of search trees

supports proximity queries

shortcomings of search trees

low query efficiency
Compromise is needed to maintain the balance of the tree.

Tail wildcard query.

input: ab*

in this situation, That is, the query where the wildcard appears at the end is the tail wildcard query, which can be done using the search tree, and the method is to traverse the tree in the order of a, b.

在这里插入图片描述

Compare the root node: because a is in a-m, heading into the left.
Because ab is between a-hu, go into left.
The remaining subnodes are the result of meeting the requirements, traversing and getting their posting.

Head wildcard query

On the assumption that we are searching by head wildcard query, it needs to introduce the inverted B-tree concept.

inverted B-tree means that inverts the B-tree searching order.

the specific process is as follow:

input: *cba

the system search by the order of “a,b,c”

在这里插入图片描述

process:

compare the root node, because a is in [a, m], go left
because ba i s in [aa, uh], go left

so the below subnode is the result meeting the conditions

general wildcard query

For example, abc * cba, you only need to split it down into abc and cba, and use above knowledge, respectively. However, it is important to note that the results of the query must be filtered through the abc * cba. Because, for example, abcba meets abc* and *cba, but not abc * cba;

Permuterm index

在这里插入图片描述

Method:

$ represents the end of a word (regular format), i.e. if ab, it is represented by ab $, a n d t h e r o t a t i o n i s m a d e, f o r m i n g a b$ , $a b, b$ a, and pointing to ab;
When working with a single wildcard query

if you want to check for *b , add “$” and rotate it so that it is found at the end of the word, which is “b$*”, and in the search tree. If b$a is found to meet the requirement, ab meets the requirement.
When processing multiple wildcard queries

if you want to query a*b*, then add “$”, we get a*b*$, then rotate to $a*b*, query $a, and then filter the results through a*b*.

this approach has a Unavoidable Defects: the dictionary will become very large.

K-GRAM Index

Defination: The combination of k consecutive letters.

input : grammar

3-Gram: {gram, ram, amm, mma, mar}

index structure

The dictionary of the k-gram index is a collection of k-gram of all words.

The posting of k-gram index is a sequence of words that match k-gram.

Indexing rule

Must add $at the beginning and end of the word before building the index;

Query method

input: com*

method: use 3-gram

add “ $" — — "$ com*$”
3-gram algorithm——{"$co", “com”, “om*”, “m*$”}
search matching terms
the results are not definitely Accurate. Thus, filtered by “$com*”.

summary

The k-gram index is very slow, because the word needs to be taken for the k-gram index (the original single-tier index becomes a two-tier index) and then filtered again before it can go into the normal inverted index to find the docID;

And the permuterm row index does not need post-filtering, but the space consumption is very large;

Edit distance

Through the method of dynamic planning, two words are treated as two-dimensional matrix and calculated.

algorithm

在这里插入图片描述

def minEditDist(sm,sn):
    m,n = len(sm)+1,len(sn)+1

    matrix = [[0]*n for i in range(m)]

    matrix[0][0]=0
    for i in range(1,m):
        matrix[i][0] = matrix[i-1][0] + 1

    for j in range(1,n):
        matrix[0][j] = matrix[0][j-1]+1

    
    for i in range(m):
        print matrix[i]


    print "********************"
    
    cost = 0

    for i in range(1,m):
        for j in range(1,n):
            if sm[i-1]==sn[j-1]:
                cost = 0
            else:
                cost = 1
            
            matrix[i][j]=min(matrix[i-1][j]+1,matrix[i][j-1]+1,matrix[i-1][j-1]+cost)


    for i in range(m):
        print matrix[i]

    return matrix[m-1][n-1]

if we wanna calculate the similarity of two strs:
similarity formula:

1 - distance / max(str1.length, str2.length)

Note: There is a disadvantage to editing distance, that is, if you want to make the query and each term calculation do distance editing, then the efficiency is too poor, because there are tens of millions of terms in the inverted index of the dictionary;

calculate Jaccard by k-gram

Jaccard coefficient: given set A and B
J = (AB)/(A + B - AB);

Suppose A and B are two words, the length of them are m and n respectively. So A and B have m-1 and n-1 k-gram respectively.

Given a query Q, after calculating k-gram of Q, you can traverse the k-gram index, calculate the Jaccard coefficient for each word and Q.
AB can be understood as how many k-gram overlaps, and A+B-AB refers to a total of how many k-gram (through the sum of A and B k-gram length minus the number of AB k-gram overlap), and get the word whose Jaccard coefficient is above the threshold.

Summary: You can index k-gram before you calculate the “edit distance”.

Nevertheless, the above applies only to the spelling errors of individual words, and if a query like “where are your home?”, whose errors can’t be detected, because each word is collect.

improved detection method:
When the results returned after the query are very small, there is any doubt about the query phrase. And then each word is replaced to find the phrase that returns a lot of results.

summary of spelling correction

we just remember to follow a main Principle:

Prioritize looking for words with a high proximity, and if the same proximity appears, look for common words.

phonetic correction

soundex Algorithm

Soundex is a speech algorithm that uses the pronunciation of an English word to calculate an approximate value consisting of four characters. The first character is an English letter and the last three are digital.

simple example:

suppose:

a e h i o u w y -> 0
b f p v -> 1
c g j k q s x z -> 2
d t -> 3
l -> 4
m n -> 5
r -> 6
replace from the 2nd character
del 0
Keep single one, if duplicated characters
return forword 4 characters, make up with 0.

Right now, i say:

quality & quantity

quality -> Q004030 -> Q43 -> Q430
quantity -> Q0053030 -> Q533

这里作为自己的笔记和总结，借鉴了manning原书还有csdn博主：

iteye_17686：（传送门）

英语难免有些错误大家见谅。
大家共勉~~

Braylon1002

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
信息索引导论第三章笔记(英文)

AbstractIn the forward two chapters, the authors have developed ideas underlying inverted indexes for handling boolean and proximity queries. However, we still find a better approach to improve the Robustness to typographical errors. According to that, th
复制链接

扫一扫

专栏目录