Abstract
In the forward two chapters, the authors have developed ideas underlying inverted indexes for handling boolean and proximity queries. However, we still find a better approach to improve the Robustness to typographical errors. According to that, this chapter is gonna to introduce data structures that help the search for terms in the vocabulary in an inverted index. And then, there is a new idea of a wildcard query. In addition to that, the section 3.3 turns to other forms of imprecisely posed queries, focusing on spelling errors. Eventually, we study a method for seeking vocabulary terms that are phonetically close to the query terms.
Introduction
Hash Table & Search Tree
- Hash Table
- Hash Function: A function that maps the keyword in the lookup table to the address corresponding to the keyword.
Hash(KEY) = ADDR
-
Hash Table: A data structure that is accessed directly based on keywords. That is, the hash table establishes a direct mapping relationship between the keyword and the stored address.
-
implement HT in python
- if is None, insert
- if collisions:
- if key1 == key2 and key1 existed: Replace k1 with k2.
- if key1 <> key2: use Linear detection, look back for a position until you find the empty position, and then fill in.
class HashTable:
def __init__(self):
self.size=11
self.slots=[None]*self.size # hold the key items
self.data=[None]*self.size # hold the data values
def hashfunction(self,key,size):
return key%size
def rehash(self,oldhash,size):
return (oldhash+1)%size
def put(self,key,data):
hashvalue=self.hashfunction(key,len(self.slots))
if self.slots[hashvalue]==None: # 如果slot内是empty,就存进去
self.slots[hashvalue]=key
self.data[hashvalue]=data
else: # slot内已有key
if self.slots[hashvalue]==key: # 如果已有值等于key,更新data
self.data[hashvalue]=data # replace
else: # 如果slot不等于key,找下一个为None的地方
nextslot=self.rehash(hashvalue,len(self.slots))
while self.slots[nextslot]!=None and self.slots[nextslot]!=key:
nextslot=self.rehash(nextslot,len(self.slots))
print('while nextslot:',nextslot)
if self.slots[nextslot]==None:
self.slots[nextslot]=key
self.data[nextslot]=data
print('slots None')
else:
self.data[nextslot]=data
print('slots not None')
def get(self,key):
startslot=self.hashfunction(key,len(self.slots))
data=None
stop=False
found=False
position=startslot
while self.slots[position]!=None and not found and not stop:
if self.slots[position]==key:
found=True
data=self.data[postion]
else:
position=self.rehash(position,len(self.slots))
if position==startslot:
stop=True
return data
def __getitem__(self,key):
return self.get(key)
def __setitem__(self,key,data):
print('key:',key)
print('data:',data)
self.put(key,data)
-
Advantages
- the Hash Table has a main advantage which is the low Query efficiency(O(1)).
-
Shortcomings
- hash collisions
- doesn’t support the proximity queries
- hash func need to change to adapt the demand again and again
in order to overcome these issues, the search tree makes a lot contributions
- Search Tree
- binary search tree
Set x for the node in binary search tree, x node contains the keyword key, in a word that the left child is smaller than the parent node, the right child is larger than the parent node, and a feature is that “middle order traversal” can order nodes.
Attention:
- If the left tree of any node is not empty, the value of all nodes in the left tree is less than the value of its root node;
- The right subtree of any node is not empty, then the value of all nodes in the right sub-tree is greater than the value of its root node;
- The left and right subtrees of any node are also binary looking for trees;
- There are no nodes with equal key values.
- B-Tree
B-tree is a kind of data structure designed for secondary storage and is commonly used in databases and file systems.
In terms of the structure of B-tree, every internal node ha a number of children in the interval[a, b], where a and b are appropriate positive integers. This is especially advantageous wen some of the dictionary is disk resident, in which case this collapsing serves the function of prefetching imminent binary tests.
All in all, we get Conclusions are as follows:
- advantages of search trees
- supports proximity queries
- shortcomings of search trees
- low query efficiency
- Compromise is needed to maintain the balance of the tree.
Tail wildcard query.
input: ab*
in this situation, That is, the query where the wildcard appears at the end is the tail wildcard query, which can be done using the search tree, and the method is to traverse the tree in the order of a, b.
- Compare the root node: because a is in a-m, heading into the left.
- Because ab is between a-hu, go into left.
- The remaining subnodes are the result of meeting the requirements, traversing and getting their posting.
Head wildcard query
On the assumption that we are searching by head wildcard query, it needs to introduce the inverted B-tree concept.
- inverted B-tree means that inverts the B-tree searching order.
the specific process is as follow:
input: *cba
the system search by the order of “a,b,c”
process:
- compare the root node, because a is in [a, m], go left
- because ba i s in [aa, uh], go left
so the below subnode is the result meeting the conditions
general wildcard query
For example, abc * cba, you only need to split it down into abc and cba, and use above knowledge, respectively. However, it is important to note that the results of the query must be filtered through the abc * cba. Because, for example, abcba meets abc* and *cba, but not abc * cba;
Permuterm index
-
Method:
$ represents the end of a word (regular format), i.e. if ab, it is represented by ab , a n d t h e r o t a t i o n i s m a d e , f o r m i n g a b , and the rotation is made, forming ab ,andtherotationismade,formingab, a b , b ab, b ab,ba, and pointing to ab;
-
When working with a single wildcard query
if you want to check for *b , add “$” and rotate it so that it is found at the end of the word, which is “b$*”, and in the search tree. If b$a is found to meet the requirement, ab meets the requirement.
-
When processing multiple wildcard queries
if you want to query a*b*, then add “$”, we get a*b*$, then rotate to $a*b*, query $a, and then filter the results through a*b*.
this approach has a Unavoidable Defects: the dictionary will become very large.
K-GRAM Index
Defination: The combination of k consecutive letters.
input : grammar
3-Gram: {gram, ram, amm, mma, mar}
index structure
The dictionary of the k-gram index is a collection of k-gram of all words.
The posting of k-gram index is a sequence of words that match k-gram.
Indexing rule
Must add $at the beginning and end of the word before building the index;
Query method
input: com*
method: use 3-gram
- add “ " — — " "——" "——"com*$”
- 3-gram algorithm——{"$co", “com”, “om*”, “m*$”}
- search matching terms
- the results are not definitely Accurate. Thus, filtered by “$com*”.
- summary
The k-gram index is very slow, because the word needs to be taken for the k-gram index (the original single-tier index becomes a two-tier index) and then filtered again before it can go into the normal inverted index to find the docID;
And the permuterm row index does not need post-filtering, but the space consumption is very large;
Edit distance
Through the method of dynamic planning, two words are treated as two-dimensional matrix and calculated.
- algorithm
def minEditDist(sm,sn):
m,n = len(sm)+1,len(sn)+1
matrix = [[0]*n for i in range(m)]
matrix[0][0]=0
for i in range(1,m):
matrix[i][0] = matrix[i-1][0] + 1
for j in range(1,n):
matrix[0][j] = matrix[0][j-1]+1
for i in range(m):
print matrix[i]
print "********************"
cost = 0
for i in range(1,m):
for j in range(1,n):
if sm[i-1]==sn[j-1]:
cost = 0
else:
cost = 1
matrix[i][j]=min(matrix[i-1][j]+1,matrix[i][j-1]+1,matrix[i-1][j-1]+cost)
for i in range(m):
print matrix[i]
return matrix[m-1][n-1]
if we wanna calculate the similarity of two strs:
similarity formula:
1 - distance / max(str1.length, str2.length)
Note: There is a disadvantage to editing distance, that is, if you want to make the query and each term calculation do distance editing, then the efficiency is too poor, because there are tens of millions of terms in the inverted index of the dictionary;
calculate Jaccard by k-gram
Jaccard coefficient: given set A and B
J = (AB)/(A + B - AB);
Suppose A and B are two words, the length of them are m and n respectively. So A and B have m-1 and n-1 k-gram respectively.
Given a query Q, after calculating k-gram of Q, you can traverse the k-gram index, calculate the Jaccard coefficient for each word and Q.
AB can be understood as how many k-gram overlaps, and A+B-AB refers to a total of how many k-gram (through the sum of A and B k-gram length minus the number of AB k-gram overlap), and get the word whose Jaccard coefficient is above the threshold.
Summary: You can index k-gram before you calculate the “edit distance”.
Nevertheless, the above applies only to the spelling errors of individual words, and if a query like “where are your home?”, whose errors can’t be detected, because each word is collect.
- improved detection method:
When the results returned after the query are very small, there is any doubt about the query phrase. And then each word is replaced to find the phrase that returns a lot of results.
summary of spelling correction
we just remember to follow a main Principle:
Prioritize looking for words with a high proximity, and if the same proximity appears, look for common words.
phonetic correction
- soundex Algorithm
Soundex is a speech algorithm that uses the pronunciation of an English word to calculate an approximate value consisting of four characters. The first character is an English letter and the last three are digital.
simple example:
suppose:
- a e h i o u w y -> 0
- b f p v -> 1
- c g j k q s x z -> 2
- d t -> 3
- l -> 4
- m n -> 5
- r -> 6
- replace from the 2nd character
- del 0
- Keep single one, if duplicated characters
- return forword 4 characters, make up with 0.
Right now, i say:
quality & quantity
quality -> Q004030 -> Q43 -> Q430
quantity -> Q0053030 -> Q533
这里作为自己的笔记和总结,借鉴了manning原书还有csdn博主:
- iteye_17686:(传送门)
英语难免有些错误大家见谅。
大家共勉~~