weak-and算法原理演示（wand）

最新推荐文章于 2023-04-16 00:57:36 发布

可克

最新推荐文章于 2023-04-16 00:57:36 发布

阅读量2.2k

点赞数

分类专栏： c/c++/ios python

python 同时被 2 个专栏收录

41 篇文章 0 订阅

订阅专栏

c/c++/ios

29 篇文章 0 订阅

订阅专栏

weak-and算法原理演示（wand）

分类：算法 2013-01-14 11:38 2147人阅读评论(2) 收藏举报

wand 信息检索 weak-and

推荐一个在信息检索中用到的weak-and算法，这个算法在广告系统中有成熟的应用。

简单来说，一般我们在计算文本相关性的时候，会通过倒排索引的方式进行查询，通过倒排索引已经要比全量遍历节约大量时间，但是有时候仍然很慢。

原因是很多时候我们其实只是想要top n个结果，一些结果明显较差的也进行了复杂的相关性计算，而weak-and算法通过计算每个词的贡献上限来估计文档的相关性上限，从而建立一个阈值对倒排中的结果进行减枝，从而得到提速的效果。

从我实际测试的结果看，对于短文本的效果不如长文本的明显，但是在视频的电影数据上面看，仍然减少了50%的耗时（top 100），并且该算法可以通过牺牲精度来进一步提升速度，非常不错。

以下代码是一个算法原理演示，实现了主要的算法逻辑以验证算法的有效性，供大家参考，该实现优化了原始算法的一些逻辑尽量减少了无谓的循环

[python] view plain copy

#!/usr/bin/python
#wangben updated 20130108
class WAND:
'''''implement wand algorithm'''
def __init__(self, InvertIndex, last_docid):
self.invert_index = InvertIndex #InvertIndex: term -> docid1, docid2, docid3 ...
self.current_doc = 0
self.current_invert_index = {}
self.query_terms = []
self.threshold = 2
self.sort_terms = []
self.LastID = 2000000000 #big num
self.debug_count = 0
self.last_docid = last_docid
def __InitQuery(self, query_terms):
'''''check terms len > 0'''
self.current_doc = -1
self.current_invert_index.clear()
self.query_terms = query_terms
self.sort_terms[:] = []
self.debug_count = 0
for term in query_terms:
#initial start pos from the first position of term's invert_index
self.current_invert_index[term] = [ self.invert_index[term][0], 0 ] #[ docid, index ]
def __SortTerms(self):
if len(self.sort_terms) == 0:
for term in self.query_terms:
if term in self.current_invert_index:
doc_id = self.current_invert_index[term][0]
self.sort_terms.append([ int(doc_id), term ])
self.sort_terms.sort()
def __PickTerm(self, pivot_index):
return 0
def __FindPivotTerm(self):
score = 0
for i in range(0, len(self.sort_terms)):
score += 1
if score >= self.threshold:
return [ self.sort_terms[i][1], i]
return [ None, len(self.sort_terms) ]
def __IteratorInvertIndex(self, change_term, docid, pos):
'''''move to doc id > docid'''
doc_list = self.invert_index[change_term]
i = 0
for i in range(pos, len(doc_list)):
if doc_list[i] >= docid:
pos = i
docid = doc_list[i]
break
return [ docid, pos ]
def __AdvanceTerm(self, change_index, docid ):
change_term = self.sort_terms[change_index][1]
pos = self.current_invert_index[change_term][1]
(new_doc, new_pos) = \
self.__IteratorInvertIndex(change_term, docid, pos)
self.current_invert_index[change_term] = \
[ new_doc , new_pos ]
self.sort_terms[change_index][0] = new_doc
def __Next(self):
if self.last_docid == self.current_doc:
return None
while True:
self.debug_count += 1
#sort terms by doc id
self.__SortTerms()
#find pivot term > threshold
(pivot_term, pivot_index) = self.__FindPivotTerm()
if pivot_term == None:
#no more candidate
return None
#debug_info:
for i in range(0, pivot_index + 1):
print self.sort_terms[i][0],self.sort_terms[i][1],"|",
print ""
pivot_doc_id = self.current_invert_index[pivot_term][0]
if pivot_doc_id == self.LastID: #!!
return None
if pivot_doc_id <= self.current_doc:
change_index = self.__PickTerm(pivot_index)
self.__AdvanceTerm( change_index, self.current_doc + 1 )
else:
first_docid = self.sort_terms[0][0]
if pivot_doc_id == first_docid:
self.current_doc = pivot_doc_id
return self.current_doc
else:
#pick all preceding term
for i in range(0, pivot_index):
change_index = i
self.__AdvanceTerm( change_index, pivot_doc_id )
def DoQuery(self, query_terms):
self.__InitQuery(query_terms)
while True:
candidate_docid = self.__Next()
if candidate_docid == None:
break
print "candidate_docid:",candidate_docid
#insert candidate_docid to heap
#update threshold
print "debug_count:",self.debug_count
if __name__ == "__main__":
testIndex = {}
testIndex["t1"] = [ 0, 1, 2, 3, 6 , 2000000000]
testIndex["t2"] = [ 3, 4, 5, 6, 2000000000 ]
testIndex["t3"] = [ 2, 5, 2000000000 ]
testIndex["t4"] = [ 4, 6, 2000000000 ]
w = WAND(testIndex, 6)
w.DoQuery(["t1", "t2", "t3", "t4"])

输出结果中会展示next中循环的次数，以及最后被选为candidate的docid

这里省略了建立堆的过程，使用了一个默认阈值2作为doc的删选条件，候选doc和query doc采用重复词的个数计算UB，这里只是一个算法演示，实际使用的时候需要根据自己的相关性公式进行调整（关于Upper Bound需要注意的是，需要在预处理阶段，把每个词可能会共现的最大相关性的分值计算出来作为该词的UB）

输出结果如下，展示了docid的相关变化过程，以及最后的循环次数14次

[html] view plain copy