2016年9月22日
Introduction
(1)次线性时间,(2)存储高效,(3)便于实现
(1)在搜索k-nn的时候先把问题转化为搜索汉明距离.
(2)搜索汉明距离可以转化为搜索每段的汉明距离,然后将m段的查询结果合并得到候选集
(3)最后在候选集中删除汉明距离大于r的结果
Tow problem
Define
•r-neighbors
如果一个二值码g是查询二值码的r-neighbors,当且仅当g与查询二值码最多有r位不同。
r-neighbor search
Tackle r-neighbor
IntuitiveMethod
随着r的增大,L(q,r)急剧增大。这种方法只有在小的搜素半径和短的编码下才具有使用性。
Multi-Index hashing
Tow proposition
MIH forr-neighbor Search
(1)numberof lookups
那些要搜索的buckets的个数
(2)numberof candidatestested
取决于子串的长度s. hash buckets 中的二值码子串都是完全一样的。如果一个buckets是在要检索的buckets,那么这个buckets里的所有二值码都是候选的。长度s得到hashbuckets 的个数是2^s,数据库中待查询的二值码总共有n个,假设二值码是均匀分布的,那么每个hashbuckets中二值码的个数平均是n/2^s.那么一个子串中所有候选的个数就是lookups* n/2^s.
Cost的上界取决于s
Choosing an Effective Substring Lengthsearch ratio r/q
plotscost as a function of substring lengths, for240-bit codes, different database sizesn, anddifferentsearchratio.
Complexity
k-NEAREST NEIGHBOR SEARCH
EXPERIMENTS
Multi-Index Hashingvs.LinearScanMulti-Index Hashingvs.LinearScan
Substring Optimization
方法:
(1)Initial:A random bit is assigned to the first substring
(2)a bit is assigned to substringj,which is maximally correlated with the bit assigned to substring j− 1. 到这一步,每一个substring中都有1个bits.
(3)repeat :
An unused bit is assigned to substring j, if the maximum correlation between that bit and other bits already assigned to substring jisminimal.
This approach significantlydecreases the correlation between bits withina single substring. This should make the distribution ofcodes within substrings buckets more uniform, andthereby lower the number of candidates within a given search radius.
Futurework
each substring hash table, thereby making the distribution of substrings asuniform as possible. However, this entropic approach is left to futurework.