This is part of my general survey on LSH in CDN class.
NN search
Given a set P of n points, design an algorithm that, given any point q, returns a point in P that is closest to q (its “nearest neighbor” in P).
ANN search
An approximate nearest neighbor search problem is to find the point, whose distance from the query is at most c times the distance from the query to its nearest points. Its formal definition is denoted as followed:
Given a set P and a point q, find a point pi in P so that
d ( p i , q ) ≤ c × m i n p j d ( p j , q ) {d(p_i, q) \leq c \times min_{p_j}d(p_j, q)} d(pi,q)≤c×minpjd(pj,q)
The appealing point of this approach is that it is almost as good as the exact one in many cases. In particular, if the distance measure accurately captures the notion of user quality, then small differences in the distance should not matter.
General approaches to ANN problem are mainly two kinds, tree-based approach and LSH.
LSH
LSH refers to a family of functions (LSH families) to hash data points into buckets so that data points near each other are located in the same buckets with high probability, while data points far from each other are likely to be in different buckets. This makes it easier to identify observations in high dimensional space. The formal definition is:
A family of hashing functions is ( r , c r , p 1 , p 2 ) − L S H {(r,cr,p_1,p_2)-LSH} (r,cr,p

这篇博客介绍了局部敏感哈希(LSH)在CDN场景下的应用,主要讨论了近似最近邻(ANN)搜索问题。LSH通过创建一种概率模型,使得相近的点被分配到同一个桶中的概率较高,从而简化高维空间中的数据比较。文章举例说明了LSH的工作原理,并提到了Min Hash和Sim Hash等变种,用于快速估算集合相似性。
最低0.47元/天 解锁文章
4万+

被折叠的 条评论
为什么被折叠?



