joey 周琦
本文首先介绍了局部敏感哈希的概念与用处,然后介绍了常见的快速计算相似度、寻找近邻的方法minHash, simHash
局部敏感哈希Locality-sensitive hashing (LSH)
定义
首先我们看看wiki上比较准确的英文描述[1]。
An LSH family F is defined for a metric space M=(M,d) , a threshold R>0 and an approximation factor c>1. This family F is a family of functions h:M→S which map elements from the metric space to a bucket s∈S . The LSH family satisfies the following conditions for any two points p,q∈M , using a function h∈F which is chosen uniformly at random:
- if d(p,q) ≤ R, then h(p)=h(q) (i.e.,p and q collide) with probability at least P1 ,
- if d(p,q) ≥ cR, then h(p)=h(q) with probability at most P2 .
A family is interesting when P1>P2 . Such a family F is called (R,cR,P1,P2) -sensitive.
wiki关于metric space给出的定义
In mathematics, a metric space is a set for which distances between all members of the set are defined. Those distances, taken together, are called a metric on the set.
根据上述信息,我中文理解下。 局部敏感哈希(Locality-sensitive hashing)是为了度量空间 M=(M,d) (metric space)定义的函数族,将度量空间的不同元素映射到相应的桶(bucket)中。满足以下性质:
- if d(p,q)