Locality-sensitive hashing

最新推荐文章于 2022-08-12 10:05:45 发布

ljtyxl

最新推荐文章于 2022-08-12 10:05:45 发布

阅读量507

点赞数

分类专栏：算法实现

算法实现专栏收录该内容

24 篇文章 1 订阅

订阅专栏

Locality-sensitive hashing (LSH) reduces the dimensionality of high-dimensional data. LSH hashes input items so that similar items map to the same “buckets” with high probability (the number of buckets being much smaller than the universe of possible input items). LSH differs from conventional and cryptographic hash functions because it aims to maximize the probability of a “collision” for similar items.[1] Locality-sensitive hashing has much in common with data clustering and nearest neighbor search.

Hashing-based approximate nearest neighbor search algorithms generally use one of two main categories of hashing methods: either data-independent methods, such as locality-sensitive hashing (LSH); or data-dependent methods, such as Locality-preserving hashing (LPH).[2][3]

Definitions[edit]

An LSH family[1][4][5] {\displaystyle {\mathcal {F}}} ${\mathcal {F}}$ is defined for a metric space {\displaystyle {\mathcal {M}}=(M,d)} $\mathcal M =(M, d)$ , a threshold {\displaystyle R>0} $R>0$ and an approximation factor {\displaystyle c>1} $c>1$ . This family {\displaystyle {\mathcal {F}}} ${\mathcal {F}}$ is a family of functions {\displaystyle h:{\mathcal {M}}\to S} $h:{\mathcal M}\to S$ which map elements from the metric space to a bucket {\displaystyle s\in S} $s\in S$ . The LSH family satisfies the following conditions for any two points {\displaystyle p,q\in {\mathcal {M}}} $p, q \in {\mathcal M}$ , using a function {\displaystyle h\in {\mathcal {F}}} $h \in \mathcal F$ which is chosen uniformly at random:

if {\displaystyle d(p,q)\leq R} $d(p,q) \le R$ , then {\displaystyle h(p)=h(q)} $h(p)=h(q)$ (i.e.,p and q collide) with probability at least {\displaystyle P_{1}} $P_{1}$ ,
if {\displaystyle d(p,q)\geq cR} $d(p,q) \ge cR$ , then {\displaystyle h(p)=h(q)} $h(p)=h(q)$ with probability at most {\displaystyle P_{2}} $P_{2}$ .

A family is interesting when {\displaystyle P_{1}>P_{2}} $P_1>P_2$ . Such a family {\displaystyle {\mathcal {F}}} ${\mathcal {F}}$ is called {\displaystyle (R,cR,P_{1},P_{2})} $(R,cR,P_1,P_2)$ -sensitive.

Alternatively[6] it is defined with respect to a universe of items U that have a similarity function {\displaystyle \phi :U\times U\to [0,1]} $\phi : U \times U \to [0,1]$ . An LSH scheme is a family of hash functions H coupled with a probability distribution D over the functions such that a function {\displaystyle h\in H} $h\in H$ chosen according to D satisfies the property that {\displaystyle Pr_{h\in H}[h(a)=h(b)]=\phi (a,b)} $Pr_{h \in H} [h(a) = h(b)] = \phi(a,b)$ for any {\displaystyle a,b\in U} $a,b \in U$ .

Locality-preserving hashing[edit]

A locality-preserving hashing is a hash function f that maps a point or points in a multidimensional coordinate spaceto a scalar value, such that if we have three points A, B and C such that

{\displaystyle |A-B|<|B-C|\Rightarrow |f(A)-f(B)|<|f(B)-f(C)|.\,} $|A-B|<|B-C|\Rightarrow |f(A)-f(B)|<|f(B)-f(C)|.\,$

In other words, these are hash functions where the relative distance between the input values is preserved in the relative distance between of the output hash values; input values that are closer to each other will produce output hash values that are closer to each other.

This is in contrast to cryptographic hash functions and checksums, which are designed to have maximum output difference between adjacent inputs.

Locality preserving hashes are related to space-filling curves.

Amplification[edit]

Given a {\displaystyle (d_{1},d_{2},p_{1},p_{2})} $(d_1, d_2, p_1, p_2)$ -sensitive family {\displaystyle {\mathcal {F}}} ${\mathcal {F}}$ , we can construct new families {\displaystyle {\mathcal {G}}} ${\mathcal {G}}$ by either the AND-construction or OR-construction of {\displaystyle {\mathcal {F}}} ${\mathcal {F}}$ .[1]

To create an AND-construction, we define a new family {\displaystyle {\mathcal {G}}} ${\mathcal {G}}$ of hash functions g, where each function g is constructed from k random functions {\displaystyle h_{1},...,h_{k}} $h_1, ..., h_k$ from {\displaystyle {\mathcal {F}}} ${\mathcal {F}}$ . We then say that for a hash function {\displaystyle g\in {\mathcal {G}}} $g \in \mathcal G$ , {\displaystyle g(x)=g(y)} $g(x) = g(y)$ if and only if all {\displaystyle h_{i}(x)=h_{i}(y)} $h_i(x) = h_i(y)$ for {\displaystyle i=1,2,...,k} $i = 1, 2, ..., k$ . Since the members of {\displaystyle {\mathcal {F}}} ${\mathcal {F}}$ are independently chosen for any {\displaystyle g\in {\mathcal {G}}} $g \in \mathcal G$ , {\displaystyle {\mathcal {G}}} ${\mathcal {G}}$ is a {\displaystyle (d_{1},d_{2},p_{1}^{k},p_{2}^{k})} $(d_1, d_2, p_{1}^k, p_{2}^k)$ -sensitive family.

To create an OR-construction, we define a new family {\displaystyle {\mathcal {G}}} ${\mathcal {G}}$ of hash functions g, where each function g is constructed from krandom functions {\displaystyle h_{1},...,h_{k}} $h_1, ..., h_k$ from {\displaystyle {\mathcal {F}}} ${\mathcal {F}}$ . We then say that for a hash function {\displaystyle g\in {\mathcal {G}}} $g \in \mathcal G$ , {\displaystyle g(x)=g(y)} $g(x) = g(y)$ if and only if {\displaystyle h_{i}(x)=h_{i}(y)} $h_i(x) = h_i(y)$ for one or more values of i. Since the members of {\displaystyle {\mathcal {F}}} ${\mathcal {F}}$ are independently chosen for any {\displaystyle g\in {\mathcal {G}}} $g \in \mathcal G$ , {\displaystyle {\mathcal {G}}} ${\mathcal {G}}$ is a {\displaystyle (d_{1},d_{2},1-(1-p_{1})^{k},1-(1-p_{2})^{k})} $(d_1, d_2, 1- (1 - p_1)^k, 1 - (1 - p_2)^k)$ -sensitive family.

Applications[edit]

LSH has been applied to several problem domains, including:[citation needed]

Near-duplicate detection[7][8]
Hierarchical clustering [9][10]
Genome-wide association study [11]
Image similarity identification
- VisualRank
Gene expression similarity identification[citation needed]
Audio similarity identification
Nearest neighbor search
Audio fingerprint [12]
Digital video fingerprinting
Physical data organization in database management systems[13]

Methods[edit]

Bit sampling for Hamming distance[edit]

One of the easiest ways to construct an LSH family is by bit sampling.[5] This approach works for the Hamming distanceover d-dimensional vectors {\displaystyle \{0,1\}^{d}} $\{0,1\}^{d}$ . Here, the family {\displaystyle {\mathcal {F}}} ${\mathcal {F}}$ of hash functions is simply the family of all the projections of points on one of the {\displaystyle d} $d$ coordinates, i.e., {\displaystyle {\mathcal {F}}=\{h:\{0,1\}^{d}\to \{0,1\}\mid h(x)=x_{i}{\text{ for some }}i\in \{1,...,d\}\}} ${{\mathcal F}}=\{h:\{0,1\}^{d}\to \{0,1\}\mid h(x)=x_{i}{\text{ for some }}i\in \{1,...,d\}\}$ , where {\displaystyle x_{i}} $x_{i}$ is the {\displaystyle i} $i$ th coordinate of {\displaystyle x} $x$ . A random function {\displaystyle h} $h$ from {\displaystyle {\mathcal {F}}} ${\mathcal F}$ simply selects a random bit from the input point. This family has the following parameters: {\displaystyle P_{1}=1-R/d} $P_1=1-R/d$ , {\displaystyle P_{2}=1-cR/d} $P_2=1-cR/d$ .[clarification needed]

Min-wise independent permutations[edit]

Main article: MinHash

Suppose U is composed of subsets of some ground set of enumerable items S and the similarity function of interest is the Jaccard index J. If π is a permutation on the indices of S, for {\displaystyle A\subseteq S} $A \subseteq S$ let {\displaystyle h(A)=\min _{a\in A}\{\pi (a)\}} $h(A) = \min_{a \in A} \{ \pi(a) \}$ . Each possible choice of π defines a single hash function h mapping input sets to elements of S.

Define the function family H to be the set of all such functions and let D be the uniform distribution. Given two sets {\displaystyle A,B\subseteq S} $A,B \subseteq S$ the event that {\displaystyle h(A)=h(B)} $h(A) = h(B)$ corresponds exactly to the event that the minimizer of π over {\displaystyle A\cup B} $A\cup B$ lies inside {\displaystyle A\cap B} $A\cap B$ . As h was chosen uniformly at random, {\displaystyle Pr[h(A)=h(B)]=J(A,B)\,} $Pr[h(A) = h(B)] = J(A,B)\,$ and {\displaystyle (H,D)\,} $(H,D)\,$ define an LSH scheme for the Jaccard index.

Because the symmetric group on n elements has size n!, choosing a truly random permutation from the full symmetric group is infeasible for even moderately sized n. Because of this fact, there has been significant work on finding a family of permutations that is "min-wise independent" - a permutation family for which each element of the domain has equal probability of being the minimum under a randomly chosen π. It has been established that a min-wise independent family of permutations is at least of size {\displaystyle \operatorname {lcm} (1,2,\cdots ,n)\geq e^{n-o(n)}} $\operatorname {lcm} (1,2,\cdots ,n)\geq e^{n-o(n)}$ .[14] and that this bound is tight.[15]

Because min-wise independent families are too big for practical applications, two variant notions of min-wise independence are introduced: restricted min-wise independent permutations families, and approximate min-wise independent families. Restricted min-wise independence is the min-wise independence property restricted to certain sets of cardinality at most k.[16] Approximate min-wise independence differs from the property by at most a fixed ε.[17]

Open source methods[edit]

Nilsimsa Hash[edit]

Main article: Nilsimsa Hash

Nilsimsa is an anti-spam focused locality-sensitive hashing algorithm.[18] The goal of Nilsimsa is to generate a hash digest of an email message such that the digests of two similar messages are similar to each other. The paper suggests that the Nilsimsa satisfies three requirements:

The digest identifying each message should not vary significantly for changes that can be produced automatically.
The encoding must be robust against intentional attacks.
The encoding should support an extremely low risk of false positives.

TLSH[edit]

TLSH is locality-sensitive hashing algorithm designed for a range of security and digital forensic applications.[19] The goal of TLSH is to generate a hash digest of document such that if two digests have a low distance between them, then it is likely that the messages are similar to each other.

Testing performed in the paper on a range of file types identified the Nilsimsa hash as having a significantly higher false positive rate when compared to other similarity digest schemes such as TLSH, Ssdeep and Sdhash.

An implementation of TLSH is available as open-source software.[20]

Random projection[edit]

Main article: Random projection

For small angles (not too close to orthogonal), {\displaystyle 1-{\frac {\theta }{\pi }}} $1-{\frac {\theta }{\pi }}$ is a pretty good approximation to {\displaystyle \cos(\theta )} $\cos(\theta )$ .

The random projection method of LSH due to Moses Charikar[6] called SimHash (also sometimes called arccos[21]) is designed to approximate the cosine distance between vectors. The basic idea of this technique is to choose a random hyperplane (defined by a normal unit vector r) at the outset and use the hyperplane to hash input vectors.

Given an input vector v and a hyperplane defined by r, we let {\displaystyle h(v)=sgn(v\cdot r)} $h(v) = sgn(v \cdot r)$ . That is, {\displaystyle h(v)=\pm 1} $h(v) = \pm 1$ depending on which side of the hyperplane v lies.

Each possible choice of r defines a single function. Let H be the set of all such functions and let D be the uniform distribution once again. It is not difficult to prove that, for two vectors {\displaystyle u,v} $u,v$ , {\displaystyle Pr[h(u)=h(v)]=1-{\frac {\theta (u,v)}{\pi }}} $Pr[h(u) = h(v)] = 1 - \frac{\theta(u,v)}{\pi}$ , where {\displaystyle \theta (u,v)} $\theta(u,v)$ is the angle between u and v. {\displaystyle 1-{\frac {\theta (u,v)}{\pi }}} $1 - \frac{\theta(u,v)}{\pi}$ is closely related to {\displaystyle \cos(\theta (u,v))} $\cos(\theta(u,v))$ .

In this instance hashing produces only a single bit. Two vectors' bits match with probability proportional to the cosine of the angle between them.

Stable distributions[edit]

The hash function [22] {\displaystyle h_{\mathbf {a} ,b}({\boldsymbol {\upsilon }}):{\mathcal {R}}^{d}\to {\mathcal {N}}} $h_{\mathbf{a},b} (\boldsymbol{\upsilon}) : \mathcal{R}^d \to \mathcal{N}$ maps a d dimensional vector {\displaystyle {\boldsymbol {\upsilon }}} $\boldsymbol{\upsilon}$ onto a set of integers. Each hash function in the family is indexed by a choice of random {\displaystyle \mathbf {a} } $\mathbf {a}$ and {\displaystyle b} $b$ where {\displaystyle \mathbf {a} } $\mathbf {a}$ is a d dimensional vector with entries chosen independently from a stable distribution and {\displaystyle b} $b$ is a real number chosen uniformly from the range [0,r]. For a fixed {\displaystyle \mathbf {a} ,b} $\mathbf{a},b$ the hash function {\displaystyle h_{\mathbf {a} ,b}} $h_{\mathbf{a},b}$ is given by {\displaystyle h_{\mathbf {a} ,b}({\boldsymbol {\upsilon }})=\left\lfloor {\frac {\mathbf {a} \cdot {\boldsymbol {\upsilon }}+b}{r}}\right\rfloor } $h_{\mathbf{a},b} (\boldsymbol{\upsilon}) = \left \lfloor \frac{\mathbf{a}\cdot \boldsymbol{\upsilon}+b}{r} \right \rfloor$ .

Other construction methods for hash functions have been proposed to better fit the data. [23] In particular k-means hash functions are better in practice than projection-based hash functions, but without any theoretical guarantee.

LSH algorithm for nearest neighbor search[edit]

One of the main applications of LSH is to provide a method for efficient approximate nearest neighbor searchalgorithms. Consider an LSH family {\displaystyle {\mathcal {F}}} ${\mathcal {F}}$ . The algorithm has two main parameters: the width parameter k and the number of hash tables L.

In the first step, we define a new family {\displaystyle {\mathcal {G}}} ${\mathcal {G}}$ of hash functions g, where each function g is obtained by concatenating kfunctions {\displaystyle h_{1},...,h_{k}} $h_1, ..., h_k$ from {\displaystyle {\mathcal {F}}} ${\mathcal {F}}$ , i.e., {\displaystyle g(p)=[h_{1}(p),...,h_{k}(p)]} $g(p) = [h_1(p), ..., h_k(p)]$ . In other words, a random hash function g is obtained by concatenating k randomly chosen hash functions from {\displaystyle {\mathcal {F}}} ${\mathcal {F}}$ . The algorithm then constructs L hash tables, each corresponding to a different randomly chosen hash function g.

In the preprocessing step we hash all n points from the data set S into each of the L hash tables. Given that the resulting hash tables have only n non-zero entries, one can reduce the amount of memory used per each hash table to {\displaystyle O(n)} $O(n)$ using standard hash functions.

Given a query point q, the algorithm iterates over the L hash functions g. For each g considered, it retrieves the data points that are hashed into the same bucket as q. The process is stopped as soon as a point within distance {\displaystyle cR} $cR$ from q is found.

Given the parameters k and L, the algorithm has the following performance guarantees:

preprocessing time: {\displaystyle O(nLkt)} $O(nLkt)$ , where t is the time to evaluate a function {\displaystyle h\in {\mathcal {F}}} $h \in \mathcal F$ on an input point p;
space: {\displaystyle O(nL)} $O(nL)$ , plus the space for storing data points;
query time: {\displaystyle O(L(kt+dnP_{2}^{k}))} $O(L(kt+dnP_2^k))$ ;
the algorithm succeeds in finding a point within distance {\displaystyle cR} $cR$ from q (if there exists a point within distance R) with probability at least {\displaystyle 1-(1-P_{1}^{k})^{L}} $1 - ( 1 - P_1^k ) ^ L$ ;

For a fixed approximation ratio {\displaystyle c=1+\epsilon } $c=1+\epsilon$ and probabilities {\displaystyle P_{1}} $P_{1}$ and {\displaystyle P_{2}} $P_{2}$ , one can set {\displaystyle k={\log n \over \log 1/P_{2}}} $k={\log n \over \log 1/P_2}$ and {\displaystyle L=n^{\rho }} $L = n^{\rho}$ , where {\displaystyle \rho ={\log P_{1} \over \log P_{2}}} $\rho={\log P_1\over \log P_2}$ . Then one obtains the following performance guarantees:

preprocessing time: {\displaystyle O(n^{1+\rho }kt)} $O(n^{1+\rho}kt)$ ;
space: {\displaystyle O(n^{1+\rho })} $O(n^{1+\rho})$ , plus the space for storing data points;
query time: {\displaystyle O(n^{\rho }(kt+d))} $O(n^{\rho}(kt+d))$ ;

See also[edit]

LSH首先看看它有什么用先~它可以快速地找出海量数据中的相似数据点，听着有点抽象？那我们来举个实际的例子，比如说你有海量的网页（这里的网页是指你拥有的本地数据，不是指互联网上的），你现在想找和一个特定网页相似的其它网页，就比如你想在海量的网页中找出和我这篇博文相似的网页~最naive的方法就是去遍历整个数据集，一个一个页面去比较，看看哪一个页面和我的这个页面最相似，当然所谓的相似需要你自己去定义，可以用Cosine Similarity、Jaccard Coefficient等去衡量（可以参考我的另一篇博文：常用相似性、相关性度量指标），至于特征向量的构造，那就有很多方法了，如使用分词后的各词的TF-IDF值，这里就不展开了。

现在假设你已经对各个页面提取好了特征，那下面的事情就是：给定一个特征，如何快速地从海量的特征中找到和这个特征相似的一些特征？这时候LSH就闪亮登场了（鼓掌ing...），先看看它的名字，Locality Sensitive Hashing，局部敏感哈希。哈希大家应该都知道，它的查找时间复杂度是O(1)，有着极高的查找性能，那什么叫“局部敏感”的哈希？它的意思就是：如果原来的数据相似，那么hash以后的数据也保持一定的相似性，这玩意就叫局部敏感哈希。

来看看我们通常的哈希，比如有一个hash function: f(x)=(x*7)%10，有两个数据x1=123，x2=124，现在用f(x)把它们hash一下，f(x1)=1，f(x2)=8，这想说明什么呢？看看原来的数据，是不是很相似？（假设用欧氏距离度量）再看hash后的数据，那就差得远了，说明这个hash它并没有保持相似性，那么这种就不是局部敏感哈希。

那么LSH就是一种在hash之后能保持一定的相似性神奇玩意儿，这里为什么说是“保持一定的相似性”？因为我们知道，hash函数的值域一般都是有限的，但是要哈希的数据却是无法预知的，可能数据大大超过其值域，那么就不可能避免地会出现一个以上的数据点拥有相同的hash值，这有什么影响呢？假设有一个很好的hash function，好到可以在hash之后很好地保持原始数据的相似性，假设它的不同取值数是10个，然后现在我有11个截然不相似的数据点，想用这个hash function来hash一下，因为这个hash function的不同取值数是10个，所以必然会出现一个hash桶中有1个以上的数据点，那刚才不是说11个截然不相似的数据点吗？它们应该有不同的hash值才对啊。没错，的确希望它是这样，但实际上很难，只能在hash之后保持一定的相似性。其根本原因就是在做相似性度量的时候，hash function通常是把高维数据映射到低维空间上，为什么映射到低维空间上？因为高维空间上计算复杂度太高。

降维会对相似性度量造成什么影响？数据的维数在某种程度上能反映其信息量，一般来说维数越多，其反映的信息量就越大，比如说对于一个人，假设有一个一维数据：（姓名）,有一个三维数据（姓名，身高，体重），那么这个三维数据反映的信息是不是要比一维的多？如果我们知道的信息越多，是不是就能越准确地判定两个东西的相似性？所以，降维就在某种程度上造成的信息的丢失，在降维后的低维空间中就很难100%保持原始数据空间中的相似性，所以刚才是说“保持一定的相似性”。

好家伙，一方面想把数据降维，一方面又希望降维后不丢失信息，这是不可能的，那么就要做一个妥协了，双方都让一步，那么就可以实现在损失一点相似性度量准确性的基础上，把数据降维，通常来说，这是值得的。说了这么多，那LSH究竟是如何做的呢？先一句话总结一下它的思想吧：它是用hash的方法把数据从原空间哈希到一个新的空间中，使得在原始空间的相似的数据，在新的空间中也相似的概率很大，而在原始空间的不相似的数据，在新的空间中相似的概率很小。

其实在用LSH前通常会进行一些降维操作，我们先看看下面这张图：

先说说整个流程，一般的步骤是先把数据点（可以是原始数据，或者提取到的特征向量）组成矩阵，然后通过第一步的hash functions（有多个哈希函数，是从某个哈希函数族中选出来的）哈希成一个叫“签名矩阵（Signature Matrix）”的东西，这个矩阵可以直接理解为是降维后的数据，然后再通过LSH把Signature Matrix哈希一下，就得到了每个数据点最终被hash到了哪个bucket里，如果新来一个数据点，假如是一个网页的特征向量，我想找和这个网页相似的网页，那么把这个网页对应的特征向量hash一下，看看它到哪个桶里了，于是bucket里的网页就是和它相似的一些候选网页，这样就大大减少了要比对的网页数，极大的提高了效率。注意上句为什么说是“候选网页”，也就是说，在那个bucket里，也有可能存在和被hash到这个bucket的那个网页不相似的网页，原因请回忆前面讲的“降维”问题。。但LSH的巧妙之处在于可以控制这种情况发生的概率，这一点实在是太牛了，下面会介绍。

由于采用不同的相似性度量时，第一步所用的hash functions是不一样的，并没有通用的hash functions，因此下面主要介绍两种情况：（1）用Jaccard相似性度量时，（2）用Cosine相似性度量时。在第一步之后得到了Signature Matrix后，第二步就都一样了。

先来看看用Jaccard相似性度量时第一步的hash functions。

假设现在有4个网页（看成是document），页面中词项的出现情况用以下矩阵来表示，1表示对应词项出现，0则表示不出现，这里并不用去count出现了几次，因为是用Jaccard去度量的嘛，原因就不用解释了吧。

好，接下来我们就要去找一种hash function，使得在hash后尽量还能保持这些documents之间的Jaccard相似度

         目标就是找到这样一种哈希函数h()，如果原来documents的Jaccard相似度高，那么它们的hash值相同的概率高，如果原来documents的Jaccard相似度低，那么它们的hash值不相同的概率高。这玩意叫Min-Hashing。

        Min-Hashing是怎么做的呢？请看下图：

首先生成一堆随机置换，把Signature Matrix的每一行进行置换，然后hash function就定义为把一个列C hash成一个这样的值：就是在置换后的列C上，第一个值为1的行的行号。呼，听上去好抽象，下面看列子：

图中展示了三个置换，就是彩色的那三个，我现在解释其中的一个，另外两个同理。比如现在看蓝色的那个置换，置换后的Signature Matrix为：

然后看第一列的第一个是1的行是第几行，是第2行，同理再看二三四列，分别是1，2，1，因此这四列（四个document）在这个置换下，被哈希成了2，1，2，1，就是右图中的蓝色部分，也就相当于每个document现在是1维。再通过另外两个置换然后再hash，又得到右边的另外两行，于是最终结果是每个document从7维降到了3维。我们来看看降维后的相似度情况，就是右下角那个表，给出了降维后的document两两之间的相似性。可以看出，还是挺准确的，回想一下刚刚说的：希望原来documents的Jaccard相似度高，那么它们的hash值相同的概率高，如果原来documents的Jaccard相似度低，那么它们的hash值不相同的概率高，如何进行概率上的保证？Min-Hashing有个惊人的性质：

就是说，对于两个document，在Min-Hashing方法中，它们hash值相等的概率等于它们降维前的Jaccard相似度。下面来看看这个性质的Proof：
设有一个词项x（就是Signature Matrix中的行），它满足下式：

就是说，词项x被置换之后的位置，和C1,C2两列并起来（就是把两列对应位置的值求或）的结果的置换中第一个是1的行的位置相同。那么有下面的式子成立：

就是说x这个词项要么出现在C1中（就是说C1中对应行的值为1），要么出现在C2中，或者都出现。这个应该很好理解，因为那个1肯定是从C1,C2中来的。

那么词项x同时出现在C1,C2中（就是C1,C2中词项x对应的行处的值是1）的概率，就等于x属于C1与C2的交集的概率，这也很好理解，属于它们的交集，就是同时出现了嘛。那么现在问题是：已知x属于C1与C2的并集，那么x属于C1与C2的交集的概率是多少？其实也很好算，就是看看它的交集有多大，并集有多大，那么x属于并集的概率就是交集的大小比上并集的大小，而交集的大小比上并集的大小，不就是Jaccard相似度吗？于是有下式：

又因为当初我们的hash function就是

往上面一代，不就是下式了吗？

这就证完了，这标志着我们找到了第一步中所需要的hash function，再注意一下，现在这个hash function只适用于Jaccard相似度，并没有一个万能的hash function。有了hash functions，就可以求Signature Matrix了，求得Signature Matrix之后，就要进行LSH了。
首先将Signature Matrix分成一些bands，每个bands包含一些rows，如下图所示：

然后把每个band哈希到一些bucket中，如下图所示：

       注意bucket的数量要足够多，使得两个不一样的bands被哈希到不同的bucket中，这样一来就有：如果两个document的bands中，至少有一个share了同一个bucket，那这两个document就是candidate pair，也就是很有可能是相似的。
       下面来看看一个例子，来算一下概率，假设有两个document，它们的相似性是80%，它们对应的Signature Matrix矩阵的列分别为C1,C2，又假设把Signature Matrix分成20个bands，每个bands有5行，那么C1中的一个band与C2中的一个band完全一样的概率就是0.8^5=0.328，那么C1与C2在20个bands上都没有相同的band的概率是(1-0.328)^20=0.00035，这个0.00035概率表示什么？它表示，如果这两个document是80%相似的话，LSH中判定它们不相似的概率是0.00035，多么小的概率啊！
       再看先一个例子，假设有两个document，它们的相似性是30%，它们对应的Signature Matrix矩阵的列分别为C1,C2，Signature Matrix还是分成20个bands，每个bands有5行，那么C1中的一个band与C2中的一个band完全一样的概率就是0.3^5=0.00243，那么C1与C2在20个bands至少C1的一个band和C2的一个band一样的概率是1-（1-0.00243）……20=0.0474，换句话说就是，如果这两个document是30%相似的话，LSH中判定它们相似的概率是0.0474，也就是几乎不会认为它们相似，多么神奇。
       更为神奇的是，这些概率是可以通过选取不同的band数量以及每个band中的row的数量来控制的：

        除此之外，还可以通过AND和OR操作来控制，就不展开了。

        呼，LSH的核心内容算是介绍完了，下面再简单补充一下当相似性度量是Cosine相似性的时候，第一步的hash function是什么。它是用了一个叫随机超平面（Random Hyperplanes）的东西，就是说随机生成一些超平面，哈希方法是看一个特征向量对应的点，它是在平台的哪一侧：

这个可以直观地想象一下，假设有两个相交的超平面，把空间分成了4个区域，如果两个特征向量所对应的点在同一域中，那么这两个向量是不是挨得比较近？因此夹角就小，Cosine相似度就高。对比一下前面用Jaccard相似度时Signature Matrix的生成方法，那里是用了三个转换，在这里对应就是用三个随机超平面，生成方法是：对于一个列C（这里因为是用Cosine相似度，所以列的值就不一定只是0，1了，可以是其它值，一个列就对应一个特征向量），算出该列对应的特征向量的那个点，它是在超平面的哪个侧，从而对于每个超平面，就得到+1或者-1，对于三个超平面，就得到三个值，就相当于把原来7维的数据降到三维，和前面用Jaccard相似度时的情况是不是很像？得到Signature Matrix后，就进行LSH

ljtyxl

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Locality-sensitive hashing

Locality-sensitive hashing(LSH)reduces the dimensionalityof high-dimensional data. LSHhashesinput items so that similar items map to the same “buckets” with high probability (the number of bucket...
复制链接

扫一扫