# 海量文档查同或聚类问题 -- Locality Sensitive Hash 算法

The algorithm can be described as follows:
If distance between input and the nearest cluster above threshold, then create new cluster for the input.
Or else, add input to the cluster and update cluster center.

Locality Sensitive Hash(LSH)

Charikar's simhash
Moses S. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In STOC ’02: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388, New York, NY, USA. ACM.

LSH 算法怎么样来解决高维数据的 KNN 问题了 , 我们可以参考 Google WWW2007 发表的一篇论文 “Detecting near-duplicates for web crawling”, 这篇文章中是要找到 duplicate 的网页 , 和我们的问题其实是同一个问题 , 都是怎样使用 LSH 解决 KNN 问题

We experimentally validate that for a repository of 8 billion webpages, 64-bit simhash fingerprints and k = 3 are reasonable.

Charikar's simhash is a dimensionality reduction technique . It maps high-dimensional vectors to small-sized fingerprints.

Computation:
Given a set of features extracted from a document and their corresponding weights, we use simhash to generate an f-bit fingerprint as follows.
We maintain an f-dimensional vector V, each of whose dimensions is initialized to zero.
A feature is hashed into an f-bit hash value.
These f bits (unique to the feature) increment/decrement the f components of the vector by the weight of that feature as follows:
ü   if the i-th bit of the hash value is 1, the i-th component of V is incremented by the weight of that feature;
ü   if the i-th bit of the hash value is 0, the i-th component of V is decremented by the weight of that feature.
When all features have been processed, some components of V are positive while others are negative. The signs of components determine the corresponding bits of the final fingerprint.
For our system, we used the original C++ implementation of simhash, done by Moses Charikar himself.

Definition: Given a collection of f-bit fingerprints and a query fingerprint F, identify whether an existing fingerprint differs from F in at most k bits.

Build a sorted table of all existing fingerprints

For 64-bit _ngerprints and k = 3, we need C64 3 = 41664 probes.

We now develop a practical algorithm that lies in between the two approaches outlined above: it is possible to solve the problem with a small number of probes and by duplicating the table of fingerprints by a small factor.

64 位分成 6 , 分别是 11,11,11,11,10 10 位。共有 C(6,3)=20 种方法从 6 块中选择 3 块。对于每种选择 , 排列 π 使得选出的块中的位成为最高位 . d ′ 的值就是选出的块中的位数的总和。因此 d ′=31,32, 或者 33 ( d 差的不多 ). 平均每次检测返回最多 234~31 个排列后的指纹。实际应该不会很多

09-25 521
10-06 4319
05-11 3983
02-09 568
11-28 34万+
07-21 3118
05-17 4万+
07-08 117万+
01-16 7万+