Implementation Details
Implementation of LSH follows the rough steps
- minhash each vector some number of times. The number of times to hash is an input parameter. The hashing function is defined in com.invincea.spark.hash.Hasher. Essentially each element of the input vector is hashed and the minimum hash value for the vector is returned. Minhashing produces a set of signatures for each vector.
- Chop up each vector's minhash signatures into bands where each band contains an equal number of signatures. Bands with a greater number of signatures will produce clusters withgreater similarity. A greater number of bands will increase the probabilty that similar vector signatures hash to the same value.
- Order each of the vector bands such that for each band the vector's data for that band are grouped tog