[spark-hash学习]minhash算法实现细节

本文详细介绍了Spark-Hash库中Minhash算法的实现步骤,包括多次minhash计算,将签名分块,按顺序排列并进行哈希分组,以及可选的过滤操作,以提高相似向量的检测效率。
摘要由CSDN通过智能技术生成

Implementation Details

Implementation of LSH follows the rough steps

  1. minhash each vector some number of times. The number of times to hash is an input parameter. The hashing function is defined in com.invincea.spark.hash.Hasher. Essentially each element of the input vector is hashed and the minimum hash value for the vector is returned. Minhashing produces a set of signatures for each vector.
  2. Chop up each vector's minhash signatures into bands where each band contains an equal number of signatures. Bands with a greater number of signatures will produce clusters withgreater similarity. A greater number of bands will increase the probabilty that similar vector signatures hash to the same value.
  3. Order each of the vector bands such that for each band the vector's data for that band are grouped tog
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值