minhash算法 java,为LSH Minhash算法生成随机哈希函数

文章探讨了如何通过创建两个基础哈希函数h1和h2,然后利用公式gi=h1(x)+i*h2(x)生成多个哈希函数,以提高效率。这种方法减少了代码复杂性,且对Bloom Filter的性能没有负面影响。该方法为生成大量随机哈希函数提供了一种简洁的替代方案。
摘要由CSDN通过智能技术生成

I'm programming a minhashing algorithm in Java that requires me to generate an arbitrary number of random hash functions (240 hash functions in my case), and run any number of integers through it (2000 at the moment).

In order to do that, I've been generating random numbers a, b, and c (from the range 1 - 2001) for each of the 240 hash functions. Then, my hash function returns h = ((a*x) + b) % c, where h is the return value and x is one of the integers run through it.

Is this an efficient implementation of random hashing, or is there a more common/acceptable way to do it?

This post was asking a similar question, but I'm still somewhat confused by the wording of the answer: Minhash implementation how to find hash functions for permutations

解决方案

When I was working with Bloom filters a few years ago, I ran across an article that describes how to generate multiple hash functions very simply, with a minimum of code. The method he describes works very well. See Less Hashing, Same Performance: Building a Better Bloom Filter.

The basic idea is to create two hash functions, call them h1 and h2, with which you can then simulate multiple hash functions, g1 through gk, using the formula:

gi = h1(x) + i*h2(x)

i varies from 1 to k (the number of hash functions you want).

The paper is well worth reading, even if you decide not to implement his idea. Although after reading it I can't imagine not wanting to implement it. It made my Bloom filter code a whole lot more tractable and didn't negatively impact performance.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值