介绍
1970 年被 Burton Bloom 发明
为什么要用布隆过滤器
比如 URL 有 10 亿条,每条 64 字节,使用哈希函数,由于冲突使用链表,构建成散列表至少需要 100G。而使用布隆过滤器,100 亿的位图,仅仅需要 1.2G 内存。内存优势明显。此外,布隆过滤器是计算密集型,而散列表是内存密集型。
算法
False is always false,ture maybe false。
1、分配一个 N 个元素的数组
2、把一个对象映射成 k 个值
3、对 k 个值中的每个数计算 hash 函数求值,得到索引
算法复杂度:O(k)
详细参考附录
要点
- 每个元素的 bucket 数量
- hash 函数:个数(个数越多冲突越小)、算法(独立、分布均匀、简单,计算时间越短)
- false-positive:将不存在的元素错误判断为存在
应用场景
判断某个元素是否在某个集合中
特点
- 如果布隆过滤器判断某个元素不在集合中,则一定不在集合中
- 如果布隆过滤器判断某个元素在集合中,则可能不在集合中
由于可用极小的内容判断非常大的数据量,经常用于
1、缓存系统
2、URL 判重
3、网站每条的 UV 数
4、电话黑名单
5、垃圾邮件过滤
源码分析
以 guava 的 bloomfilter 为例
public final class BloomFilter<T> implements Predicate<T>, Serializable {
/**
* A strategy to translate T instances, to {@code numHashFunctions} bit indexes.
*
* <p>Implementations should be collections of pure functions (i.e. stateless).
*/
interface Strategy extends java.io.Serializable {
/**
* Sets {@code numHashFunctions} bits of the given bit array, by hashing a user element.
*
* <p>Returns whether any bits changed as a result of this operation.
*/
<T> boolean put(
T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits);
/**
* Queries {@code numHashFunctions} bits of the given bit array, by hashing a user element;
* returns {@code true} if and only if all selected bits are set.
*/
<T> boolean mightContain(
T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits);
/**
* Identifier used to encode this strategy, when marshalled as part of a BloomFilter. Only
* values in the [-128, 127] range are valid for the compact serial form. Non-negative values
* are reserved for enums defined in BloomFilterStrategies; negative values are reserved for any
* custom, stateful strategy we may define (e.g. any kind of strategy that would depend on user
* input).
*/
int ordinal();
}
/** The bit set of the BloomFilter (not necessarily power of 2!) */
private final LockFreeBitArray bits;
/** Number of hashes per element */
private final int numHashFunctions;
/** The funnel to translate Ts to bytes */
private final Funnel<? super T> funnel;
/** The strategy we employ to map an element T to {@code numHashFunctions} bit indexes. */
private final Strategy strategy;
/** Creates a BloomFilter. */
private BloomFilter(
LockFreeBitArray bits, int numHashFunctions, Funnel<? super T> funnel, Strategy strategy) {
checkArgument(numHashFunctions > 0, "numHashFunctions (%s) must be > 0", numHashFunctions);
checkArgument(
numHashFunctions <= 255, "numHashFunctions (%s) must be <= 255", numHashFunctions);
this.bits = checkNotNull(bits);
this.numHashFunctions = numHashFunctions;
this.funnel = checkNotNull(funnel);
this.strategy = checkNotNull(strategy);
}
/**
* Returns {@code true} if the element <i>might</i> have been put in this Bloom filter, {@code
* false} if this is <i>definitely</i> not the case.
*/
public boolean mightContain(T object) {
return strategy.mightContain(object, funnel, numHashFunctions, bits);
}
/**
* Puts an element into this {@code BloomFilter}. Ensures that subsequent invocations of {@link
* #mightContain(Object)} with the same element will always return {@code true}.
*
* @return true if the Bloom filter's bits changed as a result of this operation. If the bits
* changed, this is <i>definitely</i> the first time {@code object} has been added to the
* filter. If the bits haven't changed, this <i>might</i> be the first time {@code object} has
* been added to the filter. Note that {@code put(t)} always returns the <i>opposite</i>
* result to what {@code mightContain(t)} would have returned at the time it is called.
* @since 12.0 (present in 11.0 with {@code void} return type})
*/
public boolean put(T object) {
return strategy.put(object, funnel, numHashFunctions, bits);
}
/**
* Returns the probability that {@linkplain #mightContain(Object)} will erroneously return {@code
* true} for an object that has not actually been put in the {@code BloomFilter}.
*
* <p>Ideally, this number should be close to the {@code fpp} parameter passed in {@linkplain
* #create(Funnel, int, double)}, or smaller. If it is significantly higher, it is usually the
* case that too many elements (more than expected) have been put in the {@code BloomFilter},
* degenerating it.
*
* @since 14.0 (since 11.0 as expectedFalsePositiveProbability())
*/
public double expectedFpp() {
// You down with FPP? (Yeah you know me!) Who's down with FPP? (Every last homie!)
return Math.pow((double) bits.bitCount() / bitSize(), numHashFunctions);
}
/**
* Returns an estimate for the total number of distinct elements that have been added to this
* Bloom filter. This approximation is reasonably accurate if it does not exceed the value of
* {@code expectedInsertions} that was used when constructing the filter.
*
* @since 22.0
*/
public long approximateElementCount() {
long bitSize = bits.bitSize();
long bitCount = bits.bitCount();
/**
* Each insertion is expected to reduce the # of clear bits by a factor of
* `numHashFunctions/bitSize`. So, after n insertions, expected bitCount is `bitSize * (1 - (1 -
* numHashFunctions/bitSize)^n)`. Solving that for n, and approximating `ln x` as `x - 1` when x
* is close to 1 (why?), gives the following formula.
*/
double fractionOfBitsSet = (double) bitCount / bitSize;
return DoubleMath.roundToLong(
-Math.log1p(-fractionOfBitsSet) * bitSize / numHashFunctions, RoundingMode.HALF_UP);
}
/** Returns the number of bits in the underlying bit array. */
@VisibleForTesting
long bitSize() {
return bits.bitSize();
}
@Override
public boolean equals(@Nullable Object object) {
if (object == this) {
return true;
}
if (object instanceof BloomFilter) {
BloomFilter<?> that = (BloomFilter<?>) object;
return this.numHashFunctions == that.numHashFunctions
&& this.funnel.equals(that.funnel)
&& this.bits.equals(that.bits)
&& this.strategy.equals(that.strategy);
}
return false;
}
@Override
public int hashCode() {
return Objects.hashCode(numHashFunctions, funnel, strategy, bits);
}
/**
* Returns a {@code Collector} expecting the specified number of insertions, and yielding a {@link
* BloomFilter} with false positive probability 3%.
*
* <p>Note that if the {@code Collector} receives significantly more elements than specified, the
* resulting {@code BloomFilter} will suffer a sharp deterioration of its false positive
* probability.
*
* <p>The constructed {@code BloomFilter} will be serializable if the provided {@code Funnel<T>}
* is.
*
* <p>It is recommended that the funnel be implemented as a Java enum. This has the benefit of
* ensuring proper serialization and deserialization, which is important since {@link #equals}
* also relies on object identity of funnels.
*
* @param funnel the funnel of T's that the constructed {@code BloomFilter} will use
* @param expectedInsertions the number of expected insertions to the constructed {@code
* BloomFilter}; must be positive
* @return a {@code Collector} generating a {@code BloomFilter} of the received elements
* @since 23.0
*/
public static <T> Collector<T, ?, BloomFilter<T>> toBloomFilter(
Funnel<? super T> funnel, long expectedInsertions) {
return toBloomFilter(funnel, expectedInsertions, 0.03);
}
/**
* Returns a {@code Collector} expecting the specified number of insertions, and yielding a {@link
* BloomFilter} with the specified expected false positive probability.
*
* <p>Note that if the {@code Collector} receives significantly more elements than specified, the
* resulting {@code BloomFilter} will suffer a sharp deterioration of its false positive
* probability.
*
* <p>The constructed {@code BloomFilter} will be serializable if the provided {@code Funnel<T>}
* is.
*
* <p>It is recommended that the funnel be implemented as a Java enum. This has the benefit of
* ensuring proper serialization and deserialization, which is important since {@link #equals}
* also relies on object identity of funnels.
*
* @param funnel the funnel of T's that the constructed {@code BloomFilter} will use
* @param expectedInsertions the number of expected insertions to the constructed {@code
* BloomFilter}; must be positive
* @param fpp the desired false positive probability (must be positive and less than 1.0)
* @return a {@code Collector} generating a {@code BloomFilter} of the received elements
* @since 23.0
*/
public static <T> Collector<T, ?, BloomFilter<T>> toBloomFilter(
Funnel<? super T> funnel, long expectedInsertions, double fpp) {
checkNotNull(funnel);
checkArgument(
expectedInsertions >= 0, "Expected insertions (%s) must be >= 0", expectedInsertions);
checkArgument(fpp > 0.0, "False positive probability (%s) must be > 0.0", fpp);
checkArgument(fpp < 1.0, "False positive probability (%s) must be < 1.0", fpp);
return Collector.of(
() -> BloomFilter.create(funnel, expectedInsertions, fpp),
BloomFilter::put,
(bf1, bf2) -> {
bf1.putAll(bf2);
return bf1;
},
Collector.Characteristics.UNORDERED,
Collector.Characteristics.CONCURRENT);
}
/**
* Creates a {@link BloomFilter} with the expected number of insertions and expected false
* positive probability.
*
* <p>Note that overflowing a {@code BloomFilter} with significantly more elements than specified,
* will result in its saturation, and a sharp deterioration of its false positive probability.
*
* <p>The constructed {@code BloomFilter} will be serializable if the provided {@code Funnel<T>}
* is.
*
* <p>It is recommended that the funnel be implemented as a Java enum. This has the benefit of
* ensuring proper serialization and deserialization, which is important since {@link #equals}
* also relies on object identity of funnels.
*
* @param funnel the funnel of T's that the constructed {@code BloomFilter} will use
* @param expectedInsertions the number of expected insertions to the constructed {@code
* BloomFilter}; must be positive
* @param fpp the desired false positive probability (must be positive and less than 1.0)
* @return a {@code BloomFilter}
*/
public static <T> BloomFilter<T> create(
Funnel<? super T> funnel, int expectedInsertions, double fpp) {
return create(funnel, (long) expectedInsertions, fpp);
}
/**
* Creates a {@link BloomFilter} with the expected number of insertions and expected false
* positive probability.
*
* <p>Note that overflowing a {@code BloomFilter} with significantly more elements than specified,
* will result in its saturation, and a sharp deterioration of its false positive probability.
*
* <p>The constructed {@code BloomFilter} will be serializable if the provided {@code Funnel<T>}
* is.
*
* <p>It is recommended that the funnel be implemented as a Java enum. This has the benefit of
* ensuring proper serialization and deserialization, which is important since {@link #equals}
* also relies on object identity of funnels.
*
* @param funnel the funnel of T's that the constructed {@code BloomFilter} will use
* @param expectedInsertions the number of expected insertions to the constructed {@code
* BloomFilter}; must be positive
* @param fpp the desired false positive probability (must be positive and less than 1.0)
* @return a {@code BloomFilter}
* @since 19.0
*/
public static <T> BloomFilter<T> create(
Funnel<? super T> funnel, long expectedInsertions, double fpp) {
return create(funnel, expectedInsertions, fpp, BloomFilterStrategies.MURMUR128_MITZ_64);
}
@VisibleForTesting
static <T> BloomFilter<T> create(
Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {
checkNotNull(funnel);
checkArgument(
expectedInsertions >= 0, "Expected insertions (%s) must be >= 0", expectedInsertions);
checkArgument(fpp > 0.0, "False positive probability (%s) must be > 0.0", fpp);
checkArgument(fpp < 1.0, "False positive probability (%s) must be < 1.0", fpp);
checkNotNull(strategy);
if (expectedInsertions == 0) {
expectedInsertions = 1;
}
/*
* TODO(user): Put a warning in the javadoc about tiny fpp values, since the resulting size
* is proportional to -log(p), but there is not much of a point after all, e.g.
* optimalM(1000, 0.0000000000000001) = 76680 which is less than 10kb. Who cares!
*/
long numBits = optimalNumOfBits(expectedInsertions, fpp);
int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
try {
return new BloomFilter<T>(new LockFreeBitArray(numBits), numHashFunctions, funnel, strategy);
} catch (IllegalArgumentException e) {
throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e);
}
}
/**
* Creates a {@link BloomFilter} with the expected number of insertions and a default expected
* false positive probability of 3%.
*
* <p>Note that overflowing a {@code BloomFilter} with significantly more elements than specified,
* will result in its saturation, and a sharp deterioration of its false positive probability.
*
* <p>The constructed {@code BloomFilter} will be serializable if the provided {@code Funnel<T>}
* is.
*
* <p>It is recommended that the funnel be implemented as a Java enum. This has the benefit of
* ensuring proper serialization and deserialization, which is important since {@link #equals}
* also relies on object identity of funnels.
*
* @param funnel the funnel of T's that the constructed {@code BloomFilter} will use
* @param expectedInsertions the number of expected insertions to the constructed {@code
* BloomFilter}; must be positive
* @return a {@code BloomFilter}
*/
public static <T> BloomFilter<T> create(Funnel<? super T> funnel, int expectedInsertions) {
return create(funnel, (long) expectedInsertions);
}
/**
* Creates a {@link BloomFilter} with the expected number of insertions and a default expected
* false positive probability of 3%.
*
* <p>Note that overflowing a {@code BloomFilter} with significantly more elements than specified,
* will result in its saturation, and a sharp deterioration of its false positive probability.
*
* <p>The constructed {@code BloomFilter} will be serializable if the provided {@code Funnel<T>}
* is.
*
* <p>It is recommended that the funnel be implemented as a Java enum. This has the benefit of
* ensuring proper serialization and deserialization, which is important since {@link #equals}
* also relies on object identity of funnels.
*
* @param funnel the funnel of T's that the constructed {@code BloomFilter} will use
* @param expectedInsertions the number of expected insertions to the constructed {@code
* BloomFilter}; must be positive
* @return a {@code BloomFilter}
* @since 19.0
*/
public static <T> BloomFilter<T> create(Funnel<? super T> funnel, long expectedInsertions) {
return create(funnel, expectedInsertions, 0.03); // FYI, for 3%, we always get 5 hash functions
}
// Cheat sheet:
//
// m: total bits
// n: expected insertions
// b: m/n, bits per insertion
// p: expected false positive probability
//
// 1) Optimal k = b * ln2
// 2) p = (1 - e ^ (-kn/m))^k
// 3) For optimal k: p = 2 ^ (-k) ~= 0.6185^b
// 4) For optimal k: m = -nlnp / ((ln2) ^ 2)
/**
* Computes the optimal k (number of hashes per element inserted in Bloom filter), given the
* expected insertions and total number of bits in the Bloom filter.
*
* <p>See http://en.wikipedia.org/wiki/File:Bloom_filter_fp_probability.svg for the formula.
*
* @param n expected insertions (must be positive)
* @param m total number of bits in Bloom filter (must be positive)
*/
static int optimalNumOfHashFunctions(long n, long m) {
// (m / n) * log(2), but avoid truncation due to division!
return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
}
/**
* Computes m (total bits of Bloom filter) which is expected to achieve, for the specified
* expected insertions, the required false positive probability.
*
* <p>See http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives for the
* formula.
*
* @param n expected insertions (must be positive)
* @param p false positive rate (must be 0 < p < 1)
*/
static long optimalNumOfBits(long n, double p) {
if (p == 0) {
p = Double.MIN_VALUE;
}
return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
}
}
过滤器策略
/**
* Collections of strategies of generating the k * log(M) bits required for an element to be mapped
* to a BloomFilter of M bits and k hash functions. These strategies are part of the serialized form
* of the Bloom filters that use them, thus they must be preserved as is (no updates allowed, only
* introduction of new versions).
*
* <p>Important: the order of the constants cannot change, and they cannot be deleted - we depend on
* their ordinal for BloomFilter serialization.
*
* @author Dimitris Andreou
* @author Kurt Alfred Kluever
*/
enum BloomFilterStrategies implements BloomFilter.Strategy {
/**
* See "Less Hashing, Same Performance: Building a Better Bloom Filter" by Adam Kirsch and Michael
* Mitzenmacher. The paper argues that this trick doesn't significantly deteriorate the
* performance of a Bloom filter (yet only needs two 32bit hash functions).
*/
MURMUR128_MITZ_32() {
@Override
public <T> boolean put(
T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
long bitSize = bits.bitSize();
long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
int hash1 = (int) hash64;
int hash2 = (int) (hash64 >>> 32);
boolean bitsChanged = false;
for (int i = 1; i <= numHashFunctions; i++) {
int combinedHash = hash1 + (i * hash2);
// Flip all the bits if it's negative (guaranteed positive number)
if (combinedHash < 0) {
combinedHash = ~combinedHash;
}
bitsChanged |= bits.set(combinedHash % bitSize);
}
return bitsChanged;
}
@Override
public <T> boolean mightContain(
T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
long bitSize = bits.bitSize();
long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
int hash1 = (int) hash64;
int hash2 = (int) (hash64 >>> 32);
for (int i = 1; i <= numHashFunctions; i++) {
int combinedHash = hash1 + (i * hash2);
// Flip all the bits if it's negative (guaranteed positive number)
if (combinedHash < 0) {
combinedHash = ~combinedHash;
}
if (!bits.get(combinedHash % bitSize)) {
return false;
}
}
return true;
}
},
/**
* This strategy uses all 128 bits of {@link Hashing#murmur3_128} when hashing. It looks different
* than the implementation in MURMUR128_MITZ_32 because we're avoiding the multiplication in the
* loop and doing a (much simpler) += hash2. We're also changing the index to a positive number by
* AND'ing with Long.MAX_VALUE instead of flipping the bits.
*/
MURMUR128_MITZ_64() {
@Override
public <T> boolean put(
T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
long bitSize = bits.bitSize();
byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
long hash1 = lowerEight(bytes);
long hash2 = upperEight(bytes);
boolean bitsChanged = false;
long combinedHash = hash1;
for (int i = 0; i < numHashFunctions; i++) {
// Make the combined hash positive and indexable
bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
combinedHash += hash2;
}
return bitsChanged;
}
@Override
public <T> boolean mightContain(
T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
long bitSize = bits.bitSize();
byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
long hash1 = lowerEight(bytes);
long hash2 = upperEight(bytes);
long combinedHash = hash1;
for (int i = 0; i < numHashFunctions; i++) {
// Make the combined hash positive and indexable
if (!bits.get((combinedHash & Long.MAX_VALUE) % bitSize)) {
return false;
}
combinedHash += hash2;
}
return true;
}
private /* static */ long lowerEight(byte[] bytes) {
return Longs.fromBytes(
bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]);
}
private /* static */ long upperEight(byte[] bytes) {
return Longs.fromBytes(
bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]);
}
};
/**
* Models a lock-free array of bits.
*
* <p>We use this instead of java.util.BitSet because we need access to the array of longs and we
* need compare-and-swap.
*/
static final class LockFreeBitArray {
private static final int LONG_ADDRESSABLE_BITS = 6;
final AtomicLongArray data;
private final LongAddable bitCount;
LockFreeBitArray(long bits) {
this(new long[Ints.checkedCast(LongMath.divide(bits, 64, RoundingMode.CEILING))]);
}
// Used by serialization
LockFreeBitArray(long[] data) {
checkArgument(data.length > 0, "data length is zero!");
this.data = new AtomicLongArray(data);
this.bitCount = LongAddables.create();
long bitCount = 0;
for (long value : data) {
bitCount += Long.bitCount(value);
}
this.bitCount.add(bitCount);
}
/** Returns true if the bit changed value. */
boolean set(long bitIndex) {
if (get(bitIndex)) {
return false;
}
int longIndex = (int) (bitIndex >>> LONG_ADDRESSABLE_BITS);
long mask = 1L << bitIndex; // only cares about low 6 bits of bitIndex
long oldValue;
long newValue;
do {
oldValue = data.get(longIndex);
newValue = oldValue | mask;
if (oldValue == newValue) {
return false;
}
} while (!data.compareAndSet(longIndex, oldValue, newValue));
// We turned the bit on, so increment bitCount.
bitCount.increment();
return true;
}
boolean get(long bitIndex) {
return (data.get((int) (bitIndex >>> 6)) & (1L << bitIndex)) != 0;
}
/**
* Careful here: if threads are mutating the atomicLongArray while this method is executing, the
* final long[] will be a "rolling snapshot" of the state of the bit array. This is usually good
* enough, but should be kept in mind.
*/
public static long[] toPlainArray(AtomicLongArray atomicLongArray) {
long[] array = new long[atomicLongArray.length()];
for (int i = 0; i < array.length; ++i) {
array[i] = atomicLongArray.get(i);
}
return array;
}
/** Number of bits */
long bitSize() {
return (long) data.length() * Long.SIZE;
}
/**
* Number of set bits (1s).
*
* <p>Note that because of concurrent set calls and uses of atomics, this bitCount is a (very)
* close *estimate* of the actual number of bits set. It's not possible to do better than an
* estimate without locking. Note that the number, if not exactly accurate, is *always*
* underestimating, never overestimating.
*/
long bitCount() {
return bitCount.sum();
}
LockFreeBitArray copy() {
return new LockFreeBitArray(toPlainArray(data));
}
/**
* Combines the two BitArrays using bitwise OR.
*
* <p>NOTE: Because of the use of atomics, if the other LockFreeBitArray is being mutated while
* this operation is executing, not all of those new 1's may be set in the final state of this
* LockFreeBitArray. The ONLY guarantee provided is that all the bits that were set in the other
* LockFreeBitArray at the start of this method will be set in this LockFreeBitArray at the end
* of this method.
*/
void putAll(LockFreeBitArray other) {
checkArgument(
data.length() == other.data.length(),
"BitArrays must be of equal length (%s != %s)",
data.length(),
other.data.length());
for (int i = 0; i < data.length(); i++) {
long otherLong = other.data.get(i);
long ourLongOld;
long ourLongNew;
boolean changedAnyBits = true;
do {
ourLongOld = data.get(i);
ourLongNew = ourLongOld | otherLong;
if (ourLongOld == ourLongNew) {
changedAnyBits = false;
break;
}
} while (!data.compareAndSet(i, ourLongOld, ourLongNew));
if (changedAnyBits) {
int bitsAdded = Long.bitCount(ourLongNew) - Long.bitCount(ourLongOld);
bitCount.add(bitsAdded);
}
}
}
@Override
public boolean equals(@Nullable Object o) {
if (o instanceof LockFreeBitArray) {
LockFreeBitArray lockFreeBitArray = (LockFreeBitArray) o;
// TODO(lowasser): avoid allocation here
return Arrays.equals(toPlainArray(data), toPlainArray(lockFreeBitArray.data));
}
return false;
}
@Override
public int hashCode() {
// TODO(lowasser): avoid allocation here
return Arrays.hashCode(toPlainArray(data));
}
}
}
参考
http://llimllib.github.io/bloomfilter-tutorial/
http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html
http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf
Cassandra 的 BloomFilter
HBase 的 BloomFilter