BloomFlter 是什么
A Bloom filter is an array of many bits. When an element is ‘added’ to a bloom filter, the element is hashed. Then bit[hashval % nbits] is set to 1. This looks fairly similar to how buckets in a hash table are mapped. To check if an item is present or not, the hash is computed and the filter sees if the corresponding bit is set or not.
Of course, this is subject to collisions. If a collision happens, the filter returns a false positive - indicating that the entry is indeed found (note that a bloom filter never returns a false negative, that is, claim that something does not exist when it fact it is present).
碰撞?
影响布隆过滤器 hash 碰撞概率的因素有2个
- 对元素的 hash 次数 (bits_per_element)
- 数组中被置为1的占比 (fill ratio) : 如果一个数组中大部分位置被设置为1, 则其返会"不存在"的概率很小, 也就是返回 false positive 的概率大大增加
扩容
Bloom filters cannot be “rebalanced” because there is no way to know which entries are part of the filter (the filter can only determine whether a given entry is not present, but does not actually store the entries
布隆过滤器不能扩容. 因为它无法像 list 一样完成 rebalance 的操作. 布隆过滤器只是把 b i t [ h a s h v a l bit[hashval % nbits] bit[hashval 置为1, 但并不真实记录元素的存在. 所以, 布隆过滤器要想在达到初始大小后增加元素, 会创建一个新的byte数组 “stack” 在原数组之上. 添加一个元素要完成2步:先哈希 bpe 次, 确保在原数组中不存在; 再哈希 bpe 次, 在新数组中设置为1. 新数组的长度一般会比旧数组长, 等于 len(旧数组) * EXPANSION, EXPANSION 默认为2. 增大长度可以减小再次装满的可能性
如果布隆过滤器初始大小设置的很小, 会及其影响性能. 因为扩容导致无数个 byte 数组堆叠在一起, 查询时要从最新(top)的数组开始, 共进行 n ∗ b p e n*bpe n∗bpe 次哈希, 多了n倍. 插入时也多了 n 倍(确保 stack 的 n 个数组都不存在)
redis 布隆过滤器的实际内存占用大小
b
i
t
s
_
p
e
r
_
i
t
e
m
=
−
l
o
g
(
e
r
r
o
r
)
/
l
n
(
2
)
bits\_per\_item = -log(error)/ln(2)
bits_per_item=−log(error)/ln(2)
m
e
m
o
r
y
=
c
a
p
a
c
i
t
y
∗
b
i
t
s
_
p
e
r
_
i
t
e
m
memory = capacity * bits\_per\_item
memory=capacity∗bits_per_item
1% error rate requires 7 hash functions and 10.08 bits per item.
0.1% error rate requires 10 hash functions and 14.4 bits per item.
0.01% error rate requires 14 hash functions and 20.16 bits per item. (大概 20 * 元素个数 的 bit)
redis 命令
bf.info
appId:11426> bf.info bf_0
Capacity # Return the number of unique items that can be stored in this Bloom filter before scaling would be required
3000000
Size # Return memory size: number of bytes allocated for this Bloom filter
5037368
Number of filters # Return the number of sub-filters (随着时间推移,看看这个子 filter 个数有没有变大)
2
Number of items inserted
1339642
Expansion rate
2
bf.add, bf.exists
appId:11426> bf.add bf_0 test_lj
1 # "1" 表示添加成功
appId:11426> bf.add bf_0 test_lj
0 # "0" 表示已经存在
appId:11426> bf.exists bf_0 test_lj
1 # 1 表示很大概率已经存在
appId:11426> bf.exists bf_0 test_lj_2
0 # 0 表示肯定不存在
bf.madd, bf.mexists
appId:11426> bf.madd bf_0 test1 test2
1
1
appId:11426> bf.mexists bf_0 test1 test2 test3
1
1
0
bf.reserve (key名字 error_date capacity expansion)
appId:11426> bf.reserve lj_filter 0.001 10 expansion 2
OK
appId:11426> bf.info lj_filter
Capacity # 声明10个capacity的 filter
10
Size # 内存占用比特数
176
Number of filters
1
Number of items inserted
0
Expansion rate
2
appId:11426> bf.add lj_filter test1
1
appId:11426> bf.info lj_filter
Capacity
10
Size
176
Number of filters
1
Number of items inserted
1
Expansion rate
2
appId:11426> bf.add lj_filter test2
1
appId:11426> bf.add lj_filter test3
1
appId:11426> bf.add lj_filter test4
1
appId:11426> bf.add lj_filter test5
1
appId:11426> bf.add lj_filter test6
1
appId:11426> bf.add lj_filter test7
1
appId:11426> bf.add lj_filter test8
1
appId:11426> bf.add lj_filter test9
1
appId:11426> bf.add lj_filter test10
1
appId:11426> bf.info lj_filter
Capacity # 插入10个值还没扩容
10
Size
176
Number of filters
1
Number of items inserted
10
Expansion rate
2
appId:11426> bf.add lj_filter test11
1
appId:11426> bf.info lj_filter
Capacity # 插入10个值, 增加了1个 capacity 2倍的新 filter
30
Size
344
Number of filters
2
Number of items inserted
11
Expansion rate
2
appId:11426> bf.add lj_filter test12
1
appId:11426> bf.add lj_filter test13
1
appId:11426> bf.add lj_filter test14
1
appId:11426> bf.add lj_filter test15
1
appId:11426> bf.add lj_filter test16
1
appId:11426> bf.add lj_filter test17
1
appId:11426> bf.add lj_filter test18
1
appId:11426> bf.add lj_filter test19
1
appId:11426> bf.add lj_filter test20
1
appId:11426> bf.info lj_filter
Capacity # 插入30个值还没扩容
30
Size
344
Number of filters
2
Number of items inserted
20
Expansion rate
2
appId:11426> bf.add lj_filter test21
1
appId:11426> bf.add lj_filter test22
1
appId:11426> bf.add lj_filter test23
1
appId:11426> bf.add lj_filter test24
1
appId:11426> bf.add lj_filter test25
1
appId:11426> bf.add lj_filter test26
1
appId:11426> bf.add lj_filter test27
1
appId:11426> bf.add lj_filter test28
1
appId:11426> bf.add lj_filter test29
1
appId:11426> bf.add lj_filter test30
1
appId:11426> bf.info lj_filter
Capacity # 插入30个值还没扩容, 因为新 stack 的 filter 的 capacity 是 20
30
Size
344
Number of filters
2
Number of items inserted
30
Expansion rate
2
appId:11426> bf.add lj_filter test31
1
appId:11426> bf.info lj_filter
Capacity # 插入 31 个扩容 (10 + 20 + 40 = 70)
70
Size
560
Number of filters
3
Number of items inserted
31
Expansion rate
2