Roaring Bitmap 原理

一. bitmaps 是干什么的?
  1. bitmap 是一个比特数组:Array[Byte], 用来存储整数集合:Set[Integer].它通过"如果集合中有一个整数n,就设置arr[n]=1 bit"来存放整数.
  2. 由于 bitmap 的这种表达整数的方式, 它可以利用 cpu 的 bitwise-and (按位与) 和 bitwise-or (按位或) 很快的进行"2个整数集合求交集,并集"操作, 时间复杂度O(1)
    假设有10亿个文档, 编号从 1 到 10亿.现在要算出同时存在单词 carrier 和单词 pigeon 的文档该怎么做?
    可以分别将存在单词 carrier 的文档编号集合用 arr1:Array[Byte] 表示, 存在单词 pigeon 的文档编号集合用 arr2:Array[Byte] 表示; 同时存在两个单词的文档集合就是将这两个比特数组按位与
  3. 普通的 bitmaps 有一个缺陷: 当整数数组最大值很大, 但是元素个数却很少时, 会造成巨量的空间浪费.
    比如: [1,1000000000] 这个数组, 只有2个整数, 却要用 10亿 个bit的空间表示这个整数数组
二. Roaring bitmaps 是干什么的?

Roaring bitmaps 在传统 bitmaps 上, 使用压缩解决数组稀疏问题.具体上讲, Roaring bitmaps 将1个 32 位整数集合, 按照高 16 位分桶(container),最多可分 2 16 = 65536 2^{16}=65536 216=65536 个桶. 存储整数时,按照整数的高16位找到container(找不到就会新建一个),再将整数的低16位放入 container 中. 常见的 container 有一下2类:

  1. ArrayContainer
    当桶内数据的个数不大于4096时,会采用它来存储,其本质上是一个unsigned short类型(正好 16 位)的有序数组:Array[Short]。数组初始长度为4,随着数据的增多会自动扩容(但数组的最大长度就是4096, 即 ArrayContainer 最大占用从初始的 4 * 2B=8B, 到最大 4096 * 2B = 8KB)。另外还维护有一个计数器,用来实时记录基数。

  2. BitmapContainer
    当桶内数据的个数大于4096时,会采用它来存储,其本质上是长度固定为 2 16 2^{16} 216 位(8KB)的传统 bitmap (存储 2 16 2^{16} 216 个整数) 1物理表现为 长度固定为 1024 的 unsigned long型(64位,8B)数组:Array[Long] (size=1024),亦即这些位图的大小固定 8KB。它同样有一个计数器。

三. Roaring bitmaps 的 exist, union, intersect 如何计算?
  1. 判断整数 N 是否存在集合中
    To check if an integer N exists, get N’s 16 most significant bits (N / 2^16) and use it to find N’s corresponding container in the Roaring bitmap.

If the container doesn’t exist, then N is not in the Roaring bitmap.

Checking for existence in array and bitmap containers works differently:

Bitmap: check if the bit at N % 2^16 is set.
Array: use binary search to find N % 2^16 in the sorted array.
Intersect matching containers to intersect two Roaring bitmaps. Algorithms vary by container type(s), and container types may change.

  1. 计算 intersect
    To intersect Roaring bitmaps A and B, it is sufficient to intersect matching containers in A and B.

This is possible because of how integers are partitioned in Roaring bitmaps: matching containers in A and B store integers with the same 16 most significant bits (the same chunks).

Intersection algorithms vary by the types of the containers involved, as do the resulting container types:

Bitmap / Bitmap: Compute the bitwise AND of the two bitmaps. If the cardinality is <= 4,096, store the result in an array container, otherwise store it in a bitmap container.
Bitmap / Array: Iterate over the array, checking for the existence of each 16-bit integer in the bitmap. If the integer exists, add it to the resulting array container – note that intersections of bitmap and array container types will always create an array container.
Array / Array: Intersections of two array containers always create a new array container. The algorithm used to compute the intersection varies by a cardinality heuristic described at the bottom of page 5 here. It will either be a simple merge (as used in merge sort) or a galloping intersection, described in this paper.
If there is a container in either Roaring bitmap without a corresponding container in the other, it will not exist in the result: the intersection of an empty set and any set is an empty set.

  1. 计算 union
    Union matching containers to produce a Roaring bitmap union. Algorithms vary by container type(s), and container types may change.
    To union Roaring bitmaps A and B, union all matching containers in A and B.

Union algorithms vary by the container types involved, as do the resulting container types:

Bitmap / Bitmap: Compute the bitwise OR of the two bitmaps. Unions of two bitmap containers will always create another bitmap container.
Bitmap / Array: Copy the bitmap and set corresponding bits for all the integers in the array container. Unions of a bitmap and array container will always create another bitmap container.
Array / Array: If the sum of the cardinalities of the two array containers is <= 4,096, the resulting container will be an array container. In this case, add all integers from both arrays to a new array container. Otherwise, optimistically assume the resulting container will be a bitmap: create a new bitmap container and set all corresponding bits for all integers in both arrays. If the cardinality of the resulting container is <= 4,096, convert the bitmap container back into an array container.
Finally, add all containers in A and B that do not have a matching container to the result. Remember: this is a union, so all integers in Roaring bitmaps A and B must be in the resulting set.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值