一. 前言
本文承接Presto(OpenLookeng)之BloomFilter索引优化代码走读_王飞活的博客-CSDN博客 一文继续通过走读代码介绍Presto(其实是OpenLookeng)中的另外一种索引:Bitmap索引的实现过程。在Presto中,Bitmap索引的官网介绍可以参考openLooKeng documentation。
二. Bitmap
bitmap是一种较适合在低基数的列进行建立,如性别,类型等列的value可枚举的列。其原理是将列条件和每行数据是否符合列条件编制成一个bool值的二维数组。比如下边将动物生活地是否为陆地,水地、空中作为纵坐标,每行数据作为横坐标进行一个bool map的编制:
当建立完索引后,对于查询如type = ’LNAD‘,只需要将map的第一行的所有bool为1的行筛选出来即可。
在Presto(Openlookeng)中,bitMap的实现是通过RoaringBitmap承载的。RoaringBitmap的介绍可以参考:GitHub - RoaringBitmap/RoaringBitmap: A better compressed bitset in Java 。
在Presto中用到了RoaringBitmap中有几个重要的接口:
三. Presto中Bitmap建立索引代码走读
BitMap索引建立在Presto过程其实和BloomFilter Presto(OpenLookeng)之BloomFilter索引优化代码走读_王飞活的博客-CSDN博客 过程是一样的,唯一不同的是建立索引时候的addValue和索引匹配的时候的Match实现,如下是BitMap的addValue过程的代码走读:
public boolean addValues(List<Pair<String, List<Object>>> values)
throws IOException
{
checkClosed();
// values can only be added once
if (!updateAllowed.getAndSet(false)) {
throw new UnsupportedOperationException("Unable to update index. " +
"An existing Btree index can not be updated because all values must be added together since the " +
"position of the values is important.");
}
// 入参values是一个列中该operator的所有数据,Pair中key值为列名,value为列的所有数据组成的list
if (values.size() != 1) {
throw new UnsupportedOperationException("Only single column is supported.");
}
List<Object> columnValues = values.get(0).getSecond();
// positions中key为列的数组,value为key值对应的所有下标,也即position
Map<Object, ArrayList<Integer>> positions = new HashMap<>();
// 如下为构造positions,大体就是将所有key值的下标put到一个list中
for (int i = 0; i < columnValues.size(); i++) {
Object value = columnValues.get(i);
if (value != null) {
positions.computeIfAbsent(value, k -> new ArrayList<>()).add(i);
}
}
if (positions.isEmpty()) {
return true;
}
// 如下对每个key值的下标list构造一个RoaringBitmap并反序列化成ByteArray
// 然后将key值和反序列化后的ByteArray组成一个Pair
// 再将所有的key值的Pair组成一个List
List<kotlin.Pair> bitmaps = new ArrayList<>(positions.size());
for (Map.Entry<Object, ArrayList<Integer>> e : positions.entrySet()) {
int[] valuePositions = ArrayUtils.toPrimitive(e.getValue().toArray(new Integer[0]));
RoaringBitmap rr = RoaringBitmap.bitmapOf(valuePositions);
rr.runOptimize();
ByteArrayOutputStream bos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(bos);
rr.serialize(dos);
dos.close();
Object value = convertToSupportedType(e.getKey());
bitmaps.add(new kotlin.Pair(value, bos.toByteArray()));
}
Collections.sort(bitmaps, (o1, o2) -> ((Comparable) o1.component1()).compareTo(o2.component1()));
// List转Tree
getBtreeWriteOptimized(bitmaps.iterator().next().component1(), bitmaps.iterator());
return true;
}
四. Presto中Bitmap索引匹配代码走读
Match的过程主要是调用如下的LookUp去找where过滤条件的值是否能匹配到索引,如下为其代码走读过程:
public Iterator<Integer> lookUp(Object expression)
{
checkClosed();
if (expression instanceof Domain) {
Domain predicate = (Domain) expression;
List<Range> ranges = ((SortedRangeSet) (predicate.getValues())).getOrderedRanges();
try {
ArrayList<RoaringBitmap> allMatches = new ArrayList<>();
for (Range range : ranges) {
// 对于single的过滤条件,比如where id = 1这样的场景,处理过程为通过列的过滤条件(如where id=1中的1)到上述建立bitmaps tree中找到对应的ByteArray,然后反序列化成RoaringBitmap
if (range.isSingleValue()) {
// unique value(for example: id=1, id in (1,2) (IN operator gives single exact values one by one)), bound: EXACTLY
Object value = getActualValue(predicate.getType(), range.getSingleValue());
Object byteArray = getBtreeReadOptimized().get(value);
if (byteArray != null) {
RoaringBitmap bitmap = byteArrayToBitmap(value, byteArray);
allMatches.add(bitmap);
}
}
else {
// 处理范围查询的场景,如 id >2 and id < 10
// <, <=, >=, >, BETWEEN
boolean highBoundless = range.getHigh().isUpperUnbounded();
boolean lowBoundless = range.getLow().isLowerUnbounded();
ConcurrentNavigableMap<Object, byte[]> concurrentNavigableMap = null;
if (highBoundless && !lowBoundless) {
// 查询条件只有下限,下限为过滤条件的下限,上限为bitmaps tree中的上限,取此范围内的value所对应的ByteArray反序列化成n个RoaringBitmap
// >= or >
Object low = getActualValue(predicate.getType(), range.getLow().getValue());
Object high = getBtreeReadOptimized().lastKey();
boolean fromInclusive = range.getLow().getBound().equals(Marker.Bound.EXACTLY);
if (getBtreeReadOptimized().comparator().compare(low, high) > 0) {
Object temp = low;
low = high;
high = temp;
}
concurrentNavigableMap = getBtreeReadOptimized().subMap(low, fromInclusive, high, true);
}
else if (!highBoundless && lowBoundless) {
// 查询条件只有上限时,下限取bitmaps tree的下限,上限取过滤条件的上限,取此范围内的value所对应的ByteArray反序列化成n个RoaringBitmap
// <= or <
Object low = getBtreeReadOptimized().firstKey();
Object high = getActualValue(predicate.getType(), range.getHigh().getValue());
boolean toInclusive = range.getHigh().getBound().equals(Marker.Bound.EXACTLY);
if (getBtreeReadOptimized().comparator().compare(low, high) > 0) {
Object temp = low;
low = high;
high = temp;
}
concurrentNavigableMap = getBtreeReadOptimized().subMap(low, true, high, toInclusive);
}
else if (!highBoundless && !lowBoundless) {
// 查询条件既有上限也有下限时,取过滤条件的上下限范围内的value所对应的ByteArray反序列化成n个RoaringBitmap
// BETWEEN
Object low = getActualValue(predicate.getType(), range.getLow().getValue());
Object high = getActualValue(predicate.getType(), range.getHigh().getValue());
if (getBtreeReadOptimized().comparator().compare(low, high) > 0) {
Object temp = low;
low = high;
high = temp;
}
concurrentNavigableMap = getBtreeReadOptimized().subMap(low, true, high, true);
}
else {
// 既无上限也无下限的场景
// This case, combined gives a range of boundless for both high and low end
throw new UnsupportedOperationException("No use for bitmap index as all values are matched due to no bounds.");
}
for (Map.Entry<Object, byte[]> e : concurrentNavigableMap.entrySet()) {
if (e != null) {
RoaringBitmap bitmap = byteArrayToBitmap(e.getKey(), e.getValue());
allMatches.add(bitmap);
}
}
}
}
if (allMatches.size() == 0) {
return Collections.emptyIterator();
}
if (allMatches.size() == 1) {
return allMatches.get(0).iterator();
}
// 将所有的RoaringBitmap通过or关系组装成一个新的RoaringBitmap
return RoaringBitmap.or(allMatches.iterator()).iterator();
}
catch (Exception e) {
throw new UnsupportedOperationException("Unsupported expression type.", e);
}
}
else {
throw new UnsupportedOperationException("Unsupported expression type.");
}
}
在matches当中,如果能通过过滤条件能找到对应的RoaringBitmap,则说明该stripe包含过滤条件的值。比如where id = 1,如果能找到过滤条件1所对应的RoaringBitmap,则保留该stripe,否则可以舍弃该stripe。