ClickHouse 的 isNullAt 实现分析

class ColumnNullable final : public COWHelper<IColumn, ColumnNullable>
{
    bool isNullAt(size_t n) const override
    {
      return assert_cast<const ColumnUInt8 &>(*null_map).getData()[n] != 0;
    }
}

我们来分块解释上面的代码:

ColumnUInt8

ColumnUInt8 是 ColumnVector 模板的一个实现,负责描述列的内存布局。这里的列可能是数据列,也可以是 bitmap 列。

ColumnUint8 的定义见 Columns/ColumnsNumber.h

namespace DB
{

/** Columns with numbers. */

using ColumnUInt8 = ColumnVector<UInt8>;
using ColumnUInt16 = ColumnVector<UInt16>;
using ColumnUInt32 = ColumnVector<UInt32>;
using ColumnUInt64 = ColumnVector<UInt64>;
using ColumnUInt128 = ColumnVector<UInt128>;
using ColumnUInt256 = ColumnVector<UInt256>;

using ColumnInt8 = ColumnVector<Int8>;
using ColumnInt16 = ColumnVector<Int16>;
using ColumnInt32 = ColumnVector<Int32>;
using ColumnInt64 = ColumnVector<Int64>;
using ColumnInt128 = ColumnVector<Int128>;
using ColumnInt256 = ColumnVector<Int256>;

using ColumnFloat32 = ColumnVector<Float32>;
using ColumnFloat64 = ColumnVector<Float64>;

using ColumnUUID = ColumnVector<UUID>;

}
template <typename T>
class ColumnVector final : public COWHelper<ColumnVectorHelper, ColumnVector<T>>
{
public:
    using ValueType = T;
    using Container = PaddedPODArray<ValueType>;
    
    Container & getData()
    {
        return data;
    }
};

Container 是一个 PODArray(Plain Old Data) ,可以简单理解成是一个 std::vector。详细实现见PODArray.h

PODArray_fwd.h:
using PaddedPODArray = PODArray<T, initial_bytes, TAllocator, PADDING_FOR_SIMD - 1, PADDING_FOR_SIMD>;


Common/PODArray.h
/** A dynamic array for POD types.
  * Designed for a small number of large arrays (rather than a lot of small ones).
  * To be more precise - for use in ColumnVector.
  * It differs from std::vector in that it does not initialize the elements.
  *
  * Made noncopyable so that there are no accidental copies. You can copy the data using `assign` method.
  *
  * Only part of the std::vector interface is supported.
  *
  * The default constructor creates an empty object that does not allocate memory.
  * Then the memory is allocated at least initial_bytes bytes.
  *
  * If you insert elements with push_back, without making a `reserve`, then PODArray is about 2.5 times faster than std::vector.
  *
  * The template parameter `pad_right` - always allocate at the end of the array as many unused bytes.
  * Can be used to make optimistic reading, writing, copying with unaligned SIMD instructions.
  *
  * The template parameter `pad_left` - always allocate memory before 0th element of the array (rounded up to the whole number of elements)
  *  and zero initialize -1th element. It allows to use -1th element that will have value 0.
  * This gives performance benefits when converting an array of offsets to array of sizes.
  *
  * Some methods using allocator have TAllocatorParams variadic arguments.
  * These arguments will be passed to corresponding methods of TAllocator.
  * Example: pointer to Arena, that is used for allocations.
  *
  * Why Allocator is not passed through constructor, as it is done in C++ standard library?
  * Because sometimes we have many small objects, that share same allocator with same parameters,
  *  and we must avoid larger object size due to storing the same parameters in each object.
  * This is required for states of aggregate functions.
  *
  * TODO Pass alignment to Allocator.
  * TODO Allow greater alignment than alignof(T). Example: array of char aligned to page size.
  */
template <typename T, size_t initial_bytes, typename TAllocator, size_t pad_right_, size_t pad_left_>
class PODArray : public PODArrayBase<sizeof(T), initial_bytes, TAllocator, pad_right_, pad_left_>


官方架构文档 对 ColumnUInt8 的描述如下:

Various IColumn implementations (ColumnUInt8, ColumnString, and so on) are responsible for the memory layout of columns. The memory layout is usually a contiguous array. For the integer type of columns, it is just one contiguous array, like std :: vector. For String and Array columns, it is two vectors: one for all array elements, placed contiguously, and a second one for offsets to the beginning of each array. There is also ColumnConst that stores just one value in memory, but looks like a column.

null_map

记录了列里的全部 null 信息,是一个数组。里面的值不为 0 时表示它对应的 Column Cell 为 null。

这里我有一个疑问没有搞清楚:为什么 null_map 不是一个 bitmap array,而是一个 uint8 array?是为了性能考虑吗?如果是 bitmap 数组,那么判断一个值是不是 null 的时候还需要做一些 offset 以及 bitmap mask 计算,用 uint8 的话几乎不需要任何代价就能判断出一个值是否为 null。

// 省空间算法:假设是 bitmap,判断第 n 行是否为 null 的实现如下:
bool isNullAt(size_t n)
{
   static char mask_lookup_dict[8] = { 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80 };
   size_t index = n >> 3;
   size_t offset = n & 0x7;
   char c = null_map[index];
   return mask_lookup_dict[offset] & c;
}

// 省时间算法:假设是 char,判断第 n 行是否为 null 的实现如下:
bool isNullAt(size_t n)
{
   return null_map[n] != 0;
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值