ClickHouse 的 isNullAt 实现分析

最新推荐文章于 2023-07-04 11:45:39 发布

maray

最新推荐文章于 2023-07-04 11:45:39 发布

阅读量173

点赞数

分类专栏：数据库技术文章标签： clickhouse c++

本文链接：https://blog.csdn.net/maray/article/details/128625157

版权

数据库技术专栏收录该内容

76 篇文章 1 订阅

订阅专栏

class ColumnNullable final : public COWHelper<IColumn, ColumnNullable>
{
    bool isNullAt(size_t n) const override
    {
      return assert_cast<const ColumnUInt8 &>(*null_map).getData()[n] != 0;
    }
}

我们来分块解释上面的代码：

ColumnUInt8

ColumnUInt8 是 ColumnVector 模板的一个实现，负责描述列的内存布局。这里的列可能是数据列，也可以是 bitmap 列。

ColumnUint8 的定义见 Columns/ColumnsNumber.h

namespace DB
{

/** Columns with numbers. */

using ColumnUInt8 = ColumnVector<UInt8>;
using ColumnUInt16 = ColumnVector<UInt16>;
using ColumnUInt32 = ColumnVector<UInt32>;
using ColumnUInt64 = ColumnVector<UInt64>;
using ColumnUInt128 = ColumnVector<UInt128>;
using ColumnUInt256 = ColumnVector<UInt256>;

using ColumnInt8 = ColumnVector<Int8>;
using ColumnInt16 = ColumnVector<Int16>;
using ColumnInt32 = ColumnVector<Int32>;
using ColumnInt64 = ColumnVector<Int64>;
using ColumnInt128 = ColumnVector<Int128>;
using ColumnInt256 = ColumnVector<Int256>;

using ColumnFloat32 = ColumnVector<Float32>;
using ColumnFloat64 = ColumnVector<Float64>;

using ColumnUUID = ColumnVector<UUID>;

}

template <typename T>
class ColumnVector final : public COWHelper<ColumnVectorHelper, ColumnVector<T>>
{
public:
    using ValueType = T;
    using Container = PaddedPODArray<ValueType>;
    
    Container & getData()
    {
        return data;
    }
};

Container 是一个 PODArray（Plain Old Data），可以简单理解成是一个 std::vector。详细实现见PODArray.h ：

PODArray_fwd.h:
using PaddedPODArray = PODArray<T, initial_bytes, TAllocator, PADDING_FOR_SIMD - 1, PADDING_FOR_SIMD>;


Common/PODArray.h
/** A dynamic array for POD types.
  * Designed for a small number of large arrays (rather than a lot of small ones).
  * To be more precise - for use in ColumnVector.
  * It differs from std::vector in that it does not initialize the elements.
  *
  * Made noncopyable so that there are no accidental copies. You can copy the data using `assign` method.
  *
  * Only part of the std::vector interface is supported.
  *
  * The default constructor creates an empty object that does not allocate memory.
  * Then the memory is allocated at least initial_bytes bytes.
  *
  * If you insert elements with push_back, without making a `reserve`, then PODArray is about 2.5 times faster than std::vector.
  *
  * The template parameter `pad_right` - always allocate at the end of the array as many unused bytes.
  * Can be used to make optimistic reading, writing, copying with unaligned SIMD instructions.
  *
  * The template parameter `pad_left` - always allocate memory before 0th element of the array (rounded up to the whole number of elements)
  *  and zero initialize -1th element. It allows to use -1th element that will have value 0.
  * This gives performance benefits when converting an array of offsets to array of sizes.
  *
  * Some methods using allocator have TAllocatorParams variadic arguments.
  * These arguments will be passed to corresponding methods of TAllocator.
  * Example: pointer to Arena, that is used for allocations.
  *
  * Why Allocator is not passed through constructor, as it is done in C++ standard library?
  * Because sometimes we have many small objects, that share same allocator with same parameters,
  *  and we must avoid larger object size due to storing the same parameters in each object.
  * This is required for states of aggregate functions.
  *
  * TODO Pass alignment to Allocator.
  * TODO Allow greater alignment than alignof(T). Example: array of char aligned to page size.
  */
template <typename T, size_t initial_bytes, typename TAllocator, size_t pad_right_, size_t pad_left_>
class PODArray : public PODArrayBase<sizeof(T), initial_bytes, TAllocator, pad_right_, pad_left_>

官方架构文档对 ColumnUInt8 的描述如下：

Various IColumn implementations (ColumnUInt8, ColumnString, and so on) are responsible for the memory layout of columns. The memory layout is usually a contiguous array. For the integer type of columns, it is just one contiguous array, like std :: vector. For String and Array columns, it is two vectors: one for all array elements, placed contiguously, and a second one for offsets to the beginning of each array. There is also ColumnConst that stores just one value in memory, but looks like a column.

null_map

记录了列里的全部 null 信息，是一个数组。里面的值不为 0 时表示它对应的 Column Cell 为 null。

这里我有一个疑问没有搞清楚：为什么 null_map 不是一个 bitmap array，而是一个 uint8 array？是为了性能考虑吗？如果是 bitmap 数组，那么判断一个值是不是 null 的时候还需要做一些 offset 以及 bitmap mask 计算，用 uint8 的话几乎不需要任何代价就能判断出一个值是否为 null。

// 省空间算法：假设是 bitmap，判断第 n 行是否为 null 的实现如下：
bool isNullAt(size_t n)
{
   static char mask_lookup_dict[8] = { 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80 };
   size_t index = n >> 3;
   size_t offset = n & 0x7;
   char c = null_map[index];
   return mask_lookup_dict[offset] & c;
}

// 省时间算法：假设是 char，判断第 n 行是否为 null 的实现如下：
bool isNullAt(size_t n)
{
   return null_map[n] != 0;
}