class ColumnNullable final : public COWHelper<IColumn, ColumnNullable>
{
bool isNullAt(size_t n) const override
{
return assert_cast<const ColumnUInt8 &>(*null_map).getData()[n] != 0;
}
}
我们来分块解释上面的代码:
ColumnUInt8
ColumnUInt8 是 ColumnVector 模板的一个实现,负责描述列的内存布局。这里的列可能是数据列,也可以是 bitmap 列。
ColumnUint8 的定义见 Columns/ColumnsNumber.h
namespace DB
{
/** Columns with numbers. */
using ColumnUInt8 = ColumnVector<UInt8>;
using ColumnUInt16 = ColumnVector<UInt16>;
using ColumnUInt32 = ColumnVector<UInt32>;
using ColumnUInt64 = ColumnVector<UInt64>;
using ColumnUInt128 = ColumnVector<UInt128>;
using ColumnUInt256 = ColumnVector<UInt256>;
using ColumnInt8 = ColumnVector<Int8>;
using ColumnInt16 = ColumnVector<Int16>;
using ColumnInt32 = ColumnVector<Int32>;
using ColumnInt64 = ColumnVector<Int64>;
using ColumnInt128 = ColumnVector<Int128>;
using ColumnInt256 = ColumnVector<Int256>;
using ColumnFloat32 = ColumnVector<Float32>;
using ColumnFloat64 = ColumnVector<Float64>;
using ColumnUUID = ColumnVector<UUID>;
}
template <typename T>
class ColumnVector final : public COWHelper<ColumnVectorHelper, ColumnVector<T>>
{
public:
using ValueType = T;
using Container = PaddedPODArray<ValueType>;
Container & getData()
{
return data;
}
};
Container 是一个 PODArray(Plain Old Data) ,可以简单理解成是一个 std::vector
。详细实现见PODArray.h
:
PODArray_fwd.h:
using PaddedPODArray = PODArray<T, initial_bytes, TAllocator, PADDING_FOR_SIMD - 1, PADDING_FOR_SIMD>;
Common/PODArray.h
/** A dynamic array for POD types.
* Designed for a small number of large arrays (rather than a lot of small ones).
* To be more precise - for use in ColumnVector.
* It differs from std::vector in that it does not initialize the elements.
*
* Made noncopyable so that there are no accidental copies. You can copy the data using `assign` method.
*
* Only part of the std::vector interface is supported.
*
* The default constructor creates an empty object that does not allocate memory.
* Then the memory is allocated at least initial_bytes bytes.
*
* If you insert elements with push_back, without making a `reserve`, then PODArray is about 2.5 times faster than std::vector.
*
* The template parameter `pad_right` - always allocate at the end of the array as many unused bytes.
* Can be used to make optimistic reading, writing, copying with unaligned SIMD instructions.
*
* The template parameter `pad_left` - always allocate memory before 0th element of the array (rounded up to the whole number of elements)
* and zero initialize -1th element. It allows to use -1th element that will have value 0.
* This gives performance benefits when converting an array of offsets to array of sizes.
*
* Some methods using allocator have TAllocatorParams variadic arguments.
* These arguments will be passed to corresponding methods of TAllocator.
* Example: pointer to Arena, that is used for allocations.
*
* Why Allocator is not passed through constructor, as it is done in C++ standard library?
* Because sometimes we have many small objects, that share same allocator with same parameters,
* and we must avoid larger object size due to storing the same parameters in each object.
* This is required for states of aggregate functions.
*
* TODO Pass alignment to Allocator.
* TODO Allow greater alignment than alignof(T). Example: array of char aligned to page size.
*/
template <typename T, size_t initial_bytes, typename TAllocator, size_t pad_right_, size_t pad_left_>
class PODArray : public PODArrayBase<sizeof(T), initial_bytes, TAllocator, pad_right_, pad_left_>
官方架构文档 对 ColumnUInt8 的描述如下:
Various IColumn implementations (ColumnUInt8, ColumnString, and so on) are responsible for the memory layout of columns. The memory layout is usually a contiguous array. For the integer type of columns, it is just one contiguous array, like std :: vector. For String and Array columns, it is two vectors: one for all array elements, placed contiguously, and a second one for offsets to the beginning of each array. There is also ColumnConst that stores just one value in memory, but looks like a column.
null_map
记录了列里的全部 null 信息,是一个数组。里面的值不为 0 时表示它对应的 Column Cell 为 null。
这里我有一个疑问没有搞清楚:为什么 null_map 不是一个 bitmap array,而是一个 uint8 array?是为了性能考虑吗?如果是 bitmap 数组,那么判断一个值是不是 null 的时候还需要做一些 offset 以及 bitmap mask 计算,用 uint8 的话几乎不需要任何代价就能判断出一个值是否为 null。
// 省空间算法:假设是 bitmap,判断第 n 行是否为 null 的实现如下:
bool isNullAt(size_t n)
{
static char mask_lookup_dict[8] = { 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80 };
size_t index = n >> 3;
size_t offset = n & 0x7;
char c = null_map[index];
return mask_lookup_dict[offset] & c;
}
// 省时间算法:假设是 char,判断第 n 行是否为 null 的实现如下:
bool isNullAt(size_t n)
{
return null_map[n] != 0;
}