背景
单机Clickhouse GroupBy的功能是被业界津津乐道的。那么它又是怎么能达到这么好的性能的呢?优化点在哪里呢。这里我们来去探索下Clickhouse关于GroupBy的秘密。
说到GroupBy那就不能不提聚合函数了,在Clickhouse里面关于GroupBy的设计是非常优秀的,Clickhouse计算引擎是通过火山模型来实现的,但是火山模型有一个非常大的性能损耗就是虚函数调用的开销,那么针对于这个痛点问题,Clickhouse在设计GroupBy的时候采用了C++ CRTP的设计方式来降低虚函数的性能。 并且Clickhouse中设计GroupBy的处理流程和聚集函数的计算流程完全解耦。 这么设计一个非常大的好处就是,我们可以通过对内核打补丁的方式定制化实现各种各样聚集函数从而满足业务的需要。
那么下面我们来看看聚合函数是如何实现的。
聚集函数
在Clickhouse里面聚集函数的实现是由二个部分组成的。
- 聚合函数的定义。(聚合函数的实现类 + 聚合函数数据类)
- 聚合函数注册到Clickhouse中。
那么下面我们先用聚合函数Count来看定如何实现一个聚合函数在Clickhouse里面有哪些事情要做。聚集函数Count涉及的相关类图
IAggregateFunction类描述
在Clickhouse中定义了一个统一的聚集函数接口(IAggregateFunction),所有具体的聚集函数都要实现相应的接口进行计算。本例子中是聚集函数Count的实现。
IAggregateFunction的核心接口有下面这些。
- getName()接口: 返回具体的聚集函数名称。例如本例子中返回Count。
- getReturnType()接口: 返回聚集函数计算完结果的数据类型是什么。
- create()接口: 是数据管理的函数,通过placement new的方式创建一个空的数据。例如本例子中的AggregateFunctionCountData。
- destroy()接口: 销毁通过create接口创建出来的数据。
- add(...)接口: 外部通过这个接口传递过来数据给聚集函数进行计算使用。
- addBatch(...)接口: 这是函数也是非常重要的,虽然它仅仅实现了一个for循环调用add函数。它通过这样的方式来减少虚函数的调用次数,并且增加了编译器内联的概率。
- merge(...)接口: 将两个聚合结果进行合并的函数。例如如果有两个线程都计算聚集,那最后两个线程的计算结果也要聚合在一起。
- serialize(...)接口: 序列化聚集数据,通常用于spill to disk或分布式场景需要保存或传输中间结果的。
- deserialize(...)接口: 反序列化成聚集数据。
IAggregateFunction部分代码如下:
/** Aggregate functions interface.
* Instances of classes with this interface do not contain the data itself for aggregation,
* but contain only metadata (description) of the aggregate function,
* as well as methods for creating, deleting and working with data.
* The data resulting from the aggregation (intermediate computing states) is stored in other objects
* (which can be created in some memory pool),
* and IAggregateFunction is the external interface for manipulating them.
*/
class IAggregateFunction
{
public:
/// Get main function name.
virtual String getName() const = 0;
/// Get the result type.
virtual DataTypePtr getReturnType() const = 0;
/** Data manipulating functions. */
/** Create empty data for aggregation with `placement new` at the specified location.
* You will have to destroy them using the `destroy` method.
*/
virtual void create(AggregateDataPtr __restrict place) const = 0;
/// Delete data for aggregation.
virtual void destroy(AggregateDataPtr __restrict place) const noexcept = 0;
/** Adds a value into aggregation data on which place points to.
* columns points to columns containing arguments of aggregation function.
* row_num is number of row which should be added.
* Additional parameter arena should be used instead of standard memory allocator if the addition requires memory allocation.
*/
virtual void add(AggregateDataPtr __restrict place, const IColumn ** columns, size_t row_num, Arena * arena) const = 0;
/// Merges state (on which place points to) with other state of current aggregation function.
virtual void merge(AggregateDataPtr __restrict place, ConstAggregateDataPtr rhs, Arena * arena) const = 0;
/// Serializes state (to transmit it over the network, for example).
virtual void serialize(ConstAggregateDataPtr __restrict place, WriteBuffer & buf) const = 0;
/// Deserializes state. This function is called only for empty (just created) states.
virtual void deserialize(AggregateDataPtr __restrict place, ReadBuffer & buf, Arena * arena) const = 0;
/** The inner loop that uses the function pointer is better than using the virtual function.
* The reason is that in the case of virtual functions GCC 5.1.2 generates code,
* which, at each iteration of the loop, reloads the function address (the offset value in the virtual function table) from memory to the register.
* This gives a performance drop on simple queries around 12%.
* After the appearance of better compilers, the code can be removed.
*/
using AddFunc = void (*)(const IAggregateFunction *, AggregateDataPtr, const IColumn **, size_t, Arena *);
virtual AddFunc getAddressOfAddFunction() const = 0;
/** Contains a loop with calls to "add" function. You can collect arguments into array "places"
* and do a single call to "