#**KADB向量化执行引擎技术**
**随着企业的发展和数字化转型,企业的数据量越来越大,
数字化转型的加速推进,使得业务数据爆炸性增长,海量数据已成为数字化时代最为典型的标志之一。**
**传统的集中式数据库已经不能满足决策者需要,为了满足大数据下的分析需求,分部署数据库出现了,早期的分部署式数据库如:Greenplum、vertica、gbase现在的Tidb、gaussdb、kadb等**
**KADB是北京人大金仓科技股份有限公司基于postgres数据库,借鉴Greenplum架构研发的一款分布式数据库,KADB完美集成了postgresXX能力和Greenplum的计算能,同时KADB还推出了全新的向量化执行引擎传统的QUERY执行器在处理表达式时,是一行一行的处理模式。
比如在处理(x+y)这个表达式时,首先读取一条记录的X值,然后读取这条记录的Y值,然后执行+操作符。然后将结果返回给上层node。
然而,向量化执行器,一个操作符可以处理多个值,比如(x+y) ,x, y并不是scalar值,而是一批值的向量,向量化执行的结果也不是scalar值,而是向量。
相比传统的query引擎,向量化执行更适合处理大批量的数据。**
为了更好的理解和比较,我们采用最新的KADB来做一下不同执行引擎的测试
# 集群配置 #
| 集群角色 | 服务器数量|实例数量 | 服务器配置 |
| :----:| :----: |:----: | :----: |
| 管理节点 | 2 | | CPU: E5-2620 v4 @ 2.10GHz 内存:256G 硬盘:STAT 7200 |
| 计算节点 | 6 | 8 |CPU: E5-2620 v4 @ 2.10GHz 内存:256G 硬盘:STAT 7200 |
# 数据量 #
采用标准的TPC_H程序生成测试数据
| TPC_H数据量 | lineitem表数据条数 | lineitem表大小|
| :----:| :----: |:----:|
|1GB | 6百万 |652 MB|
|100GB | 6亿 | 64 GB|
|1TB | 60亿 | 652 GB
## 表结构 ##
采用TPC_H lineitem 表作为测试表,存储模式为列存。
CREATE TABLE lineitem
(L_ORDERKEY BIGINT,
L_PARTKEY INT,
L_SUPPKEY INT,
L_LINENUMBER INTEGER,
L_QUANTITY float,
L_EXTENDEDPRICE float,
L_DISCOUNT float,
L_TAX float,
L_RETURNFLAG CHAR(1),
L_LINESTATUS CHAR(1),
L_SHIPDATE DATE,
L_COMMITDATE DATE,
L_RECEIPTDATE DATE,
L_SHIPINSTRUCT CHAR(25),
L_SHIPMODE CHAR(10),
L_COMMENT VARCHAR(44))
using ao_column;
<h1 align = "center">ORCA、PostgreSQL的JIT、向量化引擎 三种执行引擎对比</h1>
**ORCA、PostgreSQL的JIT、向量化引擎 三种执行引擎,其中ORCA是从Greenplum集成过来的,JIT是从PostgreSQL继承过来的,向量化引擎是KADB最新研发的。下面测试在不同数据量下三种执行引擎的效果。 注:数据都是通过标准TPC_H产生的数据**
# 1.数据量 1GB #
## ORCA执行引擎 ##
默认ORCA是开启的。
postgres=# show optimizer;
optimizer
-----------
on
(1 row)
postgres=# select
postgres-# l_returnflag,
postgres-# l_linestatus,
postgres-# sum(l_quantity) as sum_qty,
postgres-# sum(l_extendedprice) as sum_base_price,
postgres-# sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
postgres-# sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
postgres-# avg(l_quantity) as avg_qty,
postgres-# avg(l_extendedprice) as avg_price,
postgres-# avg(l_discount) as avg_disc,
postgres-# count(*) as count_order
postgres-# from
postgres-# lineitem_1g
postgres-# where
postgres-# l_shipdate <= date '1998-12-01' - interval '90 days'
postgres-# group by
postgres-# l_returnflag,
postgres-# l_linestatus
postgres-# ;
l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty |
avg_price | avg_disc | count_order
--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+----
----------------+----------------------+-------------
N| F| 991417 | 1487504710.3799996 | 1413082168.0540998 | 1469649223.1943755 | 25.516471920522985 | 38284.467760848296 | 0.05009342667421628 | 38854
A| F| 37734107 | 56586554400.73018 | 53758257134.86981 | 55909065222.82771 | 25.522005853257337 | 38273.12973462179 | 0.049985295838423954 | 1478493
N| O| 74476040 | 111701729697.74016 | 106118230307.60542 | 110367043872.49702 | 25.50222676958499 | 38249.11798890832 | 0.04999658605376632 | 2920374
R| F| 37719753 | 56568041380.90002 | 53741292684.60405 | 55889619119.83172 | 25.50579361269077 | 38250.854626099666 | 0.05000940583015323 | 1478870
(4 rows)
Time: 534.271 ms
postgres=#
## JIT执行引擎 ##
打开JIT执行引起调整相关参数强制走JIT执行引擎。
set optimizer=off;(JIT不支持ORAC所以要先关闭ORAC)
set jit to on;
set jit_optimize_above_cost to 0;
set jit_above_cost to 0;
set jit_inline_above_cost to 0;
postgres=# select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem_1g
where
l_shipdate <= date '1998-12-01' - interval '90 days'
group by
l_returnflag,
l_linestatus
;
l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty |
avg_price | avg_disc | count_order
--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+----
----------------+---------------------+-------------
A| F| 37734107 | 56586554400.730194 | 53758257134.86982 | 55909065222.82772 | 25.522005853257337 | 38273.129734621805 | 0.04998529583842396 | 1478493
N| F| 991417 | 1487504710.3799999 | 1413082168.0540996 | 1469649223.1943755 | 25.516471920522985 | 38284.4677608483 | 0.05009342667421627 | 38854
N| O| 74476040 | 111701729697.74011 | 106118230307.60542 | 110367043872.49704 | 25.50222676958499 | 38249.11798890831 | 0.04999658605376633 | 2920374
R| F| 37719753 | 56568041380.90002 | 53741292684.60404 | 55889619119.83172 | 25.50579361269077 | 38250.854626099666 | 0.05000940583015323 | 1478870
(4 rows)
Time: 1251.055 ms (00:01.251)
postgres=#
## 向量化执行引擎 ##
打开向量化执行引擎
set optimizer = off ;(向量化也不支持ORAC所以也要先关闭ORAC)
set enable_groupagg to off;
set jit to off;
set enable_vectorize_engine to on;
postgres=# select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem_1g
where
l_shipdate <= date '1998-12-01' - interval '90 days'
group by
l_returnflag,
l_linestatus
;
l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty |avg_price | avg_disc | count_order
--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------
R| F| 37719753 | 56568041380.90002 | 53741292684.60406 | 55889619119.83171 | 25.50579361269077 | 38250.854626099666 | 0.050009405830153224 | 1478870
A| F| 37734107 | 56586554400.73018 | 53758257134.86983 | 55909065222.82771 | 25.522005853257337 | 38273.12973462179 | 0.04998529583842394 | 1478493
N| O| 74476040 | 111701729697.74014 | 106118230307.60542 | 110367043872.497 | 25.50222676958499 | 38249.11798890832 | 0.04999658605376632 | 2920374
N| F| 991417 | 1487504710.3799996 | 1413082168.0540996 | 1469649223.1943755 | 25.516471920522985 | 38284.467760848296 | 0.05009342667421627 | 38854
(4 rows)
Time: 280.208 ms
postgres=#
# 结果对比 #
| 执行引擎类型 | 时间 |
| :-----:| :----: |
| ORCA | 534.271 ms |
| JIT | 1251.055 ms |
| 向量化 | 280.208 ms |
**从上面可以看出数据量为1GB时,向量化执行引擎是三种执行中用时最短的大概是GPORCA
执行引擎的2倍,是JIT的6倍。那随着数据量增长是否还会有这样的效果呢?下面我增加数据量进行测试。**
----------
# 2.数据量 100GB #
## ORCA执行引擎 ##
postgres=# show optimizer;
optimizer
-----------
on
(1 row)
Time: 0.204 ms
postgres=# select
postgres-# l_returnflag,
postgres-# l_linestatus,
postgres-# sum(l_quantity) as sum_qty,
postgres-# sum(l_extendedprice) as sum_base_price,
postgres-# sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
postgres-# sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
postgres-# avg(l_quantity) as avg_qty,
postgres-# avg(l_extendedprice) as avg_price,
postgres-# avg(l_discount) as avg_disc,
postgres-# count(*) as count_order
postgres-# from
postgres-# lineitem_100g
postgres-# where
postgres-# l_shipdate <= date '1998-12-01' - interval '90 days'
postgres-# group by
postgres-# l_returnflag,
postgres-# l_linestatus
postgres-# ;
l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order
--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------
A| F| 3775127758 | 5660776097194.451 | 5377736398183.81 | 5592847429515.931 | 25.499370423275426 |38236.11698430491 | 0.05000224353247317 | 148047881
N| O| 7436302976 | 11150725681373.66 | 10593195308234.557 | 11016932248183.568 | 25.500009404374193 | 38237.227646361185 | 0.04999791831849612 | 291619617
R| F| 3775724970 | 5661603032745.284 | 5378513563915.388 | 5593662252667.158 | 25.50006628406532 | 38236.697258452594 | 0.05000130434119794 | 148067261
N| F| 98553062 | 147771098385.98004 | 140384965965.03406 | 145999793032.77475 | 25.501556956882876 |38237.19938880452 | 0.049985284338125296 | 3864590
(4 rows)
Time: 39110.600 ms (00:39.111)
postgres=#
## JIT执行引擎 ##
postgres=# set optimizer = off;
SET
Time: 310.773 ms
postgres=# select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem_100g
where
l_shipdate <= date '1998-12-01' - interval '90 days'
group by
l_returnflag,
l_linestatus
;
l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order
--------------+--------------+------------+-------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------
A| F| 3775127758 | 5660776097194.451 | 5377736398183.811 | 5592847429515.931 | 25.499370423275426 | 38236.11698430491 | 0.05000224353247317 | 148047881
N| F| 98553062 | 147771098385.98 | 140384965965.03403 | 145999793032.77478 | 25.501556956882876 | 38237.19938880451 | 0.049985284338125296 | 3864590
N| O| 7436302976 | 11150725681373.66 | 10593195308234.559 | 11016932248183.568 | 25.500009404374193 | 38237.227646361185 | 0.04999791831849612 | 291619617
R| F| 3775724970 | 5661603032745.282 | 5378513563915.389 | 5593662252667.159 | 25.50006628406532 | 38236.69725845258 | 0.05000130434119794 | 148067261
(4 rows)
Time: 58297.991 ms (00:58.298)
postgres=#
## 向量化执行引擎 ##
postgres=# set enable_groupagg to off;
SET
Time: 18.182 ms
postgres=# set jit to off;
SET
Time: 0.556 ms
postgres=# set enable_vectorize_engine to on;
SET
Time: 0.630 ms
postgres=# select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem_100g
where
l_shipdate <= date '1998-12-01' - interval '90 days'
group by
l_returnflag,
l_linestatus
;
l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order
--------------+--------------+------------+-------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------
R| F| 3775724970 | 5661603032745.284 | 5378513563915.388 | 5593662252667.158 | 25.50006628406532 | 38236.697258452594 | 0.05000130434119794 | 148067261
A| F| 3775127758 | 5660776097194.452 | 5377736398183.81 | 5592847429515.932 | 25.499370423275426 | 38236.11698430491 | 0.05000224353247317 | 148047881
N| O| 7436302976 | 11150725681373.66 | 10593195308234.555 | 11016932248183.566 | 25.500009404374193 | 38237.227646361185 | 0.049997918318496114 | 291619617
N| F| 98553062 | 147771098385.98 | 140384965965.03406 | 145999793032.77478 | 25.501556956882876 | 38237.19938880451 | 0.04998528433812528 | 3864590
(4 rows)
Time: 19781.192 ms (00:19.781)
postgres=#
# 结果对比 #
| 执行引擎类型 | 时间 |
| :-----:| :----: |
| ORCA | 39110.600 ms |
| JIT | 58297.991 ms |
| 向量化 | 19781.192 ms|
**从测试结果来看,当数据量是100GB的时候,向量化执行引擎仍然是三种执行引擎里用时最短的,是ORCA2倍,是JIT的3倍,可以看出随着数据量的增长,JIT引擎的用时有所提升,但是和向量化引擎和ORCA引擎还是有差距,那么我们在增加数据量来看一下对比情况。**
----------
# 3. 数据量 1TB #
## ORCA执行引擎 ##
postgres=# show optimizer;
optimizer
-----------
on
(1 row)
Time: 0.204 ms
postgres=# select
postgres-# l_returnflag,
postgres-# l_linestatus,
postgres-# sum(l_quantity) as sum_qty,
postgres-# sum(l_extendedprice) as sum_base_price,
postgres-# sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
postgres-# sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
postgres-# avg(l_quantity) as avg_qty,
postgres-# avg(l_extendedprice) as avg_price,
postgres-# avg(l_discount) as avg_disc,
postgres-# count(*) as count_order
postgres-# from
postgres-# lineitem_100g
postgres-# where
postgres-# l_shipdate <= date '1998-12-01' - interval '90 days'
postgres-# group by
postgres-# l_returnflag,
postgres-# l_linestatus
postgres-# ;
l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order
--------------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------
N| F| 1009474721 | 1513655546292.845 | 1437964915505.205 1495505427205.7922 | 25.501252901992732 |38237.82021433879 | 0.050007751109846676 |39585299
A| F| 38663494247 | 57975897720343.05 | 55077056575228.49 | 57280139974700.33 | 25.499902314719474 |38237.09567046249 | 0.05000115197512918 | 1516221269
N| O| 76132027266 | 114160174445769.75 | 108452120147030.14 | 112790257632072.64 | 25.500031386008317 |38237.36390505108 | 0.04999960482532665 | 2985566022
R| F| 38663342979 | 57975227940306.44 | 55076481003328.086 | 57279552747411.33 | 25.500078164568315 |38237.067210909765 | 0.04999968871567838 | 1516204881
(4 rows)
Time: 340687.305 ms (05:40.687)
## JIT执行引擎 ##
postgres=# set optimizer = off ;
SET
Time: 21.191 ms
postgres=# select
postgres-# l_returnflag,
postgres-# l_linestatus,
postgres-# sum(l_quantity) as sum_qty,
postgres-# sum(l_extendedprice) as sum_base_price,
postgres-# sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
postgres-# sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
postgres-# avg(l_quantity) as avg_qty,
postgres-# avg(l_extendedprice) as avg_price,
postgres-# avg(l_discount) as avg_disc,
postgres-# count(*) as count_order
postgres-# from
postgres-# lineitem_1t
postgres-# where
postgres-# l_shipdate <= date '1998-12-01' - interval '90 days'
postgres-# group by
postgres-# l_returnflag,
postgres-# l_linestatus
postgres-# ;
l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order
--------------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+-------------------+----------------------+-------------
A| F| 38663494247 | 57975897720343.055 | 55077056575228.5 | 57280139974700.336 | 25.499902314719474 |38237.0956704625 | 0.050001151975129186 | 1516221269
N| F| 1009474721 | 1513655546292.845 | 1437964915505.205 | 1495505427205.7922 | 25.501252901992732 |38237.82021433879 | 0.05000775110984667 |39585299
N| O| 76132027266 | 114160174445769.75 | 108452120147030.16 | 112790257632072.62 | 25.500031386008317 |38237.36390505108 | 0.049999604825326636 | 2985566022
R| F| 38663342979 | 57975227940306.43 | 55076481003328.08 | 57279552747411.33 | 25.500078164568315 |38237.06721090976 | 0.049999688715678385 | 1516204881
(4 rows)
Time: 1625708.799 ms (27:05.709)
## 向量化执行引擎 ##
set optimizer = off ;
set enable_groupagg to off;
set jit to off;
set enable_vectorize_engine to on;
postgres=# select
postgres-# l_returnflag,
postgres-# l_linestatus,
postgres-# sum(l_quantity) as sum_qty,
postgres-# sum(l_extendedprice) as sum_base_price,
postgres-# sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
postgres-# sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
postgres-# avg(l_quantity) as avg_qty,
postgres-# avg(l_extendedprice) as avg_price,
postgres-# avg(l_discount) as avg_disc,
postgres-# count(*) as count_order
postgres-# from
postgres-# lineitem_1t
postgres-# where
postgres-# l_shipdate <= date '1998-12-01' - interval '90 days'
postgres-# group by
postgres-# l_returnflag,
postgres-# l_linestatus
postgres-# ;
l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order
--------------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+- -------------------+----------------------+-------------
R| F| 38663342979 | 57975227940306.44 | 55076481003328.08 | 57279552747411.33 | 25.500078164568315 |38237.067210909765 | 0.0499996887156784 | 1516204881
A| F| 38663494247 | 57975897720343.05 | 55077056575228.5 | 57280139974700.35 | 25.499902314719474 |38237.09567046249 | 0.05000115197512918 | 1516221269
N| O| 76132027266 | 114160174445769.75 | 108452120147030.16 | 112790257632072.62 | 25.500031386008317 |38237.36390505108 | 0.04999960482532665 | 2985566022
N| F| 1009474721 | 1513655546292.8452 | 1437964915505.2053 | 1495505427205.7922 | 25.501252901992732 |38237.82021433879 | 0.050007751109846656 |39585299
(4 rows)
Time: 935449.207 ms (15:35.449)
# 结果对比 #
| 执行引擎类型 | 时间 |
| :-----:| :----: |
| ORCA | 340687.305 ms |
| JIT | 1625708.799 ms |
| 向量化 | 135449.207 ms |
**当数据量达到1TB的时候向量化引擎仍然是最快的,与ORCA相比依然有着2~3倍的提升,与JIT相比提升就更大了。所以从以上的测试可以看出KADB新增的向量化引擎大大提高了计算的效率,虽然目前还存在一些限制条件但是在未来,向量化引擎将大大提升数据库的性能。**
KADB向量化执行引擎
于 2023-09-12 10:45:58 首次发布