Starrocks(2.0.1) vs clickhouse (20.4.2.9)集群 SSB性能测试对比
作者:聂雄超 (kelvin)
测试结论
Star schema benchmark(以下简称SSB)是学术界和工业界广泛使用的一个星型模型测试集(来源论文),通过这个测试集合可以方便的对比各种OLAP产品的基础性能指标。Clickhouse 通过改写SSB,将星型模型打平转化成宽表,改造成了一个单表测试benchmark(参考链接)。本报告记录了StarRocks和Clickhouse在SSB单表数据集上的性能对比结果,并记录了在用户经常碰到的低基数聚合场景下StarRocks和ClickHouse的性能对比结果。测试结论如下:
- 在标准测试数据集的 13 个查询上,ClickHouse没有手动设置lowcardinality(string),StarRocks 的整体查询性能是 ClickHouse 的 4.89 倍;StarRocks无需手动设置,自动开启低基数字典。ClickHouse需要根据列的基数手动设置,不易使用。。
- 在标准测试数据集上,我们选取了一些常见的低基数聚合场景。ClickHouse的整体查询时间是StarRocks的 2.2倍
- 测试准备
硬件准备
机器 | 6台 服务器 |
---|---|
cpu | Intel® Xeon® Gold 5218 CPU @ 2.30GHz 32core ,64线程 |
内存 | 512GB |
网络带宽 | 20Gbits/s |
磁盘 | HDD盘(12*4TB) |
- 软件环境
StarRocks和Clickhouse部署在相同配置的机器上分别进行启动测试。
- StarRocks部署3BE 3FE
- Clickhouse部署六个节点后建立分布式表
内核版本:Linux 3.10.0-862.el7.x86_64
操作系统版本:Red Hat 4.8.5-28
软件版本: StarRocks2.0.1、ClickHouse 20.4.2.9
- 测试数据与结果
- 测试数据
表名 | 行数 | 解释 |
---|---|---|
lineorder | 6亿 | SSB商品订单表 |
customer | 300万 | SSB客户表 |
part | 140万 | SSB 零部件表 |
supplier | 20万 | SSB 供应商表 |
dates | 2556 | 日期表 |
lineorder_flat | 6亿 | SSB打平后的宽表 |
测试SQL
- 建表
use ssb;
CREATE TABLE `lineorder` (
`lo_orderkey` int(11) NOT NULL COMMENT "",
`lo_linenumber` int(11) NOT NULL COMMENT "",
`lo_custkey` int(11) NOT NULL COMMENT "",
`lo_partkey` int(11) NOT NULL COMMENT "",
`lo_suppkey` int(11) NOT NULL COMMENT "",
`lo_orderdate` int(11) NOT NULL COMMENT "",
`lo_orderpriority` varchar(16) NOT NULL COMMENT "",
`lo_shippriority` int(11) NOT NULL COMMENT "",
`lo_quantity` int(11) NOT NULL COMMENT "",
`lo_extendedprice` int(11) NOT NULL COMMENT "",
`lo_ordtotalprice` int(11) NOT NULL COMMENT "",
`lo_discount` int(11) NOT NULL COMMENT "",
`lo_revenue` int(11) NOT NULL COMMENT "",
`lo_supplycost` int(11) NOT NULL COMMENT "",
`lo_tax` int(11) NOT NULL COMMENT "",
`lo_commitdate` int(11) NOT NULL COMMENT "",
`lo_shipmode` varchar(11) NOT NULL COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`lo_orderkey`)
COMMENT "OLAP"
PARTITION BY RANGE(`lo_orderdate`)
(PARTITION p1 VALUES [("-2147483648"), ("19930101")),
PARTITION p2 VALUES [("19930101"), ("19940101")),
PARTITION p3 VALUES [("19940101"), ("19950101")),
PARTITION p4 VALUES [("19950101"), ("19960101")),
PARTITION p5 VALUES [("19960101"), ("19970101")),
PARTITION p6 VALUES [("19970101"), ("19980101")),
PARTITION p7 VALUES [("19980101"), ("19990101")))
DISTRIBUTED BY HASH(`lo_orderkey`) BUCKETS 48
PROPERTIES (
"replication_num" = "1"
);
CREATE TABLE IF NOT EXISTS `customer` (
`c_custkey` int(11) NOT NULL COMMENT "",
`c_name` varchar(26) NOT NULL COMMENT "",
`c_address` varchar(41) NOT NULL COMMENT "",
`c_city` varchar(11) NOT NULL COMMENT "",
`c_nation` varchar(16) NOT NULL COMMENT "",
`c_region` varchar(13) NOT NULL COMMENT "",
`c_phone` varchar(16) NOT NULL COMMENT "",
`c_mktsegment` varchar(11) NOT NULL COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`c_custkey`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`c_custkey`) BUCKETS 12
PROPERTIES (
"replication_num" = "1"
);
CREATE TABLE IF NOT EXISTS `dates` (
`d_datekey` int(11) NOT NULL COMMENT "",
`d_date` varchar(20) NOT NULL COMMENT "",
`d_dayofweek` varchar(10) NOT NULL COMMENT "",
`d_month` varchar(11) NOT NULL COMMENT "",
`d_year` int(11) NOT NULL COMMENT "",
`d_yearmonthnum` int(11) NOT NULL COMMENT "",
`d_yearmonth` varchar(9) NOT NULL COMMENT "",
`d_daynuminweek` int(11) NOT NULL COMMENT "",
`d_daynuminmonth` int(11) NOT NULL COMMENT "",
`d_daynuminyear` int(11) NOT NULL COMMENT "",
`d_monthnuminyear` int(11) NOT NULL COMMENT "",
`d_weeknuminyear` int(11) NOT NULL COMMENT "",
`d_sellingseason` varchar(14) NOT NULL COMMENT "",
`d_lastdayinweekfl` int(11) NOT NULL COMMENT "",
`d_lastdayinmonthfl` int(11) NOT NULL COMMENT "",
`d_holidayfl` int(11) NOT NULL COMMENT "",
`d_weekdayfl` int(11) NOT NULL COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`d_datekey`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`d_datekey`) BUCKETS 1
PROPERTIES (
"replication_num" = "1"
);
CREATE TABLE IF NOT EXISTS `supplier` (
`s_suppkey` int(11) NOT NULL COMMENT "",
`s_name` varchar(26) NOT NULL COMMENT "",
`s_address` varchar(26) NOT NULL COMMENT "",
`s_city` varchar(11) NOT NULL COMMENT "",
`s_nation` varchar(16) NOT NULL COMMENT "",
`s_region` varchar(13) NOT NULL COMMENT "",
`s_phone` varchar(16) NOT NULL COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`s_suppkey`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`s_suppkey`) BUCKETS 12
PROPERTIES (
"replication_num" = "1"
);
CREATE TABLE IF NOT EXISTS `part` (
`p_partkey` int(11) NOT NULL COMMENT "",
`p_name` varchar(23) NOT NULL COMMENT "",
`p_mfgr` varchar(7) NOT NULL COMMENT "",
`p_category` varchar(8) NOT NULL COMMENT "",
`p_brand` varchar(10) NOT NULL COMMENT "",
`p_color` varchar(12) NOT NULL COMMENT "",
`p_type` varchar(26) NOT NULL COMMENT "",
`p_size` int(11) NOT NULL COMMENT "",
`p_container` varchar(11) NOT NULL COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`p_partkey`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`p_partkey`) BUCKETS 12
PROPERTIES (
"replication_num" = "1"
);
CREATE TABLE `lineorder_flat` (
`LO_ORDERDATE` date NOT NULL COMMENT "",
`LO_ORDERKEY` int(11) NOT NULL COMMENT "",
`LO_LINENUMBER` tinyint(4) NOT NULL COMMENT "",
`LO_CUSTKEY` int(11) NOT NULL COMMENT "",
`LO_PARTKEY` int(11) NOT NULL COMMENT "",
`LO_SUPPKEY` int(11) NOT NULL COMMENT "",
`LO_ORDERPRIORITY` varchar(100) NOT NULL COMMENT "",
`LO_SHIPPRIORITY` tinyint(4) NOT NULL COMMENT "",
`LO_QUANTITY` tinyint(4) NOT NULL COMMENT "",
`LO_EXTENDEDPRICE` int(11) NOT NULL COMMENT "",
`LO_ORDTOTALPRICE` int(11) NOT NULL COMMENT "",
`LO_DISCOUNT` tinyint(4) NOT NULL COMMENT "",
`LO_REVENUE` int(11) NOT NULL COMMENT "",
`LO_SUPPLYCOST` int(11) NOT NULL COMMENT "",
`LO_TAX` tinyint(4) NOT NULL COMMENT "",
`LO_COMMITDATE` date NOT NULL COMMENT "",
`LO_SHIPMODE` varchar(100) NOT NULL COMMENT "",
`C_NAME` varchar(100) NOT NULL COMMENT "",
`C_ADDRESS` varchar(100) NOT NULL COMMENT "",
`C_CITY` varchar(100) NOT NULL COMMENT "",
`C_NATION` varchar(100) NOT NULL COMMENT "",
`C_REGION` varchar(100) NOT NULL COMMENT "",
`C_PHONE` varchar(100) NOT NULL COMMENT "",
`C_MKTSEGMENT` varchar(100) NOT NULL COMMENT "",
`S_NAME` varchar(100) NOT NULL COMMENT "",
`S_ADDRESS` varchar(100) NOT NULL COMMENT "",
`S_CITY` varchar(100) NOT NULL COMMENT "",
`S_NATION` varchar(100) NOT NULL COMMENT "",
`S_REGION` varchar(100) NOT NULL COMMENT "",
`S_PHONE` varchar(100) NOT NULL COMMENT "",
`P_NAME` varchar(100) NOT NULL COMMENT "",
`P_MFGR` varchar(100) NOT NULL COMMENT "",
`P_CATEGORY` varchar(100) NOT NULL COMMENT "",
`P_BRAND` varchar(100) NOT NULL COMMENT "",
`P_COLOR` varchar(100) NOT NULL COMMENT "",
`P_TYPE` varchar(100) NOT NULL COMMENT "",
`P_SIZE` tinyint(4) NOT NULL COMMENT "",
`P_CONTAINER` varchar(100) NOT NULL COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`LO_ORDERDATE`, `LO_ORDERKEY`)
COMMENT "OLAP"
PARTITION BY RANGE(`LO_ORDERDATE`)
(START ("1992-01-01") END ("1999-01-01") EVERY (INTERVAL 1 YEAR))
DISTRIBUTED BY HASH(`LO_ORDERKEY`) BUCKETS 48
PROPERTIES (
"replication_num" = "1"
);
- 生成数据
由于脚本为python3解释器执行,服务器版本为python2.7,因此需用修改为python2.7
# 下载测试脚本
wget https://starrocks-public.oss-cn-zhangjiakou.aliyuncs.com/ssb-poc-0.9.3.zip
unzip ssb-poc-0.9.3.zip
cd ssb-poc
make && make install
# 生成100G数据脚本
cd output
bin/gen-ssb.sh 100 data_dir
# 测试100G数据
bin/create_db_table.sh ddl_100
- 查询参数调整
-- 默认为1,并行度的设置极大影响查询效率
# 设置并行度,建议是每个集群节点逻辑核数的一半,64线程设置为32,经测试再调大该参数,查询性能基本没有变化
set global parallel_fragment_exec_instance_num = 32;
- 单表测试
--Q1.1
SELECT sum(lo_extendedprice * lo_discount) AS `revenue`
FROM lineorder_flat
WHERE lo_orderdate >= '1993-01-01' and lo_orderdate <= '1993-12-31' AND lo_discount BETWEEN 1 AND 3 AND lo_quantity < 25;
--Q1.2
SELECT sum(lo_extendedprice * lo_discount) AS revenue FROM lineorder_flat
WHERE lo_orderdate >= '1994-01-01' and lo_orderdate <= '1994-01-31' AND lo_discount BETWEEN 4 AND 6 AND lo_quantity BETWEEN 26 AND 35;
--Q1.3
SELECT sum(lo_extendedprice * lo_discount) AS revenue
FROM lineorder_flat
WHERE weekofyear(lo_orderdate) = 6 AND lo_orderdate >= '1994-01-01' and lo_orderdate <= '1994-12-31'
AND lo_discount BETWEEN 5 AND 7 AND lo_quantity BETWEEN 26 AND 35;
--Q2.1
SELECT sum(lo_revenue), year(lo_orderdate) AS year, p_brand
FROM lineorder_flat
WHERE p_category = 'MFGR#12' AND s_region = 'AMERICA'
GROUP BY year, p_brand
ORDER BY year, p_brand;
--Q2.2
SELECT
sum(lo_revenue), year(lo_orderdate) AS year, p_brand
FROM lineorder_flat
WHERE p_brand >= 'MFGR#2221' AND p_brand <= 'MFGR#2228' AND s_region = 'ASIA'
GROUP BY year, p_brand
ORDER BY year, p_brand;
--Q2.3
SELECT sum(lo_revenue), year(lo_orderdate) AS year, p_brand
FROM lineorder_flat
WHERE p_brand = 'MFGR#2239' AND s_region = 'EUROPE'
GROUP BY year, p_brand
ORDER BY year, p_brand;
--Q3.1
SELECT c_nation, s_nation, year(lo_orderdate) AS year, sum(lo_revenue) AS revenue FROM lineorder_flat
WHERE c_region = 'ASIA' AND s_region = 'ASIA' AND lo_orderdate >= '1992-01-01' AND lo_orderdate <= '1997-12-31'
GROUP BY c_nation, s_nation, year
ORDER BY year ASC, revenue DESC;
--Q3.2
SELECT c_city, s_city, year(lo_orderdate) AS year, sum(lo_revenue) AS revenue
FROM lineorder_flat
WHERE c_nation = 'UNITED STATES' AND s_nation = 'UNITED STATES' AND lo_orderdate >= '1992-01-01' AND lo_orderdate <= '1997-12-31'
GROUP BY c_city, s_city, year
ORDER BY year ASC, revenue DESC;
--Q3.3
SELECT c_city, s_city, year(lo_orderdate) AS year, sum(lo_revenue) AS revenue
FROM lineorder_flat
WHERE c_city in ( 'UNITED KI1' ,'UNITED KI5') AND s_city in ( 'UNITED KI1' ,'UNITED KI5') AND lo_orderdate >= '1992-01-01' AND lo_orderdate <= '1997-12-31'
GROUP BY c_city, s_city, year
ORDER BY year ASC, revenue DESC;
--Q3.4
SELECT c_city, s_city, year(lo_orderdate) AS year, sum(lo_revenue) AS revenue
FROM lineorder_flat
WHERE c_city in ('UNITED KI1', 'UNITED KI5') AND s_city in ( 'UNITED KI1', 'UNITED KI5') AND lo_orderdate >= '1997-12-01' AND lo_orderdate <= '1997-12-31'
GROUP BY c_city, s_city, year
ORDER BY year ASC, revenue DESC;
--Q4.1
SELECT year(lo_orderdate) AS year, c_nation, sum(lo_revenue - lo_supplycost) AS profit FROM lineorder_flat
WHERE c_region = 'AMERICA' AND s_region = 'AMERICA' AND p_mfgr in ( 'MFGR#1' , 'MFGR#2')
GROUP BY year, c_nation
ORDER BY year ASC, c_nation ASC;
--Q4.2
SELECT year(lo_orderdate) AS year,
s_nation, p_category, sum(lo_revenue - lo_supplycost) AS profit
FROM lineorder_flat
WHERE c_region = 'AMERICA' AND s_region = 'AMERICA' AND lo_orderdate >= '1997-01-01' and lo_orderdate <= '1998-12-31' AND p_mfgr in ( 'MFGR#1' , 'MFGR#2')
GROUP BY year, s_nation, p_category
ORDER BY year ASC, s_nation ASC, p_category ASC;
--Q4.3
SELECT year(lo_orderdate) AS year, s_city, p_brand,
sum(lo_revenue - lo_supplycost) AS profit
FROM lineorder_flat
WHERE s_nation = 'UNITED STATES' AND lo_orderdate >= '1997-01-01' and lo_orderdate <= '1998-12-31' AND p_category = 'MFGR#14'
GROUP BY year, s_city, p_brand
ORDER BY year ASC, s_city ASC, p_brand ASC;
单表低基数测试SQL
--Q1
select count(*),lo_shipmode from lineorder_flat group by lo_shipmode;
--Q2
select count(distinct lo_shipmode) from lineorder_flat;
--Q3
select count(*),lo_shipmode,lo_orderpriority from lineorder_flat group by lo_shipmode,lo_orderpriority;
--Q4
select count(*),lo_shipmode,lo_orderpriority from lineorder_flat group by lo_shipmode,lo_orderpriority,lo_shippriority;
--Q5
select count(*),lo_shipmode,s_city from lineorder_flat group by lo_shipmode,s_city;
--Q6
select count(*) from lineorder_flat group by c_city,s_city;
--Q7
select count(*) from lineorder_flat group by lo_shipmode,lo_orderdate;
--Q8
select count(*) from lineorder_flat group by lo_orderdate,s_nation,s_region;
--Q9
select count(*) from lineorder_flat group by c_city,s_city,c_nation,s_nation;
--Q10
select count(*) from (select count(*) from lineorder_flat group by lo_shipmode,lo_orderpriority,p_category,s_nation,c_nation) t;
--Q11
select count(*) from (select count(*) from lineorder_flat_distributed group by lo_shipmode,lo_orderpriority,p_category,s_nation,c_nation,p_mfgr) t;
--Q12
select count(*) from (select count(*) from lineorder_flat group by substr(lo_shipmode,2),lower(lo_orderpriority),p_category,s_nation,c_nation,s_region,p_mfgr) t;
单表测试
SSB单表测试结果
starrocks 2.0.1(ms) | clickhouse 20.4.2.9 (ms) | clickhouse/starrocks 性能对比 | |
---|---|---|---|
Q1.1 | 70 | 27 | 0.39 |
Q1.2 | 20 | 21 | 1.05 |
Q1.3 | 50 | 18 | 0.36 |
Q2.1 | 90 | 376 | 4.18 |
Q2.2 | 80 | 309 | 3.86 |
Q2.3 | 70 | 481 | 6.87 |
Q3.1 | 200 | 792 | 3.96 |
Q3.2 | 80 | 848 | 10.60 |
Q3.3 | 60 | 622 | 10.37 |
Q3.4 | 20 | 33 | 1.65 |
Q4.1 | 130 | 919 | 7.07 |
Q4.2 | 110 | 441 | 4.01 |
Q4.3 | 80 | 295 | 3.69 |
sum | 1060 | 5182 | 4.89 |
低基数聚合测试结果
查询类型 | 结果集的基数 | starrocks 2.0.1(s)并行度=1 | starrocks 2.0.1(s)并行度=32 | clickhouse 20.4.2.9 (s) | clickhouse/starrocks 性能对比 | |
---|---|---|---|---|---|---|
Q1 | group by 1个低基数列(<50) | 7 | 1.88 | 0.25 | 0.199 | 0.80 |
Q2 | count distinct 1个低基数列(<50) | 1 | 1.56 | 0.21 | 0.365 | 1.74 |
Q3 | group by 2个低基数列 | 35 | 2.88 | 0.28 | 2.732 | 9.76 |
Q4 | group by 2个低基数列,一个int列 | 35 | 2.99 | 0.32 | 3.465 | 10.83 |
Q5 | group by 4个低基数列(7*250) | 1750 | 2.87 | 0.32 | 0.996 | 3.11 |
Q6 | group by 2个低基数列(250*250) | 62500 | 3.74 | 0.83 | 1.947 | 2.35 |
Q7 | group by 1个低基数列(<50)和1个日期列 | 16842 | 2.09 | 0.28 | 0.656 | 2.34 |
Q8 | group by 2个低基数列(<50)和2个日期列 | 60150 | 2.08 | 0.42 | 0.978 | 2.33 |
Q9 | group by 4个低基数列 | 62500 | 7.14 | 1.27 | 3.308 | 2.60 |
Q10 | group by 5个低基数列(<50) | 546875 | 19.22 | 2.9 | 4.46 | 1.54 |
Q11 | group by 6个低基数列(<50) | 546875 | 24.81 | 3.08 | 5.254 | 1.71 |
Q12 | group by 7个包含函数计算低基数列(<50) | 469750 | 28.99 | 3.57 | 5.868 | 1.64 |
sum | 100.25 | 13.73 | 30.228 | 2.20 |
Clickhouse SSB测试报告:Clickhouse SSB测试报告