1. 说明
在某些情况下,我们会有这样的一些场景,手中有一些csv需要进行分析,但是数据量很大,超过1亿,写程序执行会非常慢,急需一个工具来解决这类问题,那么你不妨看看NiceFlow,让你1亿数据查询毫秒出结果
在看到这里的时候,你可能会认为博主在吹牛批,然后准备划走,那么你先别划走,博主亲自使用数据测试,保证毫秒出结果的结论100%可靠,下面让我们来看看。
在此先说明,NiceFlow底层所使用的数据分析引擎是DuckDB,NiceFlow则是基于DuckDB基础上扩展更加强大的功能,所以此次测试,我们主要测试DuckDB数据分析处理能力。
看完测试结论后,觉得NiceFlow有那么一勒勒用,可以看看这篇文章Python数据处理还在用Pandas? NiceFlow无需一行代码搞定数据处理
2. 硬件环境
机器数量 | 1台家用办公电脑 |
---|---|
CPU | i5 13600K |
内存 | 64G |
磁盘 | SSD |
3. 软件环境
操作系统 | Win10 |
Python版本 | 3.10 |
DuckDB | 0.9.2 |
4. 测试数据量
数据生成的仓库 ssb-dbgen
别人都是在linux上生成数据的,我的在windows上生成的,因为我用了TDM-GCC编译器,将上面的代码编译为exe
数据生成的命令
dbgen.exe -s 20 -T l
dbgen.exe -s 20 -T p
dbgen.exe -s 20 -T s
dbgen.exe -s 20 -T d
dbgen.exe -s 20 -T c
生成的数据的信息
SSB表名 | 行数 | 磁盘大小 | 备注 |
---|---|---|---|
lineorder | 119994608 | 11.2G | 商品订单明细表表 |
customer | 600000 | 55.3M | 客户信息表 |
part | 1000000 | 83.1M | 零件信息表 |
supplier | 40000 | 3.26M | 供应商信息表 |
dates | 2556 | 227K | 日期表 |
5. 建表语句
CREATE TABLE IF NOT EXISTS lineorder (
lo_orderkey INTEGER ,
lo_linenumber INTEGER ,
lo_custkey INTEGER ,
lo_partkey INTEGER ,
lo_suppkey INTEGER ,
lo_orderdate INTEGER ,
lo_orderpriority VARCHAR ,
lo_shippriority INTEGER ,
lo_quantity INTEGER ,
lo_extendedprice INTEGER ,
lo_ordtotalprice INTEGER ,
lo_discount INTEGER ,
lo_revenue INTEGER ,
lo_supplycost INTEGER ,
lo_tax INTEGER ,
lo_commitdate INTEGER ,
lo_shipmode VARCHAR
) ;
CREATE TABLE IF NOT EXISTS customer (
c_custkey INTEGER ,
c_name varchar ,
c_address varchar ,
c_city varchar ,
c_nation varchar ,
c_region varchar ,
c_phone varchar ,
c_mktsegment varchar
);
CREATE TABLE IF NOT EXISTS dates (
d_datekey INTEGER ,
d_date varchar ,
d_dayofweek varchar ,
d_month varchar ,
d_year INTEGER ,
d_yearmonthnum INTEGER ,
d_yearmonth varchar ,
d_daynuminweek INTEGER ,
d_daynuminmonth INTEGER ,
d_daynuminyear INTEGER ,
d_monthnuminyear INTEGER ,
d_weeknuminyear INTEGER ,
d_sellingseason varchar ,
d_lastdayinweekfl INTEGER ,
d_lastdayinmonthfl INTEGER ,
d_holidayfl INTEGER ,
d_weekdayfl INTEGER
) ;
CREATE TABLE IF NOT EXISTS supplier (
s_suppkey INTEGER ,
s_name varchar ,
s_address varchar ,
s_city varchar ,
s_nation varchar ,
s_region varchar ,
s_phone varchar
);
CREATE TABLE IF NOT EXISTS part (
p_partkey INTEGER ,
p_name varchar ,
p_mfgr varchar ,
p_category varchar ,
p_brand varchar ,
p_color varchar ,
p_type varchar ,
p_size INTEGER ,
p_container varchar
) ;
6. 数据导入
-- 导入耗时20s
COPY lineorder FROM 'F:/07_数据源大全/ssb-dbgen/data/lineorder.csv' (DELIMITER '|');
-- 导入耗时49ms
COPY supplier FROM 'F:/07_数据源大全/ssb-dbgen/data/supplier.csv' (DELIMITER '|');
-- 导入耗时1.158s
COPY part FROM 'F:/07_数据源大全/ssb-dbgen/data/part.csv' (DELIMITER '|');
-- 导入耗时863ms
COPY customer FROM 'F:/07_数据源大全/ssb-dbgen/data/customer.csv' (DELIMITER '|');
-- 导入耗时21ms
COPY dates FROM 'F:/07_数据源大全/ssb-dbgen/data/dates.csv' (DELIMITER '|');
7. 磁盘占用
查询导入db后所占磁盘空间
CALL pragma_database_size();
database_name | database_size | block_size | total_blocks | used_blocks | free_blocks | wal_size | memory_usage | memory_limit |
---|---|---|---|---|---|---|---|---|
1 | 3.8GB | 262144 | 14852 | 14843 | 9 | 242KB | 2.9GB | 53.6GB |
8. 查询测试语句
--Q1.1
SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE
FROM lineorder, dates
WHERE
lo_orderdate = d_datekey
AND d_year = 1993
AND lo_discount BETWEEN 1 AND 3
AND lo_quantity < 25;
--Q1.2
SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE
FROM lineorder, dates
WHERE
lo_orderdate = d_datekey
AND d_yearmonth = 'Jan1994'
AND lo_discount BETWEEN 4 AND 6
AND lo_quantity BETWEEN 26 AND 35;
--Q1.3
SELECT
SUM(lo_extendedprice * lo_discount) AS REVENUE
FROM lineorder, dates
WHERE
lo_orderdate = d_datekey
AND d_weeknuminyear = 6
AND d_year = 1994
AND lo_discount BETWEEN 5 AND 7
AND lo_quantity BETWEEN 26 AND 35;
--Q2.1
SELECT SUM(lo_revenue), d_year, p_brand
FROM lineorder, dates, part, supplier
WHERE
lo_orderdate = d_datekey
AND lo_partkey = p_partkey
AND lo_suppkey = s_suppkey
AND p_category = 'MFGR#12'
AND s_region = 'AMERICA'
GROUP BY d_year, p_brand
ORDER BY p_brand;
--Q2.2
SELECT SUM(lo_revenue), d_year, p_brand
FROM lineorder, dates, part, supplier
WHERE
lo_orderdate = d_datekey
AND lo_partkey = p_partkey
AND lo_suppkey = s_suppkey
AND p_brand BETWEEN 'MFGR#2221' AND 'MFGR#2228'
AND s_region = 'ASIA'
GROUP BY d_year, p_brand
ORDER BY d_year, p_brand;
--Q2.3
SELECT SUM(lo_revenue), d_year, p_brand
FROM lineorder, dates, part, supplier
WHERE
lo_orderdate = d_datekey
AND lo_partkey = p_partkey
AND lo_suppkey = s_suppkey
AND p_brand = 'MFGR#2239'
AND s_region = 'EUROPE'
GROUP BY d_year, p_brand
ORDER BY d_year, p_brand;
--Q3.1
SELECT
c_nation,
s_nation,
d_year,
SUM(lo_revenue) AS REVENUE
FROM customer, lineorder, supplier, dates
WHERE
lo_custkey = c_custkey
AND lo_suppkey = s_suppkey
AND lo_orderdate = d_datekey
AND c_region = 'ASIA'
AND s_region = 'ASIA'
AND d_year >= 1992
AND d_year <= 1997
GROUP BY c_nation, s_nation, d_year
ORDER BY d_year ASC, REVENUE DESC;
--Q3.2
SELECT
c_city,
s_city,
d_year,
SUM(lo_revenue) AS REVENUE
FROM customer, lineorder, supplier, dates
WHERE
lo_custkey = c_custkey
AND lo_suppkey = s_suppkey
AND lo_orderdate = d_datekey
AND c_nation = 'UNITED STATES'
AND s_nation = 'UNITED STATES'
AND d_year >= 1992
AND d_year <= 1997
GROUP BY c_city, s_city, d_year
ORDER BY d_year ASC, REVENUE DESC;
--Q3.3
SELECT
c_city,
s_city,
d_year,
SUM(lo_revenue) AS REVENUE
FROM customer, lineorder, supplier, dates
WHERE
lo_custkey = c_custkey
AND lo_suppkey = s_suppkey
AND lo_orderdate = d_datekey
AND (
c_city = 'UNITED KI1'
OR c_city = 'UNITED KI5'
)
AND (
s_city = 'UNITED KI1'
OR s_city = 'UNITED KI5'
)
AND d_year >= 1992
AND d_year <= 1997
GROUP BY c_city, s_city, d_year
ORDER BY d_year ASC, REVENUE DESC;
--Q3.4
SELECT
c_city,
s_city,
d_year,
SUM(lo_revenue) AS REVENUE
FROM customer, lineorder, supplier, dates
WHERE
lo_custkey = c_custkey
AND lo_suppkey = s_suppkey
AND lo_orderdate = d_datekey
AND (
c_city = 'UNITED KI1'
OR c_city = 'UNITED KI5'
)
AND (
s_city = 'UNITED KI1'
OR s_city = 'UNITED KI5'
)
AND d_yearmonth = 'Dec1997'
GROUP BY c_city, s_city, d_year
ORDER BY d_year ASC, REVENUE DESC;
--Q4.1
SELECT /*+SET_VAR(parallel_fragment_exec_instance_num=4, enable_vectorized_engine=true, batch_size=4096, enable_cost_based_join_reorder=true, enable_projection=true) */
d_year,
c_nation,
SUM(lo_revenue - lo_supplycost) AS PROFIT
FROM dates, customer, supplier, part, lineorder
WHERE
lo_custkey = c_custkey
AND lo_suppkey = s_suppkey
AND lo_partkey = p_partkey
AND lo_orderdate = d_datekey
AND c_region = 'AMERICA'
AND s_region = 'AMERICA'
AND (
p_mfgr = 'MFGR#1'
OR p_mfgr = 'MFGR#2'
)
GROUP BY d_year, c_nation
ORDER BY d_year, c_nation;
--Q4.2
SELECT /*+SET_VAR(parallel_fragment_exec_instance_num=2, enable_vectorized_engine=true, batch_size=4096, enable_cost_based_join_reorder=true, enable_projection=true) */
d_year,
s_nation,
p_category,
SUM(lo_revenue - lo_supplycost) AS PROFIT
FROM dates, customer, supplier, part, lineorder
WHERE
lo_custkey = c_custkey
AND lo_suppkey = s_suppkey
AND lo_partkey = p_partkey
AND lo_orderdate = d_datekey
AND c_region = 'AMERICA'
AND s_region = 'AMERICA'
AND (
d_year = 1997
OR d_year = 1998
)
AND (
p_mfgr = 'MFGR#1'
OR p_mfgr = 'MFGR#2'
)
GROUP BY d_year, s_nation, p_category
ORDER BY d_year, s_nation, p_category;
--Q4.3
SELECT /*+SET_VAR(parallel_fragment_exec_instance_num=2, enable_vectorized_engine=true, batch_size=4096, enable_cost_based_join_reorder=true, enable_projection=true) */
d_year,
s_city,
p_brand,
SUM(lo_revenue - lo_supplycost) AS PROFIT
FROM dates, customer, supplier, part, lineorder
WHERE
lo_custkey = c_custkey
AND lo_suppkey = s_suppkey
AND lo_partkey = p_partkey
AND lo_orderdate = d_datekey
AND s_nation = 'UNITED STATES'
AND (
d_year = 1997
OR d_year = 1998
)
AND p_category = 'MFGR#14'
GROUP BY d_year, s_city, p_brand
ORDER BY d_year, s_city, p_brand;
9. 查询耗时
Query | DuckDB(单机1亿数据) (ms) | Doris 1.2.0-rc01(6亿数据)(ms) | Doris 1.1.3(6亿数据) (ms) | Doris 0.15.0 RC04(6亿数据)(ms) |
---|---|---|---|---|
Q1.1 | 125 | 40 | 18 | 350 |
Q1.2 | 117 | 30 | 100 | 80 |
Q1.3 | 114 | 20 | 70 | 80 |
Q2.1 | 143 | 350 | 940 | 20680 |
Q2.2 | 133 | 320 | 750 | 18250 |
Q2.3 | 109 | 300 | 720 | 14760 |
Q3.1 | 218 | 650 | 2150 | 22190 |
Q3.2 | 132 | 260 | 510 | 8360 |
Q3.3 | 144 | 220 | 450 | 6200 |
Q3.4 | 110 | 60 | 70 | 160 |
Q4.1 | 205 | 840 | 1480 | 24320 |
Q4.2 | 207 | 460 | 560 | 6310 |
Q4.3 | 160 | 610 | 660 | 10170 |
- 结果说明
doris的测试结果我是直接抄袭的doris官网测试结果, doris测试结果对应的数据集为scale 100, 约6亿条,duckdb测试的数据集为scale 20, 约1亿条, 因为duckdb和doris所属类型不同,数据量不同,查询性能自然不能简单对比,因此上述结果仅供参考
但是DuckDB查询耗时基本稳定在100-250ms之间,且导入速度极快,这在轻量级数据分析场景中非常有用
个人得出的结论是,在资源受限的情况下,需要分析亿级数据,DuckDB是一个非常好的选择,这也是博主NiceFlow工具中使用DuckDB作为分析引擎的一个重要原因。