1亿数据分析跑不动,看NiceFlow Python ETL工具如何实现单机查询毫秒出结果

1. 说明

在某些情况下,我们会有这样的一些场景,手中有一些csv需要进行分析,但是数据量很大,超过1亿,写程序执行会非常慢,急需一个工具来解决这类问题,那么你不妨看看NiceFlow,让你1亿数据查询毫秒出结果

在看到这里的时候,你可能会认为博主在吹牛批,然后准备划走,那么你先别划走,博主亲自使用数据测试,保证毫秒出结果的结论100%可靠,下面让我们来看看。

在此先说明,NiceFlow底层所使用的数据分析引擎是DuckDB,NiceFlow则是基于DuckDB基础上扩展更加强大的功能,所以此次测试,我们主要测试DuckDB数据分析处理能力。

看完测试结论后,觉得NiceFlow有那么一勒勒用,可以看看这篇文章Python数据处理还在用Pandas? NiceFlow无需一行代码搞定数据处理

2. 硬件环境

机器数量1台家用办公电脑
CPUi5 13600K
内存64G
磁盘SSD

3. 软件环境

操作系统Win10
Python版本3.10
DuckDB0.9.2

4. 测试数据量

数据生成的仓库 ssb-dbgen

别人都是在linux上生成数据的,我的在windows上生成的,因为我用了TDM-GCC编译器,将上面的代码编译为exe

数据生成的命令

dbgen.exe -s 20 -T l
dbgen.exe -s 20 -T p
dbgen.exe -s 20 -T s
dbgen.exe -s 20 -T d
dbgen.exe -s 20 -T c

生成的数据的信息

SSB表名行数磁盘大小备注
lineorder11999460811.2G商品订单明细表表
customer60000055.3M客户信息表
part100000083.1M零件信息表
supplier400003.26M供应商信息表
dates2556227K日期表

5. 建表语句

CREATE TABLE IF NOT EXISTS lineorder (
  lo_orderkey INTEGER ,
  lo_linenumber INTEGER  ,
  lo_custkey INTEGER ,
  lo_partkey INTEGER ,
  lo_suppkey INTEGER ,
  lo_orderdate INTEGER ,
  lo_orderpriority VARCHAR ,
  lo_shippriority INTEGER ,
  lo_quantity INTEGER ,
  lo_extendedprice INTEGER ,
  lo_ordtotalprice INTEGER ,
  lo_discount INTEGER ,
  lo_revenue INTEGER ,
  lo_supplycost INTEGER ,
  lo_tax INTEGER ,
  lo_commitdate INTEGER ,
  lo_shipmode VARCHAR 
) ;

CREATE TABLE IF NOT EXISTS customer (
  c_custkey INTEGER ,
  c_name varchar ,
  c_address varchar ,
  c_city varchar ,
  c_nation varchar ,
  c_region varchar ,
  c_phone varchar ,
  c_mktsegment varchar 
);

CREATE TABLE IF NOT EXISTS dates (
  d_datekey INTEGER ,
  d_date varchar ,
  d_dayofweek varchar ,
  d_month varchar ,
  d_year INTEGER ,
  d_yearmonthnum INTEGER ,
  d_yearmonth varchar ,
  d_daynuminweek INTEGER ,
  d_daynuminmonth INTEGER ,
  d_daynuminyear INTEGER ,
  d_monthnuminyear INTEGER ,
  d_weeknuminyear INTEGER ,
  d_sellingseason varchar ,
  d_lastdayinweekfl INTEGER ,
  d_lastdayinmonthfl INTEGER ,
  d_holidayfl INTEGER ,
  d_weekdayfl INTEGER 
) ;

 CREATE TABLE IF NOT EXISTS supplier (
  s_suppkey INTEGER ,
  s_name varchar ,
  s_address varchar ,
  s_city varchar ,
  s_nation varchar ,
  s_region varchar ,
  s_phone varchar 
);

CREATE TABLE IF NOT EXISTS part (
  p_partkey INTEGER ,
  p_name varchar ,
  p_mfgr varchar ,
  p_category varchar ,
  p_brand varchar ,
  p_color varchar ,
  p_type varchar ,
  p_size INTEGER ,
  p_container varchar 
) ;

6. 数据导入

-- 导入耗时20s
COPY lineorder FROM 'F:/07_数据源大全/ssb-dbgen/data/lineorder.csv' (DELIMITER '|');

-- 导入耗时49ms
COPY supplier FROM 'F:/07_数据源大全/ssb-dbgen/data/supplier.csv' (DELIMITER '|');


-- 导入耗时1.158s
COPY part FROM 'F:/07_数据源大全/ssb-dbgen/data/part.csv' (DELIMITER '|');

-- 导入耗时863ms
COPY customer FROM 'F:/07_数据源大全/ssb-dbgen/data/customer.csv' (DELIMITER '|');

-- 导入耗时21ms
COPY dates FROM 'F:/07_数据源大全/ssb-dbgen/data/dates.csv' (DELIMITER '|');

7. 磁盘占用

查询导入db后所占磁盘空间

CALL pragma_database_size();

database_namedatabase_sizeblock_sizetotal_blocksused_blocksfree_blockswal_sizememory_usagememory_limit
13.8GB26214414852148439242KB2.9GB53.6GB

8. 查询测试语句

--Q1.1
SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE
FROM lineorder, dates
WHERE
    lo_orderdate = d_datekey
    AND d_year = 1993
    AND lo_discount BETWEEN 1 AND 3
    AND lo_quantity < 25;
--Q1.2
SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE
FROM lineorder, dates
WHERE
    lo_orderdate = d_datekey
    AND d_yearmonth = 'Jan1994'
    AND lo_discount BETWEEN 4 AND 6
    AND lo_quantity BETWEEN 26 AND 35;
    
--Q1.3
SELECT
    SUM(lo_extendedprice * lo_discount) AS REVENUE
FROM lineorder, dates
WHERE
    lo_orderdate = d_datekey
    AND d_weeknuminyear = 6
    AND d_year = 1994
    AND lo_discount BETWEEN 5 AND 7
    AND lo_quantity BETWEEN 26 AND 35;
    
--Q2.1
SELECT SUM(lo_revenue), d_year, p_brand
FROM lineorder, dates, part, supplier
WHERE
    lo_orderdate = d_datekey
    AND lo_partkey = p_partkey
    AND lo_suppkey = s_suppkey
    AND p_category = 'MFGR#12'
    AND s_region = 'AMERICA'
GROUP BY d_year, p_brand
ORDER BY p_brand;

--Q2.2
SELECT SUM(lo_revenue), d_year, p_brand
FROM lineorder, dates, part, supplier
WHERE
    lo_orderdate = d_datekey
    AND lo_partkey = p_partkey
    AND lo_suppkey = s_suppkey
    AND p_brand BETWEEN 'MFGR#2221' AND 'MFGR#2228'
    AND s_region = 'ASIA'
GROUP BY d_year, p_brand
ORDER BY d_year, p_brand;

--Q2.3
SELECT SUM(lo_revenue), d_year, p_brand
FROM lineorder, dates, part, supplier
WHERE
    lo_orderdate = d_datekey
    AND lo_partkey = p_partkey
    AND lo_suppkey = s_suppkey
    AND p_brand = 'MFGR#2239'
    AND s_region = 'EUROPE'
GROUP BY d_year, p_brand
ORDER BY d_year, p_brand;

--Q3.1
SELECT
    c_nation,
    s_nation,
    d_year,
    SUM(lo_revenue) AS REVENUE
FROM customer, lineorder, supplier, dates
WHERE
    lo_custkey = c_custkey
    AND lo_suppkey = s_suppkey
    AND lo_orderdate = d_datekey
    AND c_region = 'ASIA'
    AND s_region = 'ASIA'
    AND d_year >= 1992
    AND d_year <= 1997
GROUP BY c_nation, s_nation, d_year
ORDER BY d_year ASC, REVENUE DESC;

--Q3.2
SELECT
    c_city,
    s_city,
    d_year,
    SUM(lo_revenue) AS REVENUE
FROM customer, lineorder, supplier, dates
WHERE
    lo_custkey = c_custkey
    AND lo_suppkey = s_suppkey
    AND lo_orderdate = d_datekey
    AND c_nation = 'UNITED STATES'
    AND s_nation = 'UNITED STATES'
    AND d_year >= 1992
    AND d_year <= 1997
GROUP BY c_city, s_city, d_year
ORDER BY d_year ASC, REVENUE DESC;

--Q3.3
SELECT
    c_city,
    s_city,
    d_year,
    SUM(lo_revenue) AS REVENUE
FROM customer, lineorder, supplier, dates
WHERE
    lo_custkey = c_custkey
    AND lo_suppkey = s_suppkey
    AND lo_orderdate = d_datekey
    AND (
        c_city = 'UNITED KI1'
        OR c_city = 'UNITED KI5'
    )
    AND (
        s_city = 'UNITED KI1'
        OR s_city = 'UNITED KI5'
    )
    AND d_year >= 1992
    AND d_year <= 1997
GROUP BY c_city, s_city, d_year
ORDER BY d_year ASC, REVENUE DESC;

--Q3.4
SELECT
    c_city,
    s_city,
    d_year,
    SUM(lo_revenue) AS REVENUE
FROM customer, lineorder, supplier, dates
WHERE
    lo_custkey = c_custkey
    AND lo_suppkey = s_suppkey
    AND lo_orderdate = d_datekey
    AND (
        c_city = 'UNITED KI1'
        OR c_city = 'UNITED KI5'
    )
    AND (
        s_city = 'UNITED KI1'
        OR s_city = 'UNITED KI5'
    )
    AND d_yearmonth = 'Dec1997'
GROUP BY c_city, s_city, d_year
ORDER BY d_year ASC, REVENUE DESC;

--Q4.1
SELECT /*+SET_VAR(parallel_fragment_exec_instance_num=4, enable_vectorized_engine=true, batch_size=4096, enable_cost_based_join_reorder=true, enable_projection=true) */
    d_year,
    c_nation,
    SUM(lo_revenue - lo_supplycost) AS PROFIT
FROM dates, customer, supplier, part, lineorder
WHERE
    lo_custkey = c_custkey
    AND lo_suppkey = s_suppkey
    AND lo_partkey = p_partkey
    AND lo_orderdate = d_datekey
    AND c_region = 'AMERICA'
    AND s_region = 'AMERICA'
    AND (
        p_mfgr = 'MFGR#1'
        OR p_mfgr = 'MFGR#2'
    )
GROUP BY d_year, c_nation
ORDER BY d_year, c_nation;

--Q4.2
SELECT /*+SET_VAR(parallel_fragment_exec_instance_num=2, enable_vectorized_engine=true, batch_size=4096, enable_cost_based_join_reorder=true, enable_projection=true) */  
    d_year,
    s_nation,
    p_category,
    SUM(lo_revenue - lo_supplycost) AS PROFIT
FROM dates, customer, supplier, part, lineorder
WHERE
    lo_custkey = c_custkey
    AND lo_suppkey = s_suppkey
    AND lo_partkey = p_partkey
    AND lo_orderdate = d_datekey
    AND c_region = 'AMERICA'
    AND s_region = 'AMERICA'
    AND (
        d_year = 1997
        OR d_year = 1998
    )
    AND (
        p_mfgr = 'MFGR#1'
        OR p_mfgr = 'MFGR#2'
    )
GROUP BY d_year, s_nation, p_category
ORDER BY d_year, s_nation, p_category;

--Q4.3
SELECT /*+SET_VAR(parallel_fragment_exec_instance_num=2, enable_vectorized_engine=true, batch_size=4096, enable_cost_based_join_reorder=true, enable_projection=true) */
    d_year,
    s_city,
    p_brand,
    SUM(lo_revenue - lo_supplycost) AS PROFIT
FROM dates, customer, supplier, part, lineorder
WHERE
    lo_custkey = c_custkey
    AND lo_suppkey = s_suppkey
    AND lo_partkey = p_partkey
    AND lo_orderdate = d_datekey
    AND s_nation = 'UNITED STATES'
    AND (
        d_year = 1997
        OR d_year = 1998
    )
    AND p_category = 'MFGR#14'
GROUP BY d_year, s_city, p_brand
ORDER BY d_year, s_city, p_brand;

9. 查询耗时

QueryDuckDB(单机1亿数据) (ms)Doris 1.2.0-rc01(6亿数据)(ms)Doris 1.1.3(6亿数据) (ms)Doris 0.15.0 RC04(6亿数据)(ms)
Q1.11254018350
Q1.21173010080
Q1.3114207080
Q2.114335094020680
Q2.213332075018250
Q2.310930072014760
Q3.1218650215022190
Q3.21322605108360
Q3.31442204506200
Q3.41106070160
Q4.1205840148024320
Q4.22074605606310
Q4.316061066010170
  • 结果说明

doris的测试结果我是直接抄袭的doris官网测试结果, doris测试结果对应的数据集为scale 100, 约6亿条,duckdb测试的数据集为scale 20, 约1亿条, 因为duckdb和doris所属类型不同,数据量不同,查询性能自然不能简单对比,因此上述结果仅供参考

但是DuckDB查询耗时基本稳定在100-250ms之间,且导入速度极快,这在轻量级数据分析场景中非常有用

个人得出的结论是,在资源受限的情况下,需要分析亿级数据,DuckDB是一个非常好的选择,这也是博主NiceFlow工具中使用DuckDB作为分析引擎的一个重要原因。

  • 23
    点赞
  • 20
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Python ETL是指使用Python编程语言进行数据提取、转换和加载(Extract, Transform, Load)的过程。引用中提到了一个纯Python开发的ETL框架pyetl,相比其他ETL工具,pyetl可以灵活地对每个字段添加自定义函数,使得数据转换过程更加灵活。而且,pyetl是纯Python代码操作,更加符合开发人员的习惯,相对于专业ETL工具更轻量。引用中介绍了使用Python进行ETL数据处理的实战案例,其中第一步是数据提取,可以使用Pythonpandas库来读取各种不同的数据源,如CSV文件,并将其转换为DataFrame对象。所以,Python ETL是一种使用Python进行数据处理的方法,可以从各种数据源中提取数据,并通过一系列的处理和转换,最终将数据导入到目标系统中。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* [python ETL工具 pyetl](https://download.csdn.net/download/weixin_38606206/14842276)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] - *2* *3* [使用Python进行ETL数据处理](https://blog.csdn.net/qq_44273429/article/details/130245912)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值