在之前写的另外一篇文章“Trafodion 数据加载介绍”中提到,Trafodion数据加载主要有以下四种方式,
- Insert
- Upsert
- Upsert Using Load
- Load
从性能上面来看,这四种加载方式是依次递增的,即Load >Upsert Using Load >Upsert >Insert。
下面我们就现场验证一下这几种数据加载方式各自的性能,我所使用的集群配置为:4个节点的Hadoop,Trafodion装在3个节点的RegionServer上,每个Node的内存为132G。现在通过以下步骤来测试性能,
- 数据准备—创建两张表相同结构的表,一个表3千万数据,另外一张为空表
SQL>select count(*) from eboxdata_30m;
(EXPR)
--------------------
30000000
--- 1 row(s) selected.
SQL>create table eboxdata_30m_new like eboxdata_30m;
--- SQL operation complete.
- Insert测试—由于insert语句最慢,我们通过qid拿到当前insert语句的统计信息发现,在11分45秒的时候约插入20万条记录,这意味着,每秒插入数据约300条。如果3千万全部插入,则约需要耗时27个小时。
SQL>insert into eboxdata_30m_new select * from eboxdata_30m;
SQL>get statistics for qid MXID11001049697212347845223497595000000000706U3333300_1609_SQL_CUR_7 default;
....
Qid MXID11001049697212347845223497595000000000706U3333300_1609_SQL_CUR_7
Compile Start Time 2016/12/14 13:11:24.930565
Compile End Time 2016/12/14 13:11:26.585146
Compile Elapsed Time 0:00:01.654581
Execute Start Time 2016/12/14 13:11:26.585298
Execute End Time -1
Execute Elapsed Time 0:11:45.164579
State OPEN
Rows Affected -1
SQL Error Code 0
Stats Error Code 0
Query Type SQL_INSERT_NON_UNIQUE
Sub Query Type SQL_STMT_NA
Estimated Accessed Rows 0
Estimated Used Rows 0
Parent Qid NONE
Parent Query System NONE
Child Qid NONE
Number of SQL Processes 19
Number of Cpus 3
Transaction Id 4294967298
Source String insert into eboxdata_30m_new select * from eboxdata_30m
SQL Source Length 55
Rows Returned 0
First Row Returned Time -1
Last Error before AQR 0
Number of AQR retries 0
Delay before AQR 0
No. of times reclaimed 0
Cancel Time -1
Last Suspend Time -1
Stats Collection Type OPERATOR_STATS
LC RC Id PaId ExId Frag TDB Name DOP Dispatches Oper CPU Time Est. Records Used Act. Records Used Details
12 . 13 . 7 0 EX_ROOT 1 1 29 0 0 1174
11 . 12 13 6 0 EX_SPLIT_TOP 1 1 7 29,999,970 0
10 . 11 12 6 0 EX_SEND_TOP 6 18 1,138 29,999,970 0
9 . 10 11 6 2 EX_SEND_BOTTOM 6 12 558 29,999,970 0
8 . 9 10 6 2 EX_SPLIT_BOTTOM 6 48 1,280 29,999,970 0 261561976
6 7 8 9 5 2 EX_TUPLE_FLOW 6 3,177 160,763 29,999,970 0
. . 7 8 4 2 EX_TRAF_INSERT 6 3,171 39,525,341 1 197,967 TRAFODION.PKSAAS.EBOXDATA_30M_NEW|0|54230353
5 . 6 8 3 2 EX_SORT 6 1,571,909 149,789,880 29,999,970 200,388 704|89924|24
4 . 5 6 2 2 EX_SPLIT_TOP 6 1,568,842 4,135,959 29,999,970 29,999,833
3 . 4 5 2 2 EX_SEND_TOP 72 1,579,269 67,948,195 29,999,970 29,999,833
2 . 3 4 2 3 EX_SEND_BOTTOM 72 3,158,220 457,945,988 29,999,970 29,999,909
1 . 2 3 2 3 EX_SPLIT_BOTTOM 12 3,157,637 9,283,550 29,999,970 30,000,000 628902186
. . 1 2 1 3 EX_TRAF_KEY_SELECT 23 3,026,037 309,730,635 29,999,970 30,000,000 TRAFODION.PKSAAS.EBOXDATA_30M|30000000|7527968500
--- SQL operation complete.
- Upsert测试—同样使用上述方法,拿到upsert执行语句的统计信息,发现在6分钟的时候已经插入了将近9百万的数据了,以此估算每秒约插入数据约2.5万条,全部插入约需20分钟。
SQL>control query cancel qid MXID11001049697212347845223497595000000000706U3333300_1609_SQL_CUR_7;
SQL>upsert into eboxdata_30m_new select * from eboxdata_30m;
SQL>get statistics for qid MXID11001049515212347845223198599000000000206U3333300_522_SQL_CUR_29 default;
....
Qid MXID11001049515212347845223198599000000000206U3333300_522_SQL_CUR_29
Compile Start Time 2016/12/14 13:35:00.893745
Compile End Time 2016/12/14 13:35:07.080435
Compile Elapsed Time 0:00:06.186690
Execute Start Time 2016/12/14 13:35:07.080609
Execute End Time -1
Execute Elapsed Time 0:06:09.589123
State OPEN
Rows Affected -1
SQL Error Code 0
Stats Error Code 0
Query Type SQL_INSERT_NON_UNIQUE
Sub Query Type SQL_STMT_NA
Estimated Accessed Rows 0
Estimated Used Rows 0
Parent Qid NONE
Parent Query System NONE
Child Qid NONE
Number of SQL Processes 19
Number of Cpus 3
Transaction Id 4294967299
Source String upsert into eboxdata_30m_new select * from eboxdata_30m
SQL Source Length 55
Rows Returned 0
First Row Returned Time -1
Last Error before AQR 0
Number of AQR retries 0
Delay before AQR 0
No. of times reclaimed 0
Cancel Time -1
Last Suspend Time -1
Stats Collection Type OPERATOR_STATS
LC RC Id PaId ExId Frag TDB Name DOP Dispatches Oper CPU Time Est. Records Used Act. Records Used Details
12 . 13 . 7 0 EX_ROOT 1 1 30 0 0 1382
11 . 12 13 6 0 EX_SPLIT_TOP 1 1 31 29,999,970 0
10 . 11 12 6 0 EX_SEND_TOP 6 18 1,321 29,999,970 0
9 . 10 11 6 2 EX_SEND_BOTTOM 6 12 517 29,999,970 0
8 . 9 10 6 2 EX_SPLIT_BOTTOM 6 48 1,106 29,999,970 0 276976607
6 7 8 9 5 2 EX_TUPLE_FLOW 6 141,906 1,161,857 29,999,970 0
. . 7 8 4 2 EX_TRAF_VSBB_UPSERT 6 141,900 34,293,113 1 8,928,848 TRAFODION.PKSAAS.EBOXDATA_30M_NEW|0|2454851950
5 . 6 8 3 2 EX_SORT 6 1,702,355 167,387,942 29,999,970 8,936,375 27163|89932|24
4 . 5 6 2 2 EX_SPLIT_TOP 6 1,560,567 4,261,383 29,999,970 29,999,819
3 . 4 5 2 2 EX_SEND_TOP 72 1,579,262 69,870,689 29,999,970 29,999,819
2 . 3 4 2 3 EX_SEND_BOTTOM 72 3,158,252 460,964,057 29,999,970 29,999,895
1 . 2 3 2 3 EX_SPLIT_BOTTOM 12 3,157,635 9,503,227 29,999,970 30,000,000 635988911
. . 1 2 1 3 EX_TRAF_KEY_SELECT 23 3,026,053 317,312,896 29,999,970 30,000,000 TRAFODION.PKSAAS.EBOXDATA_30M|30000000|7527968500
--- SQL operation complete.
- Upsert Using Load测试,从下面的结果可以看出,导入3千万的数据花了12分钟,这比upsert语句要快将近一倍,相当于每秒4万多条记录。
SQL>upsert using load into eboxdata_30m_new select * from eboxdata_30m;
....
--- 30000000 row(s) inserted.
Start Time 2016/12/15 09:16:22.879596
End Time 2016/12/15 09:28:23.098801
Elapsed Time 00:12:00.219205
Compile Time 00:00:00.046010
Execution Time 00:12:00.172971
Table Name Records Records Hdfs Hdfs I/O Hdfs Access
Accessed Used I/Os Bytes Time(usec)
TRAFODION.PKSAAS.EBOXDATA_30M
30000000 30000000 3017 7527968500 9287063
TRAFODION.PKSAAS.EBOXDATA_30M_NEW
0 30000000 247956 8248464412 889703853
SQL>get statistics for qid MXID11000047548212348461421382158000000000406U3333300_1108_SQL_CUR_14 default;
....
LC RC Id PaId ExId Frag TDB Name DOP Dispatches Oper CPU Time Est. Records Used Act. Records Used Details
12 . 13 . 7 0 EX_ROOT 1 1 41 0 0 1133
11 . 12 13 6 0 EX_SPLIT_TOP 1 1 6 29,999,970 0
10 . 11 12 6 0 EX_SEND_TOP 6 18 1,086 29,999,970 0
9 . 10 11 6 2 EX_SEND_BOTTOM 6 12 399 29,999,970 0
8 . 9 10 6 2 EX_SPLIT_BOTTOM 6 43 447 29,999,970 0 268009646
6 7 8 9 5 2 EX_TUPLE_FLOW 6 190,246 1,046,102 29,999,970 0
. . 7 8 4 2 EX_TRAF_VSBB_UPSERT_LOAD 6 190,238 29,003,168 1 5,751,963 TRAFODION.PKSAAS.EBOXDATA_30M_NEW|0|1581492354
5 . 6 8 3 2 EX_SORT 6 1,753,154 164,131,820 29,999,970 5,752,791 17520|89908|24
4 . 5 6 2 2 EX_SPLIT_TOP 6 1,563,018 4,278,627 29,999,970 29,999,835
3 . 4 5 2 2 EX_SEND_TOP 72 1,579,278 69,549,083 29,999,970 29,999,835
2 . 3 4 2 3 EX_SEND_BOTTOM 72 3,158,237 459,331,880 29,999,970 29,999,911
1 . 2 3 2 3 EX_SPLIT_BOTTOM 12 3,157,643 9,414,684 29,999,970 30,000,000 590013078
. . 1 2 1 3 EX_TRAF_KEY_SELECT 23 3,025,840 232,149,739 29,999,970 30,000,000 TRAFODION.PKSAAS.EBOXDATA_30M|30000000|7527968500
--- SQL operation complete.
- Load测试,通过以下结果可以发现,load最快,从统计信息可以看出它走的是hbase bulk load,总用时花了42秒,相当于每秒70万条记录,这远远超过了上述三种加载方式。
SQL>load into eboxdata_30m_new select * from eboxdata_30m;
UTIL_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------
Task: LOAD Status: Started Object: TRAFODION.PKSAAS.EBOXDATA_30M_NEW
Task: CLEANUP Status: Started Object: TRAFODION.PKSAAS.EBOXDATA_30M_NEW
Task: CLEANUP Status: Ended Object: TRAFODION.PKSAAS.EBOXDATA_30M_NEW
Task: PREPARATION Status: Started Object: TRAFODION.PKSAAS.EBOXDATA_30M_NEW
Rows Processed: 30000000
Task: PREPARATION Status: Ended ET: 00:00:39.754
Task: COMPLETION Status: Started Object: TRAFODION.PKSAAS.EBOXDATA_30M_NEW
Task: COMPLETION Status: Ended ET: 00:00:01.852
--- SQL operation complete.
Start Time 2016/12/15 09:33:20.473559
End Time 2016/12/15 09:34:02.888691
Elapsed Time 00:00:42.415132
Compile Time 00:00:00.010635
Execution Time 00:00:42.404151
SQL>get statistics for qid MXID11000047548212348461421382158000000000406U3333300_1248_SQL_CUR_22 default;
...
LC RC Id PaId ExId Frag TDB Name DOP Dispatches Oper CPU Time Est. Records Used Act. Records Used Details
1 . 2 . 2 0 EX_ROOT 1 0 0 0 0
. . 1 2 1 0 HBASE_BULK_LOAD 1 1 0 0 0
--- SQL operation complete.