Apache Doris Profile&Explain详解
一、简述
Apache Doris中运行EXPLAIN + SQL就可以得到SQL对应的Query Plan,再结合Apche Doris的Profile可以了解Doris是如何处理SQL语句,用于分析查询语句或是结构的性能瓶颈,从而帮助选择更好的索引和写出更优化的查询语句。
二、Plan分析
2.1 sql准备
tpcds query96.sql为例
explain
-- explain graph 生成对应执行计划图表
select count(*)
from store_sales
,household_demographics
,time_dim
, store
where ss_sold_time_sk = time_dim.t_time_sk
and ss_hdemo_sk = household_demographics.hd_demo_sk
and ss_store_sk = s_store_sk
and time_dim.t_hour = 8
and time_dim.t_minute >= 30
and household_demographics.hd_dep_count = 5
and store.s_store_name = 'ese'
order by count(*) limit 100;
2.2 explain结果分析
Query Plan可以分为逻辑执行计划(Logical Query Plan)和物理执行计划(Physical Query Plan),当前讲述的Query Plan默认指逻辑执行计划;tpcds query96.sql对应的Query Plan展示如下。
-- graph
┌───────────────┐
│[8: ResultSink]│
│[Fragment: 4] │
│RESULT SINK │
└───────────────┘
│
│
┌─────────────┐
│[8: TOP-N] │
│[Fragment: 4]│
└─────────────┘
│
│
┌────────────────────────────────┐
│[13: AGGREGATE (merge finalize)]│
│[Fragment: 4] │
└────────────────────────────────┘
│
│
┌──────────────┐
│[12: EXCHANGE]│
│[Fragment: 4] │
└──────────────┘
│
│
┌────────────────────┐
│[12: DataStreamSink]│
│[Fragment: 0] │
│STREAM DATA SINK │
│ EXCHANGE ID: 12 │
│ UNPARTITIONED │
└────────────────────┘
│
│
┌─────────────────────────────────┐
│[7: AGGREGATE (update serialize)]│
│[Fragment: 0] │
└─────────────────────────────────┘
│
│
┌───────────────────────────────┐
│[6: HASH JOIN] │
│[Fragment: 0] │
│join op: INNER JOIN (BROADCAST)│
└───────────────────────────────┘
┌───────────┴─────────────────────────────────────┐
│ │
┌───────────────────────────────┐ ┌──────────────┐
│[4: HASH JOIN] │ │[11: EXCHANGE]│
│[Fragment: 0] │ │[Fragment: 0] │
│join op: INNER JOIN (BROADCAST)│ └──────────────┘
└───────────────────────────────┘ │
┌───────────────┴─────────────────────┐ │
│ │ ┌────────────────────┐
┌───────────────────────────────┐ ┌──────────────┐ │[11: DataStreamSink]│
│[2: HASH JOIN] │ │[10: EXCHANGE]│ │[Fragment: 3] │
│[Fragment: 0] │ │[Fragment: 0] │ │STREAM DATA SINK │
│join op: INNER JOIN (BROADCAST)│ └──────────────┘ │ EXCHANGE ID: 11 │
└───────────────────────────────┘ │ │ UNPARTITIONED │
┌─────────┴──────────┐ │ └────────────────────┘
│ │ ┌────────────────────┐ ┌┘
┌──────────────────┐ ┌─────────────┐ │[10: DataStreamSink]│ │
│[0: OlapScanNode] │ │[9: EXCHANGE]│ │[Fragment: 2] │ ┌─────────────────┐
│[Fragment: 0] │ │[Fragment: 0]│ │STREAM DATA SINK │ │[5: OlapScanNode]│
│TABLE: store_sales│ └─────────────┘ │ EXCHANGE ID: 10 │ │[Fragment: 3] │
└──────────────────┘ │ │ UNPARTITIONED │ │TABLE: store │
│ └────────────────────┘ └─────────────────┘
┌───────────────────┐ │
│[9: DataStreamSink]│ │
│[Fragment: 1] │ ┌─────────────────────────────┐
│STREAM DATA SINK │ │[3: OlapScanNode] │
│ EXCHANGE ID: 09 │ │[Fragment: 2] │
│ UNPARTITIONED │ │TABLE: household_demographics│
└───────────────────┘ └─────────────────────────────┘
│
│
┌─────────────────┐
│[1: OlapScanNode]│
│[Fragment: 1] │
│TABLE: time_dim │
└─────────────────┘
-- 非graph
PLAN FRAGMENT 0
OUTPUT EXPRS:<slot 11> <slot 10> count(*)
PARTITION: UNPARTITIONED
RESULT SINK
8:TOP-N
| order by: <slot 11> <slot 10> count(*) ASC
| offset: 0
| limit: 100
|
13:AGGREGATE (merge finalize)
| output: count(<slot 10> count(*))
| group by:
| cardinality=-1
|
12:EXCHANGE
PLAN FRAGMENT 1
OUTPUT EXPRS:
PARTITION: HASH_PARTITIONED: `default_cluster:tpcds`.`store_sales`.`ss_item_sk`, `default_cluster:tpcds`.`store_sales`.`ss_ticket_number`
STREAM DATA SINK
EXCHANGE ID: 12
UNPARTITIONED
7:AGGREGATE (update serialize)
| output: count(*)
| group by:
| cardinality=1
|
6:HASH JOIN
| join op: INNER JOIN (BROADCAST)
| hash predicates:
| colocate: false, reason: Tables are not in the same group
| equal join conjunct: `ss_store_sk` = `s_store_sk`
| runtime filters: RF000[in] <- `s_store_sk`
| cardinality=2880403
|
|----11:EXCHANGE
|
4:HASH JOIN
| join op: INNER JOIN (BROADCAST)
| hash predicates:
| colocate: false, reason: Tables are not in the same group
| equal join conjunct: `ss_hdemo_sk` = `household_demographics`.`hd_demo_sk`
| runtime filters: RF001[in] <- `household_demographics`.`hd_demo_sk`
| cardinality=2880403
|
|----10:EXCHANGE
|
2:HASH JOIN
| join op: INNER JOIN (BROADCAST)
| hash predicates:
| colocate: false, reason: Tables are not in the same group
| equal join conjunct: `ss_sold_time_sk` = `time_dim`.`t_time_sk`
| runtime filters: RF002[in] <- `time_dim`.`t_time_sk`
| cardinality=2880403
|
|----9:EXCHANGE
|
0:OlapScanNode
TABLE: store_sales
PREAGGREGATION: OFF. Reason: conjunct on `ss_sold_time_sk` which is StorageEngine value column
PREDICATES: `default_cluster:tpcds.store_sales`.`__DORIS_DELETE_SIGN__` = 0
runtime filters: RF000[in] -> `ss_store_sk`, RF001[in] -> `ss_hdemo_sk`, RF002[in] -> `ss_sold_time_sk`
partitions=1/1
rollup: store_sales
tabletRatio=3/3
tabletList=20968,20972,20976
cardinality=2880403
avgRowSize=67.95811
numNodes=3
PLAN FRAGMENT 2
OUTPUT EXPRS:
PARTITION: HASH_PARTITIONED: `default_cluster:tpcds`.`store`.`s_store_sk`
STREAM DATA SINK
EXCHANGE ID: 11
UNPARTITIONED
5:OlapScanNode
TABLE: store
PREAGGREGATION: OFF. Reason: null
PREDICATES: `store`.`s_store_name` = 'ese', `default_cluster:tpcds.store`.`__DORIS_DELETE_SIGN__` = 0
partitions=1/1
rollup: store
tabletRatio=3/3
tabletList=20773,20777,20781
cardinality=23
avgRowSize=1798.8695
numNodes=3
PLAN FRAGMENT 3
OUTPUT EXPRS:
PARTITION: HASH_PARTITIONED: `default_cluster:tpcds`.`household_demographics`.`hd_demo_sk`
STREAM DATA SINK
EXCHANGE ID: 10
UNPARTITIONED
3:OlapScanNode
TABLE: household_demographics
PREAGGREGATION: OFF. Reason: null
PREDICATES: `household_demographics`.`hd_dep_count` = 5, `default_cluster:tpcds.household_demographics`.`__DORIS_DELETE_SIGN__` = 0
partitions=1/1
rollup: household_demographics
tabletRatio=3/3
tabletList=20848,20852,20856
cardinality=14399
avgRowSize=2.8781166
numNodes=3
PLAN FRAGMENT 4
OUTPUT EXPRS:
PARTITION: HASH_PARTITIONED: `default_cluster:tpcds`.`time_dim`.`t_time_sk`
STREAM DATA SINK
EXCHANGE ID: 09
UNPARTITIONED
1:OlapScanNode
TABLE: time_dim
PREAGGREGATION: OFF. Reason: null
PREDICATES: `time_dim`.`t_hour` = 8, `time_dim`.`t_minute` >= 30, `default_cluster:tpcds.time_dim`.`__DORIS_DELETE_SIGN__` = 0
partitions=1/1
rollup: time_dim
tabletRatio=3/3
tabletList=20713,20717,20721
cardinality=172799
avgRowSize=11.671202
numNodes=3
2.2.1 常见属性说明
Colocate Join 适合几张表按照相同字段分桶,并高频根据相同字段 Join 的场景,比如电商的不少应用都按照商家 Id 分桶,并高频按照商家 Id 进行 Join。
2.2.2 plan分析
-
Query96的Query Plan分为五个Plan Fragment,编号从0~4
-
分析Query Plan可以采用自底向上的方式进行,逐个进行分析
-
最底部的Plan Fragment为Fragment 4分析
- 主要负责扫描time_dim表,并提前执行相关查询条件,即谓词下推
- 对于聚合表(Aggregate Key),doris会根据不同查询选择是否开启PREAGGREGATION,上图中time_dim的预聚合为关闭状态,关闭状态之下会读取time_dim的全部维度列,当表中维度列多的时候,这个可能会成为影响性能的一个关键因素
- 如果time_dim表有选择Range Partition进行数据划分,Query Plan中的partitions会表征查询命中几个分区,无关分区被自动过滤会有效减少扫描数据量
- 如果有物化视图,doris会根据查询去自动选择物化视图,如果没有物化视图,那么查询自动命中base table,也就是上图中展示的rollup: time_dim,可参考doris测试物化视图
- 当time_dim数据扫描完成之后,Fragment 4的执行过程也就随之结束,此时它将扫描得到的数据传递给你其它Fragment,EXCHANGE ID : 09表示数据传递给了标号为9的接收节点,可通过graph查看
-
对于Query96的Query Plan而言,Fragment 2, 3, 4功能类似,只是负责扫描的表不同;具体到查询中的Order/Aggregation/Join算子,都在Fragment 1中进行,着重分析Fragment 1
- Fragment 1集成了三个Join算子的执行,采用默认的BROADCAST方式进行执行,也就是小表向大表广播的方式进行,如果两个Join的表都是大表,建议采用SHUFFLE的方式进行
- 目前doris只支持HASH JOIN,也就是采用哈希算法进行Join
- 其中有一个colocate字段,这个用来表述两张Join表采用同样的分区/分桶方式,如此执行Join的过程中可以直接在本地执行,不用进行数据的移动
- Join执行完成之后,就是执行上成的Aggregation, Order by和TOP-N的算子
三、Doris-Profile简述
可通过8030页面的QueryProfile模块查看任务执行详情,以下为query96.sql实际执行的QueryProfile部分内容,各指标名详情可参考:Apache Doris查询分析
Query:
Summary:
- Query ID: 7dd4ba245012441c-b0aadbed39f80f20
- Start Time: 2022-04-15 15:52:22
- End Time: 2022-04-15 15:52:22
- Total: 611ms
- Query Type: Query
- Query State: EOF
- Doris Version: 0.15.0-rc04
- User: root
- Default Db: default_cluster:tpcds
- Sql Statement: /* ApplicationName=DBeaver Enterprise 7.0.0 - SQLEditor <20220321常用命令-doris.sql> */ select count(*)
from store_sales
,household_demographics
,time_dim
, store
where ss_sold_time_sk = time_dim.t_time_sk
and ss_hdemo_sk = household_demographics.hd_demo_sk
and ss_store_sk = s_store_sk
and time_dim.t_hour = 8
and time_dim.t_minute >= 30
and household_demographics.hd_dep_count = 5
and store.s_store_name = 'ese'
order by count(*) limit 100
- Is Cached: No
Execution Summary:
- Analysis Time: 636.648us
- Plan Time: 19.230ms
- Schedule Time: 125.121ms
- Wait and Fetch Result Time: 466.30ms
Execution Profile 7dd4ba245012441c-b0aadbed39f80f20:(Active: 611.44ms, % non-child: 100.00%)
Fragment 0:
Instance 7dd4ba245012441c-b0aadbed39f80f2d (host=TNetworkAddress(hostname:10.192.119.70, port:9060)):(Active: 586.950ms, % non-child: 0.00%)
- FragmentCpuTime: 756.962us
- MemoryLimit: 2.00 GB
- PeakMemoryUsage: 48.01 KB
- PeakReservation: 0.00
- PeakUsedReservation: 0.00
- RowsProduced: 1
BlockMgr:
- BlockWritesOutstanding: 0
- BlocksCreated: 0
- BlocksRecycled: 0
- BufferedPins: 0
- BytesWritten: 0.00
- MaxBlockSize: 8.00 MB
- TotalBufferWaitTime: 0ns
- TotalEncryptionTime: 0ns
- TotalIntegrityCheckTime: 0ns
- TotalReadBlockTime: 0ns
DataBufferSender (dst_fragment_instance_id=7dd4ba245012441c-b0aadbed39f80f2d):
- AppendBatchTime: 124.481us
- ResultSendTime: 119.257us
- TupleConvertTime: 4.217us
- NumSentRows: 1
SORT_NODE (id=8):(Active: 587.36ms, % non-child: 0.01%)
- PeakMemoryUsage: 16.00 KB
- RowsReturned: 1
- RowsReturnedRate: 1
AGGREGATION_NODE (id=13):(Active: 586.958ms, % non-child: 0.10%)
- Probe Method: HashTable Linear Probing
- BuildTime: 10.533us
- GetResultsTime: 0ns
- HTResize: 0
- HTResizeTime: 0ns
- HashBuckets: 0
- HashCollisions: 0
- HashFailedProbe: 0
- HashFilledBuckets: 0
- HashProbe: 0
- HashTravelLength: 0
- LargestPartitionPercent: 0
- MaxPartitionLevel: 0
- NumRepartitions: 0
- PartitionsCreated: 0
- PeakMemoryUsage: 28.00 KB
- RowsProcessed: 0
- RowsRepartitioned: 0
- RowsReturned: 1
- RowsReturnedRate: 1
- SpilledPartitions: 0
EXCHANGE_NODE (id=12):(Active: 586.364ms, % non-child: 95.96%)
- BytesReceived: 32.00 B
- ConvertRowBatchTime: 7.320us
- DataArrivalWaitTime: 586.282ms
- DeserializeRowBatchTimer: 22.637us
- FirstBatchArrivalWaitTime: 349.530ms
- PeakMemoryUsage: 12.01 KB
- RowsReturned: 3
- RowsReturnedRate: 5
- SendersBlockedTotalTimer(*): 0ns
Fragment 1:
Instance 7dd4ba245012441c-b0aadbed39f80f23 (host=TNetworkAddress(hostname:10.192.119.68, port:9060)):(Active: 472.511ms, % non-child: 0.10%)
- FragmentCpuTime: 5.714ms
- MemoryLimit: 2.00 GB
- PeakMemoryUsage: 610.00 KB
- PeakReservation: 0.00
- PeakUsedReservation: 0.00
- RowsProduced: 1
BlockMgr:
- BlockWritesOutstanding: 0
- BlocksCreated: 0
- BlocksRecycled: 0
- BufferedPins: 0
- BytesWritten: 0.00
- MaxBlockSize: 8.00 MB
- TotalBufferWaitTime: 0ns
- TotalEncryptionTime: 0ns
- TotalIntegrityCheckTime: 0ns
- TotalReadBlockTime: 0ns
DataStreamSender (dst_id=12, dst_fragments=[7dd4ba245012441c-b0aadbed39f80f2d]):(Active: 186.357us, % non-child: 0.03%)
- BytesSent: 16.00 B
- IgnoreRows: 0
- LocalBytesSent: 0.00
- OverallThroughput: 83.84375 KB/sec
- PeakMemoryUsage: 16.00 KB
- SerializeBatchTime: 7.0us
- UncompressedRowBatchSize: 16.00 B
AGGREGATION_NODE (id=7):(Active: 471.713ms, % non-child: 0.14%)
- Probe Method: HashTable Linear Probing
- BuildTime: 45.223us
- GetResultsTime: 0ns
- HTResize: 0
- HTResizeTime: 0ns
- HashBuckets: 0
- HashCollisions: 0
- HashFailedProbe: 0
- HashFilledBuckets: 0
- HashProbe: 0
- HashTravelLength: 0
- LargestPartitionPercent: 0
- MaxPartitionLevel: 0
- NumRepartitions: 0
- PartitionsCreated: 0
- PeakMemoryUsage: 280.00 KB
- RowsProcessed: 0
- RowsRepartitioned: 0
- RowsReturned: 1
- RowsReturnedRate: 2
- SpilledPartitions: 0
HASH_JOIN_NODE (id=6):(Active: 470.881ms, % non-child: 0.08%)
- ExecOption: Hash Table Built Asynchronously
- BuildBuckets: 1.024K (1024)
- BuildRows: 1
- BuildTime: 1.129ms
- HashTableMaxList: 1
- HashTableMinList: 1
- LoadFactor: 4562146422526312400.00
- PeakMemoryUsage: 308.00 KB
- ProbeRows: 341
- ProbeTime: 34.697us
- PushDownComputeTime: 156.171us
- PushDownTime: 4.423us
- RowsReturned: 341
- RowsReturnedRate: 724
- Active:表示该节点(包含其所有子节点)的执行时间
- BuildTime:扫描右表并构建hash表的时间
- ProbeTime:获取左表并搜索hashtable进行匹配并输出的时间