Spark 中的 exchange reuse

说明

如果一个Query中有重复出现的Exchange或者Subquery, Spark 可以对着一部分进行 Reuse。
在 disable AE 时,是通过两个固定的Rule 来改写Plan 来实现 reuse.
在 enable AE 时,AE 通过 Stage 的划分和调度,通过特殊的 ReusedStage 来实现 reuse.

ReuseExchange 和 ReuseSubquery Rule

在 disable AE 时,通过 ReuseExchange 和 ReuseSubquery Rule 来实现exchange 的reuse。
sparkPlan 在转换成 executedPlan 之前会先执行 ReuseExchange Rule 和 ReuseSubquery Rule。

在 ReuseExchange Rule 中对之前出现过的 Exchange, 替换成 ReusedExchangeExec。
在 ReusedExchangeExec 中所有的 execute 方法对应的 child 的 execute方法。

在 ReuseSubquery Rule 中对之前出现过的 ExecSubqueryExpression, 替换成 ReusedSubqueryExec.
在 ReusedExchangeExec 和 ReusedSubqueryExec 中所有的 execute 方法对应的 child 的 execute方法。

    // ReuseExchange 
    def reuse: PartialFunction[Exchange, SparkPlan] = {
      case exchange: Exchange =>
        val sameSchema = exchanges.getOrElseUpdate(exchange.schema, ArrayBuffer[Exchange]())
        val samePlan = sameSchema.find { e =>
          exchange.sameResult(e)
        }
        if (samePlan.isDefined) {
          // Keep the output of this exchange, the following plans require that to resolve
          // attributes.
          // samePlan.get 就是之前出现过的 exchange, 
          ReusedExchangeExec(exchange.output, samePlan.get)
        } else {
          sameSchema += exchange
          exchange
        }
    }

    // ReuseSubquery
    val subqueries = mutable.HashMap[StructType, ArrayBuffer[BaseSubqueryExec]]()
    plan transformAllExpressions {
      case sub: ExecSubqueryExpression =>
        val sameSchema =
          subqueries.getOrElseUpdate(sub.plan.schema, ArrayBuffer[BaseSubqueryExec]())
        val sameResult = sameSchema.find(_.sameResult(sub.plan))
        if (sameResult.isDefined) {
          sub.withNewPlan(ReusedSubqueryExec(sameResult.get))
        } else {
          sameSchema += sub.plan
          sub
        }
    }

AE 中 reuse 的实现

AE 对Exchange的 reuse 是在调度的时候实现的。
以下面Query为例:

SELECT user.id, name, count(salary) as c1
FROM user join sal
ON user.id = sal.id
GROUP BY user.id, name
UNION ALL
SELECT user.id, name, count(salary) as c1
FROM user join sal
ON user.id = sal.id
GROUP BY user.id, name

原始 Plan 如下

== Optimized Logical Plan ==
Union
:- Aggregate [id#277L, name#278], [id#277L, name#278, count(salary#282) AS c1#279L]
:  +- Project [id#277L, name#278, salary#282]
:     +- Join Inner, (id#277L = id#281L)
:        :- Filter isnotnull(id#277L)
:        :  +- Relation default.user[id#277L,name#278] parquet
:        +- Filter isnotnull(id#281L)
:           +- Relation default.sal[id#281L,salary#282] parquet
+- Aggregate [id#277L, name#278], [id#277L, name#278, count(salary#282) AS c1#287L]
   +- Project [id#277L, name#278, salary#282]
      +- Join Inner, (id#277L = id#281L)
         :- Filter isnotnull(id#277L)
         :  +- Relation default.user[id#277L,name#278] parquet
         +- Filter isnotnull(id#281L)
            +- Relation default.sal[id#281L,salary#282] parquet

== Physical Plan ==

Union
:- HashAggregate(keys=[id#277L, name#278], functions=[count(salary#282)], output=[id#277L, name#278, c1#279L])
:  +- Exchange hashpartitioning(id#277L, name#278, 5), ENSURE_REQUIREMENTS, [id=#186]
:     +- HashAggregate(keys=[id#277L, name#278], functions=[partial_count(salary#282)], output=[id#277L, name#278, count#292L])
:        +- Project [id#277L, name#278, salary#282]
:           +- BroadcastHashJoin [id#277L], [id#281L], Inner, BuildLeft, false
:              :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]),false), [id=#181]
:              :  +- Project [id#277L, name#278]
:              :     +- Filter isnotnull(id#277L)
:              :        +- FileScan parquet default.user[id#277L,name#278] Batched: true, DataFilters: [isnotnull(id#277L)], Format: Parquet...
:              +- Project [id#281L, salary#282]
:                 +- Filter isnotnull(id#281L)
:                    +- FileScan parquet default.sal[id#281L,salary#282] Batched: true, DataFilters: [isnotnull(id#281L)], Format: Parquet...
+- HashAggregate(keys=[id#277L, name#278], functions=[count(salary#282)], output=[id#277L, name#278, c1#287L])
   +- Exchange hashpartitioning(id#277L, name#278, 5), ENSURE_REQUIREMENTS, [id=#193]
      +- HashAggregate(keys=[id#277L, name#278], functions=[partial_count(salary#282)], output=[id#277L, name#278, count#294L])
         +- Project [id#277L, name#278, salary#282]
            +- BroadcastHashJoin [id#277L], [id#281L], Inner, BuildLeft, false
               :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]),false), [id=#188]
               :  +- Project [id#277L, name#278]
               :     +- Filter isnotnull(id#277L)
               :        +- FileScan parquet default.user[id#277L,name#278] Batched: true, DataFilters: [isnotnull(id#277L)], Format: Parquet...
               +- Project [id#281L, salary#282]
                  +- Filter isnotnull(id#281L)
                     +- FileScan parquet default.sal[id#281L,salary#282] Batched: true, DataFilters: [isnotnull(id#281L)], Format: Parquet...

AE 第一次划分 Stage 并提交执行

Union
:- HashAggregate(keys=[id#277L, name#278], functions=[count(salary#282)], output=[id#277L, name#278, c1#279L])
:  +- Exchange hashpartitioning(id#277L, name#278, 5), ENSURE_REQUIREMENTS, [id=#229]
:     +- HashAggregate(keys=[id#277L, name#278], functions=[partial_count(salary#282)], output=[id#277L, name#278, count#292L])
:        +- Project [id#277L, name#278, salary#282]
:           +- BroadcastHashJoin [id#277L], [id#281L], Inner, BuildLeft, false
:              :- BroadcastQueryStage 0
:              :  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]),false), [id=#224]
:              :     +- *(1) Project [id#277L, name#278]
:              :        +- *(1) Filter isnotnull(id#277L)
:              :           +- *(1) ColumnarToRow
:              :              +- FileScan parquet default.user[id#277L,name#278] Batched: true, DataFilters: [isnotnull(id#277L)], Format: Parquet...
:              +- Project [id#281L, salary#282]
:                 +- Filter isnotnull(id#281L)
:                    +- FileScan parquet default.sal[id#281L,salary#282] Batched: true, DataFilters: [isnotnull(id#281L)], Format: Parquet...
+- HashAggregate(keys=[id#277L, name#278], functions=[count(salary#282)], output=[id#277L, name#278, c1#287L])
   +- Exchange hashpartitioning(id#277L, name#278, 5), ENSURE_REQUIREMENTS, [id=#255]
      +- HashAggregate(keys=[id#277L, name#278], functions=[partial_count(salary#282)], output=[id#277L, name#278, count#294L])
         +- Project [id#277L, name#278, salary#282]
            +- BroadcastHashJoin [id#277L], [id#281L], Inner, BuildLeft, false
               :- BroadcastQueryStage 1
               :  +- ReusedExchange [id#277L, name#278], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]),false), [id=#224]
               +- Project [id#281L, salary#282]
                  +- Filter isnotnull(id#281L)
                     +- FileScan parquet default.sal[id#281L,salary#282] Batched: true, DataFilters: [isnotnull(id#281L)], Format: Parquet....

此时 AE 提交执行 BroadcastQueryStage 0BroadcastQueryStage 1.
BroadcastQueryStage 的执行就是执行 BroadcastExchangeExec 中 val relationFuture: Future[broadcast.Broadcast[Any]], 将 SparkPlan 转换成执行的 Relation, 并 Broadcast 出去。

AE 第二次划分 Stage 并提交执行

Union
:- HashAggregate(keys=[id#277L, name#278], functions=[count(salary#282)], output=[id#277L, name#278, c1#279L])
:  +- ShuffleQueryStage 2
:     +- Exchange hashpartitioning(id#277L, name#278, 5), ENSURE_REQUIREMENTS, [id=#325]
:        +- *(2) HashAggregate(keys=[id#277L, name#278], functions=[partial_count(salary#282)], output=[id#277L, name#278, count#292L])
:           +- *(2) Project [id#277L, name#278, salary#282]
:              +- *(2) BroadcastHashJoin [id#277L], [id#281L], Inner, BuildLeft, false
:                 :- BroadcastQueryStage 0
:                 :  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]),false), [id=#224]
:                 :     +- *(1) Project [id#277L, name#278]
:                 :        +- *(1) Filter isnotnull(id#277L)
:                 :           +- *(1) ColumnarToRow
:                 :              +- FileScan parquet default.user[id#277L,name#278] Batched: true, DataFilters: [isnotnull(id#277L)], Format: Parquet...
:                 +- *(2) Project [id#281L, salary#282]
:                    +- *(2) Filter isnotnull(id#281L)
:                       +- *(2) ColumnarToRow
:                          +- FileScan parquet default.sal[id#281L,salary#282] Batched: true, DataFilters: [isnotnull(id#281L)], Format: Parquet...
+- HashAggregate(keys=[id#277L, name#278], functions=[count(salary#282)], output=[id#277L, name#278, c1#287L])
   +- ShuffleQueryStage 3
      +- ReusedExchange [id#277L, name#278, count#294L], Exchange hashpartitioning(id#277L, name#278, 5), ENSURE_REQUIREMENTS, [id=#325]

此时 AE 提交执行 ShuffleQueryStage 2ShuffleQueryStage 3.
对于ShuffleQueryStage 通过 val mapOutputStatisticsFuture: Future[MapOutputStatistics] 返回 Future 对象,等待执行完成。

上面的 QueryStage 都对应同一 Future对象,保证了 QueryStage 不会重复提交执行。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值