Spark 3.4.x 对 from_json regexp_replace组合表达式慢问题的解决

89 篇文章 11 订阅

背景

Spark 3.1.1 遇到的 from_json regexp_replace组合表达式慢问题的解决
中其实在spark 3.4.x已经解决了,
具体的解决方法可以见 SPARK-44700
也就是设置spark.sql.optimizer.collapseProjectAlwaysInlinefalse (默认就是false)
但是 spark 3.4.x是怎么解决的呢?

分析

以如下SQL为例:

    Seq("""{"a":1, "b":0.8}""").toDF("s").write.saveAsTable("t")
    // spark.sql.planChangeLog.level warn
    val df = sql(
      """
        |SELECT j.*
        |FROM   (SELECT from_json(regexp_replace(s, 'a', 'new_a'), 'new_a INT, b DOUBLE') AS j
        |        FROM   t) tmp
        |""".stripMargin)
    df.explain(true)

spark 3.1.1会有如下的转换:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseProject ===
!Project [j#17.new_a AS new_a#19, j#17.b AS b#20]                                                                                                                    Project [from_json(StructField(new_a,IntegerType,true), StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).new_a AS new_a#19, from_json(StructField(new_a,IntegerType,true), StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).b AS b#20]
!+- Project [from_json(StructField(new_a,IntegerType,true), StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)) AS j#17]   +- Relation[s#18] parquet
!   +- Relation[s#18] parquet                                                                                                                                        
           
09:46:57.649 WARN org.apache.spark.sql.catalyst.rules.PlanChangeLogger: 
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.OptimizeJsonExprs ===
!Project [from_json(StructField(new_a,IntegerType,true), StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).new_a AS new_a#19, from_json(StructField(new_a,IntegerType,true), StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).b AS b#20]   Project [from_json(StructField(new_a,IntegerType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).new_a AS new_a#19, from_json(StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).b AS b#20]
 +- Relation[s#18] parquet                                                                                                                                                                                                                                                                                                          +- Relation[s#18] parquet
   

最终的物理计划如下:

== Physical Plan ==
Project [from_json(StructField(new_a,IntegerType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).new_a AS new_a#19, from_json(StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).b AS b#20]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t[s#18] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/jiahong.li/xmalaya/github/spark/spark-warehouse/org.apache.spark.sq..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<s:string>

这里列new_a,b有多少列,from_json/regexp_replace就会被计算几次,也就是说会有性能损失。

spark 3.4.0中没有了以上规则CollapseProjectOptimizeJsonExprs的转换,生成的物理计划如下:

== Physical Plan ==
*(2) Project [j#17.new_a AS new_a#20, j#17.b AS b#21]
+- Project [from_json(StructField(new_a,IntegerType,true), StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)) AS j#17]
   +- *(1) ColumnarToRow
      +- FileScan parquet spark_catalog.default.t[s#18] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/jiahong.li/xmalaya/ultimate-git/spark/spark-warehouse/org...., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<s:string>

无论这里的列new_a,b有多少列(虽然当前例子中只有一列),from_json/regexp_replace只会被计算一次。

分析一下:

关键是在CollapseProject这个Rule中:

  def apply(plan: LogicalPlan): LogicalPlan = {
    apply(plan, conf.getConf(SQLConf.COLLAPSE_PROJECT_ALWAYS_INLINE))
  }
  ...   
  def apply(plan: LogicalPlan, alwaysInline: Boolean): LogicalPlan = {
  plan.transformUpWithPruning(_.containsPattern(PROJECT), ruleId) {
    case p1 @ Project(_, p2: Project)
        if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline) =>
      p2.copy(projectList = buildCleanedProjectList(p1.projectList, p2.projectList))
  

这里最重要的是canCollapseExpressions这个方法,主要是来判断是否可以合并project:

  def canCollapseExpressions(
      consumers: Seq[Expression],
      producerMap: Map[Attribute, Expression],
      alwaysInline: Boolean = false): Boolean = {
    consumers
      .filter(_.references.exists(producerMap.contains))
      .flatMap(collectReferences)
      .groupBy(identity)
      .mapValues(_.size)
      .forall {
        case (reference, count) =>
          val producer = producerMap.getOrElse(reference, reference)
          val relatedConsumers = consumers.filter(_.references.contains(reference))
          def cheapToInlineProducer: Boolean = trimAliases(producer) match {
            case e @ (_: CreateNamedStruct | _: UpdateFields | _: CreateMap | _: CreateArray) =>
              var nonCheapAccessSeen = false
              def nonCheapAccessVisitor(): Boolean = {
                try {
                  nonCheapAccessSeen
                } finally {
                  nonCheapAccessSeen = true
                }
              }

              !relatedConsumers.exists(findNonCheapAccesses(_, reference, e, nonCheapAccessVisitor))

            case other => isCheap(other)
          }

          producer.deterministic && (count == 1 || alwaysInline || cheapToInlineProducer)
      }

这里对于JsonToStructs(RegExpReplace)处理是在 case other中,

  def isCheap(e: Expression): Boolean = e match {
    case _: Attribute | _: OuterReference => true
    case _ if e.foldable => true
    // PythonUDF is handled by the rule ExtractPythonUDFs
    case _: PythonUDF => true
    // Alias and ExtractValue are very cheap.
    case _: Alias | _: ExtractValue => e.children.forall(isCheap)
    case _ => false
  }

isCheap匹配的是之后的一个case _ => false,
所以isCheap返回false, 而 count的个数是2, alwaysInline 是默认值为false, cheapToInlineProducer 为false
所以最终canCollapseExpressions返回是false,所以在spark 3.4.0版本中CollapseProject不会被应用在当前计划中,所以不会有性能的损失

总结

该计划的差异主要部分还是在于Rule CollapseProjectspark 3.1.1spark 3.4.0的差别处理。

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值