Spark 3.4.x 对 from_json regexp_replace组合表达式慢问题的解决

鸿乃江边鸟

已于 2024-07-02 10:18:38 修改

阅读量436

点赞数 1

分类专栏： spark 大数据文章标签： spark 大数据

于 2023-08-12 16:44:47 首次发布

本文链接：https://blog.csdn.net/monkeyboy_tech/article/details/132250504

版权

大数据同时被 2 个专栏收录

127 篇文章 22 订阅

订阅专栏

spark

89 篇文章 11 订阅

订阅专栏

背景

在Spark 3.1.1 遇到的 from_json regexp_replace组合表达式慢问题的解决
中其实在spark 3.4.x已经解决了，
具体的解决方法可以见 SPARK-44700，
也就是设置spark.sql.optimizer.collapseProjectAlwaysInline 为 false （默认就是false）
但是 spark 3.4.x是怎么解决的呢？

分析

以如下SQL为例：

    Seq("""{"a":1, "b":0.8}""").toDF("s").write.saveAsTable("t")
    // spark.sql.planChangeLog.level warn
    val df = sql(
      """
        |SELECT j.*
        |FROM   (SELECT from_json(regexp_replace(s, 'a', 'new_a'), 'new_a INT, b DOUBLE') AS j
        |        FROM   t) tmp
        |""".stripMargin)
    df.explain(true)

在spark 3.1.1会有如下的转换：

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseProject ===
!Project [j#17.new_a AS new_a#19, j#17.b AS b#20]                                                                                                                    Project [from_json(StructField(new_a,IntegerType,true), StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).new_a AS new_a#19, from_json(StructField(new_a,IntegerType,true), StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).b AS b#20]
!+- Project [from_json(StructField(new_a,IntegerType,true), StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)) AS j#17]   +- Relation[s#18] parquet
!   +- Relation[s#18] parquet                                                                                                                                        
           
09:46:57.649 WARN org.apache.spark.sql.catalyst.rules.PlanChangeLogger: 
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.OptimizeJsonExprs ===
!Project [from_json(StructField(new_a,IntegerType,true), StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).new_a AS new_a#19, from_json(StructField(new_a,IntegerType,true), StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).b AS b#20]   Project [from_json(StructField(new_a,IntegerType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).new_a AS new_a#19, from_json(StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).b AS b#20]
 +- Relation[s#18] parquet                                                                                                                                                                                                                                                                                                          +- Relation[s#18] parquet

最终的物理计划如下：

== Physical Plan ==
Project [from_json(StructField(new_a,IntegerType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).new_a AS new_a#19, from_json(StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).b AS b#20]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t[s#18] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/jiahong.li/xmalaya/github/spark/spark-warehouse/org.apache.spark.sq..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<s:string>

这里列new_a,b有多少列,from_json/regexp_replace就会被计算几次，也就是说会有性能损失。

在spark 3.4.0中没有了以上规则CollapseProject和OptimizeJsonExprs的转换，生成的物理计划如下：

== Physical Plan ==
*(2) Project [j#17.new_a AS new_a#20, j#17.b AS b#21]
+- Project [from_json(StructField(new_a,IntegerType,true), StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)) AS j#17]
   +- *(1) ColumnarToRow
      +- FileScan parquet spark_catalog.default.t[s#18] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/jiahong.li/xmalaya/ultimate-git/spark/spark-warehouse/org...., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<s:string>

无论这里的列new_a,b有多少列(虽然当前例子中只有一列),from_json/regexp_replace只会被计算一次。

分析一下：

关键是在CollapseProject这个Rule中：

  def apply(plan: LogicalPlan): LogicalPlan = {
    apply(plan, conf.getConf(SQLConf.COLLAPSE_PROJECT_ALWAYS_INLINE))
  }
  ...   
  def apply(plan: LogicalPlan, alwaysInline: Boolean): LogicalPlan = {
  plan.transformUpWithPruning(_.containsPattern(PROJECT), ruleId) {
    case p1 @ Project(_, p2: Project)
        if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline) =>
      p2.copy(projectList = buildCleanedProjectList(p1.projectList, p2.projectList))

这里最重要的是canCollapseExpressions这个方法，主要是来判断是否可以合并project：

  def canCollapseExpressions(
      consumers: Seq[Expression],
      producerMap: Map[Attribute, Expression],
      alwaysInline: Boolean = false): Boolean = {
    consumers
      .filter(_.references.exists(producerMap.contains))
      .flatMap(collectReferences)
      .groupBy(identity)
      .mapValues(_.size)
      .forall {
        case (reference, count) =>
          val producer = producerMap.getOrElse(reference, reference)
          val relatedConsumers = consumers.filter(_.references.contains(reference))
          def cheapToInlineProducer: Boolean = trimAliases(producer) match {
            case e @ (_: CreateNamedStruct | _: UpdateFields | _: CreateMap | _: CreateArray) =>
              var nonCheapAccessSeen = false
              def nonCheapAccessVisitor(): Boolean = {
                try {
                  nonCheapAccessSeen
                } finally {
                  nonCheapAccessSeen = true
                }
              }

              !relatedConsumers.exists(findNonCheapAccesses(_, reference, e, nonCheapAccessVisitor))

            case other => isCheap(other)
          }

          producer.deterministic && (count == 1 || alwaysInline || cheapToInlineProducer)
      }

这里对于JsonToStructs(RegExpReplace)处理是在 case other中，

  def isCheap(e: Expression): Boolean = e match {
    case _: Attribute | _: OuterReference => true
    case _ if e.foldable => true
    // PythonUDF is handled by the rule ExtractPythonUDFs
    case _: PythonUDF => true
    // Alias and ExtractValue are very cheap.
    case _: Alias | _: ExtractValue => e.children.forall(isCheap)
    case _ => false
  }

isCheap匹配的是之后的一个case _ => false,
所以isCheap返回false, 而 count的个数是2， alwaysInline 是默认值为false, cheapToInlineProducer 为false，
所以最终canCollapseExpressions返回是false,所以在spark 3.4.0版本中CollapseProject不会被应用在当前计划中，所以不会有性能的损失