spark SQL(12)show函数的执行流程

DataSet中的show()调用select()执行,并打印结果。

  def show(truncate: Boolean): Unit = show(20, truncate)
  //
  def show(numRows: Int, truncate: Boolean): Unit = if (truncate) {
    println(showString(numRows, truncate = 20))
  } else {
    println(showString(numRows, truncate = 0))
  }

showString() 调用了getRows(),把结果美化一下,封装到StringBuilder对象中返回,然后打印。

  private[sql] def showString(
      _numRows: Int,
      truncate: Int = 20,
      vertical: Boolean = false): String = {
    val numRows = _numRows.max(0).min(ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH - 1)
    // Get rows represented by Seq[Seq[String]], we may get one more line if it has more data.
    val tmpRows = getRows(numRows, truncate)

    val hasMoreData = tmpRows.length - 1 > numRows
    val rows = tmpRows.take(numRows + 1)

    val sb = new StringBuilder

    //......

    sb.toString()
  }

getRows()

先toDF()从Dataset.ofRows()得到生成了LogicalPlan的DataFrame,记为newDf;
然后对newDf的LogicalPlan增加了一条dataType的操作计划作为castCols,传给select()
最后select().take() 返回数据。

select()是一个withPlan(),withPlan()居然是@inline的,难道借鉴了C++的思想?

take()其实就是head(),head()对withPlan()调用了collectFromPlan!

  def toDF(): DataFrame = new Dataset[Row](sparkSession, queryExecution, RowEncoder(schema))
  //
  private[sql] def getRows(
      numRows: Int,
      truncate: Int): Seq[Seq[String]] = {
    val newDf = toDF()
    val castCols = newDf.logicalPlan.output.map { col =>
      // Since binary types in top-level schema fields have a specific format to print,
      // so we do not cast them to strings here.
      if (col.dataType == BinaryType) {
        Column(col)
      } else {
        Column(col).cast(StringType)
      }
    }
    val data = newDf.select(castCols: _*).take(numRows + 1)
    ...
    }
    
  def select(cols: Column*): DataFrame = withPlan {
    Project(cols.map(_.named), logicalPlan)
  }
  
  /** A convenient function to wrap a logical plan and produce a DataFrame. */
  @inline private def withPlan(logicalPlan: LogicalPlan): DataFrame = {
    Dataset.ofRows(sparkSession, logicalPlan)
  }
  
  def take(n: Int): Array[T] = head(n)
  
  def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)

collectFromPlan()

先调用GenerateSafeProjection,生成代码。
然后执行SparkPlan.executeCollect()拿到文件前n行数据,生成一个byteArrayRdd写入压缩的DataOutputStream对象,
把byteArrayRdd解压缩成Array[InternalRow],就有了RDD的每一行,再对每行套用查询计划生成的代码。

看起来简单的查询大致就是这样。

object GenerateSafeProjection extends CodeGenerator[Seq[Expression], Projection] {

  /**
   * Dataset.scala
   * Collect all elements from a spark plan.
   */
  private def collectFromPlan(plan: SparkPlan): Array[T] = {
    // This projection writes output to a `InternalRow`, which means applying this projection is not
    // thread-safe. Here we create the projection inside this method to make `Dataset` thread-safe.
    val objProj = GenerateSafeProjection.generate(deserializer :: Nil)
    plan.executeCollect().map { row =>
      // The row returned by SafeProjection is `SpecificInternalRow`, which ignore the data type
      // parameter of its `get` method, so it's safe to use null here.
      objProj(row).get(0, null).asInstanceOf[T]
    }
  }
  
  // SparkPlan.scala
  /**
   * Runs this query returning the result as an array.
   */
  def executeCollect(): Array[InternalRow] = {
    val byteArrayRdd = getByteArrayRdd()

    val results = ArrayBuffer[InternalRow]()
    byteArrayRdd.collect().foreach { countAndBytes =>
      decodeUnsafeRows(countAndBytes._2).foreach(results.+=)
    }
    results.toArray
  }
  // 
  private def getByteArrayRdd(n: Int = -1): RDD[(Long, Array[Byte])] = {
    execute().mapPartitionsInternal { iter =>
      var count = 0
      val buffer = new Array[Byte](4 << 10)  // 4K
      val codec = CompressionCodec.createCodec(SparkEnv.get.conf)
      val bos = new ByteArrayOutputStream()
      val out = new DataOutputStream(codec.compressedOutputStream(bos))
      // `iter.hasNext` may produce one row and buffer it, we should only call it when the limit is
      // not hit.
      while ((n < 0 || count < n) && iter.hasNext) {
        val row = iter.next().asInstanceOf[UnsafeRow]
        out.writeInt(row.getSizeInBytes)
        row.writeToStream(out, buffer)
        count += 1
      }
      out.writeInt(-1)
      out.flush()
      out.close()
      Iterator((count, bos.toByteArray))
    }
  }
  /**
   * Returns the result of this query as an RDD[InternalRow] by delegating to `doExecute` after
   * preparations.
   *
   * Concrete implementations of SparkPlan should override `doExecute`.
   */
  final def execute(): RDD[InternalRow] = executeQuery {
    if (isCanonicalizedPlan) {
      throw new IllegalStateException("A canonicalized plan is not supposed to be executed.")
    }
    // 调用子类的doExecute()
    doExecute()
  }

doExecute的执行,还是会对RDD的每一个分区,用sparkContext.runJob()启动一个spark任务。

比如SparkPlan中的executeTake():

  /**
   * Runs this query returning the first `n` rows as an array.
   *
   * This is modeled after `RDD.take` but never runs any job locally on the driver.
   */
  def executeTake(n: Int): Array[InternalRow] = {
    if (n == 0) {
      return new Array[InternalRow](0)
    }

    val childRDD = getByteArrayRdd(n).map(_._2)

    val buf = new ArrayBuffer[InternalRow]
    val totalParts = childRDD.partitions.length
    var partsScanned = 0
    while (buf.size < n && partsScanned < totalParts) {
      // The number of partitions to try in this iteration. It is ok for this number to be
      // greater than totalParts because we actually cap it at totalParts in runJob.
      var numPartsToTry = 1L
      if (partsScanned > 0) {
        // If we didn't find any rows after the previous iteration, quadruple and retry.
        // Otherwise, interpolate the number of partitions we need to try, but overestimate
        // it by 50%. We also cap the estimation in the end.
        val limitScaleUpFactor = Math.max(sqlContext.conf.limitScaleUpFactor, 2)
        if (buf.isEmpty) {
          numPartsToTry = partsScanned * limitScaleUpFactor
        } else {
          val left = n - buf.size
          // As left > 0, numPartsToTry is always >= 1
          numPartsToTry = Math.ceil(1.5 * left * partsScanned / buf.size).toInt
          numPartsToTry = Math.min(numPartsToTry, partsScanned * limitScaleUpFactor)
        }
      }

      val p = partsScanned.until(math.min(partsScanned + numPartsToTry, totalParts).toInt)
      val sc = sqlContext.sparkContext
      val res = sc.runJob(childRDD,
        (it: Iterator[Array[Byte]]) => if (it.hasNext) it.next() else Array.empty[Byte], p)

      buf ++= res.flatMap(decodeUnsafeRows)

      partsScanned += p.size
    }

    if (buf.size > n) {
      buf.take(n).toArray
    } else {
      buf.toArray
    }
  }
  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Spark SQL的窗口函数是一种用于对数据进行分组和排序的函数。它可以在查询结果中创建一个窗口,然后在窗口内进行聚合、排序和分析操作。窗口函数的语法结构如下所示: <窗口函数>(参数) OVER ( [PARTITION BY <列清单>] [ORDER BY <排序用清单列>] [ASC/DESC] (ROWS | RANGE) <范围条件> ) 其中,关键字OVER表示这是一个窗口函数,不是普通的聚合函数。PARTITION BY子句用于指定分组字段,ORDER BY子句用于指定排序字段。ROWS/RANGE窗口子句用于控制窗口的尺寸边界,有两种选项:ROW表示物理窗口,数据筛选基于排序后的索引;RANGE表示逻辑窗口,数据筛选基于值。[1] 在Spark SQL中,窗口函数可以通过使用Column提供的over函数来实现。下面是一个示例代码,展示了如何使用窗口函数进行累加历史和全部统计: ```scala import org.apache.spark.sql.expressions._ val first_2_now_window = Window.partitionBy("pcode").orderBy("event_date") df_userlogs_date.select( $"pcode", $"event_date", sum($"duration").over(first_2_now_window).as("sum_duration") ).show ``` 以上代码使用了窗口函数sum来计算每个pcode和event_date的duration的累加值,并将结果命名为sum_duration。通过使用over函数和指定的窗口(first_2_now_window),我们可以在每个窗口内进行累加计算。最后,使用show方法展示结果。[3] 总结来说,Spark SQL的窗口函数是一种强大的工具,可以在查询结果中创建窗口,并在窗口内进行聚合、排序和分析操作。它可以通过使用Column提供的over函数来实现。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值