Spark SQL functions.scala 源码解析（四）Non-aggregate functions （基于 Spark 3.3.0）

最新推荐文章于 2022-11-29 10:56:42 发布

Shockang

最新推荐文章于 2022-11-29 10:56:42 发布

阅读量2.9k

点赞数 3

分类专栏：大数据技术体系文章标签： scala spark sql

本文链接：https://blog.csdn.net/Shockang/article/details/121906186

版权

大数据技术体系专栏收录该内容

282 篇文章 552 订阅

订阅专栏

前言

本文隶属于专栏《1000个问题搞定大数据技术体系》，该专栏为笔者原创，引用请注明来源，不足和错误之处请在评论区帮忙指出，谢谢！

本专栏目录结构和参考文献请见1000个问题搞定大数据技术体系

正文

Column

/**
   * 返回基于给定列名称的 [[Column]] 对象。
   *
   * @group normal_funcs
   * @since 1.3.0
   */
  def col(colName: String): Column = Column(colName)

  /**
   * 返回基于给定列名称的 [[Column]] 对象。 
   * [[col]] 的别名。
   *
   * @group normal_funcs
   * @since 1.3.0
   */
  def column(colName: String): Column = Column(colName)

  /**
   * 创建一个文本值的 [[Column]] 对象
   * 如果传入的对象已经是[[Column]]，则直接返回它。 
   * 如果对象是一个 Scala 符号，它也会被转换成一个[[Column]]。 
   * 否则，将创建一个新的[[Column]]来表示文本值。
   *
   * @group normal_funcs
   * @since 1.3.0
   */
  def lit(literal: Any): Column = typedLit(literal)

  /**
   * 创建一个文本值的 [[Column]] 对象
   *
   * `typedlit` 的别名，鼓励直接使用 `typedlit` 
   *
   * @group normal_funcs
   * @since 2.2.0
   */
  def typedLit[T : TypeTag](literal: T): Column = typedlit(literal)

  /**
   * 创建一个文本值的 [[Column]] 对象
   * 如果传入的对象已经是[[Column]]，则直接返回它。 
   * 如果对象是一个 Scala 符号，它也会被转换成一个[[Column]]。 
   * 否则，将创建一个新的[[Column]]来表示文本值。 
   * 这个函数和lit的区别在于这个函数可以处理参数化的scala类型，例如：List、Seq和Map。
   *
   * @group normal_funcs
   * @since 3.2.0
   */
  def typedlit[T : TypeTag](literal: T): Column = literal match {
    case c: Column => c
    case s: Symbol => new ColumnName(s.name)
    case _ => Column(Literal.create(literal))
  }

array

/**
   * 创建一个新的数组列。 输入列必须都具有相同的数据类型。
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  @scala.annotation.varargs
  def array(cols: Column*): Column = withExpr { CreateArray(cols.map(_.expr)) }

  /**
   * 创建一个新的数组列。 输入列必须都具有相同的数据类型。
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  @scala.annotation.varargs
  def array(colName: String, colNames: String*): Column = {
    array((colName +: colNames).map(col) : _*)
  }

map/map_from_arrays

  /**
   * 创建一个新的 map 列。 
   * 输入列必须分组为键值对，例如 (key1, value1, key2, value2, ...)。 
   * 键列必须具有相同的数据类型，并且不能为空。 
   * 值列必须都具有相同的数据类型。
   *
   * @group normal_funcs
   * @since 2.0
   */
  @scala.annotation.varargs
  def map(cols: Column*): Column = withExpr { CreateMap(cols.map(_.expr)) }

  /**
   * 创建一个新的 map 列。 
   * 第一列中的数组用于键。 
   * 第二列中的数组用于值。 
   * key 数组中的所有元素都不应为空。
   *
   * @group normal_funcs
   * @since 2.4
   */
  def map_from_arrays(keys: Column, values: Column): Column = withExpr {
    MapFromArrays(keys.expr, values.expr)
  }

broadcast


  /**
   * 将 DataFrame 标记为足够小以用于广播连接。
   * 以下示例将 DataFrame：right 标记为 broadcast hash join 。
   * {{{
   *   // left 和 right 都是 DataFrame
   *   left.join(broadcast(right), "joinKey")
   * }}}
   *
   * @group normal_funcs
   * @since 1.5.0
   */
  def broadcast[T](df: Dataset[T]): Dataset[T] = {
    Dataset[T](df.sparkSession,
      ResolvedHint(df.logicalPlan, HintInfo(strategy = Some(BROADCAST))))(df.exprEnc)
  }

coalesce


  /**
   * 返回不为空的第一列，如果所有输入都为空，则返回空值。
   * 例如，如果 a 不为空，则coalesce(a, b, c)将返回a；
   * 如果 a 为空且 b 不为空，则返回 b；
   * 如果 a 和 b 都为空但 c 不为空，则返回c。
   *
   * @group normal_funcs
   * @since 1.3.0
   */
  @scala.annotation.varargs
  def coalesce(e: Column*): Column = withExpr { Coalesce(e.map(_.expr)) }

input_file_name

  /**
   * 为当前 Spark 任务的文件名创建一个字符串列
   *
   * @group normal_funcs
   * @since 1.6.0
   */
  def input_file_name(): Column = withExpr { InputFileName() }

isnan/isnull


  /**
   * 如果列是 NaN，则返回 true.
   *
   * @group normal_funcs
   * @since 1.6.0
   */
  def isnan(e: Column): Column = withExpr { IsNaN(e.expr) }

  /**
   * 如果列是 null，则返回 true.
   *
   * @group normal_funcs
   * @since 1.6.0
   */
  def isnull(e: Column): Column = withExpr { IsNull(e.expr) }

monotonically_increasing_id


  /**
   * 生成单调递增的 64 位整数的列表达式。
   * 生成的 ID 保证单调递增且唯一，但不连续。 
   * 当前实现将分区 ID 放在高 31 位，将每个分区内的记录号放在低 33 位。 
   * 假设数据帧的分区数少于 10 亿，每个分区的记录数少于 80 亿。
   * 
   * 例如，考虑一个具有两个分区的DataFrame ，每个分区有 3 条记录。 此表达式将返回以下 ID：
   * {{{
   * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
   * }}}
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  @deprecated("Use monotonically_increasing_id()", "2.0.0")
  def monotonicallyIncreasingId(): Column = monotonically_increasing_id()

  /**
   * 生成单调递增的 64 位整数的列表达式。
   * 生成的 ID 保证单调递增且唯一，但不连续。 
   * 当前实现将分区 ID 放在高 31 位，将每个分区内的记录号放在低 33 位。 
   * 假设数据帧的分区数少于 10 亿，每个分区的记录数少于 80 亿。
   * 例如，考虑一个具有两个分区的DataFrame ，每个分区有 3 条记录。 此表达式将返回以下 ID：
   *
   * {{{
   * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
   * }}}
   *
   * @group normal_funcs
   * @since 1.6.0
   */
  def monotonically_increasing_id(): Column = withExpr { MonotonicallyIncreasingID() }

nanvl


  /**
   * 如果不是 NaN，则返回 col1，如果 col1 是 NaN，则返回 col2。
   * 两个输入都应该是浮点列（DoubleType 或 FloatType）。
   *
   * @group normal_funcs
   * @since 1.5.0
   */
  def nanvl(col1: Column, col2: Column): Column = withExpr { NaNvl(col1.expr, col2.expr) }

negate


  /**
   * 一元减号，即否定表达式。
   * {{{
   *   // 选择`amount`列并否定所有值。
   *   // Scala:
   *   df.select( -df("amount") )
   *
   *   // Java:
   *   df.select( negate(df.col("amount")) );
   * }}}
   *
   * @group normal_funcs
   * @since 1.3.0
   */
  def negate(e: Column): Column = -e

not


  /**
   * 布尔表达式的反转，即 NOT。
   * {{{
   *   // Scala: 选择不活跃的列 (isActive === false)
   *   df.filter( !df("isActive") )
   *
   *   // Java:
   *   df.filter( not(df.col("isActive")) );
   * }}}
   *
   * @group normal_funcs
   * @since 1.3.0
   */
  def not(e: Column): Column = !e

rand/randn


  /**
   * 生成一个随机列，其中独立同分布 (i.i.d.) 样本均匀分布在 [0.0, 1.0) 中。
   * 注意：在一般情况下，该函数是不确定的。
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  def rand(seed: Long): Column = withExpr { Rand(seed) }

  /**
   * 生成一个随机列，其中独立同分布 (i.i.d.) 样本均匀分布在 [0.0, 1.0) 中。
   * 注意：在一般情况下，该函数是不确定的。
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  def rand(): Column = rand(Utils.random.nextLong)

  /**
   * 从标准正态分布生成具有独立同分布 (i.i.d.) 样本的列。
   * 注意：在一般情况下，该函数是不确定的。
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  def randn(seed: Long): Column = withExpr { Randn(seed) }

  /**
   * 从标准正态分布生成具有独立同分布 (i.i.d.) 样本的列。
   * 注意：在一般情况下，该函数是不确定的。
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  def randn(): Column = randn(Utils.random.nextLong)

spark_partition_id


  /**
   * 分区ID
   *
   * 注意：这是不确定的，因为它取决于数据分区和任务调度
   *
   * @group normal_funcs
   * @since 1.6.0
   */
  def spark_partition_id(): Column = withExpr { SparkPartitionID() }

sqrt

  /**
   * 计算指定浮点值的平方根
   *
   * @group math_funcs
   * @since 1.3.0
   */
  def sqrt(e: Column): Column = withExpr { Sqrt(e.expr) }

  /**
   * 计算指定浮点值的平方根
   *
   * @group math_funcs
   * @since 1.5.0
   */
  def sqrt(colName: String): Column = sqrt(Column(colName))

struct


  /**
   * 创建一个新的结构体列。 
   * 如果输入列是DataFrame 中的列，或者是命名（即别名）的派生列表达式，则其名称将保留为 
   * StructField 的名称，否则，新生成的 StructField 的名称将自动生成为带有`index + 1`后缀的
   * `col`，即 col1, col2, col3, ...
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  @scala.annotation.varargs
  def struct(cols: Column*): Column = withExpr { CreateStruct.create(cols.map(_.expr)) }

  /**
   * 创建一个由多个输入列组成的新结构体列
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  @scala.annotation.varargs
  def struct(colName: String, colNames: String*): Column = {
    struct((colName +: colNames).map(col) : _*)
  }

when


  /**
   * 评估条件列表并返回多个可能的结果表达式之一。 
   * 如果最后未定义其他内容，则为不匹配的条件返回 null。
   *
   * {{{
   *   // 示例：将性别字符串列编码为整数。
   *
   *   // Scala:
   *   people.select(when(people("gender") === "male", 0)
   *     .when(people("gender") === "female", 1)
   *     .otherwise(2))
   *
   *   // Java:
   *   people.select(when(col("gender").equalTo("male"), 0)
   *     .when(col("gender").equalTo("female"), 1)
   *     .otherwise(2))
   * }}}
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  def when(condition: Column, value: Any): Column = withExpr {
    CaseWhen(Seq((condition.expr, lit(value).expr)))
  }

bitwise_not


  /**
   * 将一个数按位取反 (~)
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  @deprecated("Use bitwise_not", "3.2.0")
  def bitwiseNOT(e: Column): Column = bitwise_not(e)

  /**
   * 将一个数按位取反 (~)
   *
   * @group normal_funcs
   * @since 3.2.0
   */
  def bitwise_not(e: Column): Column = withExpr { BitwiseNot(e.expr) }

expr


  /**
   * 将表达式字符串解析为它所代表的列，类似于[[Dataset#selectExpr]].
   * {{{
   *   // 获取单词长度的不同数目
   *   df.groupBy(expr("length(word)")).count()
   * }}}
   *
   * @group normal_funcs
   */
  def expr(expr: String): Column = {
    val parser = SparkSession.getActiveSession.map(_.sessionState.sqlParser).getOrElse {
      new SparkSqlParser()
    }
    Column(parser.parseExpression(expr))
  }

greatest

/**
   * 返回值列表的最大值，跳过空值。 
   * 此函数至少需要 2 个参数。 
   * 当且仅当所有参数都为空，它将返回空值。
   *
   * @group normal_funcs
   * @since 1.5.0
   */
  @scala.annotation.varargs
  def greatest(exprs: Column*): Column = withExpr { Greatest(exprs.map(_.expr)) }

  /**
   * 返回列名列表的最大值，跳过空值。 
   * 此函数至少需要 2 个参数。 
   * 当且仅当所有参数都为空，它将返回空值。
   *
   * @group normal_funcs
   * @since 1.5.0
   */
  @scala.annotation.varargs
  def greatest(columnName: String, columnNames: String*): Column = {
    greatest((columnName +: columnNames).map(Column.apply): _*)
  }

least

/**
   * 返回值列表中的最小值，跳过空值。 
   * 此函数至少需要 2 个参数。 
   * 当且仅当所有参数都为空，它将返回空值。
   *
   * @group normal_funcs
   * @since 1.5.0
   */
  @scala.annotation.varargs
  def least(exprs: Column*): Column = withExpr { Least(exprs.map(_.expr)) }

  /**
   * 返回值列表中的最小值，跳过空值。 
   * 此函数至少需要 2 个参数。 
   * 当且仅当所有参数都为空，它将返回空值。
   *
   * @group normal_funcs
   * @since 1.5.0
   */
  @scala.annotation.varargs
  def least(columnName: String, columnNames: String*): Column = {
    least((columnName +: columnNames).map(Column.apply): _*)
  }

实践

代码

package com.shockang.study.spark.sql.functions

import com.shockang.study.spark.util.Utils.formatPrint
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

/**
 *
 * @author Shockang
 */
object NonAggregateFunctionsExample {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.OFF)
    val spark = SparkSession.builder().appName("NonAggregateFunctionsExample").master("local[*]").getOrCreate()

    import spark.implicits._

    val df = Seq(0.0d, -0.0d, 0.0d / 0.0d, Double.NaN).toDF("d")

    // Column
    formatPrint("""df.select(col("d")).show()""")
    df.select(col("d")).show()

    formatPrint("""df.select(column("d")).show()""")
    df.select(column("d")).show()

    formatPrint("""df.select(lit(1)).show()""")
    df.select(lit(1)).show()

    formatPrint("""df.select(typedLit(List(1, 2, 3))).show()""")
    df.select(typedLit(List(1, 2, 3))).show()

    formatPrint("""df.select(typedlit(Map(1 -> 1, 2 -> 2, 3 -> 3))).show()""")
    df.select(typedlit(Map(1 -> 1, 2 -> 2, 3 -> 3))).show()

    // array
    formatPrint("""df.select(array($"d")).show()""")
    df.select(array($"d")).show()

    formatPrint("""df.select(array("d")).show()""")
    df.select(array("d")).show()

    // map/map_from_arrays
    formatPrint("""df.as[Double].map(1 -> _).toDF("a", "b").select(map($"a" + 1, $"b")).show()""")
    df.as[Double].map(1 -> _).toDF("a", "b").select(map($"a" + 1, $"b")).show()

    formatPrint("""df.as[Double].map(Array(_)).toDF("d").select(map_from_arrays($"d", typedlit(Array(1)))).show()""")
    df.as[Double].map(Array(_)).toDF("d").select(map_from_arrays($"d", typedlit(Array(1)))).show()

    // broadcast
    formatPrint("""df.as[Double].map((_, 1)).join(broadcast(df.as[Double].map((_, 2)))).show()""")
    df.as[Double].map((_, 1)).join(broadcast(df.as[Double].map((_, 2)))).show()

    // coalesce
    formatPrint("""df.select(coalesce($"d")).show()""")
    df.select(coalesce($"d")).show()

    // input_file_name
    formatPrint("""df.select(input_file_name()).show()""")
    df.select(input_file_name()).show()

    // isnan/isnull
    formatPrint("""df.select(isnan($"d")).show()""")
    df.select(isnan($"d")).show()

    formatPrint("""df.select(isnull($"d")).show()""")
    df.select(isnull($"d")).show()

    // monotonically_increasing_id
    formatPrint("""df.select(monotonicallyIncreasingId()).show()""")
    df.select(monotonicallyIncreasingId()).show()

    formatPrint("""df.select(monotonically_increasing_id()).show()""")
    df.select(monotonically_increasing_id()).show()

    // nanvl
    formatPrint("""df.select(nanvl($"d", lit(1))).show()""")
    df.select(nanvl($"d", lit(1))).show()

    // negate
    formatPrint("""df.select(-df("d")).show()""")
    df.select(-df("d")).show()

    formatPrint("""df.select(negate(df("d"))).show()""")
    df.select(negate(df("d"))).show()

    // not
    formatPrint("""df.select(not(isnan($"d"))).show()""")
    df.select(not(isnan($"d"))).show()

    // rand/randn
    formatPrint("""df.select(rand(1L)).show()""")
    df.select(rand(1L)).show()

    formatPrint("""df.select(rand()).show()""")
    df.select(rand()).show()

    formatPrint("""df.select(randn(1L)).show()""")
    df.select(randn(1L)).show()

    formatPrint("""df.select(randn()).show()""")
    df.select(randn()).show()

    // spark_partition_id
    formatPrint("""df.select(spark_partition_id()).show()""")
    df.select(spark_partition_id()).show()

    // sqrt
    formatPrint("""df.select(sqrt($"d")).show()""")
    df.select(sqrt($"d")).show()

    formatPrint("""df.select(sqrt("d")).show()""")
    df.select(sqrt("d")).show()

    // struct
    formatPrint("""df.select(struct($"d")).show()""")
    df.select(struct($"d")).show()

    formatPrint("""df.select(struct("d")).show()""")
    df.select(struct("d")).show()

    // when
    formatPrint("""df.select(when(isnan($"d"), 0).otherwise(1)).show()""")
    df.select(when(isnan($"d"), 0).otherwise(1)).show()

    // bitwise_not
    formatPrint("""df.select(bitwiseNOT(lit(1))).show()""")
    df.select(bitwiseNOT(lit(1))).show()

    formatPrint("""df.select(bitwise_not(lit(1))).show()""")
    df.select(bitwise_not(lit(1))).show()

    // expr
    formatPrint("""df.select(expr("d + 1")).show()""")
    df.select(expr("d + 1")).show()

    // greatest
    formatPrint("""df.select(greatest($"d", lit(1))).show()""")
    df.select(greatest($"d", lit(1))).show()

    formatPrint("""df.as[Double].map((_, 1)).toDF("a", "b").select(greatest("a", "b")).show()""")
    df.as[Double].map((_, 1)).toDF("a", "b").select(greatest("a", "b")).show()

    // least
    formatPrint("""df.select(least($"d", lit(1))).show()""")
    df.select(least($"d", lit(1))).show()

    formatPrint("""df.as[Double].map((_, 1)).toDF("a", "b").select(least("a", "b")).show()""")
    df.as[Double].map((_, 1)).toDF("a", "b").select(least("a", "b")).show()

    spark.stop()
  }
}

输出

========== df.select(col("d")).show() ==========
+----+
|   d|
+----+
| 0.0|
|-0.0|
| NaN|
| NaN|
+----+

========== df.select(column("d")).show() ==========
+----+
|   d|
+----+
| 0.0|
|-0.0|
| NaN|
| NaN|
+----+

========== df.select(lit(1)).show() ==========
+---+
|  1|
+---+
|  1|
|  1|
|  1|
|  1|
+---+

========== df.select(typedLit(List(1, 2, 3))).show() ==========
+---------+
|  [1,2,3]|
+---------+
|[1, 2, 3]|
|[1, 2, 3]|
|[1, 2, 3]|
|[1, 2, 3]|
+---------+

========== df.select(typedlit(Map(1 -> 1, 2 -> 2, 3 -> 3))).show() ==========
+------------------------------+
|keys: [1,2,3], values: [1,2,3]|
+------------------------------+
|          {1 -> 1, 2 -> 2, ...|
|          {1 -> 1, 2 -> 2, ...|
|          {1 -> 1, 2 -> 2, ...|
|          {1 -> 1, 2 -> 2, ...|
+------------------------------+

========== df.select(array($"d")).show() ==========
+--------+
|array(d)|
+--------+
|   [0.0]|
|  [-0.0]|
|   [NaN]|
|   [NaN]|
+--------+

========== df.select(array("d")).show() ==========
+--------+
|array(d)|
+--------+
|   [0.0]|
|  [-0.0]|
|   [NaN]|
|   [NaN]|
+--------+

========== df.as[Double].map(1 -> _).toDF("a", "b").select(map($"a" + 1, $"b")).show() ==========
+---------------+
|map((a + 1), b)|
+---------------+
|     {2 -> 0.0}|
|    {2 -> -0.0}|
|     {2 -> NaN}|
|     {2 -> NaN}|
+---------------+

========== df.as[Double].map(Array(_)).toDF("d").select(map_from_arrays($"d", typedlit(Array(1)))).show() ==========
+-----------------------+
|map_from_arrays(d, [1])|
+-----------------------+
|             {0.0 -> 1}|
|            {-0.0 -> 1}|
|             {NaN -> 1}|
|             {NaN -> 1}|
+-----------------------+

========== df.as[Double].map((_, 1)).join(broadcast(df.as[Double].map((_, 2)))).show() ==========
+----+---+----+---+
|  _1| _2|  _1| _2|
+----+---+----+---+
| 0.0|  1| 0.0|  2|
| 0.0|  1|-0.0|  2|
| 0.0|  1| NaN|  2|
| 0.0|  1| NaN|  2|
|-0.0|  1| 0.0|  2|
|-0.0|  1|-0.0|  2|
|-0.0|  1| NaN|  2|
|-0.0|  1| NaN|  2|
| NaN|  1| 0.0|  2|
| NaN|  1|-0.0|  2|
| NaN|  1| NaN|  2|
| NaN|  1| NaN|  2|
| NaN|  1| 0.0|  2|
| NaN|  1|-0.0|  2|
| NaN|  1| NaN|  2|
| NaN|  1| NaN|  2|
+----+---+----+---+

========== df.select(coalesce($"d")).show() ==========
+-----------+
|coalesce(d)|
+-----------+
|        0.0|
|       -0.0|
|        NaN|
|        NaN|
+-----------+

========== df.select(input_file_name()).show() ==========
+-----------------+
|input_file_name()|
+-----------------+
|                 |
|                 |
|                 |
|                 |
+-----------------+

========== df.select(isnan($"d")).show() ==========
+--------+
|isnan(d)|
+--------+
|   false|
|   false|
|    true|
|    true|
+--------+

========== df.select(isnull($"d")).show() ==========
+-----------+
|(d IS NULL)|
+-----------+
|      false|
|      false|
|      false|
|      false|
+-----------+

========== df.select(monotonicallyIncreasingId()).show() ==========
+-----------------------------+
|monotonically_increasing_id()|
+-----------------------------+
|                            0|
|                            1|
|                            2|
|                            3|
+-----------------------------+

========== df.select(monotonically_increasing_id()).show() ==========
+-----------------------------+
|monotonically_increasing_id()|
+-----------------------------+
|                            0|
|                            1|
|                            2|
|                            3|
+-----------------------------+

========== df.select(nanvl($"d", lit(1))).show() ==========
+-----------+
|nanvl(d, 1)|
+-----------+
|        0.0|
|       -0.0|
|        1.0|
|        1.0|
+-----------+

========== df.select(-df("d")).show() ==========
+-----+
|(- d)|
+-----+
| -0.0|
|  0.0|
|  NaN|
|  NaN|
+-----+

========== df.select(negate(df("d"))).show() ==========
+-----+
|(- d)|
+-----+
| -0.0|
|  0.0|
|  NaN|
|  NaN|
+-----+

========== df.select(not(isnan($"d"))).show() ==========
+--------------+
|(NOT isnan(d))|
+--------------+
|          true|
|          true|
|         false|
|         false|
+--------------+

========== df.select(rand(1L)).show() ==========
+-------------------+
|            rand(1)|
+-------------------+
| 0.6363787615254752|
| 0.5993846534021868|
|  0.134842710012538|
|0.07684163905460906|
+-------------------+

========== df.select(rand()).show() ==========
+--------------------------+
|rand(-1907264262282864604)|
+--------------------------+
|        0.9449925707673885|
|       0.13270416690882902|
|        0.5197808830765432|
|         0.850147127687052|
+--------------------------+

========== df.select(randn(1L)).show() ==========
+------------------+
|          randn(1)|
+------------------+
|1.6845611254444919|
|1.2276070094376463|
|0.7360632906893385|
|  0.45082574888574|
+------------------+

========== df.select(randn()).show() ==========
+---------------------------+
|randn(-2990680574684490535)|
+---------------------------+
|        -0.4880556599505849|
|         0.5549722607972974|
|         1.1801461686705492|
|        -0.5697861532546645|
+---------------------------+

========== df.select(spark_partition_id()).show() ==========
+--------------------+
|SPARK_PARTITION_ID()|
+--------------------+
|                   0|
|                   0|
|                   0|
|                   0|
+--------------------+

========== df.select(sqrt($"d")).show() ==========
+-------+
|SQRT(d)|
+-------+
|    0.0|
|   -0.0|
|    NaN|
|    NaN|
+-------+

========== df.select(sqrt("d")).show() ==========
+-------+
|SQRT(d)|
+-------+
|    0.0|
|   -0.0|
|    NaN|
|    NaN|
+-------+

========== df.select(struct($"d")).show() ==========
+---------+
|struct(d)|
+---------+
|    {0.0}|
|   {-0.0}|
|    {NaN}|
|    {NaN}|
+---------+

========== df.select(struct("d")).show() ==========
+---------+
|struct(d)|
+---------+
|    {0.0}|
|   {-0.0}|
|    {NaN}|
|    {NaN}|
+---------+

========== df.select(when(isnan($"d"), 0).otherwise(1)).show() ==========
+------------------------------------+
|CASE WHEN isnan(d) THEN 0 ELSE 1 END|
+------------------------------------+
|                                   1|
|                                   1|
|                                   0|
|                                   0|
+------------------------------------+

========== df.select(bitwiseNOT(lit(1))).show() ==========
+---+
| ~1|
+---+
| -2|
| -2|
| -2|
| -2|
+---+

========== df.select(bitwise_not(lit(1))).show() ==========
+---+
| ~1|
+---+
| -2|
| -2|
| -2|
| -2|
+---+

========== df.select(expr("d + 1")).show() ==========
+-------+
|(d + 1)|
+-------+
|    1.0|
|    1.0|
|    NaN|
|    NaN|
+-------+

========== df.select(greatest($"d", lit(1))).show() ==========
+--------------+
|greatest(d, 1)|
+--------------+
|           1.0|
|           1.0|
|           NaN|
|           NaN|
+--------------+

========== df.as[Double].map((_, 1)).toDF("a", "b").select(greatest("a", "b")).show() ==========
+--------------+
|greatest(a, b)|
+--------------+
|           1.0|
|           1.0|
|           NaN|
|           NaN|
+--------------+

========== df.select(least($"d", lit(1))).show() ==========
+-----------+
|least(d, 1)|
+-----------+
|        0.0|
|       -0.0|
|        1.0|
|        1.0|
+-----------+

========== df.as[Double].map((_, 1)).toDF("a", "b").select(least("a", "b")).show() ==========
+-----------+
|least(a, b)|
+-----------+
|        0.0|
|       -0.0|
|        1.0|
|        1.0|
+-----------+

Shockang

关注

3
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
Spark SQL functions.scala 源码解析（四）Non-aggregate functions （基于 Spark 3.3.0）

前言本文隶属于专栏《1000个问题搞定大数据技术体系》，该专栏为笔者原创，引用请注明来源，不足和错误之处请在评论区帮忙指出，谢谢！本专栏目录结构和参考文献请见1000个问题搞定大数据技术体系目录Spark SQL functions.scala 源码解析（一）Sort functions （基于 Spark 3.3.0）Spark SQL functions.scala 源码解析（二）Aggregate functions（基于 Spark 3.3.0）Spark SQL functions
复制链接

扫一扫