第四章 Spark-SQL基础（二）之Dataset优势

本文详细介绍了Spark SQL的核心组件SparkSession的构建与配置，包括appName、master、config等方法。同时，讲解了DataFrame作为结构化数据的抽象，以及Row对象的使用。最后，讨论了Dataset的引入，它是DataFrame的优化版，支持类型安全和更高效的序列化。DataFrame相比RDD，减少了序列化开销和GC次数，但仍有类型安全检查和序列化问题。Dataset则结合了RDD和DataFrame的优势，通过Encoder实现更高效的数据访问。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Spark系列文章目录

第一章初识Spark
第二章 Spark-Core核心模型（一）
第二章 Spark-Core核心模型（二）
第三章 Spark-Core编程进阶（一）
第三章 Spark-Core编程进阶（二）
第四章 Spark-SQL基础（一）
第四章 Spark-SQL基础（二）
第五章 Spark-SQL进阶（一）
第五章 Spark-SQL进阶（二）
第五章 Spark-SQL进阶（三）

第四章 Spark-SQL基础（二）

8.核心对象

8.1SparkSession

Spark-SQL组件中的一个会话对象。

SparkSession对象中构建SparkSession的核心源代码如下：

@Stable
object SparkSession extends Logging {

  /**
   * Builder for [[SparkSession]].
   */
  @Stable
  class Builder extends Logging {

    private[this] val options = new scala.collection.mutable.HashMap[String, String]

    private[this] val extensions = new SparkSessionExtensions

    private[this] var userSuppliedContext: Option[SparkContext] = None

    private[spark] def sparkContext(sparkContext: SparkContext): Builder = synchronized {
      userSuppliedContext = Option(sparkContext)
      this
    }

    /**
     * Sets a name for the application, which will be shown in the Spark web UI.
     * If no application name is set, a randomly generated name will be used.
     *
     * @since 2.0.0
     */
    def appName(name: String): Builder = config("spark.app.name", name)

    /**
     * Sets a config option. Options set using this method are automatically propagated to
     * both `SparkConf` and SparkSession's own configuration.
     *
     * @since 2.0.0
     */
    def config(key: String, value: String): Builder = synchronized {
      options += key -> value
      this
    }

    /**
     * Sets a config option. Options set using this method are automatically propagated to
     * both `SparkConf` and SparkSession's own configuration.
     *
     * @since 2.0.0
     */
    def config(key: String, value: Long): Builder = synchronized {
      options += key -> value.toString
      this
    }

    /**
     * Sets a config option. Options set using this method are automatically propagated to
     * both `SparkConf` and SparkSession's own configuration.
     *
     * @since 2.0.0
     */
    def config(key: String, value: Double): Builder = synchronized {
      options += key -> value.toString
      this
    }

    /**
     * Sets a config option. Options set using this method are automatically propagated to
     * both `SparkConf` and SparkSession's own configuration.
     *
     * @since 2.0.0
     */
    def config(key: String, value: Boolean): Builder = synchronized {
      options += key -> value.toString
      this
    }

    /**
     * Sets a list of config options based on the given `SparkConf`.
     *
     * @since 2.0.0
     */
    def config(conf: SparkConf): Builder = synchronized {
      conf.getAll.foreach { case (k, v) => options += k -> v }
      this
    }

    /**
     * Sets the Spark master URL to connect to, such as "local" to run locally, "local[4]" to
     * run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.
     *
     * @since 2.0.0
     */
    def master(master: String): Builder = config("spark.master", master)

    /**
     * Enables Hive support, including connectivity to a persistent Hive metastore, support for
     * Hive serdes, and Hive user-defined functions.
     *
     * @since 2.0.0
     */
    def enableHiveSupport(): Builder = synchronized {
      if (hiveClassesArePresent) {
        config(CATALOG_IMPLEMENTATION.key, "hive")
      } else {
        throw new IllegalArgumentException(
          "Unable to instantiate SparkSession with Hive support because " +
            "Hive classes are not found.")
      }
    }

    /**
     * Inject extensions into the [[SparkSession]]. This allows a user to add Analyzer rules,
     * Optimizer rules, Planning Strategies or a customized parser.
     *
     * @since 2.2.0
     */
    def withExtensions(f: SparkSessionExtensions => Unit): Builder = synchronized {
      f(extensions)
      this
    }

    /**
     * Gets an existing [[SparkSession]] or, if there is no existing one, creates a new
     * one based on the options set in this builder.
     *
     * This method first checks whether there is a valid thread-local SparkSession,
     * and if yes, return that one. It then checks whether there is a valid global
     * default SparkSession, and if yes, return that one. If no valid global default
     * SparkSession exists, the method creates a new SparkSession and assigns the
     * newly created SparkSession as the global default.
     *
     * In case an existing SparkSession is returned, the non-static config options specified in
     * this builder will be applied to the existing SparkSession.
     *
     * @since 2.0.0
     */
    def getOrCreate(): SparkSession = synchronized {
      assertOnDriver()
      // Get the session from current thread's active session.
      var session = activeThreadSession.get()
      if ((session ne null) && !session.sparkContext.isStopped) {
        applyModifiableSettings(session)
        return session
      }

      // Global synchronization so we will only set the default session once.
      SparkSession.synchronized {
        // If the current thread does not have an active session, get it from the global session.
        session = defaultSession.get()
        if ((session ne null) && !session.sparkContext.isStopped) {
          applyModifiableSettings(session)
          return session
        }

        // No active nor global default session. Create a new one.
        val sparkContext = userSuppliedContext.getOrElse {
          val sparkConf = new SparkConf()
          options.foreach { case (k, v) => sparkConf.set(k, v) }

          // set a random app name if not given.
          if (!sparkConf.contains("spark.app.name")) {
            sparkConf.setAppName(java.util.UUID.randomUUID().toString)
          }

          SparkContext.getOrCreate(sparkConf)
          // Do not update `SparkConf` for existing `SparkContext`, as it's shared by all sessions.
        }

        applyExtensions(
          sparkContext.getConf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS).getOrElse(Seq.empty),
          extensions)

        session = new SparkSession(sparkContext, None, None, extensions)
        options.foreach { case (k, v) => session.initialSessionOptions.put(k, v) }
        setDefaultSession(session)
        setActiveSession(session)
        registerContextListener(sparkContext)
      }

      return session
    }

    private def applyModifiableSettings(session: SparkSession): Unit = {
      val (staticConfs, otherConfs) =
        options.partition(kv => SQLConf.staticConfKeys.contains(kv._1))

      otherConfs.foreach { case (k, v) => session.sessionState.conf.setConfString(k, v) }

      if (staticConfs.nonEmpty) {
        logWarning("Using an existing SparkSession; the static sql configurations will not take" +
          " effect.")
      }
      if (otherConfs.nonEmpty) {
        logWarning("Using an existing SparkSession; some spark core configurations may not take" +
          " effect.")
      }
    }
  }

  /**
   * Creates a [[SparkSession.Builder]] for constructing a [[SparkSession]].
   *
   * @since 2.0.0
   */
  def builder(): Builder = new Builder

  /**
   * Changes the SparkSession that will be returned in this thread and its children when
   * SparkSession.getOrCreate() is called. This can be used to ensure that a given thread receives
   * a SparkSession with an isolated session, instead of the global (first created) context.
   *
   * @since 2.0.0
   */
  def setActiveSession(session: SparkSession): Unit = {
    activeThreadSession.set(session)
  }

  /**
   * Clears the active SparkSession for current thread. Subsequent calls to getOrCreate will
   * return the first created context instead of a thread-local override.
   *
   * @since 2.0.0
   */
  def clearActiveSession(): Unit = {
    activeThreadSession.remove()
  }

  /**
   * Sets the default SparkSession that is returned by the builder.
   *
   * @since 2.0.0
   */
  def setDefaultSession(session: SparkSession): Unit = {
    defaultSession.set(session)
  }

  /**
   * Clears the default SparkSession that is returned by the builder.
   *
   * @since 2.0.0
   */
  def clearDefaultSession(): Unit = {
    defaultSession.set(null)
  }

  /**
   * Returns the active SparkSession for the current thread, returned by the builder.
   *
   * @note Return None, when calling this function on executors
   *
   * @since 2.2.0
   */
  def getActiveSession: Option[SparkSession] = {
    if (TaskContext.get != null) {
      // Return None when running on executors.
      None
    } else {
      Option(activeThreadSession.get)
    }
  }

  /**
   * Returns the default SparkSession that is returned by the builder.
   *
   * @note Return None, when calling this function on executors
   *
   * @since 2.2.0
   */
  def getDefaultSession: Option[SparkSession] = {
    if (TaskContext.get != null) {
      // Return None when running on executors.
      None
    } else {
      Option(defaultSession.get)
    }
  }

  /**
   * Returns the currently active SparkSession, otherwise the default one. If there is no default
   * SparkSession, throws an exception.
   *
   * @since 2.4.0
   */
  def active: SparkSession = {
    getActiveSession.getOrElse(getDefaultSession.getOrElse(
      throw new IllegalStateException("No active or default Spark session found")))
  }
 
}

具体搭配使用如下：

//1.获取SparkSession对象
val spark=SparkSession.builder()
.master("local[*]")
.appName("spark_sql")
//.config("spark.app.name","spark_sql2")
//.config("spark.master","local[*]")
.config(conf)
.config("spark.sql.warehouse.dir", SparkConfUtil.sparkSqlWarehouse)
.config("spark.serializer", SparkConfUtil.sparkSerializer)
.config("spark.sql.streaming.checkpointLocation",SparkConfUtil.sparkStreamCheckPointLocaltion)
//支持hive操作
.enableHiveSupport()
.getOrCreate();

import spark.implicits._

8.2DataFrame

一个分布式数据组织成命名列的集合。
概念上相当于一个关系数据库中的表。
一个由Row对象组成的RDD，附带包含每列数据类型的结构信息。
- Row对象表示DataFrame中的记录，其本质是一个定长的字段数组。
- 在Scala和Java中，Row对象有一系列的get取值方法，可以通过下标获取每个字段的值。

get取值方法源代码如下：

def apply(i: Int): Any = get(i)

def get(i: Int): Any

def getBoolean(i: Int): Boolean = getAnyValAs[Boolean](i)

def getByte(i: Int): Byte = getAnyValAs[Byte](i)

def getShort(i: Int): Short = getAnyValAs[Short](i)

def getInt(i: Int): Int = getAnyValAs[Int](i)

def getLong(i: Int): Long = getAnyValAs[Long](i)

def getFloat(i: Int): Float = getAnyValAs[Float](i)

def getDouble(i: Int): Double = getAnyValAs[Double](i)

def getString(i: Int): String = getAs[String](i)

def getDecimal(i: Int): java.math.BigDecimal = getAs[java.math.BigDecimal](i)

def getDate(i: Int): java.sql.Date = getAs[java.sql.Date](i)

def getTimestamp(i: Int): java.sql.Timestamp = getAs[java.sql.Timestamp](i)

def getSeq[T](i: Int): Seq[T] = getAs[Seq[T]](i)

def getList[T](i: Int): java.util.List[T] =
getSeq[T](i).asJava

def getMap[K, V](i: Int): scala.collection.Map[K, V] = getAs[Map[K, V]](i)

def getAs[T](i: Int): T = get(i).asInstanceOf[T]

def getAs[T](fieldName: String): T = getAs[T](fieldIndex(fieldName))

案例演示：

import org.apache.spark.sql.Row
val input=Row(“tom”,13)
val names=input.map( row => row.getString(0))
//or
val names=input.map( row => row(0))

思考1，抽象模型RDD的优缺点有哪些呢？

抽象模型RDD的优缺点

优点
- 1.功能强大
  - 内置很多函数操作，group、map、filter等
  - 方便处理结构化或非结构化数据
- 2.面向对象编程
  - 直接存储对象
  - 类型转化也安全
缺点
- 1.通用性强
  - 因此没有针对特殊场景的优化，比如对于结构化数据处理相对于SQL来比非常麻烦。
- 2.序列化结果较大
  - 默认采用的是Java序列化方式，而且数据存储在Java堆内存中，导致GC比较频繁。

思考2，借鉴了RDD的优缺点，DataFrame有什么具体操作？

DataFrame引入了schema和off-heap

schema
- 结构信息
- Spark通过schame能读懂数据
off-heap
- 指JVM堆以外的内存，直接受操作系统管理（而不是JVM）
- Spark能够将数据按照二进制的形式序列化到off-heap中

注意，off-heap就像地盘，schema就像地图， Spark有地图又有自己地盘了，就可以自己说了算了，不再受JVM的限制，也就不再收GC的困扰了。

思考3，DataFrame解决了RDD的缺点了吗？代价是什么？

DataFrame的优缺点

优点
- 1.处理结构化数据非常方便
- 2.减少序列化开销和GC次数
- 3.与Hive兼容，且支持HQL、UDF等
缺点
- 1.编译时不能进行类型转化安全检查，运行时才能确定是否有问题
- 2.序列化开销仍需要改进
- 3.对于对象支持不友好
  - RDD内部数据直接以对象形式存储
  - DataFrame存储的是Row对象而不能是自定义对象

思考4，是否可以继续改进DataFrame的缺点？

8.3Dataset

Spark1.6之后新添加的特性，优化了Spark SQL执行引擎。
API目前只支持Scala和Java。
结合了RDD和DataFrame的优点，并引入一个新的概念Encoder。
- 当序列化数据时，Encoder产生字节码与off-heap进行交互，能够达到按需访问数据的效果，而不用反序列化整个对象。
Spark目前还没有提供自定义Encoder的API。

Datasets与RDDs是很相似的，不同的是在网络中传输对象时，