Spark | 基于源码解析SparkSession初始化过程

最新推荐文章于 2024-08-22 22:19:14 发布

点滴笔记

最新推荐文章于 2024-08-22 22:19:14 发布

阅读量3.7k

点赞数 4

分类专栏： Spark 文章标签： spark

本文链接：https://blog.csdn.net/Sampson_Hugo/article/details/106998031

版权

本文深入探讨SparkSession的初始化过程，包括Builder、enableHiveSupport、getOrCreate方法。SparkSession通过Builder设计模式实现，getOrCreate方法首先尝试获取已存在的SparkSession，否则创建新的。在初始化中，重点涉及SharedState和SessionState，它们分别维护跨Session和单个session状态的状态。同时，文章讨论了Catalog，包括SessionCatalog、HiveSessionCatalog、ExternalCatalog的实现和作用。此外，还详述了SparkSession如何与Hive兼容以及metastore_db的相关配置。

摘要由CSDN通过智能技术生成

1、SparkSession.getOrCreate()

2、初始化SharedState

3、初始化SessionState

BaseSessionStateBuilder

SessionStateBuilder和HiveSessionStateBuilder

spark.version=2.4.4

Apache Spark 2.0引入了SparkSession，其为用户提供了一个统一的切入点来使用Spark的各项功能，例如不再需要显式地创建SparkConf, SparkContext 以及 SQLContext，因为这些对象已经封装在SparkSession中。另外SparkSession允许用户通过它调用DataFrame和Dataset相关API来编写Spark程序。

其次SparkSession通过生成器设计模式(Builder Design Pattern)实现，如果没有创建SparkSession对象，则会实例化出一个新的SparkSession对象及其相关的上下文。

SparkSession伴生对象

接下来学习下SparkSession的初始化过程中主要涉及到的内容点，一般情况下都都会如以下方式创建SparkSession：

 val spark = SparkSession.builder().config(ConfigUtil.getSparkConf(className)).enableHiveSupport().getOrCreate()

Builder

通过builder()创建一个SparkSession，那么builder()做了什么？

/**
   * Creates a [[SparkSession.Builder]] for constructing a [[SparkSession]].
   */
  def builder(): Builder = new Builder

Builder可以看做伴生对象SparkSession中的内部类，主要有如下对象：

SparkSessionExtensions：扩展SparkSession，即允许用户添加分析器规则，优化器规则、规划策略或自定义解析器等。

  /**
   * Builder for [[SparkSession]].
   */
  @InterfaceStability.Stable
  class Builder extends Logging {

    private[this] val options = new scala.collection.mutable.HashMap[String, String]

    private[this] val extensions = new SparkSessionExtensions

    private[this] var userSuppliedContext: Option[SparkContext] = None

    private[spark] def sparkContext(sparkContext: SparkContext): Builder = synchronized {
      userSuppliedContext = Option(sparkContext)
      this
    }

    /**
     * Inject extensions into the [[SparkSession]].
     * This allows a user to add Analyzerrules,
     * Optimizer rules, Planning Strategies or a customized parser.
     */
    def withExtensions(f: SparkSessionExtensions => Unit): Builder = synchronized {
      f(extensions)
      this
    }
    
    ......

再来看下SparkSessionExtensions 的定义

/**
 * This current provides the following extension points:
 * Analyzer Rules、Check Analysis Rules、Optimizer Rules、
 * Planning Strategies、Customized Parser、(External) Catalog listeners.
 *
 * The extensions can be used by calling withExtension on the [[SparkSession.Builder]], for
 * example:
 * {
  {
  {
 *   SparkSession.builder()
 *     .master("...")
 *     .conf("...", true)
 *     .withExtensions { extensions =>
 *       extensions.injectResolutionRule { session =>
 *         ...
 *       }
 *       extensions.injectParser { (session, parser) =>
 *         ...
 *       }
 *     }
 *     .getOrCreate()
 * }}}
 *
 * Note that none of the injected builders should assume that the [[SparkSession]] is fully
 * initialized and should not touch the session's internals (e.g. the SessionState).
 */
@DeveloperApi
@Experimental
@InterfaceStability.Unstable
class SparkSessionExtensions {
  type RuleBuilder = SparkSession => Rule[LogicalPlan]
  type CheckRuleBuilder = SparkSession => LogicalPlan => Unit
  type StrategyBuilder = SparkSession => Strategy
  type ParserBuilder = (SparkSession, ParserInterface) => ParserInterface

  private[this] val resolutionRuleBuilders = mutable.Buffer.empty[RuleBuilder]

  /**
   * Build the analyzer resolution `Rule`s using the given [[SparkSession]].
   */
  private[sql] def buildResolutionRules(session: SparkSession): Seq[Rule[LogicalPlan]] = {
    resolutionRuleBuilders.map(_.apply(session))
  }

  /**
   * Inject an analyzer resolution `Rule` builder into the [[SparkSession]]. These analyzer
   * rules will be executed as part of the resolution phase of analysis.
   */
  def injectResolutionRule(builder: RuleBuilder): Unit = {
    resolutionRuleBuilders += builder
  }
  
  ......

enableHiveSupport

spark2.0.0版后推荐使用SparkSession.builder.enableHiveSupport来兼容HiveContext

/**
 * An instance of the Spark SQL execution engine that integrates with data stored in Hive.
 * Configuration for Hive is read from hive-site.xml on the classpath.
 */
@deprecated("Use SparkSession.builder.enableHiveSupport instead", "2.0.0")
class HiveContext private[hive](_sparkSession: SparkSession)
  extends SQLContext(_sparkSession) with Logging {

  ......

再来看下该方法在SparkSession中的具体实现, 简单点是配置 spark.sql.catalogImplementation=hive, 默认为in-memory

    /**
     * Enables Hive support, including connectivity to a persistent Hive metastore, support for
     * Hive serdes, and Hive user-defined functions.
     */
    def enableHiveSupport(): Builder = synchronized {
      if (hiveClassesArePresent) {
        config(CATALOG_IMPLEMENTATION.key, "hive")
      } else {
        throw new IllegalArgumentException(
          "Unable to instantiate SparkSession with Hive support because " +
            "Hive classes are not found.")
      }
    }

静态SQL配置是一种跨会话、不可变的Spark配置。

/**
 * Static SQL configuration is a cross-session, immutable Spark configuration. External users can
 * see the static sql configs via `SparkSession.conf`, but can NOT set/unset them.
 */
object StaticSQLConf {

  import SQLConf.buildStaticConf

  val WAREHOUSE_PATH = buildStaticConf("spark.sql.warehouse.dir")
    .doc("The default location for managed databases and tables.")
    .stringConf
    .createWithDefault(Utils.resolveURI("spark-warehouse").toString)

  val CATALOG_IMPLEMENTATION = buildStaticConf("spark.sql.catalogImplementation")
    .internal()
    .stringConf
    .checkValues(Set("hive", "in-memory"))
    .createWithDefault("in-memory")

  val GLOBAL_TEMP_DATABASE = buildStaticConf("spark.sql.globalTempDatabase")
    .internal()
    .stringConf
    .createWithDefault("global_temp")
  
  ......

getOrCreate

大致可以总结为首先尝试取现有的SparkSession，如果没有现有的，则基于生成器中设置的选项创建新的一个SParkSession。

具体定义如下：

def getOrCreate(): SparkSession = synchronized {
      // SparkSession只能在Driver端创建和访问 SparkSession should only be created and accessed on the driver
      assertOnDriver()
      // 首先检查是否存在有效的线程本地SparkSession, activeThreadSession是一个new InheritableThreadLocal[SparkSession]方法,InheritableThreadLocal继承ThreadLocal
      // Get the session from current thread's active session.
      var session = activeThreadSession.get()
      // 如果session不为空,且session对应的sparkContext未停止了,返回现有session
      if ((session ne null) && !session.sparkContext.isStopped) {
        options.foreach { case (k, v) => session.sessionState.conf.setConfString(k, v) }
        if (options.nonEmpty) {
          logWarning("Using an existing SparkSession; some configuration may not take effect.")
        }
        return session
      }

      // 再检查是否存在有效的全局的默认SparkSession
      // Global synchronization so we will only set the default session once.
      SparkSession.synchronized {
        // If the current thread does not have an active session, get it from the global session.
        session = defaultSession.get()
        if ((session ne null) && !session.sparkContext.isStopped) {
          options.foreach { case (k, v) => session.sessionState.conf.setConfString(k, v) }
          if (options.nonEmpty) {
            logWarning("Using an existing SparkSession; some configuration may not take effect.")
          }
          return session
        }

        // 创建新的Session 默认userSuppliedContext是没有SparkSession的
        // 初始化SparkContext
        // No active nor global default session. Create a new one.
        val sparkContext = userSuppliedContext.getOrElse {
          val sparkConf = new SparkConf()
          options.foreach { case (k, v) => sparkConf.set(k, v) }

          // set a random app name if not given.
          if (!sparkConf.contains("spark.app.name")) {
            sparkConf.setAppName(java.util.UUID.randomUUID().toString)
          }

          SparkContext.getOrCreate(sparkConf)
          // Do not update `SparkConf` for existing `SparkContext`, as it's shared by all sessions.
        }

	    // 注入静态配置SQLConf 
        // Initialize extensions if the user has defined a configurator class.
        val extensionConfOption = sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS)
        if (extensionConfOption.isDefined) {