目录
SessionStateBuilder和HiveSessionStateBuilder
spark.version=2.4.4
Apache Spark 2.0引入了SparkSession,其为用户提供了一个统一的切入点来使用Spark的各项功能,例如不再需要显式地创建SparkConf, SparkContext 以及 SQLContext,因为这些对象已经封装在SparkSession中。另外SparkSession允许用户通过它调用DataFrame和Dataset相关API来编写Spark程序。
其次SparkSession通过生成器设计模式(Builder Design Pattern)实现,如果没有创建SparkSession对象,则会实例化出一个新的SparkSession对象及其相关的上下文。
SparkSession伴生对象
接下来学习下SparkSession的初始化过程中主要涉及到的内容点,一般情况下都都会如以下方式创建SparkSession:
val spark = SparkSession.builder().config(ConfigUtil.getSparkConf(className)).enableHiveSupport().getOrCreate()
Builder
通过builder()创建一个SparkSession,那么builder()做了什么?
/**
* Creates a [[SparkSession.Builder]] for constructing a [[SparkSession]].
*/
def builder(): Builder = new Builder
Builder可以看做伴生对象SparkSession中的内部类,主要有如下对象:
SparkSessionExtensions:扩展SparkSession,即允许用户添加分析器规则,优化器规则、规划策略或自定义解析器等。
/**
* Builder for [[SparkSession]].
*/
@InterfaceStability.Stable
class Builder extends Logging {
private[this] val options = new scala.collection.mutable.HashMap[String, String]
private[this] val extensions = new SparkSessionExtensions
private[this] var userSuppliedContext: Option[SparkContext] = None
private[spark] def sparkContext(sparkContext: SparkContext): Builder = synchronized {
userSuppliedContext = Option(sparkContext)
this
}
/**
* Inject extensions into the [[SparkSession]].
* This allows a user to add Analyzerrules,
* Optimizer rules, Planning Strategies or a customized parser.
*/
def withExtensions(f: SparkSessionExtensions => Unit): Builder = synchronized {
f(extensions)
this
}
......
再来看下SparkSessionExtensions 的定义
/**
* This current provides the following extension points:
* Analyzer Rules、Check Analysis Rules、Optimizer Rules、
* Planning Strategies、Customized Parser、(External) Catalog listeners.
*
* The extensions can be used by calling withExtension on the [[SparkSession.Builder]], for
* example:
* {
{
{
* SparkSession.builder()
* .master("...")
* .conf("...", true)
* .withExtensions { extensions =>
* extensions.injectResolutionRule { session =>
* ...
* }
* extensions.injectParser { (session, parser) =>
* ...
* }
* }
* .getOrCreate()
* }}}
*
* Note that none of the injected builders should assume that the [[SparkSession]] is fully
* initialized and should not touch the session's internals (e.g. the SessionState).
*/
@DeveloperApi
@Experimental
@InterfaceStability.Unstable
class SparkSessionExtensions {
type RuleBuilder = SparkSession => Rule[LogicalPlan]
type CheckRuleBuilder = SparkSession => LogicalPlan => Unit
type StrategyBuilder = SparkSession => Strategy
type ParserBuilder = (SparkSession, ParserInterface) => ParserInterface
private[this] val resolutionRuleBuilders = mutable.Buffer.empty[RuleBuilder]
/**
* Build the analyzer resolution `Rule`s using the given [[SparkSession]].
*/
private[sql] def buildResolutionRules(session: SparkSession): Seq[Rule[LogicalPlan]] = {
resolutionRuleBuilders.map(_.apply(session))
}
/**
* Inject an analyzer resolution `Rule` builder into the [[SparkSession]]. These analyzer
* rules will be executed as part of the resolution phase of analysis.
*/
def injectResolutionRule(builder: RuleBuilder): Unit = {
resolutionRuleBuilders += builder
}
......
enableHiveSupport
spark2.0.0版后推荐使用SparkSession.builder.enableHiveSupport来兼容HiveContext
/**
* An instance of the Spark SQL execution engine that integrates with data stored in Hive.
* Configuration for Hive is read from hive-site.xml on the classpath.
*/
@deprecated("Use SparkSession.builder.enableHiveSupport instead", "2.0.0")
class HiveContext private[hive](_sparkSession: SparkSession)
extends SQLContext(_sparkSession) with Logging {
......
再来看下该方法在SparkSession中的具体实现, 简单点是配置 spark.sql.catalogImplementation=hive, 默认为in-memory
/**
* Enables Hive support, including connectivity to a persistent Hive metastore, support for
* Hive serdes, and Hive user-defined functions.
*/
def enableHiveSupport(): Builder = synchronized {
if (hiveClassesArePresent) {
config(CATALOG_IMPLEMENTATION.key, "hive")
} else {
throw new IllegalArgumentException(
"Unable to instantiate SparkSession with Hive support because " +
"Hive classes are not found.")
}
}
静态SQL配置是一种跨会话、不可变的Spark配置。
/**
* Static SQL configuration is a cross-session, immutable Spark configuration. External users can
* see the static sql configs via `SparkSession.conf`, but can NOT set/unset them.
*/
object StaticSQLConf {
import SQLConf.buildStaticConf
val WAREHOUSE_PATH = buildStaticConf("spark.sql.warehouse.dir")
.doc("The default location for managed databases and tables.")
.stringConf
.createWithDefault(Utils.resolveURI("spark-warehouse").toString)
val CATALOG_IMPLEMENTATION = buildStaticConf("spark.sql.catalogImplementation")
.internal()
.stringConf
.checkValues(Set("hive", "in-memory"))
.createWithDefault("in-memory")
val GLOBAL_TEMP_DATABASE = buildStaticConf("spark.sql.globalTempDatabase")
.internal()
.stringConf
.createWithDefault("global_temp")
......
getOrCreate
大致可以总结为首先尝试取现有的SparkSession,如果没有现有的,则基于生成器中设置的选项创建新的一个SParkSession。
具体定义如下:
def getOrCreate(): SparkSession = synchronized {
// SparkSession只能在Driver端创建和访问 SparkSession should only be created and accessed on the driver
assertOnDriver()
// 首先检查是否存在有效的线程本地SparkSession, activeThreadSession是一个new InheritableThreadLocal[SparkSession]方法,InheritableThreadLocal继承ThreadLocal
// Get the session from current thread's active session.
var session = activeThreadSession.get()
// 如果session不为空,且session对应的sparkContext未停止了,返回现有session
if ((session ne null) && !session.sparkContext.isStopped) {
options.foreach { case (k, v) => session.sessionState.conf.setConfString(k, v) }
if (options.nonEmpty) {
logWarning("Using an existing SparkSession; some configuration may not take effect.")
}
return session
}
// 再检查是否存在有效的全局的默认SparkSession
// Global synchronization so we will only set the default session once.
SparkSession.synchronized {
// If the current thread does not have an active session, get it from the global session.
session = defaultSession.get()
if ((session ne null) && !session.sparkContext.isStopped) {
options.foreach { case (k, v) => session.sessionState.conf.setConfString(k, v) }
if (options.nonEmpty) {
logWarning("Using an existing SparkSession; some configuration may not take effect.")
}
return session
}
// 创建新的Session 默认userSuppliedContext是没有SparkSession的
// 初始化SparkContext
// No active nor global default session. Create a new one.
val sparkContext = userSuppliedContext.getOrElse {
val sparkConf = new SparkConf()
options.foreach { case (k, v) => sparkConf.set(k, v) }
// set a random app name if not given.
if (!sparkConf.contains("spark.app.name")) {
sparkConf.setAppName(java.util.UUID.randomUUID().toString)
}
SparkContext.getOrCreate(sparkConf)
// Do not update `SparkConf` for existing `SparkContext`, as it's shared by all sessions.
}
// 注入静态配置SQLConf
// Initialize extensions if the user has defined a configurator class.
val extensionConfOption = sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS)
if (extensionConfOption.isDefined) {