Spark环境更新与应用启动

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/qq_27639777/article/details/89434774

简介

在SparkContext的初始化过程中,可能对其环境造成影响,所以需要更新环境,然后再进行应用启动:

postEnvironmentUpdate()
postApplicationStart()

环境更新

环境更新前,首先判断taskScheduler是否已经创建完成,如果创建完成了则进行环境更新:

  1. 解析jar包和文件配置参数;

  2. 获取环境相关的详细配置,得到JVM参数、Spark属性、系统属性和classPath等;

  3. 生成环境更新事件SparkListenerEnvironmentUpdate,并利用listenerBus发送出去。

     /** 一旦task scheduler就绪后就可以发送环境更新的事件了 */
     private def postEnvironmentUpdate() {
       if (taskScheduler != null) {
         val schedulingMode = getSchedulingMode.toString
         val addedJarPaths = addedJars.keys.toSeq
         val addedFilePaths = addedFiles.keys.toSeq
         val environmentDetails = SparkEnv.environmentDetails(conf, schedulingMode, addedJarPaths,
           addedFilePaths)
         val environmentUpdate = SparkListenerEnvironmentUpdate(environmentDetails)
         listenerBus.post(environmentUpdate)
       }
     }
    

解析jar包和文件

SparkContext初始化过程中,如果设置了spark.jars属性,spark.jars指定的jar包将由addJar方法加入httpFileServer的jarDir变量指定的路径下。spark.files指定的文件将由addFile方法加入httpFileServer的fileDir变量指定的路径下。

Utils.scala
def getUserJars(conf: SparkConf): Seq[String] = {
  val sparkJars = conf.getOption("spark.jars")
  sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten
}

_jars = Utils.getUserJars(_conf)
_files = _conf.getOption("spark.files").map(_.split(",")).map(_.filter(_.nonEmpty))
  .toSeq.flatten

def jars: Seq[String] = _jars
// Add each JAR given through the constructor
if (jars != null) {
  jars.foreach(addJar)
}

def files: Seq[String] = _files
if (files != null) {
  files.foreach(addFile)
}

获取环境配置

调用SparkEnv的静态方法environmentDetails得到影响环境的JVM参数、Spark属性、系统属性和classPath等。

/**
 * 返回一个表示jvm信息、spark属性、系统属性和类路径的映射。
 * 映射的键表示信息的种类,映射的值表示相应的键值序列对属性。
 * 这主要是被SparkListenerEnvironmentUpdate事件使用。
 */
private[spark]
def environmentDetails(
    conf: SparkConf,
    schedulingMode: String,
    addedJars: Seq[String],
    addedFiles: Seq[String]): Map[String, Seq[(String, String)]] = {

  import Properties._
  val jvmInformation = Seq(
    ("Java Version", s"$javaVersion ($javaVendor)"),
    ("Java Home", javaHome),
    ("Scala Version", versionString)
  ).sorted

  // Spark属性
  // 包括scheduling mode参数是否被配置 (被SparkUI使用)
  val schedulerMode =
    if (!conf.contains("spark.scheduler.mode")) {
      Seq(("spark.scheduler.mode", schedulingMode))
    } else {
      Seq[(String, String)]()
    }
  val sparkProperties = (conf.getAll ++ schedulerMode).sorted

  // 不包含java classpaths外的系统属性
  val systemProperties = Utils.getSystemProperties.toSeq
  val otherProperties = systemProperties.filter { case (k, _) =>
    k != "java.class.path" && !k.startsWith("spark.")
  }.sorted

  // 包含了所有添加的jar包和文件的Class paths
  val classPathEntries = javaClassPath
    .split(File.pathSeparator)
    .filterNot(_.isEmpty)
    .map((_, "System Classpath"))
  val addedJarsAndFiles = (addedJars ++ addedFiles).map((_, "Added By User"))
  val classPaths = (addedJarsAndFiles ++ classPathEntries).sorted
  // 组装返回值
  Map[String, Seq[(String, String)]](
    "JVM Information" -> jvmInformation,
    "Spark Properties" -> sparkProperties,
    "System Properties" -> otherProperties,
    "Classpath Entries" -> classPaths)
}

发送环境更新事件

生成环境更新事件SparkListenerEnvironmentUpdate,并利用listenerBus发送出去。此事件被EnvironmentListener和JobProgressListener监听,最终影响SparkUI上的输出内容。

EnvironmentListener#onEnvironmentUpdate
// 更新EnvrionmentTab上的信息
override def onEnvironmentUpdate(environmentUpdate: SparkListenerEnvironmentUpdate) {
  synchronized {
    val environmentDetails = environmentUpdate.environmentDetails
    jvmInformation = environmentDetails("JVM Information")
    sparkProperties = environmentDetails("Spark Properties")
    systemProperties = environmentDetails("System Properties")
    classpathEntries = environmentDetails("Classpath Entries")
  }
}

JobProgressListener#onEnvironmentUpdate
// 更新SparkUI上的schedulingMode信息
override def onEnvironmentUpdate(environmentUpdate: SparkListenerEnvironmentUpdate) {
  synchronized {
    schedulingMode = environmentUpdate
      .environmentDetails("Spark Properties").toMap
      .get("spark.scheduler.mode")
      .map(SchedulingMode.withName)
  }
}

应用启动

spark应用启动很简单,只是向listenerBus发送了SparkListenerApplicationStart事件,代码如下:

/** 发送应用启动事件 */
private def postApplicationStart() {
  // 注意:这段代码假设task scheduler已经被初始化完成,并且已经联系cluster manager拿到了application ID。
  listenerBus.post(SparkListenerApplicationStart(appName, Some(applicationId),
    startTime, sparkUser, applicationAttemptId, schedulerBackend.getDriverLogUrls))
}

SparkListenerApplicationStart事件会同时被ApplicationEventListener、ExecutorsListener和JobProgressListener监听。

ApplicationEventListener#onApplicationStart
// 获取应用相关的配置信息
override def onApplicationStart(applicationStart: SparkListenerApplicationStart) {
  appName = Some(applicationStart.appName)
  appId = applicationStart.appId
  appAttemptId = applicationStart.appAttemptId
  startTime = Some(applicationStart.time)
  sparkUser = Some(applicationStart.sparkUser)
}

ExecutorsListener#onApplicationStart
// 获取driver对应的executorId(driver相当于一个特殊的executor),并组装task执行的executor相关日志
override def onApplicationStart(
    applicationStart: SparkListenerApplicationStart): Unit = {
  applicationStart.driverLogs.foreach { logs =>
    val storageStatus = activeStorageStatusList.find { s =>
      s.blockManagerId.executorId == SparkContext.LEGACY_DRIVER_IDENTIFIER ||
      s.blockManagerId.executorId == SparkContext.DRIVER_IDENTIFIER
    }
    storageStatus.foreach { s =>
      val eid = s.blockManagerId.executorId
      val taskSummary = executorToTaskSummary.getOrElseUpdate(eid, ExecutorTaskSummary(eid))
	  // TODO:组装完taskSummary后没有后续操作,而taskSummary是局部变量!!
      taskSummary.executorLogs = logs.toMap
    }
  }
}

JobProgressListener#onApplicationStart
// 获取应用的启动时间
override def onApplicationStart(appStarted: SparkListenerApplicationStart) {
  startTime = appStarted.time
}
展开阅读全文

没有更多推荐了,返回首页