SparkStream例子HdfsWordCount--从Dstream到RDD全过程解析

最新推荐文章于 2022-08-02 18:30:15 发布

水中舟_luyl

最新推荐文章于 2022-08-02 18:30:15 发布

阅读量442

点赞数

分类专栏： spark Streaming 文章标签： DStream RDD

本文链接：https://blog.csdn.net/luyllyl/article/details/78973718

版权

spark 同时被 2 个专栏收录

38 篇文章 1 订阅

订阅专栏

Streaming

9 篇文章 0 订阅

订阅专栏

上一篇SparkStream例子HdfsWordCount--InputDStream及OutputDstream是如何进入DStreamGraph中：分析了InputDstream及OutputDstream如何进入DStreamGraph,这个单元分析一下FileInputDstream如何生成的RDD.

三、 FileInputDStream是如何生成RDD的呢？

1，入口是咱们在案例中调用StreamingContext.start()开始的

def start(): Unit = synchronized {
  state match {
    case INITIALIZED =>
      try {
          validate()

          // Start the streaming scheduler in a new thread, so that thread local properties
          // like call sites and job groups can be reset without affecting those of the
          // current thread.
          //启动子线程，一方面为了本地初始化工作，另外一方面是不要阻塞主线程。
          //Sparkstreaming运行时至少需要两条线程:一条用于接收数据，一条用于处理数据。这个线程是调度层面的和他们没有关系
          ThreadUtils.runInNewThread("streaming-start") {
            sparkContext.setCallSite(startSite.get)
            sparkContext.clearJobGroup()
            sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
            scheduler.start()
          }
          state = StreamingContextState.ACTIVE
        } catch {

2，进入JobScheduler.start()方法：

JobScheduler中的 EventLoop中的onReceive这个方法是通过JobGenerator调用generateJobs（）方法之后触发的；在“SparkStream源码分析：JobScheduler的JobStarted、JobCompleted是怎么被调用的”这一文中有详细的讲解。

def start(): Unit = synchronized {
  if (eventLoop != null) return // scheduler has already been started

  logDebug("Starting JobScheduler")
  //会将JobSchedulerEvent放到LinkedBlockingDeque
  eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
    override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)
    override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
  }
  eventLoop.start()

  // attach rate controllers of input streams to receive batch completion updates
  for {
    inputDStream <- ssc.graph.getInputStreams
    rateController <- inputDStream.rateController
  } ssc.addStreamingListener(rateController)

  listenerBus.start(ssc.sparkContext)
//处理ReceiverInputDstream的数据源，如SocketInputDstream,FlumePollingInputDstream,FlumeInputDsteam等。看ReceiverInputDstream的子类
  receiverTracker = new ReceiverTracker(ssc)
  inputInfoTracker = new InputInfoTracker(ssc)
  receiverTracker.start()
  jobGenerator.start()
  logInfo("Started JobScheduler")
}

#又因为FileInputDstream不需要Receiver，所以receiverTracker.start（）代码，后面再研究

3，咱们在进入jobGenerator.start()代码，先简单分析一下JobGenerator是如何进行Duration定期调度的

private[streaming]
class JobGenerator(jobScheduler: JobScheduler) extends Logging {

  private val ssc = jobScheduler.ssc
  private val conf = ssc.conf
  private val graph = ssc.graph

//在JobGenerator初始化时会得到spark自已的Clock按Duration周期性睡眠
a，方法体返回spark的SystemClock，当然也可以自己实现一个Clock,将路径放SparkConf中
  val clock = {
    val clockClass = ssc.sc.conf.get(
      "spark.streaming.clock", "org.apache.spark.util.SystemClock")
    try {
      Utils.classForName(clockClass).newInstance().asInstanceOf[Clock]
    } catch {
     …..  }
 b,JobGenerator.start（）启动后会调用startFirstTime(),它里面会调用timeer.start(),然后不断的根据Duration,

通过发送GeneratoJobs让EventLoop执行onReceiver方法
  private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
    longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")

==》EventLoop会被上面的eventLoop.post(GenerateJobs())触发执行onReceive方法。

（EventLoop的里面的源码还是很简单的，使用一个并发BlockingQueue队列，队列在取元素即调用take时，如果队列没有元素会阻塞在哪，

所以当RecurringTimer在post（GenerateJobs()）就会往下执行，然后调用下面的onReceiver方法）

/** Start generation of jobs */
def start(): Unit = synchronized {
  if (eventLoop != null) return // generator has already been started

  checkpointWriter
  //JobGeneratorEvent:有三个实现类，查看上面有一个GenerateJobs表示生成Job
  eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
    override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)
。。。
  }
  eventLoop.start()

  if (ssc.isCheckpointPresent) {
    restart()
  } else {
    startFirstTime()
  }
}

4，因为BlockingQueue队列开始得到的是GenerateJobs这个case class,所以先执行generateJobs（time）方法

private def processEvent(event: JobGeneratorEvent) {
  logDebug("Got event " + event)
  event match {
    case GenerateJobs(time) => generateJobs(time)
   。。。。  }
}

5，这个generateJobs方法中的DstreamGraph.generateJobs是会将每个Duration周期内新增的文本数据放在一个UnionRDD中，然后放在Job中，给Executor去执行。具体过程如下：

#从spark的注释:这个方法会在指定周期中是生成jobs和执行checkpoint

/** Generate jobs and perform checkpoint for the given `time`.  */
private def generateJobs(time: Time) {
  // Set the SparkEnv in this thread, so that job generation code can access the environment
  // Example: BlockRDDs are created in this thread, and it needs to access BlockManager
  // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.
  SparkEnv.set(ssc.env)
  Try {

//这个方法是给ReceiveInputDstream使用的，FileInputDstream不是这一类
    jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
    //调用graph的generateJobs方法,给方法返回Success(jobs) 或者 Failure(e),
    // 其中的jobs就是该方法返回的Job对象集合,如果Job创建成功,再调用JobScheduler的submitJobSet方法将job提交给集群执行。
    graph.generateJobs(time) // generate jobs using allocated block
  } match {
    case Success(jobs) =>
      val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
      //其中streamIdToInputInfos就是接收的数据的元数据
      //JobSet代表了一个batch duration中的一批jobs。就是一个普通对象，包含了未提交的jobs，提交的时间，执行开始和结束时间等信息。
      jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
    case Failure(e) =>
      jobScheduler.reportError("Error generating jobs for time " + time, e)
  }
  //发送执行CheckPoint时间，发送周期为streaming batch接收数据的时间
  eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}

6,进入DstreamGraph.generateJobs()分析一下执行流程：

#该案例的outputStream只有一个ForeachDstream（只有ForeachDstream重写了Dstream中的generateJob方法）

def generateJobs(time: Time): Seq[Job] = {
  logDebug("Generating jobs for time " + time)
  val jobs = this.synchronized {
    outputStreams.flatMap { outputStream =>
      //jobOption返回就是Some(new Job(time, jobFunc))，如果是调用的print()方法，则jobFunc就是print方法中声明的方法
      val jobOption = outputStream.generateJob(time)
      jobOption.foreach(_.setCallSite(outputStream.creationSite))
      jobOption
    }
  }
  logDebug("Generated " + jobs.length + " jobs for time " + time)
  jobs
}

7，咱们就看一下ForeachDstream的generateJob(time)是如何将val lines =ssc.textFileStream(args(0))监听目录中的新增的文件放在RDD中的。

private[streaming]
class ForEachDStream[T: ClassTag] (
    parent: DStream[T],
    foreachFunc: (RDD[T], Time) => Unit,
    displayInnerRDDOps: Boolean
  ) extends DStream[Unit](parent.ssc) {
  override def dependencies: List[DStream[_]] = List(parent)
  override def slideDuration: Duration = parent.slideDuration
  override def compute(validTime: Time): Option[RDD[Unit]] = None

override def generateJob(time: Time): Option[Job] = {
  //这个parent会将整个app的Dstream算子进行回塑，回塑回来的RDD是当前Dstream周期内的RDD数据
   parent.getOrCompute(time) match {
     case Some(rdd) =>
      //createRDDWithLocalProperties方法的作用就是执行{foreachFunc(rdd, time)}这个body,前面（time,displayInnerRDDOps）是展示在UI上的
       val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
         foreachFunc(rdd, time) //这个 foreachFunc就是print方法中声明的内嵌函数。
       }
       Some(new Job(time, jobFunc))
     case None => None
   }
 }

===》当前案例中ForeachDstream的parent是ShuffledDstream,也就是下面reduceBykey算子产生的。

val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()

===》看到ShuffledDstream并自身没有重写getOrCompute（time），

private[streaming]
class ShuffledDStream[K: ClassTag, V: ClassTag, C: ClassTag](
    parent: DStream[(K, V)],
    createCombiner: V => C,
    mergeValue: (C, V) => C,
    mergeCombiner: (C, C) => C,
    partitioner: Partitioner,
    mapSideCombine: Boolean = true
  ) extends DStream[(K, C)] (parent.ssc) {

  override def dependencies: List[DStream[_]] = List(parent)

  override def slideDuration: Duration = parent.slideDuration

  override def compute(validTime: Time): Option[RDD[(K, C)]] = {
    parent.getOrCompute(validTime) match {
      case Some(rdd) => Some(rdd.combineByKey[C](
          createCombiner, mergeValue, mergeCombiner, partitioner, mapSideCombine))
      case None => None
    }
  }
}

8，Dstream中getOrCompute方法最终会调用当前Dstream实例中的compute方法

/**
 * Get the RDD corresponding to the given time; either retrieve it from cache
 * or compute-and-cache it.

private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {
  // If RDD was already generated, then retrieve it from HashMap,
  // or else compute the RDD
  generatedRDDs.get(time).orElse {
    // Compute the RDD if time is valid (e.g. correct time in a sliding window)
    // of RDD generation, else generate nothing.
    if (isTimeValid(time)) {
      //createRDDWithLocalProperties方法的作用就是执行后面柯里化函数体：{PairRDDFunctions.disableOutputSpecValidation.withValue(true)}这个body,前面（time,displayInnerRDDOps）是展示在UI上的
     val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) {
        PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
          compute(time) //由具体的数据源Dstream来创建RDD 如：FileInputDStream中的compute方法来创建
        }
      }

。。。。。。。。。。

===》会按照下面的顺序往上回塑，一直到FileInputDstream的compute方法

ShuffledDStream

          MappedDStream

                   FlatMappedDStream

                             FileInputDStream

9，进入FileInputDstream的compute方法，可以看到findNewFiles方法找到新的文本后，会调用了filesToRDD方法，将新文本中的数据转化成RDD

override def compute(validTime: Time): Option[RDD[(K, V)]] = {
  // Find new files
  val newFiles = findNewFiles(validTime.milliseconds)
  logInfo("New files at time " + validTime + ":\n" + newFiles.mkString("\n"))
  batchTimeToSelectedFiles += ((validTime, newFiles))
  recentlySelectedFiles ++= newFiles
  val rdds = Some(filesToRDD(newFiles))
  // Copy newFiles to immutable.List to prevent from being modified by the user
  val metadata = Map(
    "files" -> newFiles.toList,
    StreamInputInfo.METADATA_KEY_DESCRIPTION -> newFiles.mkString("\n"))
  val inputInfo = StreamInputInfo(id, 0, metadata)
  ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
  rdds
}

===>newAPIHadoopFile生成是NewHadoopRDD===》最后得到UnionRDD

/** Generate one RDD from an array of files */
private def filesToRDD(files: Seq[String]): RDD[(K, V)] = {
  val fileRDDs = files.map { file =>
    val rdd = serializableConfOpt.map(_.value) match {
      case Some(config) => context.sparkContext.newAPIHadoopFile(
        file,
        fm.runtimeClass.asInstanceOf[Class[F]],
        km.runtimeClass.asInstanceOf[Class[K]],
        vm.runtimeClass.asInstanceOf[Class[V]],
        config)
      case None => context.sparkContext.newAPIHadoopFile[K, V, F](file)
    }
    if (rdd.partitions.size == 0) {
      logError("File " + file + " has no data in it. Spark Streaming can only ingest " +
        "files that have been \"moved\" to the directory assigned to the file stream. " +
        "Refer to the streaming programming guide for more details.")
    }
    rdd
  }
  new UnionRDD(context.sparkContext, fileRDDs)
}

10、所以得到当前的Duration对应的UnionRDD,咱们再返回到ForeachDstream的generateJob()方法中。此时就能很清楚的看到每个Dstream周期内，会先将当前周期内的所有数据放到一个RDD中，然后再封装到streaming对应的Job里面。

override def generateJob(time: Time): Option[Job] = {
  //这个parent会将整个app的Dstream算子进行回塑，回塑回来的RDD是当前Dstream周期内的RDD数据
  parent.getOrCompute(time) match {
    case Some(rdd) =>
//createRDDWithLocalProperties方法的作用就是执行{foreachFunc(rdd, time)}这个body,前面（time,displayInnerRDDOps）是展示在UI上的
      val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
        foreachFunc(rdd, time) //如果是调用的print()方法，则foreachFunc就是print方法中声明的方法。
      }
      Some(new Job(time, jobFunc))
    case None => None
  }
}