StreamingContext启动流程
- 创建和停止:context.start()/context.stop()
- 初始化方式:
- Master URL + App name
- SparkConf配置
- 从现有的SparkContext创建
- SparkContext:外部传入或者从检查点创建
- DStreamGraph:创建新的DStreamGraph或者关联检查点的graph
- scheduler:创建JobScheduler
- state:初始化为INITIALIZED
- StreamingContext的启动:在新线程启动JobScheduler并将state设为ACTIVE
JobScheduler流程
- 启动处理JobSchedulerEvent的eventloop线程
- 注册Spark Listener Bus
- 创建ReceiverTracker和InputInfoTracker
- 创建ExecutorAllocationManager并注册到listener bus
- 启动receiverTracker
- 启动receiver tracer rpc endpoint
- launchReceivers: 从receiverInputStream的getReceiver()方法获取receivers,并向receiver endpoint发送StartAllReceivers消息启动所有receivers。
- 启动jobGenerator
- 启动所有executorAllocationManager
ReceiverTracker RPC Endpoint流程
收到StartAllReceivers消息:
- 从BlockManager获取executor列表(排除driver),并在executors上面scheduleReceiver。调度时首先考虑preferredLocation,然后尽力在executors上均匀分布。
- 调用startReceiver(receiver, executors)方法,在executor上启动receiver
- 创建receiverRDD,考虑scheduledLocations
- SparkContext提交一个作业,在worker上启动receiver的ReceiverSupervisor
- 只要ReceiverTracker没有停止,就发送RestartReceiver消息重启receiver
ReceiverSupervisor流程
ReceiverSupervisor处理所有receiver接收到的数据。
- 创建。
- ReceivedBlockHandler:分为WAL和BlockManager两种类型。
- 创建BlockGenerator并绑定BlockGeneratorListener
- 启动。
- 调用supervisor的onStart()方法,启动所有BlockGenerator
- 调用supervisor的startReceiver()方法,启动Receiver
- receiver state设为Started
- 调用receiver的onStart()方法启动receiver
BlockGenerator流程
- 5种状态:Initialized, Active, StoppedAddingData, StoppedGeneratingBlocks, StoppedAll
- 创建。
- 创建blockIntervalTimer
- 创建blockPushingThread
- 创建currentBuffer
- 创建blocksForPushing
- 启动。
- State从Initialized转移到Active
- 启动blockIntervalTimer,重复触发updateCurrentBuffer(time: Long)方法,从buffer中生成块
- 启动blockPushingThread,运行keepPushingBlocks()
- 主循环:只要block还在生成,从blocksForPushing中poll块,并push
- pushBlock由Listener执行onPushBlock()方法,默认为通过pushArrayBuffer将块存储到内存(receivedBlockHandler.storeBlock,底层由BlockManager操作),并向driver报告(发送AddBlock消息);Receiver也可以自定义listener处理块
JobGenerator流程
职责:从DStream中创建作业,以及检查点、清理DStream元信息等功能。
- 创建。
- 创建处理JobGeneratorEvent的eventloop
- 创建重复触发的RecurringTimer,反复向eventloop发送GenerateJobs的JobGeneratorEvent消息。
- 启动。eventloop启动,收到JobGeneratorEvent启动processEvent方法:
- generateJobs()
- receiverTracker.allocateBlocksToBatch(time),将接收到的块组织成batch
- graph.generateJobs(time),生成job
- 如果job生成成功,调用jobScheduler.submitJobSet方法,提交作业集
- generateJobs()