最近研究了不同kafka版本和sparkstreaming整合时的区别,整理如下
1- kafka-0.8.2以上kafka-0.10以下
一种是基于receiver的,一种是没有receiver的,两种不同的方式有不同的编程模型、性能特征和语义保证。
1-1 基于receiver的方式
这种方式使用的是kafka的high-level api,通过kafka接收到的消息会被存储在spark的executor中,然后启动一个spark streaming job来处理数据。源码分析如下.
1-1-1 重写Receiver的onStart方法
Initialize the block generator for storing Kafka message.
1-1-1-1 构造BlockGenerator时,会构造一个定时器:
private val blockIntervalTimer = new RecurringTimer(clock, blockIntervalMs,
updateCurrentBuffer, "BlockGenerator")
这个定时器线程会定时得把接收到的消息构造成一个block,大概流程如下:
1.构造线程
private val thread = new Thread("RecurringTimer - " + name)
2.调用loop,启动循环
try {
while (!stopped) {
triggerActionForNextInterval()
}
3.调用triggerActionForNextInterval,这个方法里会调用一个高阶函数callback:updateCurrentBuffer
4.callback: 把一个buffer转换成block,用新的空buffer接收数据
try {
var newBlock: Block = null
synchronized {
if (currentBuffer.nonEmpty) { //如果buffer没满,但是定时时间已到,则构造一个新的buffer
//出来用于接收下一批数据,而旧的block则
val newBlockBuffer = currentBuffer
currentBuffer = new ArrayBuffer[Any]
val blockId = StreamBlockId(receiverId, time - blockIntervalMs)
listener.onGenerateBlock(blockId)
newBlock = new Block(blockId, newBlockBuffer)
}
}
if (newBlock != null) {
blocksForPushing.put(newBlock) // put is blocking when queue is full
}
blockGenerator = supervisor.createBlockGenerator(new GeneratedBlockHandler)
1-1-1-2 另外一个在blockGenerator中的线程是
private val blockPushingThread = new Thread() {
override def run() {
keepPushingBlocks()
}
要作用就是调用keepPushingBlocks方法,把block push到BlockManager中
1-1-1-3 线程池
线程池大小就是所有topic线程数之和
messageHandlerThreadPool = ThreadUtils.newDaemonFixedThreadPool(
topics.values.sum, "KafkaMessageHandler")
val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])
.newInstance(consumerConfig.props)
.asInstanceOf[Decoder[V]]
1-1-1-4 构造kafka消息流,返回Map[String,List[KafkaStream[K,V]]]
val topicMessageStreams = consumerConnector.createMessageStreams(topics, keyDecoder, valueDecoder)
方法原型是:
def createMessageStreams[K,V](topicCountMap: Map[String,Int],
keyDecoder: Decoder[K],
valueDecoder: Decoder[V])
: Map[String,List[KafkaStream[K,V]]]
该方法详细分析参见:createMessageStreams方法分析
1-1-1-5 提交线程池启动消息消费
一个list[stream]表示一个topic的多个kafka消息流,因为是多线程消费,所以这里是一个list
topicMessageStreams.values.foreach { streams =>//遍历list
streams.foreach { stream =>//遍历stream
messageHandlerThreadPool.submit(new MessageHandler(stream))//为一个消费者的kafka stream启动一个线程来接收消息,即消费
来看看这个MessageHandler线程做些什么事情。
每接收到一个消息,调用一次storeMessageAndMetadata方法:
private final class MessageHandler(stream: KafkaStream[K, V]) extends Runnable {
override def run(): Unit = {
while (!isStopped) {
try {
val streamIterator = stream.iterator()
while (streamIterator.hasNext) {
storeMessageAndMetadata(streamIterator.next)
}
} catch {
case e: Exception =>
reportError("Error handling message", e)
}
}
}
}
storeMessageAndMetadata方法:
private def storeMessageAndMetadata(
m