sparkstreaming中通过kafka sample api实现directstream源码分析

1- 使用kafka simple api的步骤一般

  • Find an active Broker and find out which Broker is the leader for your topic and partition
  • Determine who the replica Brokers are for your topic and partition
  • Build the request defining what data you are interested in
  • Fetch the data
  • Identify and recover from leader changes

使用Low Level Consumer (Simple Consumer)的主要原因是,用户希望比Consumer Group更好的控制数据的消费。比如:
同一条消息读多次
只读取某个Topic的部分Partition
管理事务,从而确保每条消息被处理一次,且仅被处理一次

与Consumer Group相比,Low Level Consumer要求用户做大量的额外工作。
必须在应用程序中跟踪offset,从而确定下一条应该消费哪条消息
应用程序需要通过程序获知每个Partition的Leader是谁
必须处理Leader的变化

java api使用low level consumer的例子可以参考:
http://www.cnblogs.com/fxjwind/p/3794255.html
http://zqhxuyuan.github.io/2016/02/20/Kafka-Consumer-New/

2- 源码分析

2-1 计算offsets

主要是两个方法,一个计算最新的offset,一个计算最老的offset,但是都是通过调用getLeaderOffsets方法实现

  def getLatestLeaderOffsets(
      topicAndPartitions: Set[TopicAndPartition]
    ): Either[Err, Map[TopicAndPartition, LeaderOffset]] =
    getLeaderOffsets(topicAndPartitions, OffsetRequest.LatestTime)

  def getEarliestLeaderOffsets(
      topicAndPartitions: Set[TopicAndPartition]
    ): Either[Err, Map[TopicAndPartition, LeaderOffset]] =
    getLeaderOffsets(topicAndPartitions, OffsetRequest.EarliestTime)

  def getLeaderOffsets(
      topicAndPartitions: Set[TopicAndPartition],
      before: Long
    ): Either[Err, Map[TopicAndPartition, LeaderOffset]] = {
    getLeaderOffsets(topicAndPartitions, before, 1).right.map { r =>
      r.map { kv =>
        // mapValues isn't serializable, see SI-7005
        kv._1 -> kv._2.head
      }
    }
  }

重载的 getLeaderOffsets方法:

def getLeaderOffsets(
      topicAndPartitions: Set[TopicAndPartition],
      before: Long,
      maxNumOffsets: Int
    ): Either[Err, Map[TopicAndPartition, Seq[LeaderOffset]]] = {
    findLeaders(topicAndPartitions).right.flatMap { tpToLeader =>
      val leaderToTp: Map[(String, Int), Seq[TopicAndPartition]] = flip(tpToLeader)
      val leaders = leaderToTp.keys
      var result = Map[TopicAndPartition, Seq[LeaderOffset]]()
      val errs = new Err
      withBrokers(leaders, errs) { consumer =>//为每个consumer获取每个partition上的offset
        val partitionsToGetOffsets: Seq[TopicAndPartition] =
          leaderToTp((consumer.host, consumer.port))
        val reqMap = partitionsToGetOffsets.map { tp: TopicAndPartition =>
          tp -> PartitionOffsetRequestInfo(before, maxNumOffsets)
        }.toMap
        val req = OffsetRequest(reqMap)
        val resp = consumer.getOffsetsBefore(req)
        val respMap = resp.partitionErrorAndOffsets
        partitionsToGetOffsets.foreach { tp: TopicAndPartition =>
          respMap.get(tp).foreach { por: PartitionOffsetsResponse =>
            if (por.error == ErrorMapping.NoError) {
              if (por.offsets.nonEmpty) {
                result += tp -> por.offsets.map { off =>
                  LeaderOffset(consumer.host, consumer.port, off)
                }
              } else {
                errs += new SparkException(
                  s"Empty offsets for ${tp}, is ${before} before log beginning?")
              }
            } else {
              errs += ErrorMapping.exceptionFor(por.error)
            }
          }
        }
        if (result.keys.size == topicAndPartitions.size) {
          return Right(result)
        }
      }
      val missing = topicAndPartitions.diff(result.keySet)
      errs += new SparkException(s"Couldn't find leader offsets for ${missing}")
      Left(errs)
    }
  }

findLeaders方法对应上边的“Find an active Broker and find out which Broker is the leader for your topic and partition“

  def findLeaders(
      topicAndPartitions: Set[TopicAndPartition]
    ): Either[Err, Map[TopicAndPartition, (String, Int)]] = {
    val topics = topicAndPartitions.map(_.topic)
    val response = getPartitionMetadata(topics).right //调用getPartitionMetadata方法获取partition的元数据,其实是TopicMetadata集合Set
    val answer = response.flatMap { tms: Set[TopicMetadata] => //循环上边的Set
      val leaderMap = tms.flatMap { tm: TopicMetadata => //循环每个Set中的TopicMetadata对象(topic name和这个topic的所有parititon的PartitionMetadata对象)
        tm.partitionsMetadata.flatMap { pm: PartitionMetadata => //循环每个TopicMetadata对象的每个partitionsMetadata对象
          val tp = TopicAndPartition(tm.topic, pm.partitionId)
          if (topicAndPartitions(tp)) {//如果本次使用的topic和其partitionid包含从kafka集群获取到的信息,那么这个就是leader
            pm.leader.map { l =>
              tp -> (l.host -> l.port)
            }
          } else {
            None
          }
        }
      }.toMap

      if (leaderMap.keys.size == topicAndPartitions.size) {
        Right(leaderMap)
      } else {
        val missing = topicAndPartitions.diff(leaderMap.keySet)
        val err = new Err
        err += new SparkException(s"Couldn't find leaders for ${missing}")
        Left(err)
      }
    }
    answer
  }

TopicMetadata的定义,可以看出,topic的元数据主要就是指topic的名字和其一系列partition的元数据

case class TopicMetadata(topic: String, partitionsMetadata: Seq[PartitionMetadata], errorCode: Short = ErrorMapping.NoError)

PartitionMetadata的定义,可以看出,partition的元数据主要是指partition 的id,leader broker,副本broker,和isr(什么是isr???)

case class PartitionMetadata(partitionId: Int, 
                             val leader: Option[Broker], 
                             replicas: Seq[Broker], 
                             isr: Seq[Broker] = Seq.empty,
                             errorCode: Short = ErrorMapping.NoError) 

reference:
http://www.infoq.com/cn/articles/kafka-analysis-part-4

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值