sparkstreaming中通过kafka sample api实现directstream源码分析

最新推荐文章于 2020-09-24 21:55:47 发布

bigdatar

最新推荐文章于 2020-09-24 21:55:47 发布

阅读量415

点赞数

分类专栏：实时计算

本文链接：https://blog.csdn.net/sinat_27545249/article/details/78146141

版权

实时计算专栏收录该内容

8 篇文章 0 订阅

订阅专栏

1- 使用kafka simple api的步骤一般
2- 源码分析
2-1 计算offsets

1- 使用kafka simple api的步骤一般

Find an active Broker and find out which Broker is the leader for your topic and partition
Determine who the replica Brokers are for your topic and partition
Build the request defining what data you are interested in
Fetch the data
Identify and recover from leader changes

使用Low Level Consumer (Simple Consumer)的主要原因是，用户希望比Consumer Group更好的控制数据的消费。比如：
同一条消息读多次
只读取某个Topic的部分Partition
管理事务，从而确保每条消息被处理一次，且仅被处理一次

与Consumer Group相比，Low Level Consumer要求用户做大量的额外工作。
必须在应用程序中跟踪offset，从而确定下一条应该消费哪条消息
应用程序需要通过程序获知每个Partition的Leader是谁
必须处理Leader的变化

java api使用low level consumer的例子可以参考：
http://www.cnblogs.com/fxjwind/p/3794255.html
http://zqhxuyuan.github.io/2016/02/20/Kafka-Consumer-New/

2- 源码分析

2-1 计算offsets

主要是两个方法，一个计算最新的offset，一个计算最老的offset，但是都是通过调用getLeaderOffsets方法实现

  def getLatestLeaderOffsets(
      topicAndPartitions: Set[TopicAndPartition]
    ): Either[Err, Map[TopicAndPartition, LeaderOffset]] =
    getLeaderOffsets(topicAndPartitions, OffsetRequest.LatestTime)

  def getEarliestLeaderOffsets(
      topicAndPartitions: Set[TopicAndPartition]
    ): Either[Err, Map[TopicAndPartition, LeaderOffset]] =
    getLeaderOffsets(topicAndPartitions, OffsetRequest.EarliestTime)

  def getLeaderOffsets(
      topicAndPartitions: Set[TopicAndPartition],
      before: Long
    ): Either[Err, Map[TopicAndPartition, LeaderOffset]] = {
    getLeaderOffsets(topicAndPartitions, before, 1).right.map { r =>
      r.map { kv =>
        // mapValues isn't serializable, see SI-7005
        kv._1 -> kv._2.head
      }
    }
  }

重载的 getLeaderOffsets方法：

def getLeaderOffsets(
      topicAndPartitions: Set[TopicAndPartition],
      before: Long,
      maxNumOffsets: Int
    ): Either[Err, Map[TopicAndPartition, Seq[LeaderOffset]]] = {
    findLeaders(topicAndPartitions).right.flatMap { tpToLeader =>
      val leaderToTp: Map[(String, Int), Seq[TopicAndPartition]] = flip(tpToLeader)
      val leaders = leaderToTp.keys
      var result = Map[TopicAndPartition, Seq[LeaderOffset]]()
      val errs = new Err
      withBrokers(leaders, errs) { consumer =>//为每个consumer获取每个partition上的offset
        val partitionsToGetOffsets: Seq[TopicAndPartition] =
          leaderToTp((consumer.host, consumer.port))
        val reqMap = partitionsToGetOffsets.map { tp: TopicAndPartition =>
          tp -> PartitionOffsetRequestInfo(before, maxNumOffsets)
        }.toMap
        val req = OffsetRequest(reqMap)
        val resp = consumer.getOffsetsBefore(req)
        val respMap = resp.partitionErrorAndOffsets
        partitionsToGetOffsets.foreach { tp: TopicAndPartition =>
          respMap.get(tp).foreach { por: PartitionOffsetsResponse =>
            if (por.error == ErrorMapping.NoError) {
              if (por.offsets.nonEmpty) {
                result += tp -> por.offsets.map { off =>
                  LeaderOffset(consumer.host, consumer.port, off)
                }
              } else {
                errs += new SparkException(
                  s"Empty offsets for ${tp}, is ${before} before log beginning?")
              }
            } else {
              errs += ErrorMapping.exceptionFor(por.error)
            }
          }
        }
        if (result.keys.size == topicAndPartitions.size) {
          return Right(result)
        }
      }
      val missing = topicAndPartitions.diff(result.keySet)
      errs += new SparkException(s"Couldn't find leader offsets for ${missing}")
      Left(errs)
    }
  }

findLeaders方法对应上边的“Find an active Broker and find out which Broker is the leader for your topic and partition“

  def findLeaders(
      topicAndPartitions: Set[TopicAndPartition]
    ): Either[Err, Map[TopicAndPartition, (String, Int)]] = {
    val topics = topicAndPartitions.map(_.topic)
    val response = getPartitionMetadata(topics).right //调用getPartitionMetadata方法获取partition的元数据,其实是TopicMetadata集合Set
    val answer = response.flatMap { tms: Set[TopicMetadata] => //循环上边的Set
      val leaderMap = tms.flatMap { tm: TopicMetadata => //循环每个Set中的TopicMetadata对象(topic name和这个topic的所有parititon的PartitionMetadata对象)
        tm.partitionsMetadata.flatMap { pm: PartitionMetadata => //循环每个TopicMetadata对象的每个partitionsMetadata对象
          val tp = TopicAndPartition(tm.topic, pm.partitionId)
          if (topicAndPartitions(tp)) {//如果本次使用的topic和其partitionid包含从kafka集群获取到的信息，那么这个就是leader
            pm.leader.map { l =>
              tp -> (l.host -> l.port)
            }
          } else {
            None
          }
        }
      }.toMap

      if (leaderMap.keys.size == topicAndPartitions.size) {
        Right(leaderMap)
      } else {
        val missing = topicAndPartitions.diff(leaderMap.keySet)
        val err = new Err
        err += new SparkException(s"Couldn't find leaders for ${missing}")
        Left(err)
      }
    }
    answer
  }

TopicMetadata的定义，可以看出，topic的元数据主要就是指topic的名字和其一系列partition的元数据

case class TopicMetadata(topic: String, partitionsMetadata: Seq[PartitionMetadata], errorCode: Short = ErrorMapping.NoError)

PartitionMetadata的定义，可以看出，partition的元数据主要是指partition 的id，leader broker，副本broker，和isr（什么是isr？？？）

case class PartitionMetadata(partitionId: Int, 
                             val leader: Option[Broker], 
                             replicas: Seq[Broker], 
                             isr: Seq[Broker] = Seq.empty,
                             errorCode: Short = ErrorMapping.NoError)

reference：
http://www.infoq.com/cn/articles/kafka-analysis-part-4

bigdatar

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
sparkstreaming中通过kafka sample api实现directstream源码分析

1- 使用kafka simple api的步骤一般2- 源码分析2-1 计算offsets1- 使用kafka simple api的步骤一般Find an active Broker and find out which Broker is the leader for your topic and partitionDetermine who the replica Brokers
复制链接

扫一扫

专栏目录