kafka分布式如何保证

最新推荐文章于 2024-04-22 10:42:52 发布

逸643

最新推荐文章于 2024-04-22 10:42:52 发布

阅读量466

点赞数 1

分类专栏： kafka java 文章标签： java kafka

本文链接：https://blog.csdn.net/qq_16270315/article/details/113241958

版权

java 同时被 2 个专栏收录

7 篇文章 0 订阅

订阅专栏

kafka

2 篇文章 0 订阅

订阅专栏

众所周知kafka是集群模式，那么kafka是如何保证数据一致性，以及集群间和消费者是如何交互的呢？
首先先来了解几个名称：

AR:分区中所有副本统称为AR（Assigned Replicas）。

ISR:所有与leader副本保持一定程度同步风副本（包括leader副本本身）组成ISR（IN-Sync Replicas），ISR集合是AR集合中的一个子集。
replica.lag.time.max.ms这个就是follower副本落后leader副本的时间间隔，默认30秒。只要follower副本每隔30s都能发送FetchRequest请求给leader，那么该副本就不会被标记成dead从而被踢出ISR。

OSR: 与leader副本同步滞后过多的副本（不包括leader）副本，组成OSR(Out-Sync Relipcas),由此可见：AR=ISR+OSR。

ISR的伸缩:
Kafka在启动的时候会开启两个与ISR相关的定时任务，名称分别为“isr-expiration"和”isr-change-propagation"。

scheduler.schedule("isr-expiration", maybeShrinkIsr _, period = config.replicaLagTimeMaxMs / 2, unit = TimeUnit.MILLISECONDS)
scheduler.schedule("isr-change-propagation", maybePropagateIsrChanges _, period = 2500L, unit = TimeUnit.MILLISECONDS)
scheduler.schedule("shutdown-idle-replica-alter-log-dirs-thread", shutdownIdleReplicaAlterLogDirsThread _, period = 10000L, unit = TimeUnit.MILLISECONDS)

isr-expiration任务会周期性的检测每个分区是否需要缩减其ISR集合。这个周期和“replica.lag.time.max.ms”参数有关,大小是这个参数一半。当检测到ISR中有是失效的副本的时候，就会缩减ISR集合。如果某个分区的ISR集合发生变更，则会将变更后的数据记录到ZooKerper对应/brokers/topics/$topic/partitions/$partition.partition/state节点中。节点中数据示例如下：

{“controller_cpoch":26,“leader”:0,“version”:1,“leader_epoch”:2,“isr”:{0,1}}

object BrokersZNode {
  def path = "/brokers"
}

object TopicsZNode {
  def path = s"${BrokersZNode.path}/topics"
}

object TopicZNode {
  def path(topic: String) = s"${TopicsZNode.path}/$topic"
  def encode(assignment: collection.Map[TopicPartition, ReplicaAssignment]): Array[Byte] = {
    val replicaAssignmentJson = mutable.Map[String, util.List[Int]]()
    val addingReplicasAssignmentJson = mutable.Map[String, util.List[Int]]()
    val removingReplicasAssignmentJson = mutable.Map[String, util.List[Int]]()

    for ((partition, replicaAssignment) <- assignment) {
      replicaAssignmentJson += (partition.partition.toString -> replicaAssignment.replicas.asJava)
      if (replicaAssignment.addingReplicas.nonEmpty)
        addingReplicasAssignmentJson += (partition.partition.toString -> replicaAssignment.addingReplicas.asJava)
      if (replicaAssignment.removingReplicas.nonEmpty)
        removingReplicasAssignmentJson += (partition.partition.toString -> replicaAssignment.removingReplicas.asJava)
    }

    Json.encodeAsBytes(Map(
      "version" -> 2,
      "partitions" -> replicaAssignmentJson.asJava,
      "adding_replicas" -> addingReplicasAssignmentJson.asJava,
      "removing_replicas" -> removingReplicasAssignmentJson.asJava
    ).asJava)
  }
  def decode(topic: String, bytes: Array[Byte]): Map[TopicPartition, ReplicaAssignment] = {
    def getReplicas(replicasJsonOpt: Option[JsonObject], partition: String): Seq[Int] = {
      replicasJsonOpt match {
        case Some(replicasJson) => replicasJson.get(partition) match {
          case Some(ar) => ar.to[Seq[Int]]
          case None => Seq.empty[Int]
        }
        case None => Seq.empty[Int]
      }
    }

    Json.parseBytes(bytes).flatMap { js =>
      val assignmentJson = js.asJsonObject
      val partitionsJsonOpt = assignmentJson.get("partitions").map(_.asJsonObject)
      val addingReplicasJsonOpt = assignmentJson.get("adding_replicas").map(_.asJsonObject)
      val removingReplicasJsonOpt = assignmentJson.get("removing_replicas").map(_.asJsonObject)
      partitionsJsonOpt.map { partitionsJson =>
        partitionsJson.iterator.map { case (partition, replicas) =>
          new TopicPartition(topic, partition.toInt) -> ReplicaAssignment(
            replicas.to[Seq[Int]],
            getReplicas(addingReplicasJsonOpt, partition),
            getReplicas(removingReplicasJsonOpt, partition)
          )
        }
      }
    }.map(_.toMap).getOrElse(Map.empty)
  }
}


object TopicPartitionsZNode {
  def path(topic: String) = s"${TopicZNode.path(topic)}/partitions"
}

object TopicPartitionZNode {
  def path(partition: TopicPartition) = s"${TopicPartitionsZNode.path(partition.topic)}/${partition.partition}"
}

object TopicPartitionStateZNode {
  def path(partition: TopicPartition) = s"${TopicPartitionZNode.path(partition)}/state"
  def encode(leaderIsrAndControllerEpoch: LeaderIsrAndControllerEpoch): Array[Byte] = {
    val leaderAndIsr = leaderIsrAndControllerEpoch.leaderAndIsr
    val controllerEpoch = leaderIsrAndControllerEpoch.controllerEpoch
    Json.encodeAsBytes(Map("version" -> 1, "leader" -> leaderAndIsr.leader, "leader_epoch" -> leaderAndIsr.leaderEpoch,
      "controller_epoch" -> controllerEpoch, "isr" -> leaderAndIsr.isr.asJava).asJava)
  }
  def decode(bytes: Array[Byte], stat: Stat): Option[LeaderIsrAndControllerEpoch] = {
    Json.parseBytes(bytes).map { js =>
      val leaderIsrAndEpochInfo = js.asJsonObject
      val leader = leaderIsrAndEpochInfo("leader").to[Int]
      val epoch = leaderIsrAndEpochInfo("leader_epoch").to[Int]
      val isr = leaderIsrAndEpochInfo("isr").to[List[Int]]
      val controllerEpoch = leaderIsrAndEpochInfo("controller_epoch").to[Int]
      val zkPathVersion = stat.getVersion
      LeaderIsrAndControllerEpoch(LeaderAndIsr(leader, epoch, isr, zkPathVersion), controllerEpoch)
    }
  }
}
// update ISR in zk and in cache
private[cluster] def shrinkIsr(newIsr: Set[Int]): Unit = {
val newLeaderAndIsr = new LeaderAndIsr(localBrokerId, leaderEpoch, newIsr.toList, zkVersion)
val zkVersionOpt = stateStore.shrinkIsr(controllerEpoch, newLeaderAndIsr)
maybeUpdateIsrAndVersion(newIsr, zkVersionOpt)
}

其中controller_epoch表示的是当前的kafka控制器epoch.leader表示当前分区的leader副本所在的broker的id编号，version表示版本号，（当前半本固定位1），leader_epoch表示当前分区的leader纪元，isr表示变更后的isr列表。

除此之外，当ISR集合发生变更的时候还会将变更后的记录缓存到isrChangeSet中，isr-change-propagation任务会周期性（固定值为2500ms）地检查isrChangeSet，如果发现isrChangeSet中有ISR 集合的变更记录，那么它会在Zookeeper的 /isr_change_notification的路径下创建一个以isr_change开头的持久顺序节点，比如：

/isr_change_notification/isr_change_0000000000

object IsrChangeNotificationZNode {
  def path = "/isr_change_notification"
}

object IsrChangeNotificationSequenceZNode {
  val SequenceNumberPrefix = "isr_change_"
  def path(sequenceNumber: String = "") = s"${IsrChangeNotificationZNode.path}/$SequenceNumberPrefix$sequenceNumber"
  def encode(partitions: collection.Set[TopicPartition]): Array[Byte] = {
    val partitionsJson = partitions.map(partition => Map("topic" -> partition.topic, "partition" -> partition.partition).asJava)
    Json.encodeAsBytes(Map("version" -> IsrChangeNotificationHandler.Version, "partitions" -> partitionsJson.asJava).asJava)
  }

  def decode(bytes: Array[Byte]): Set[TopicPartition] = {
    Json.parseBytes(bytes).map { js =>
      val partitionsJson = js.asJsonObject("partitions").asJsonArray
      partitionsJson.iterator.map { partitionsJson =>
        val partitionJson = partitionsJson.asJsonObject
        val topic = partitionJson("topic").to[String]
        val partition = partitionJson("partition").to[Int]
        new TopicPartition(topic, partition)
      }
    }
  }.map(_.toSet).getOrElse(Set.empty)
  def sequenceNumber(path: String) = path.substring(path.lastIndexOf(SequenceNumberPrefix) + SequenceNumberPrefix.length)
}

并将isrChangeSet中的信息保存到这个节点中。kafka控制器为/isr_change_notification添加了一个Watcher，当这个节点中有子节点发生变化的时候会触发Watcher动作，以此通知控制器更新相关的元数据信息并向它管理的broker节点发送更新元数据信息的请求。最后删除/isr_change_notification的路径下已经处理过的节点。频繁的触发Watcher会影响kafka控制器，zookeeper甚至其他的broker性能。为了避免这种情况，kafka添加了指定的条件，当检测到分区ISR集合发生变化的时候，还需要检查一下两个条件：
1. 上一次ISR集合发生变化距离现在已经超过5秒
2. 上一次写入zookeeper的时候距离现在已经超过60秒。

/**
   * Gets the isr change notifications as strings. These strings are the znode names and not the absolute znode path.
   * @return sequence of znode names and not the absolute znode path.
   */
  def getAllIsrChangeNotifications: Seq[String] = {
    val getChildrenResponse = retryRequestUntilConnected(GetChildrenRequest(IsrChangeNotificationZNode.path, registerWatch = true))
    getChildrenResponse.resultCode match {
      case Code.OK => getChildrenResponse.children.map(IsrChangeNotificationSequenceZNode.sequenceNumber)
      case Code.NONODE => Seq.empty
      case _ => throw getChildrenResponse.resultException.get
    }
  }
  
  private val lastIsrChangeMs = new AtomicLong(System.currentTimeMillis())
  private val lastIsrPropagationMs = new AtomicLong(System.currentTimeMillis())
  
  val HighWatermarkFilename = "replication-offset-checkpoint"
  val IsrChangePropagationBlackOut = 5000L
  val IsrChangePropagationInterval = 60000L
  
  def maybePropagateIsrChanges(): Unit = {
    val now = System.currentTimeMillis()
    isrChangeSet synchronized {
      if (isrChangeSet.nonEmpty &&
        (lastIsrChangeMs.get() + ReplicaManager.IsrChangePropagationBlackOut < now ||
          lastIsrPropagationMs.get() + ReplicaManager.IsrChangePropagationInterval < now)) {
        zkClient.propagateIsrChanges(isrChangeSet)
        isrChangeSet.clear()
        lastIsrPropagationMs.set(now)
      }
    }
  }

有缩减就会有补充，那么kafka何时扩充ISR的？

随着follower副本不断进行消息同步，follower副本LEO也会逐渐后移，并且最终赶上leader副本，此时follower副本就有资格进入ISR集合，追赶上leader副本的判定准则是此副本的LEO是否小于leader副本HW，这里并不是和leader副本LEO相比。ISR扩充之后同样会更新ZooKeeper中的/broker/topics/partition/state节点和isrChangeSet，之后的步骤就和ISR收缩的时的相同。

这里提到了LEO和HW：

HW：是High watermark，俗称高水位，它标识一个特定的消息偏移量（offset），消费者只能拉取到这个offset之前的消息。

LEO：是Log End offset，它标识当前日志文件中下一条待写入消息的offset。LEO的大小相当于当前日志分区最后一条消息的offset加1.

ISR与HW和LEO有密切的关系。如下图：

HW为ISR的最小的LEO，如果follower2被剔除ISR,那么HW就成5。

接下来看下producer保存消息的流程：

那么具体是怎么操作的呢？

1. 生产者发送请求向某些指定分区追加消息
2. ProducerRequest经过网络层和API层到达ReplicaManager，他会将消息交给日志存储系统进行处理，最终追加到对应的log中，同时还是检测delayedFetchPurgatory中相关key对应的DelayedFetch，满足条件则将其执行完成。
3. 日志存储系统返回追加消息的结果
4. ReplicaManager为ProducerRequest生成DelayedProduce对象，并交由delayedProducePurgatory管理
5. delayedProducePurgatory使用SystemTimer管理DelayedProduce是否超时
6. ISR集合中Follower副本发送FetchRequest请求与Leader副本同步消息，同时也会检查DelayedProduce是否符合执行条件。
7. DelayProduce执行时会调用回调函数产生ProducerResponse，并将其添加到RequestChannels中
8. 由网络层将ProduceResponse返回给客户端。

而我们这次需要关注点在于5和6，leader节点是怎么通知follower节点的呢？

 def appendRecordsToLeader(records: MemoryRecords, origin: AppendOrigin, requiredAcks: Int): LogAppendInfo = {
    val (info, leaderHWIncremented) = inReadLock(leaderIsrUpdateLock) {
      leaderLogIfLocal match {
        case Some(leaderLog) =>
          val minIsr = leaderLog.config.minInSyncReplicas
          val inSyncSize = inSyncReplicaIds.size

          // Avoid writing to leader if there are not enough insync replicas to make it safe
          if (inSyncSize < minIsr && requiredAcks == -1) {
            throw new NotEnoughReplicasException(s"The size of the current ISR $inSyncReplicaIds " +
              s"is insufficient to satisfy the min.isr requirement of $minIsr for partition $topicPartition")
          }

          val info = leaderLog.appendAsLeader(records, leaderEpoch = this.leaderEpoch, origin,
            interBrokerProtocolVersion)

          //我们可能需要增加HW，因为ISR可能降到1
          (info, maybeIncrementLeaderHW(leaderLog))

        case None =>
          throw new NotLeaderOrFollowerException("Leader not local for partition %s on broker %d"
            .format(topicPartition, localBrokerId))
      }
    }

    // some delayed operations may be unblocked after HW changed
    if (leaderHWIncremented)
      tryCompleteDelayedRequests()
    else {
      // probably unblock some follower fetch requests since log end offset has been updated
      delayedOperations.checkAndCompleteFetch()
    }

    info
  }
  
   /**
   * Check if some delayed operations can be completed with the given watch key,
   * and if yes complete them.
   *
   * @return the number of completed operations during this process
   */
  def checkAndComplete(key: Any): Int = {
    val wl = watcherList(key)
    val watchers = inLock(wl.watchersLock) { wl.watchersByKey.get(key) }
    val numCompleted = if (watchers == null)
      0
    else
      watchers.tryCompleteWatched()
    debug(s"Request key $key unblocked $numCompleted $purgatoryName operations")
    numCompleted
  }

这个回去通知各个follower，而follower处理逻辑就是：

override def run(): Unit = {
    isStarted = true
    info("Starting")
    try {
      while (isRunning)
        doWork()
    } catch {
      case e: FatalExitError =>
        shutdownInitiated.countDown()
        shutdownComplete.countDown()
        info("Stopped")
        Exit.exit(e.statusCode())
      case e: Throwable =>
        if (isRunning)
          error("Error due to", e)
    } finally {
       shutdownComplete.countDown()
    }
    info("Stopped")
  }
  
 override def doWork(): Unit = {
    maybeTruncate()
    maybeFetch()
  }
  
private def maybeFetch(): Unit = {
    val fetchRequestOpt = inLock(partitionMapLock) {
      val ResultWithPartitions(fetchRequestOpt, partitionsWithError) = buildFetch(partitionStates.partitionStateMap.asScala)

      handlePartitionsWithErrors(partitionsWithError, "maybeFetch")

      if (fetchRequestOpt.isEmpty) {
        trace(s"There are no active partitions. Back off for $fetchBackOffMs ms before sending a fetch request")
        partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
      }

      fetchRequestOpt
    }

    fetchRequestOpt.foreach { case ReplicaFetch(sessionPartitions, fetchRequest) =>
      processFetchRequest(sessionPartitions, fetchRequest)
    }
  }

LSO：特指LastStableOffset。它具体与kafka的事务有关。

消费端参数——isolation.level,这个参数用来配置消费者事务的隔离级别。字符串类型，“read_uncommitted”和“read_committed”，表示消费者所消费到的位置，如果设置为“read_committed"，那么消费这就会忽略事务未提交的消息，既只能消费到LSO(LastStableOffset)的位置，默认情况下，”read_uncommitted",既可以消费到HW（High Watermak）的位置。

注：follower副本的事务隔离级别也为“read_uncommitted"，并且不可修改。

在开启kafka事务的同时，生产者发送了若干消息，（msg1，msg2，）到broker中，如果生产者没有提交事务（执行CommitTransaction），那么对于isolation.level=read_committed的消费者而言是看不多这些消息的，而isolation.level=read_uncommitted则可以看到。事务中的第一条消息的位置可以标记为firstUnstableOffset（也就是msg1的位置）。

对每一个分区而言，它Lag等于HW-ConsumerOffset的值，其中ComsmerOffset表示当前的消费的位移，当然这只是针对普通的情况。如果为消息引入了事务，那么Lag的计算方式就会有所不同。

对于未完成的事务而言，LSO的值等于事务中的第一条消息所在的位置，（firstUnstableOffset）

对于已经完成的事务而言，它的值等同于HW相同，所以我们可以得出一个结论：LSO≤HW≤LEO。

对于分区中未完成的事务，并且消费者客户端的isolation.level参数配置为”read_committed"的情况，它对应的Lag等于LSO-ComsumerOffset的值。

OK，分布式相关的重点部分就说完了。

附录 kafka日志添加流程

逸643

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
kafka分布式如何保证

众所周知kafka是集群模式，那么kafka是如何保证数据一致性，以及集群间和消费者是如何交互的呢？首先先来了解几个名称：AR:分区中所有副本统称为AR（Assigned Replicas）。ISR:所有与leader副本保持一定程度同步风副本（包括leader副本本身）组成ISR（IN-Sync Replicas），ISR集合是AR集合中的一个子集。replica.lag.time.max.ms这个就是follower副本落后leader副本的时间间隔，默认30秒。只要follower副本每隔
复制链接

扫一扫