kafka分布式如何保证

众所周知kafka是集群模式,那么kafka是如何保证数据一致性,以及集群间和消费者是如何交互的呢?
首先先来了解几个名称:

AR:分区中所有副本统称为AR(Assigned Replicas)。

ISR:所有与leader副本保持一定程度同步风副本(包括leader副本本身)组成ISR(IN-Sync Replicas),ISR集合是AR集合中的一个子集。
replica.lag.time.max.ms这个就是follower副本落后leader副本的时间间隔,默认30秒。只要follower副本每隔30s都能发送FetchRequest请求给leader,那么该副本就不会被标记成dead从而被踢出ISR。

OSR: 与leader副本同步滞后过多的副本(不包括leader)副本,组成OSR(Out-Sync Relipcas),由此可见:AR=ISR+OSR。

ISR的伸缩:
 Kafka在启动的时候会开启两个与ISR相关的定时任务,名称分别为“isr-expiration"和”isr-change-propagation"。

scheduler.schedule("isr-expiration", maybeShrinkIsr _, period = config.replicaLagTimeMaxMs / 2, unit = TimeUnit.MILLISECONDS)
scheduler.schedule("isr-change-propagation", maybePropagateIsrChanges _, period = 2500L, unit = TimeUnit.MILLISECONDS)
scheduler.schedule("shutdown-idle-replica-alter-log-dirs-thread", shutdownIdleReplicaAlterLogDirsThread _, period = 10000L, unit = TimeUnit.MILLISECONDS)

 isr-expiration任务会周期性的检测每个分区是否需要缩减其ISR集合。这个周期和“replica.lag.time.max.ms”参数有关,大小是这个参数一半。当检测到ISR中有是失效的副本的时候,就会缩减ISR集合。如果某个分区的ISR集合发生变更, 则会将变更后的数据记录到ZooKerper对应/brokers/topics/$topic/partitions/$partition.partition/state节点中。节点中数据示例如下:

{“controller_cpoch":26,“leader”:0,“version”:1,“leader_epoch”:2,“isr”:{0,1}}
object BrokersZNode {
  def path = "/brokers"
}

object TopicsZNode {
  def path = s"${BrokersZNode.path}/topics"
}

object TopicZNode {
  def path(topic: String) = s"${TopicsZNode.path}/$topic"
  def encode(assignment: collection.Map[TopicPartition, ReplicaAssignment]): Array[Byte] = {
    val replicaAssignmentJson = mutable.Map[String, util.List[Int]]()
    val addingReplicasAssignmentJson = mutable.Map[String, util.List[Int]]()
    val removingReplicasAssignmentJson = mutable.Map[String, util.List[Int]]()

    for ((partition, replicaAssignment) <- assignment) {
      replicaAssignmentJson += (partition.partition.toString -> replicaAssignment.replicas.asJava)
      if (replicaAssignment.addingReplicas.nonEmpty)
        addingReplicasAssignmentJson += (partition.partition.toString -> replicaAssignment.addingReplicas.asJava)
      if (replicaAssignment.removingReplicas.nonEmpty)
        removingReplicasAssignmentJson += (partition.partition.toString -> replicaAssignment.removingReplicas.asJava)
    }

    Json.encodeAsBytes(Map(
      "version" -> 2,
      "partitions" -> replicaAssignmentJson.asJava,
      "adding_replicas" -> addingReplicasAssignmentJson.asJava,
      "removing_replicas" -> removingReplicasAssignmentJson.asJava
    ).asJava)
  }
  def decode(topic: String, bytes: Array[Byte]): Map[TopicPartition, ReplicaAssignment] = {
    def getReplicas(replicasJsonOpt: Option[JsonObject], partition: String): Seq[Int] = {
      replicasJsonOpt match {
        case Some(replicasJson) => replicasJson.get(partition) match {
          case Some(ar) => ar.to[Seq[Int]]
          case None => Seq.empty[Int]
        }
        case None => Seq.empty[Int]
      }
    }

    Json.parseBytes(bytes).flatMap { js =>
      val assignmentJson = js.asJsonObject
      val partitionsJsonOpt = assignmentJson.get("partitions").map(_.asJsonObject)
      val addingReplicasJsonOpt = assignmentJson.get("adding_replicas").map(_.asJsonObject)
      val removingReplicasJsonOpt = assignmentJson.get("removing_replicas").map(_.asJsonObject)
      partitionsJsonOpt.map { partitionsJson =>
        partitionsJson.iterator.map { case (partition, replicas) =>
          new TopicPartition(topic, partition.toInt) -> ReplicaAssignment(
            replicas.to[Seq[Int]],
            getReplicas(addingReplicasJsonOpt, partition),
            getReplicas(removingReplicasJsonOpt, partition)
          )
        }
      }
    }.map(_.toMap).getOrElse(Map.empty)
  }
}


object TopicPartitionsZNode {
  def path(topic: String) = s"${TopicZNode.path(topic)}/partitions"
}

object TopicPartitionZNode {
  def path(partition: TopicPartition) = s"${TopicPartitionsZNode.path(partition.topic)}/${partition.partition}"
}

object TopicPartitionStateZNode {
  def path(partition: TopicPartition) = s"${TopicPartitionZNode.path(partition)}/state"
  def encode(leaderIsrAndControllerEpoch: LeaderIsrAndControllerEpoch): Array[Byte] = {
    val leaderAndIsr = leaderIsrAndControllerEpoch.leaderAndIsr
    val controllerEpoch = leaderIsrAndControllerEpoch.controllerEpoch
    Json.encodeAsBytes(Map("version" -> 1, "leader" -> leaderAndIsr.leader, "leader_epoch" -> leaderAndIsr.leaderEpoch,
      "controller_epoch" -> controllerEpoch, "isr" -> leaderAndIsr.isr.asJava).asJava)
  }
  def decode(bytes: Array[Byte], stat: Stat): Option[LeaderIsrAndControllerEpoch] = {
    Json.parseBytes(bytes).map { js =>
      val leaderIsrAndEpochInfo = js.asJsonObject
      val leader = leaderIsrAndEpochInfo("leader").to[Int]
      val epoch = leaderIsrAndEpochInfo("leader_epoch").to[Int]
      val isr = leaderIsrAndEpochInfo("isr").to[List[Int]]
      val controllerEpoch = leaderIsrAndEpochInfo("controller_epoch").to[Int]
      val zkPathVersion = stat.getVersion
      LeaderIsrAndControllerEpoch(LeaderAndIsr(leader, epoch, isr, zkPathVersion), controllerEpoch)
    }
  }
}
// update ISR in zk and in cache
private[cluster] def shrinkIsr(newIsr: Set[Int]): Unit = {
val newLeaderAndIsr = new LeaderAndIsr(localBrokerId, leaderEpoch, newIsr.toList, zkVersion)
val zkVersionOpt = stateStore.shrinkIsr(controllerEpoch, newLeaderAndIsr)
maybeUpdateIsrAndVersion(newIsr, zkVersionOpt)
}

其中controller_epoch表示的是当前的kafka控制器epoch.leader表示当前分区的leader副本所在的broker的id编号,version表示版本号,(当前半本固定位1),leader_epoch表示当前分区的leader纪元,isr表示变更后的isr列表。

除此之外,当ISR集合发生变更的时候还会将变更后的记录缓存到isrChangeSet中,isr-change-propagation任务会周期性(固定值为2500ms)地检查isrChangeSet,如果发现isrChangeSet中有ISR 集合的变更记录,那么它会在Zookeeper的 /isr_change_notification的路径下创建一个以isr_change开头的持久顺序节点,比如:

/isr_change_notification/isr_change_0000000000

object IsrChangeNotificationZNode {
  def path = "/isr_change_notification"
}

object IsrChangeNotificationSequenceZNode {
  val SequenceNumberPrefix = "isr_change_"
  def path(sequenceNumber: String = "") = s"${IsrChangeNotificationZNode.path}/$SequenceNumberPrefix$sequenceNumber"
  def encode(partitions: collection.Set[TopicPartition]): Array[Byte] = {
    val partitionsJson = partitions.map(partition => Map("topic" -> partition.topic, "partition" -> partition.partition).asJava)
    Json.encodeAsBytes(Map("version" -> IsrChangeNotificationHandler.Version, "partitions" -> partitionsJson.asJava).asJava)
  }

  def decode(bytes: Array[Byte]): Set[TopicPartition] = {
    Json.parseBytes(bytes).map { js =>
      val partitionsJson = js.asJsonObject("partitions").asJsonArray
      partitionsJson.iterator.map { partitionsJson =>
        val partitionJson = partitionsJson.asJsonObject
        val topic = partitionJson("topic").to[String]
        val partition = partitionJson("partition").to[Int]
        new TopicPartition(topic, partition)
      }
    }
  }.map(_.toSet).getOrElse(Set.empty)
  def sequenceNumber(path: String) = path.substring(path.lastIndexOf(SequenceNumberPrefix) + SequenceNumberPrefix.length)
}


并将isrChangeSet中的信息保存到这个节点中。kafka控制器为/isr_change_notification添加了一个Watcher,当这个节点中有子节点发生变化的时候会触发Watcher动作,以此通知控制器更新相关的元数据信息并向它管理的broker节点发送更新元数据信息的请求。最后删除/isr_change_notification的路径下已经处理过的节点。频繁的触发Watcher会影响kafka控制器,zookeeper甚至其他的broker性能。为了避免这种情况,kafka添加了指定的条件,当检测到分区ISR集合发生变化的时候,还需要检查一下两个条件:
1. 上一次ISR集合发生变化距离现在已经超过5秒
2. 上一次写入zookeeper的时候距离现在已经超过60秒。

/**
   * Gets the isr change notifications as strings. These strings are the znode names and not the absolute znode path.
   * @return sequence of znode names and not the absolute znode path.
   */
  def getAllIsrChangeNotifications: Seq[String] = {
    val getChildrenResponse = retryRequestUntilConnected(GetChildrenRequest(IsrChangeNotificationZNode.path, registerWatch = true))
    getChildrenResponse.resultCode match {
      case Code.OK => getChildrenResponse.children.map(IsrChangeNotificationSequenceZNode.sequenceNumber)
      case Code.NONODE => Seq.empty
      case _ => throw getChildrenResponse.resultException.get
    }
  }
  
  private val lastIsrChangeMs = new AtomicLong(System.currentTimeMillis())
  private val lastIsrPropagationMs = new AtomicLong(System.currentTimeMillis())
  
  val HighWatermarkFilename = "replication-offset-checkpoint"
  val IsrChangePropagationBlackOut = 5000L
  val IsrChangePropagationInterval = 60000L
  
  def maybePropagateIsrChanges(): Unit = {
    val now = System.currentTimeMillis()
    isrChangeSet synchronized {
      if (isrChangeSet.nonEmpty &&
        (lastIsrChangeMs.get() + ReplicaManager.IsrChangePropagationBlackOut < now ||
          lastIsrPropagationMs.get() + ReplicaManager.IsrChangePropagationInterval < now)) {
        zkClient.propagateIsrChanges(isrChangeSet)
        isrChangeSet.clear()
        lastIsrPropagationMs.set(now)
      }
    }
  }


有缩减就会有补充,那么kafka何时扩充ISR的?

 随着follower副本不断进行消息同步,follower副本LEO也会逐渐后移,并且最终赶上leader副本,此时follower副本就有资格进入ISR集合,追赶上leader副本的判定准则是此副本的LEO是否小于leader副本HW,这里并不是和leader副本LEO相比。ISR扩充之后同样会更新ZooKeeper中的/broker/topics/partition/state节点和isrChangeSet,之后的步骤就和ISR收缩的时的相同。
 
 这里提到了LEO和HW:
 
 HW:是High watermark,俗称高水位,它标识一个特定的消息偏移量(offset),消费者只能拉取到这个offset之前的消息。
 
 LEO:是Log End offset,它标识当前日志文件中下一条待写入消息的offset。LEO的大小相当于当前日志分区最后一条消息的offset加1.
 
 ISR与HW和LEO有密切的关系。如下图:

 
 HW为ISR的最小的LEO,如果follower2被剔除ISR,那么HW就成5。
 
 接下来看下producer保存消息的流程:

 
 那么具体是怎么操作的呢?

1. 生产者发送请求向某些指定分区追加消息
2. ProducerRequest经过网络层和API层到达ReplicaManager,他会将消息交给日志存储系统进行处理,最终追加到对应的log中,同时还是检测delayedFetchPurgatory中相关key对应的DelayedFetch,满足条件则将其执行完成。
3. 日志存储系统返回追加消息的结果
4. ReplicaManager为ProducerRequest生成DelayedProduce对象,并交由delayedProducePurgatory管理
5. delayedProducePurgatory使用SystemTimer管理DelayedProduce是否超时
6. ISR集合中Follower副本发送FetchRequest请求与Leader副本同步消息,同时也会检查DelayedProduce是否符合执行条件。
7. DelayProduce执行时会调用回调函数产生ProducerResponse,并将其添加到RequestChannels中
8. 由网络层将ProduceResponse返回给客户端。

而我们这次需要关注点在于5和6,leader节点是怎么通知follower节点的呢?

 def appendRecordsToLeader(records: MemoryRecords, origin: AppendOrigin, requiredAcks: Int): LogAppendInfo = {
    val (info, leaderHWIncremented) = inReadLock(leaderIsrUpdateLock) {
      leaderLogIfLocal match {
        case Some(leaderLog) =>
          val minIsr = leaderLog.config.minInSyncReplicas
          val inSyncSize = inSyncReplicaIds.size

          // Avoid writing to leader if there are not enough insync replicas to make it safe
          if (inSyncSize < minIsr && requiredAcks == -1) {
            throw new NotEnoughReplicasException(s"The size of the current ISR $inSyncReplicaIds " +
              s"is insufficient to satisfy the min.isr requirement of $minIsr for partition $topicPartition")
          }

          val info = leaderLog.appendAsLeader(records, leaderEpoch = this.leaderEpoch, origin,
            interBrokerProtocolVersion)

          //我们可能需要增加HW,因为ISR可能降到1
          (info, maybeIncrementLeaderHW(leaderLog))

        case None =>
          throw new NotLeaderOrFollowerException("Leader not local for partition %s on broker %d"
            .format(topicPartition, localBrokerId))
      }
    }

    // some delayed operations may be unblocked after HW changed
    if (leaderHWIncremented)
      tryCompleteDelayedRequests()
    else {
      // probably unblock some follower fetch requests since log end offset has been updated
      delayedOperations.checkAndCompleteFetch()
    }

    info
  }
  
   /**
   * Check if some delayed operations can be completed with the given watch key,
   * and if yes complete them.
   *
   * @return the number of completed operations during this process
   */
  def checkAndComplete(key: Any): Int = {
    val wl = watcherList(key)
    val watchers = inLock(wl.watchersLock) { wl.watchersByKey.get(key) }
    val numCompleted = if (watchers == null)
      0
    else
      watchers.tryCompleteWatched()
    debug(s"Request key $key unblocked $numCompleted $purgatoryName operations")
    numCompleted
  }


这个回去通知各个follower,而follower处理逻辑就是:

override def run(): Unit = {
    isStarted = true
    info("Starting")
    try {
      while (isRunning)
        doWork()
    } catch {
      case e: FatalExitError =>
        shutdownInitiated.countDown()
        shutdownComplete.countDown()
        info("Stopped")
        Exit.exit(e.statusCode())
      case e: Throwable =>
        if (isRunning)
          error("Error due to", e)
    } finally {
       shutdownComplete.countDown()
    }
    info("Stopped")
  }
  
 override def doWork(): Unit = {
    maybeTruncate()
    maybeFetch()
  }
  
private def maybeFetch(): Unit = {
    val fetchRequestOpt = inLock(partitionMapLock) {
      val ResultWithPartitions(fetchRequestOpt, partitionsWithError) = buildFetch(partitionStates.partitionStateMap.asScala)

      handlePartitionsWithErrors(partitionsWithError, "maybeFetch")

      if (fetchRequestOpt.isEmpty) {
        trace(s"There are no active partitions. Back off for $fetchBackOffMs ms before sending a fetch request")
        partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
      }

      fetchRequestOpt
    }

    fetchRequestOpt.foreach { case ReplicaFetch(sessionPartitions, fetchRequest) =>
      processFetchRequest(sessionPartitions, fetchRequest)
    }
  }
  

LSO:特指LastStableOffset。它具体与kafka的事务有关。 

消费端参数——isolation.level,这个参数用来配置消费者事务的隔离级别。字符串类型,“read_uncommitted”和“read_committed”,表示消费者所消费到的位置,如果设置为“read_committed",那么消费这就会忽略事务未提交的消息,既只能消费到LSO(LastStableOffset)的位置,默认情况下,”read_uncommitted",既可以消费到HW(High Watermak)的位置。

注:follower副本的事务隔离级别也为“read_uncommitted",并且不可修改。

在开启kafka事务的同时,生产者发送了若干消息,(msg1,msg2,)到broker中,如果生产者没有提交事务(执行CommitTransaction),那么对于isolation.level=read_committed的消费者而言是看不多这些消息的,而isolation.level=read_uncommitted则可以看到。事务中的第一条消息的位置可以标记为firstUnstableOffset(也就是msg1的位置)。

对每一个分区而言,它Lag等于HW-ConsumerOffset的值,其中ComsmerOffset表示当前的消费的位移,当然这只是针对普通的情况。如果为消息引入了事务,那么Lag的计算方式就会有所不同。

对于未完成的事务而言,LSO的值等于事务中的第一条消息所在的位置,(firstUnstableOffset)

对于已经完成的事务而言,它的值等同于HW相同,所以我们可以得出一个结论:LSO≤HW≤LEO。

对于分区中未完成的事务,并且消费者客户端的isolation.level参数配置为”read_committed"的情况,它对应的Lag等于LSO-ComsumerOffset的值。

OK,分布式相关的重点部分就说完了。


附录 kafka日志添加流程

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值