在这次的错误日志中最关键的一段日志信息如下,中间省略大部分类似的错误内容。
[2017-12-27 18:26:09,267] ERROR [KafkaApi-2] Error when handling request Name: FetchRequest; Version: 2; CorrelationId: 44771537; ClientId: ReplicaFetcherThread-2-2; ReplicaId: 4; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [test-topic02-rrt,12] -> PartitionFetchInfo(8085219,1048576),[test-topic01-ursr,22] -> PartitionFetchInfo(0,1048576),[test-topic02-rd,13] -> PartitionFetchInfo(787543,1048576),[test-topic02-st,12] -> PartitionFetchInfo(14804029,1048576),[test-topic04-ursr,7] -> PartitionFetchInfo(8,1048576),[test-topic04-rd,15] -> PartitionFetchInfo(2380092,1048576),[test-topic04-rrt,18] -> PartitionFetchInfo(27246143,1048576),[test-topic03-rrt,12] -> PartitionFetchInfo(12853720,1048576),[test-topic04-st,18] -> PartitionFetchInfo(25335299,1048576),[test-topic03-srt,11] -> PartitionFetchInfo(3750134,1048576),[test-topic05-ursd,17] -> PartitionFetchInfo(0,1048576),[test-topic05-srt,22] -> PartitionFetchInfo(33136074,1048576),[test-topic01-sd,1] -> PartitionFetchInfo(14361,1048576),[test-topic03-rd,21] -> PartitionFetchInfo(96366,1048576),[test-topic04-ursd,10] -> PartitionFetchInfo(0,1048576),[my-working-topic,15] -> PartitionFetchInfo(0,1048576),[test-topic02-ts_st,12] -> PartitionFetchInfo(0,1048576),[test-topic03-ursr,9] -> PartitionFetchInfo(1,1048576) (kafka.server.KafkaApis)
kafka.common.NotAssignedReplicaException: Leader 2 failed to record follower 4's position -1 since the replica is not recognized to be one of the assigned replicas for partition [my-working-topic,15].
从中我们可以至少先找出两个关键信息:
- Error when handling request Name: FetchRequest;
- kafka.common.NotAssignedReplicaException: Leader 2 failed to record follower 4's position -1 since the replica is not recognized to be one of the assigned replicas for partition [my-working-topic,15]
1、对错误信息1的分析:Error when handling request Name: FetchRequest
这段信息中,我们可以看到是kafka在处理分区数据同步的过程中遇到的错误。
在这一行日志之上还有两行日志,其中这一行内容表明broker 2节点停止了对my-working-topic的21、15、3、9共4个partitions的数据同步线程。
[2017-12-27 18:26:09,219] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions [my-working-topic,21],[my-working-topic,15],
[my-working-topic,3],[my-working-topic,9] (kafka.server.ReplicaFetcherManager)
这里有一个知识点是:ReplicaFetcherManager
这个类主要功能是,管理当前broker上的所有从leader的数据同步,主要方法有addPartitions和removePartitions等,控制启动或移除指定分区的leader副本数据同步线程。
继续往下看,日志中有这样一段内容:
[my-working-topic,15] -> PartitionFetchInfo(0,1048576)
其中
PartitionFetchInfo的定义为:class PartitionFetchInfo(offset: Long, fetchSize: Int),定义了分区的副本同步信息,两个参数分别是分区的副本数据同步操作的偏移量offset值(即起点)和获取的数据长度值。
就是说这次的FetchRequest请求中包含了my-working-topic的分区15的一段数据获取的需求,数据起始位置是0,获大小为1048576(1024*1024)的数据,(0,1048576)。
但上面我们刚刚分析过my-working-topic的21、15、3、9共4个partitions的数据同步线程是刚刚被管理器停掉了的,间隔了48ms后的一个数据同步请求中却在要求对这个topic的这几个partitions做同步。下面的日志我们只截取了partition 15的部分,实际上是把另3个分区的报错内容省略了。
那么,这种不尽合理的操作想当然得要报错了啊。错误信息就是我们重点分析的第2段日志内容。
2、kafka.common.NotAssignedReplicaException: Leader 2 failed to record follower 4's position -1 since the replica is not recognized to be one of the assigned replicas for partition [my-working-topic,15]
这段内容的意思是这样的:my-working-topic的分区15有一个leader replica是broker 2,现在broker 2在获取和记录它的一个follower replica,即broker 4副本的日志位置信息时失败了,因为broker 4副本目前不被认为是一个有效的 my-working-topic的分区15的一个replica副本。
下文的日志我们只截取了1秒钟日志内容的一部分,实际上这些错误日志是短暂而集中得出现的。
为什么会报这种错误?
这些错误对我们生产业务会有什么样的影响?
这要从my-working-topic这个topic讲起。正是它引发的这些错误信息。原因是在这次的程序版本中新增了一部分功能,并创建了名为my-working-topic的一个新的topic。
由于kafka是一套分布式的集群系统,而且在每个broker节点上都有多种功能的线程责任维护kafka各节点、各分区以及各副本的数据一致性、完整性。猜测在创建一个新的topic,且应用了比较多的分区和副本数量时,如24分区+3副本,就会出现下面这样的短暂性的数据一致性冲突问题。遇到报错的kafka broker节点会暂停问题数据的处理,等待kafka controller broker推送正确的分区副本维护信息过来,然后再按正确信息处理本地日志文件,为topic的每个分区启动数据同步线程等。
所以,只要这种错误不是在不停的刷日志,就只是属于一种过程性的中间状态数据,保持观察即可,没有任何影响。
[2017-12-27 18:26:09,219] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions [my-working-topic,21],[my-working-topic,15],[my-working-topic,3],[my-working-topic,9] (kafka.server.ReplicaFetcherManager)
[2017-12-27 18:26:09,248] INFO Completed load of log my-working-topic01 with log end offset 0 (kafka.log.Log)
[2017-12-27 18:26:09,267] ERROR [KafkaApi-2] Error when handling request Name: FetchRequest; Version: 2; CorrelationId: 44771537; ClientId: ReplicaFetcherThread-2-2; ReplicaId: 4; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [test-topic02-rrt,12] -> PartitionFetchInfo(8085219,1048576),[test-topic01-ursr,22] -> PartitionFetchInfo(0,1048576),[test-topic02-rd,13] -> PartitionFetchInfo(787543,1048576),[test-topic02-st,12] -> PartitionFetchInfo(14804029,1048576),[test-topic04-ursr,7] -> PartitionFetchInfo(8,1048576),[test-topic04-rd,15] -> PartitionFetchInfo(2380092,1048576),[test-topic04-rrt,18] -> PartitionFetchInfo(27246143,1048576),[test-topic03-rrt,12] -> PartitionFetchInfo(12853720,1048576),[test-topic04-st,18] -> PartitionFetchInfo(25335299,1048576),[test-topic03-srt,11] -> PartitionFetchInfo(3750134,1048576),[test-topic05-ursd,17] -> PartitionFetchInfo(0,1048576),[test-topic05-srt,22] -> PartitionFetchInfo(33136074,1048576),[test-topic01-sd,1] -> PartitionFetchInfo(14361,1048576),[test-topic03-rd,21] -> PartitionFetchInfo(96366,1048576),[test-topic04-ursd,10] -> PartitionFetchInfo(0,1048576),[my-working-topic,15] -> PartitionFetchInfo(0,1048576),[test-topic02-ts_st,12] -> PartitionFetchInfo(0,1048576),[test-topic03-ursr,9] -> PartitionFetchInfo(1,1048576) (kafka.server.KafkaApis)
kafka.common.NotAssignedReplicaException: Leader 2 failed to record follower 4's position -1 since the replica is not recognized to be one of the assigned replicas for partition [my-working-topic,15].
at kafka.cluster.Partition.updateReplicaLogReadResult(Partition.scala:251)
at kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:864)
at kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:861)
at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
at kafka.server.ReplicaManager.updateFollowerLogReadResults(ReplicaManager.scala:861)
at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:470)
at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:496)
at kafka.server.KafkaApis.handle(KafkaApis.scala:77)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
at java.lang.Thread.run(Thread.java:745)
.............................................................
[2017-12-27 18:26:09,387] INFO Partition [my-working-topic,13] on broker 2: No checkpointed highwatermark is found for partition [my-working-topic,13] (kafka.cluster.Partition)
[2017-12-27 18:26:09,388] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions [my-working-topic,5],[my-working-topic,10],[my-working-topic,8],[my-working-topic,13],[my-working-topic,18],[my-working-topic,4],[my-working-topic,19],[my-working-topic,14] (kafka.server.ReplicaFetcherManager)
[2017-12-27 18:26:09,388] INFO Truncating log my-working-topic-8 to offset 0. (kafka.log.Log)
[2017-12-27 18:26:09,388] INFO Truncating log my-working-topic-5 to offset 0. (kafka.log.Log)
[2017-12-27 18:26:09,388] INFO Truncating log my-working-topic-13 to offset 0. (kafka.log.Log)
[2017-12-27 18:26:09,388] INFO Truncating log my-working-topic-14 to offset 0. (kafka.log.Log)
[2017-12-27 18:26:09,388] INFO Truncating log my-working-topic-18 to offset 0. (kafka.log.Log)
[2017-12-27 18:26:09,388] INFO Truncating log my-working-topic-4 to offset 0. (kafka.log.Log)
[2017-12-27 18:26:09,389] INFO Truncating log my-working-topic-19 to offset 0. (kafka.log.Log)
[2017-12-27 18:26:09,389] INFO Truncating log my-working-topic-10 to offset 0. (kafka.log.Log)
[2017-12-27 18:26:09,393] INFO [ReplicaFetcherManager on broker 2] Added fetcher for partitions List([[my-working-topic,8], initOffset 0 to broker BrokerEndPoint(1,172.17.140.91,9092)] , [[my-working-topic,5], initOffset 0 to broker BrokerEndPoint(4,172.17.140.95,9092)] , [[my-working-topic,13], initOffset 0 to broker BrokerEndPoint(6,172.17.140.42,9092)] , [[my-working-topic,14], initOffset 0 to broker BrokerEndPoint(1,172.17.140.91,9092)] , [[my-working-topic,18], initOffset 0 to broker BrokerEndPoint(5,172.17.140.41,9092)] , [[my-working-topic,4], initOffset 0 to broker BrokerEndPoint(3,172.17.140.93,9092)] , [[my-working-topic,19], initOffset 0 to broker BrokerEndPoint(6,172.17.140.42,9092)] , [[my-working-topic,10], initOffset 0 to broker BrokerEndPoint(3,172.17.140.93,9092)] ) (kafka.server.ReplicaFetcherManager)
3、关于INFO Partition [my-working-topic,13] on broker 2: No checkpointed highwatermark is found for partition [my-working-topic,13] (kafka.cluster.Partition)
这行日志中提到了high watermark,高水位标识,用于记录当前leader replica已经同步给其它follower replicas,并已经flush到日志文件中的日志offset position信息。
那为什么会说No checkpointed highwatermark is found呢?因为这个
my-working-topic是才创建的嘛,日志文件中一条数据都没有呢,哪来的日志偏移量和highwatermark呢。所以这条日志只是INFO级别。因为这是个正常事件。
在接下来的日志中,可以找到my-working-topic-13先是被将日志文件truncating清空,然后启动了该分区的数据同步线程added fetcher for partition .
最后检查下自己的topic状态,全部是正常的:
$ ./kafka-topics.sh --describe --zookeeper 172.17.140.91:2181 --topic my-working-topic
Topic:my-working-topic PartitionCount:24 ReplicationFactor:3 Configs:
Topic: my-working-topic Partition: 0 Leader: 5 Replicas: 5,3,4 Isr: 5,3,4
Topic: my-working-topic Partition: 1 Leader: 6 Replicas: 6,4,5 Isr: 6,4,5
Topic: my-working-topic Partition: 2 Leader: 1 Replicas: 1,5,6 Isr: 1,5,6
Topic: my-working-topic
Partition: 3 Leader: 2 Replicas: 2,6,1 Isr: 2,6,1
Topic: my-working-topic Partition: 4 Leader: 3 Replicas: 3,1,2 Isr: 3,1,2
Topic: my-working-topic Partition: 5 Leader: 4 Replicas: 4,2,3 Isr: 4,2,3
Topic: my-working-topic Partition: 6 Leader: 5 Replicas: 5,4,6 Isr: 5,4,6
Topic: my-working-topic Partition: 7 Leader: 6 Replicas: 6,5,1 Isr: 6,5,1
Topic: my-working-topic Partition: 8 Leader: 1 Replicas: 1,6,2 Isr: 1,6,2
Topic: my-working-topic
Partition: 9 Leader: 2 Replicas: 2,1,3 Isr: 2,1,3
Topic: my-working-topic Partition: 10 Leader: 3 Replicas: 3,2,4 Isr: 3,2,4
Topic: my-working-topic Partition: 11 Leader: 4 Replicas: 4,3,5 Isr: 4,3,5
Topic: my-working-topic Partition: 12 Leader: 5 Replicas: 5,6,1 Isr: 5,6,1
Topic: my-working-topic Partition: 13 Leader: 6 Replicas: 6,1,2 Isr: 6,1,2
Topic: my-working-topic Partition: 14 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
Topic: my-working-topic
Partition: 15 Leader: 2 Replicas: 2,3,4 Isr: 2,3,4
Topic: my-working-topic Partition: 16 Leader: 3 Replicas: 3,4,5 Isr: 3,4,5
Topic: my-working-topic Partition: 17 Leader: 4 Replicas: 4,5,6 Isr: 4,5,6
Topic: my-working-topic Partition: 18 Leader: 5 Replicas: 5,1,2 Isr: 5,1,2
Topic: my-working-topic Partition: 19 Leader: 6 Replicas: 6,2,3 Isr: 6,2,3
Topic: my-working-topic Partition: 20 Leader: 1 Replicas: 1,3,4 Isr: 1,3,4
Topic: my-working-topic
Partition: 21 Leader: 2 Replicas: 2,4,5 Isr: 2,4,5
Topic: my-working-topic Partition: 22 Leader: 3 Replicas: 3,5,6 Isr: 3,5,6
Topic: my-working-topic Partition: 23 Leader: 4 Replicas: 4,6,1 Isr: 4,6,1