kafka有台broker挂机之后长时间没重启,导致重启报错,与出现消费死锁解决
- 报错信息:
2019-04-23 17:22:42,423 WARN kafka.controller.KafkaController: [Controller 782]: Partition [hjw_test8,6] failed to complete preferred replica leader election. Leader is 201
2019-04-23 17:22:42,423 ERROR kafka.server.ReplicaFetcherThread: [ReplicaFetcherThread-5-203], Error for partition [alarm_callback_topic,2] to broker 203:org.apache.kafka.common.errors.Not
LeaderForPartitionException: This server is not the leader for that topic-partition.
2019-04-23 17:22:42,426 ERROR state.change.logger: Controller 782 epoch 227 encountered error while electing leader for partition [004_8,3] due to: Preferred replica 203 for partition [004
_8,3] is either not alive or not in the isr. Current leader and ISR: [{“leader”:759,“leader_epoch”:3,“isr”:[759]}].
2019-04-23 17:22:42,426 ERROR state.change.logger: Controller 782 epoch 227 initiated state change for partition [004_8,3] from OfflinePartition to OnlinePartition failed
kafka.common.StateChangeFailedException: encountered error while electing leader for partition [004_8,3] due to: Preferred replica 203 for partition [004_8,3] is either not alive or not in
the isr. Current leader and ISR: [{“leader”:759,“leader_epoch”:3,“isr”:[759]}].
2019-04-23 17:22:42,430 ERROR kafka.server.ReplicaFetcherThread: [ReplicaFetcherThread-1-203], Error for partition [kjTest,8] to broker 203:org.apache.kafka.common.errors.NotLeaderForParti
tionException: This server is not the leader for that topic-partition.
2019-04-23 17:22:42,430 ERROR kafka.server.ReplicaFetcherThread: [ReplicaFetcherThread-1-203], Error for partition [app_error_log,2] to broker 203:org.apache.kafka.common.errors.NotLeaderF
orPartitionException: This server is not the leader for that topic-partition.
2019-04-23 17:22:42,430 ERROR kafka.server.ReplicaFetcherThread: [ReplicaFetcherThread-1-203], Error for partition [kjTest2,8] to broker 203:org.apache.kafka.common.errors.NotLeaderForPart
itionException: This server is not the leader for that topic-partition.
2019-04-23 17:22:42,430 ERROR kafka.server.ReplicaFetcherThread: [ReplicaFetcherThread-1-203], Error for partition [fj1001,9] to broker 203:org.apache.kafka.common.errors.NotLeaderForParti
tionException: This server is not the leader for that topic-partition.
2019-04-23 17:22:42,430 ERROR kafka.server.ReplicaFetcherThread: [ReplicaFetcherThread-1-203], Error for partition [lzsw_alarm_topic,1] to broker 203:org.apache.kafka.common.errors.NotLead
erForPartitionException: This server is not the leader for that topic-partition.
[Kafka Server 782], Proceeding to do an unclean shutdown as all the controlled shutdown attempts failed
- 问题原因:可能是副本的offset比leader的新,导致的不能启动。
- 解决方案:使用命令直接平衡所有的topic。
- 操作步骤:进入kafka目录,执行以下命令(若是集群执行其中一台即可)
./kafka-preferred-replica-election.sh --zookeeper localhost:2181
kafka集群单点故障
- 问题描述:kafka集群有三个节点,当停掉其中一个节点后,整个集群就不能正常工作。
- 问题原因:经排查发现__consumer_offsets这个topic的partition都存在一台kafka服务器上,而当它只有一个副本时就会存在单点故障。注: __consumer_offsets这个topic是由kafka自动创建的,默认50个。
- 解决方案:
- 首先调整配置文件中的参数,如下
num.partitions=3 (默认分区数为3)
auto.create.topics.enable=true (自动创建topic)
default.replication.factor=3 (默认副本数为3)
- 等所有节点都调整完成后,需要在zookeeper中删除__consumer_offsets。
进入zookeeper/bin目录执行./zkCli.sh
ls /brokers/topics
rm -r /brokers/topics/__consumer_offsets
ls /brokers/topics
- 最后重启zookeeper和kafka。