kafka集群高可用问题分析和解决

kafka集群高可用问题分析和解决

1、问题描述:

目前生产环境kafka集群共有三个节点,分别部署在不同服务器上,当有一个节点宕机后,导致kafka集群不可用,无法收发消息。

2、问题复现:

通过对测试环境kafka集群的手动开关,复现了kafka集群不可用的现象。

3、问题分析:

1)

通过命令

./kafka-topics.sh --describe --topic do.doc.v1.Document.weight.delete --zookeeper 10.86.51.74:2181

​ 查看主题的详细信息,发现当关闭主题的leader节点,主题的leader不变,kafka没有自动进行leader选举。

进一步分析,发现主题的副本数为1,并不是集群的数量3 。

2)

通过查看源代码,发现主题为自动创建,且创建时默认副本数量为1 。

源码中并未向外暴露可传入副本数的方法。

3)

通过查看kafka配置server.properties中属性

offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1

发现offsets副本数量属性,事务状态副本数量属性,默认都是1,将其修改为3。
将事务状态最小副本数修改为2,意为最少保证两个节点可用,防止数据丢失。

offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=2
4)

修改后测试,发现主题备份数仍为1,leader仍不能自动选举。

5)

通过命令和编辑好的json文件,为主题指定副本

./kafka-reassign-partitions.sh --zookeeper 10.86.51.74:2181 --reassignment-json-file /root/modify.topic.json --execute

modify.topic.json的内容为:

{"version":1,
"partitions":[
    {"topic":"CalendarUpdate","partition":0,"replicas":[1,2,3]},
    {"topic":"ChangeActualToHistory","partition":0,"replicas":[1,2,3]},
    {"topic":"ChangeActualToLatest","partition":0,"replicas":[1,2,3]},
    {"topic":"DeletePlanAuth","partition":0,"replicas":[1,2,3]},
    {"topic":"MeasurementReportAuth","partition":0,"replicas":[1,2,3]},
    {"topic":"PlanCenterCalendarUpdate","partition":0,"replicas":[1,2,3]},
    {"topic":"WorkCalendarDelete","partition":0,"replicas":[1,2,3]},
    {"topic":"createProjectTpb","partition":0,"replicas":[1,2,3]},
    {"topic":"createProjectTpb1","partition":0,"replicas":[1,2,3]},
    {"topic":"db.v1.doc.document","partition":0,"replicas":[1,2,3]},
    {"topic":"db.v1.doc.file","partition":0,"replicas":[1,2,3]},
    {"topic":"db.v1.doc.folder","partition":0,"replicas":[1,2,3]},
    {"topic":"db.v1.org.tr_org_user","partition":0,"replicas":[1,2,3]},
    {"topic":"db.v1.org.ts_user","partition":0,"replicas":[1,2,3]},
    {"topic":"db.v1.tag.tag","partition":0,"replicas":[1,2,3]},
    {"topic":"db.v1.task.project","partition":0,"replicas":[1,2,3]},
    {"topic":"db.v1.task.task","partition":0,"replicas":[1,2,3]},
    {"topic":"db.v1.task.task_group","partition":0,"replicas":[1,2,3]},
    {"topic":"db.v1.task.task_link","partition":0,"replicas":[1,2,3]},
    {"topic":"db.v1.task.task_project_member","partition":0,"replicas":[1,2,3]},
    {"topic":"db.v1.task.task_stage","partition":0,"replicas":[1,2,3]},
    {"topic":"db.v1.task.task_user","partition":0,"replicas":[1,2,3]},
    {"topic":"db.v1.task.ts_enterprise_role","partition":0,"replicas":[1,2,3]},
    {"topic":"de.task.v1.Task.clear-plancenter","partition":0,"replicas":[1,2,3]},
    {"topic":"de.task.v1.Task.clear-project","partition":0,"replicas":[1,2,3]},
    {"topic":"de.task.v1.Task.clear-question","partition":0,"replicas":[1,2,3]},
    {"topic":"de.task.v1.Task.clear-task","partition":0,"replicas":[1,2,3]},
    {"topic":"de.task.v1.Task.clear-vehiclestate","partition":0,"replicas":[1,2,3]},
    {"topic":"de.task.v1.Task.clear-weight","partition":0,"replicas":[1,2,3]},
    {"topic":"de.task.v1.Task.update-epg","partition":0,"replicas":[1,2,3]},
    {"topic":"de.task.v1.Task.update-epgcost","partition":0,"replicas":[1,2,3]},
    {"topic":"de.task.v1.Task.update-plancenter","partition":0,"replicas":[1,2,3]},
    {"topic":"de.task.v1.Task.update-project","partition":0,"replicas":[1,2,3]},
    {"topic":"de.task.v1.Task.update-question","partition":0,"replicas":[1,2,3]},
    {"topic":"de.task.v1.Task.update-task","partition":0,"replicas":[1,2,3]},
    {"topic":"de.task.v1.Task.update-tpb","partition":0,"replicas":[1,2,3]},
    {"topic":"de.task.v1.Task.update-vehiclestate","partition":0,"replicas":[1,2,3]},
    {"topic":"de.task.v1.Task.update-vehicletest","partition":0,"replicas":[1,2,3]},
    {"topic":"de.task.v1.Task.update-weight","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Document.endurancetest.delete","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Document.epg.delete","partition":0,"replicas":[1,2,3]},
     {"topic":"do.doc.v1.Document.epgcost.delete","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Document.measurement.delete","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Document.organization.delete","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Document.plancenter.delete","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Document.project.delete","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Document.question.delete","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Document.task.delete","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Document.vehicletest.delete","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Document.weight.delete","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Folder.clear","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Folder.clear-task.v1.ProjectDm","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Folder.delete","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Folder.insert","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Folder.insert-task.v1.ProjectDm","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Folder.insert-task.v1.TaskDm","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Folder.insert.chlid","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Folder.insert.root","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Folder.insert.root-task.v1.ProjectDm","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Folder.insert.root-task.v1.TaskDm","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Folder.restore","partition":0,"replicas":[1,2,3]},
    {"topic":"do.doc.v1.Folder.update","partition":0,"replicas":[1,2,3]},
    {"topic":"do.msg.v1.ChatGroup.clear","partition":0,"replicas":[1,2,3]},
    {"topic":"do.msg.v1.ChatGroup.clear-task.v1.ProjectDm","partition":0,"replicas":[1,2,3]},
    {"topic":"do.msg.v1.ChatGroup.clearGroupAndMember","partition":0,"replicas":[1,2,3]},
    {"topic":"do.msg.v1.ChatGroup.delete","partition":0,"replicas":[1,2,3]},
    {"topic":"do.msg.v1.ChatGroup.insert","partition":0,"replicas":[1,2,3]},
    {"topic":"do.msg.v1.ChatGroup.insert-task.v1.ProjectDm","partition":0,"replicas":[1,2,3]},
    {"topic":"do.msg.v1.ChatGroup.insert-task.v1.TaskDm","partition":0,"replicas":[1,2,3]},
    {"topic":"do.msg.v1.ChatGroup.update","partition":0,"replicas":[1,2,3]},
    {"topic":"do.msg.v1.ChatMember.clear","partition":0,"replicas":[1,2,3]},
    {"topic":"do.msg.v1.ChatMember.delete","partition":0,"replicas":[1,2,3]},
    {"topic":"do.msg.v1.ChatMember.deleteByGroup","partition":0,"replicas":[1,2,3]},
    {"topic":"do.msg.v1.ChatMember.insert","partition":0,"replicas":[1,2,3]},
    {"topic":"do.msg.v1.ChatMember.update","partition":0,"replicas":[1,2,3]},
    {"topic":"do.msg.v1.ChatMessage.clear","partition":0,"replicas":[1,2,3]},
    {"topic":"do.msg.v1.ChatMessage.send","partition":0,"replicas":[1,2,3]},
    {"topic":"do.tag.v1.TagGroup.clear","partition":0,"replicas":[1,2,3]},
    {"topic":"do.tag.v1.TagGroup.clear-task.v1.ProjectDm","partition":0,"replicas":[1,2,3]},
    {"topic":"do.tag.v1.TagGroup.delete","partition":0,"replicas":[1,2,3]},
    {"topic":"do.tag.v1.TagGroup.insert","partition":0,"replicas":[1,2,3]},
    {"topic":"do.tag.v1.TagGroup.insert-task.v1.ProjectDm","partition":0,"replicas":[1,2,3]},
{"topic":"do.tag.v1.TagGroup.update","partition":0,"replicas":[1,2,3]},
    {"topic":"do.tag.v1.TagLink.clear","partition":0,"replicas":[1,2,3]},
    {"topic":"do.tag.v1.TagLink.clearByUserCondition","partition":0,"replicas":[1,2,3]},
    {"topic":"do.tag.v1.TagLink.delete","partition":0,"replicas":[1,2,3]},
    {"topic":"do.tag.v1.TagLink.insert","partition":0,"replicas":[1,2,3]},
    {"topic":"do.tag.v1.TagLink.insert-task.v1.TaskDm","partition":0,"replicas":[1,2,3]},
    {"topic":"do.tag.v1.TagLink.update","partition":0,"replicas":[1,2,3]},
    {"topic":"do.tag.v1.TagLink.updateBusDeletedByUserCondition","partition":0,"replicas":[1,2,3]},
    {"topic":"doc.v1.Folder.delete","partition":0,"replicas":[1,2,3]},
    {"topic":"doc.v1.Folder.insert","partition":0,"replicas":[1,2,3]},
    {"topic":"doc.v1.Folder.update","partition":0,"replicas":[1,2,3]},
    {"topic":"log","partition":0,"replicas":[1,2,3]},
    {"topic":"monitorOperationLog","partition":0,"replicas":[1,2,3]},
    {"topic":"msg.v1.ChatGroup.insert-task.v1.TaskDm","partition":0,"replicas":[1,2,3]},
    {"topic":"mysqlcdc","partition":0,"replicas":[1,2,3]},
    {"topic":"projectCardConfigInit-dev","partition":0,"replicas":[1,2,3]},
    {"topic":"projectInfoChange","partition":0,"replicas":[1,2,3]},
    {"topic":"tag.v1.TagLink.insert-task.v1.TaskDm","partition":0,"replicas":[1,2,3]}
    ]
}

其中[1,2,3]为每个节点的broker.id。

6)

修改后,主题的leader可以自动选举,当主题leader所在kafka服务宕机后,会重新选举新的leader。但程序一直在找brokerId是1的服务器,当brokerId是1的服务器宕机后,kafka集群不可用,中台无法正常运行。

7)

怀疑为集群搭建问题,在服务器上,通过docker,搭建zookeeper 3节点,kafka 3节点的集群环境。模拟测试出现的现象。

其中kafka配置属性为

      KAFKA_BROKER_ID: 2
      KAFKA_ADVERTISED_HOST_NAME: 172.20.30.61                 ## 修改:宿主机IP
      KAFKA_ADVERTISED_PORT: 9094                               ## 修改:宿主机映射port
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://172.20.30.61:9094   ## 修改:宿主机IP
      KAFKA_ZOOKEEPER_CONNECT: "zoo1:2181,zoo2:2181,zoo3:2181"
      KAFKA_DEFAULT_REPLICATION_FACTOR: 3
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 3
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      KAFKA_MIN_INSYNC_REPLICAS: 2

其中 KAFKA_DEFAULT_REPLICATION_FACTOR: 3 属性,可以替代步骤5中json文件的作用,修改主题副本数。

8)

通过文章:https://www.jdon.com/53953

设置KAFKA_MIN_INSYNC_REPLICAS: 2属性

**min.insync.replicas最小安全设置为2**
默认值是1,会很危险,很容易忘记对其进行更改。
在min.insync.replicas上配置的代理将成为所有新的主题(你可以每个主题进行配置)的默认。
同样,事务主题不使用此设置,它有自己的:transaction.state.log.min.isr。

**不干净的领导人选举**
设置KAFKA_MIN_INSYNC_REPLICAS: 2属性

```tex
**min.insync.replicas最小安全设置为2**
默认值是1,会很危险,很容易忘记对其进行更改。
在min.insync.replicas上配置的代理将成为所有新的主题(你可以每个主题进行配置)的默认。
同样,事务主题不使用此设置,它有自己的:transaction.state.log.min.isr。

**不干净的领导人选举**
设置unclean.leader.election.enable为false,可以防止kafka集群节点(如果不在ISR列表中)却成为领导者。
为什么?考虑一个场景:
1. 您有3个kafka集群节点,节点broker_1是一个领导者。
2. 节点broker_3 由于某种原因脱机。
3. 节点broker_1从ISR列表中删除它。
4. 生产者继续其工作,并写一些消息。
5. 在broker_1和broker_2同时下线。
6. 在broker_3复苏和再次在线,它成为了一个领导者。
7. 在broker_2复苏和开始跟随broker_3。
这是什么意思?broker_3离线时,broker_1存储和确认的消息都会丢失。
9)调整后,kafka集群在保证至少两个节点可用时,可以稳定使用。
  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值