线上Kafka集群节点宕机问题排查
主机和进程信息
主机信息:6cores,64G,5.3T
Kafka进程信息:4G, partition 1K左右,消息数据量3.7T
今天上午发现Kafka有个节点挂了,上去查看日志发现有如下异常
Java HotSpot(TM) 64-Bit Server VM warning: Attempt to deallocate stack guard pages failed.
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f04249ed000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/appweb/hs_err_pid5566.log
备份日志,重启,
查看启动日志,一直有WARN日志
server.log.2021-07-22-11:[2021-07-22 11:37:18,003] WARN Found a corrupted index file due to requirement failed: Corrupt index found, index file (/data/kafka_2/ai_jl_simple-4/00000000000000000082.index) has non-zero size but the last offset is 82 which is no larger than the base offset 82.}. deleting /data/kafka_2/ai_jl_simple-4/00000000000000000082.timeindex, /data/kafka_2/ai_jl_simple-4/00000000000000000082.index and rebuilding index... (kafka.log.Log)
持续到最后启动成功,检查最后一个上述WARN日志,整个过程持续48分钟才启动成功
下午另外两台进程也相继挂掉
其中一台错误日志
[2021-07-22 16:29:47,593] INFO Rolled new log segment for 'schumann-biz.app.log-0' in 10 ms. (kafka.log.Log)
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f3646dd0000, 65536, 1) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 65536 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/appweb/hs_err_pid18240.log
#
# Compiler replay data is saved as:
# /home/appweb/replay_pid18240.log
另外一台错误日志是
[2021-07-22 16:31:17,899] FATAL [Replica Manager on Broker 2]: Halting due to unrecoverable I/O error while handling produce request: (kafka.server.ReplicaManager)
kafka.common.KafkaStorageException: I/O exception in append to log 'wdp-monitor.app.log-0'
at kafka.log.Log.append(Log.scala:362)
at kafka.cluster.Partition$$anonfun$11.apply(Partition.scala:451)
at kafka.cluster.Partition$$anonfun$11.apply(Partition.scala:439)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:213)
at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:219)
at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:438)
at kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2.apply(ReplicaManager.scala:389)
at kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2.apply(ReplicaManager.scala:375)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:375)
at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:312)
at kafka.server.KafkaApis.handleProducerRequest(KafkaApis.scala:427)
at kafka.server.KafkaApis.handle(KafkaApis.scala:80)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:62)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:907)
at kafka.log.AbstractIndex.<init>(AbstractIndex.scala:61)
at kafka.log.TimeIndex.<init>(TimeIndex.scala:55)
at kafka.log.LogSegment.<init>(LogSegment.scala:73)
at kafka.log.Log.roll(Log.scala:809)
at kafka.log.Log.maybeRoll(Log.scala:775)
at kafka.log.Log.append(Log.scala:419)
... 21 more
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:904)
... 27 more
Java HotSpot(TM) 64-Bit Server VM warning: Attempt to deallocate stack guard pages failed.