背景:kafka 部署在k8s ,其中某个kafka pod 重启后处于异常状态
[2023-02-17 15:44:01,546] ERROR Error while deleting the clean shutdown file in dir /opt/kafka/data/logs (kafka.server.LogDirFailureChannel)
java.nio.file.FileSystemException: /opt/kafka/data/logs/__consumer_offsets-17/leader-epoch-checkpoint: Transport endpoint is not connected
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.createFile(Files.java:632)
at kafka.server.checkpoints.CheckpointFile.<init>(CheckpointFile.scala:45)
at kafka.server.checkpoints.LeaderEpochCheckpointFile.<init>(LeaderEpochCheckpointFile.scala:62)
at kafka.log.Log.kafka$log$Log$$initializeLeaderEpochCache(Log.scala:302)
at kafka.log.Log.<init>(Log.scala:232)
at kafka.log.Log$.apply(Log.scala:1986)
at kafka.log.LogManager.kafka$log$LogManager$$loadLog(LogManager.scala:265)
at kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$11$$anonfun$apply$15$$anonfun$apply$2.apply$mcV$sp(LogManager.scala:345)
at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[2023-02-17 15:44:01,546] ERROR Error while deleting the clean shutdown file in dir /opt/kafka/data/logs (kafka.server.LogDirFailureChannel)
java.io.IOException: Transport endpoint is not connected
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(File.java:1012)
at kafka.log.AbstractIndex.<init>(AbstractIndex.scala:54)
at kafka.log.OffsetIndex.<init>(OffsetIndex.scala:53)
at kafka.log.LogSegment$.open(LogSegment.scala:634)
at kafka.log.Log$$anonfun$kafka$log$Log$$loadSegmentFiles$3.apply(Log.scala:395)
at kafka.log.Log$$anonfun$kafka$log$Log$$loadSegmentFiles$3.apply(Log.scala:382)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at kafka.log.Log.kafka$log$Log$$loadSegmentFiles(Log.scala:382)
at kafka.log.Log$$anonfun$loadSegments$1.apply$mcV$sp(Log.scala:493)
at kafka.log.Log$$anonfun$loadSegments$1.apply(Log.scala:487)
at kafka.log.Log$$anonfun$loadSegments$1.apply(Log.scala:487)
at kafka.log.Log.retryOnOffsetOverflow(Log.scala:1853)
at kafka.log.Log.loadSegments(Log.scala:487)
at kafka.log.Log.<init>(Log.scala:237)
at kafka.log.Log$.apply(Log.scala:1986)
at kafka.log.LogManager.kafka$log$LogManager$$loadLog(LogManager.scala:265)
at kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$11$$anonfun$apply$15$$anonfun$apply$2.apply$mcV$sp(LogManager.scala:345)
at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[root@k8s-master-01 ~]# kubectl describe pv pvc-1f7e6c8d-ec7d-4735-8623-5e5f5c7a6e86
Name: pvc-1f7e6c8d-ec7d-4735-8623-5e5f5c7a6e86
Labels: <none>
Annotations: Description: Gluster-Internal: Dynamically provisioned PV
gluster.kubernetes.io/heketi-volume-id: 101afd6bdb20d6f39c88b56741e3f4f1
gluster.org/type: file
kubernetes.io/createdby: heketi-dynamic-provisioner
pv.beta.kubernetes.io/gid: 2004
pv.kubernetes.io/bound-by-controller: yes
pv.kubernetes.io/provisioned-by: kubernetes.io/glusterfs
Finalizers: [kubernetes.io/pv-protection]
StorageClass: gluster-heketi
Status: Bound
Claim: default/datadir-kafka-1
Reclaim Policy: Delete
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 50Gi
Node Affinity: <none>
Message:
Source:
Type: Glusterfs (a Glusterfs mount on the host that shares a pod's lifetime)
EndpointsName: glusterfs-dynamic-1f7e6c8d-ec7d-4735-8623-5e5f5c7a6e86
EndpointsNamespace: default
Path: vol_101afd6bdb20d6f39c88b56741e3f4f1
ReadOnly: false
Events: <none>
[root@k8s-storage-03 logs]# gluster volume info vol_101afd6bdb20d6f39c88b56741e3f4f1
Volume Name: vol_101afd6bdb20d6f39c88b56741e3f4f1
Type: Distributed-Replicate
Volume ID: ddfff8f4-5369-41ce-94f8-da2105f7fa3f
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: 10.0.2.44:/var/lib/heketi/mounts/vg_f957b72fcf01d14abe40e4fc44126646/brick_1fba98cf0221789ee59887e30c10ffee/brick
Brick2: 10.0.2.45:/var/lib/heketi/mounts/vg_094f45cfcd0ff5815c3914723876c69e/brick_86551649dc16569a8cd02a5d291db69e/brick
Brick3: 10.0.2.46:/var/lib/heketi/mounts/vg_1f3f18f36dae1a1d968713e583b7972a/brick_46ea9c23f8ce0ff0dff4041d5d1540fd/brick
Brick4: 10.0.2.46:/var/lib/heketi/mounts/vg_1f3f18f36dae1a1d968713e583b7972a/brick_03fb64c2e50d2da7ba0d119db6cdbe6b/brick
Brick5: 10.0.2.44:/var/lib/heketi/mounts/vg_f957b72fcf01d14abe40e4fc44126646/brick_468ca0b1ed787df50165a72305608d25/brick
Brick6: 10.0.2.45:/var/lib/heketi/mounts/vg_094f45cfcd0ff5815c3914723876c69e/brick_284484f3c58fa7d25cf9a2233e8b3cca/brick
Brick7: 10.0.2.44:/var/lib/heketi/mounts/vg_f957b72fcf01d14abe40e4fc44126646/brick_993affb1fcc541b8adb7b674f462cfec/brick
Brick8: 10.0.2.45:/var/lib/heketi/mounts/vg_094f45cfcd0ff5815c3914723876c69e/brick_37a2d4cc20dae75828521ee366162946/brick
Brick9: 10.0.2.46:/var/lib/heketi/mounts/vg_1f3f18f36dae1a1d968713e583b7972a/brick_7a0000e375f3a254f65313f744fa8541/brick
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
user.heketi.id: 101afd6bdb20d6f39c88b56741e3f4f1
解决:删除所有glusterfs brick
中的 __consumer_offsets
,删除该kafka pod 待其恢复重启