报错内容:
Warning FailedMount 22m (x35 over 77m) kubelet MountVolume.MountDevice failed for volume "pvc-XXX" : rpc error: code = Internal desc = can not find diskId disk-rhtsqync by serial
KCM日志:
0)试图在172.28.16.10上attach,报错ResourceUnavailable.ZoneNotMatch。一直重试
1)AttachVolume -> 172.28.1.72
I0113 17:50:56.616633 1 reconciler.go:304] attacherDetacher.AttachVolume started for volume "pvc-2708bf81-6678-4999-9ca2-9a8d2eae3c36" (UniqueName: "kubernetes.io/csi/com.tencent.cloud.csi.cbs^disk-rhtsqync") from node "172.28.1.72"
I0113 17:51:08.630091 1 operation_generator.go:370] AttachVolume.Attach succeeded for volume "pvc-2708bf81-6678-4999-9ca2-9a8d2eae3c36" (UniqueName: "kubernetes.io/csi/com.tencent.cloud.csi.cbs^disk-rhtsqync") from node "172.28.1.72"
I0113 17:51:08.630228 1 event.go:291] "Event occurred" object="db/redis-master-0" kind="Pod" apiVersion="v1" type="Normal" reason="SuccessfulAttachVolume" message="AttachVolume.Attach succeeded for volume \"pvc-2708bf81-6678-4999-9ca2-9a8d2eae3c36\" "
2) DetachVolume -> 172.28.16.10
I0113 17:51:08.733908 1 reconciler.go:221] attacherDetacher.DetachVolume started for volume "pvc-2708bf81-6678-4999-9ca2-9a8d2eae3c36" (UniqueName: "kubernetes.io/csi/com.tencent.cloud.csi.cbs^disk-rhtsqync") on node "172.28.16.10"
I0113 17:51:08.736979 1 operation_generator.go:1578] Verified volume is safe to detach for volume "pvc-2708bf81-6678-4999-9ca2-9a8d2eae3c36" (UniqueName: "kubernetes.io/csi/com.tencent.cloud.csi.cbs^disk-rhtsqync") on node "172.28.16.10"
I0113 17:51:15.767935 1 operation_generator.go:485] DetachVolume.Detach succeeded for volume "pvc-2708bf81-6678-4999-9ca2-9a8d2eae3c36" (UniqueName: "kubernetes.io/csi/com.tencent.cloud.csi.cbs^disk-rhtsqync") on node "172.28.16.10"
结论:
POD首先在172.28.16.10上,之后迁移至172.28.1.72。
从KCM的日志看,1)AttachVolume -> 172.28.1.72和2) DetachVolume -> 172.28.16.10的执行顺序反了。
-----
KCM 包含两个数据结构,记录需要attach和detach的volume,分别叫ASW和DSW。
KCM 每个一段时间(默认100ms)触发一次循环。先detach掉不需要的volume,再attach需要的volume。
-----
16:27:15
pod分配在节点172.28.16.10。
17:50:56
pod重启,分配到172.28.1.72。
------
KCM 第一个循环:(此时POD还在老节点)。
1)detach逻辑
无操作
2)attach逻辑
KCM尝试Attacher,但是一直未成功(ResourceUnavailable.ZoneNotMatch)。
KCM假设attche成功(原因:虽然报错了,但是volume仍有可能是成功attach的。假定为成功,防止volume忘记去detach),记录到ASW中(vlomue-老节点)。
----
KCM 第二个循环:
1)detach逻辑(此时POD还在老节点)
虽然(vlomue-老节点)已经在ASW中,但是(vlomue-老节点)也在DSW中,所以跳过。
2)attach逻辑 (此时,POD到新节点。DSW中(vlomue-老节点)被删除,新增(vlomue-新节点))
AttachVolume -> 172.28.1.72 成功
记录ASW(vlomue-新节点)
---
KCM 第三个循环:
1)detach逻辑
(vlomue-老节点)在ASW中,但是已经不再DSW。
所以触发
DetachVolume -> 172.28.16.10