node节点kubelet报错 node \“xxxxx“ not found

11月 05 10:58:35 localhost.localdomain kubelet[974]: E1105 10:58:35.108952     974 kubelet.go:2412] "Error getting node" err="node \"localhost.localdomain\" not found"
11月 05 10:58:35 localhost.localdomain kubelet[974]: E1105 10:58:35.209293     974 kubelet.go:2412] "Error getting node" err="node \"localhost.localdomain\" not found"
11月 05 10:58:35 localhost.localdomain kubelet[974]: E1105 10:58:35.310543     974 kubelet.go:2412] "Error getting node" err="node \"localhost.localdomain\" not found"
11月 05 10:58:35 localhost.localdomain kubelet[974]: E1105 10:58:35.411121     974 kubelet.go:2412] "Error getting node" err="node \"localhost.localdomain\" not found"
11月 05 10:58:35 localhost.localdomain kubelet[974]: E1105 10:58:35.511949     974 kubelet.go:2412] "Error getting node" err="node \"localhost.localdomain\" not found"
11月 05 10:58:35 localhost.localdomain kubelet[974]: E1105 10:58:35.612822     974 kubelet.go:2412] "Error getting node" err="node \"localhost.localdomain\" not found"
11月 05 10:58:35 localhost.localdomain kubelet[974]: E1105 10:58:35.713249     974 kubelet.go:2412] "Error getting node" err="node \"localhost.localdomain\" not found"
11月 05 10:58:35 localhost.localdomain kubelet[974]: E1105 10:58:35.781263     974 controller.go:144] failed to ensure lease exists, will retry in 7s, error: leases.coordination.k8s.io "localhost.localdomain" is forbidden: User "system:node:k8s222" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-node-lease": can only access node lease with the same name as the requesting node
11月 05 10:58:35 localhost.localdomain kubelet[974]: E1105 10:58:35.813355     974 kubelet.go:2412] "Error getting node" err="node \"localhost.localdomain\" not found"
11月 05 10:58:35 localhost.localdomain kubelet[974]: E1105 10:58:35.913495     974 kubelet.go:2412] "Error getting node" err="node \"localhost.localdomain\" not found"

1.1 该节点一直NotReady

[root@crust-m01 ~]# kubectl get node
NAME        STATUS     ROLES                  AGE   VERSION
k8s220   NotReady   control-plane,master   44d   v1.21.3
k8s221   NotReady   <none>                 44d   v1.21.3
k8s222   NotReady   <none>                 44d   v1.21.3


1.2 查看该节点详细信息

[root@localhost ~]# kubectl describe node k8s221
……
Unschedulable:      false
Lease:
  HolderIdentity:  k8s221
  AcquireTime:     <unset>
  RenewTime:       Tue, 28 Sep 2021 14:37:08 +0800
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Tue, 28 Sep 2021 14:32:16 +0800   Tue, 28 Sep 2021 14:38:17 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Tue, 28 Sep 2021 14:32:16 +0800   Tue, 28 Sep 2021 14:38:17 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Tue, 28 Sep 2021 14:32:16 +0800   Tue, 28 Sep 2021 14:38:17 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Tue, 28 Sep 2021 14:32:16 +0800   Tue, 28 Sep 2021 14:38:17 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
……

 1.3 该节点上查看kubelet日志

[root@crust-m2 ~]# service kubelet status -l
Redirecting to /bin/systemctl status  -l kubelet.service
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since 二 2021-09-28 14:51:57 CST; 4min 6s ago
     Docs: https://kubernetes.io/docs/
 Main PID: 21165 (kubelet)
    Tasks: 19
   Memory: 43.0M
   CGroup: /system.slice/kubelet.service
           └─21165 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-infra-container-image=registry.aliyuncs.com/google_containers/pause:3.4.1

9月 28 14:56:03 crust-m2 kubelet[21165]: E0928 14:56:03.119645   21165 kubelet.go:2291] "Error getting node" err="node \"crust-m2\" not found"
9月 28 14:56:03 crust-m2 kubelet[21165]: E0928 14:56:03.220694   21165 kubelet.go:2291] "Error getting node" err="node \"crust-m2\" not found"
9月 28 14:56:03 crust-m2 kubelet[21165]: E0928 14:56:03.321635   21165 kubelet.go:2291] "Error getting node" err="node \"crust-m2\" not found"
9月 28 14:56:03 crust-m2 kubelet[21165]: E0928 14:56:03.385100   21165 eviction_manager.go:255] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"crust-m2\" not found"
9月 28 14:56:03 crust-m2 kubelet[21165]: E0928 14:56:03.422387   21165 kubelet.go:2291] "Error getting node" err="node \"crust-m2\" not found"
9月 28 14:56:03 crust-m2 kubelet[21165]: E0928 14:56:03.523341   21165 kubelet.go:2291] "Error getting node" err="node \"crust-m2\" not found"
9月 28 14:56:03 crust-m2 kubelet[21165]: E0928 14:56:03.624021   21165 kubelet.go:2291] "Error getting node" err="node \"crust-m2\" not found"
9月 28 14:56:03 crust-m2 kubelet[21165]: E0928 14:56:03.724418   21165 kubelet.go:2291] "Error getting node" err="node \"crust-m2\" not found"
9月 28 14:56:03 crust-m2 kubelet[21165]: E0928 14:56:03.825475   21165 kubelet.go:2291] "Error getting node" err="node \"crust-m2\" not found"
9月 28 14:56:03 crust-m2 kubelet[21165]: E0928 14:56:03.926199   21165 kubelet.go:2291] "Error getting node" err="node \"crust-m2\" not found"

2. 【排错】

  • 1.3中日志输出的启动命令如下
/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-infra-container-image=registry.aliyuncs.com/google_containers/pause:3.4.1

 

查看并分析启动命令中所有配置文件都没有问题

  • 发现1.3 中输出的错误err="node \"localhost.localdomain\" not found
    而master上 kubectl get node 的信息 三个node是 k8s220,k8s221,k8s222
  • 结论
    之前安装kubernetes的时候,master的名字是 k8s220,node是k8s221,k8s222,因为 /etc/hostname 中默认写成了 localhost.localdomain,因此kubelet一直报错

3. 【修改】

  • 修改hostname文件,并执行hostname命令修改服务器名称
  • 重启kubelete
[root@dai23 examples]# kubectl logs -n rook-ceph rook-ceph-osd-prepare-dai25-9pxp8 025-03-12 01:15:33.260168 W | inventory: uuid not found for device /dev/sdb. output=Creating new GPT entries in memory. Disk /dev/sdb: 83886080 sectors, 40.0 GiB Model: VMware Virtual S Sector size (logical/physical): 512/512 bytes Disk identifier (GUID): 4754F882-F0A5-401C-8D25-C23ADA95B6C5 Partition table holds up to 128 entries Main partition table begins at sector 2 and ends at sector 33 First usable sector is 34, last usable sector is 83886046 Partitions will be aligned on 2048-sector boundaries Total free space is 83886013 sectors (40.0 GiB) Number Start (sector) End (sector) Size Code Name 2025-03-12 01:15:33.260287 I | cephosd: setting device type "hdd" for device "/dev/sdb" 2025-03-12 01:15:33.260314 I | cephosd: setting device class "hdd" for device "/dev/sdb" 2025-03-12 01:15:33.260384 I | cephosd: 1 ceph-volume raw osd devices configured on this node 2025-03-12 01:15:33.260430 I | cephosd: devices = [{ID:0 Cluster:ceph UUID:6a9f8512-fdce-4583-ac9e-e23f01e44fd1 DevicePartUUID: DeviceClass:hdd BlockPath:/dev/sdb MetadataPath: WalPath: SkipLVRelease:true Location:root=default host=dai25 LVBackedPV:false CVMode:raw Store:bluestore TopologyAffinity: Encrypted:false ExportService:false NodeName: PVCName: DeviceType:hdd}] [root@dai23 examples]# kubectl get CephCluster -n rook-ceph NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID rook-ceph /var/lib/rook 2 63m Ready Cluster created successfully HEALTH_WARN 77b65090-25cf-4ff6-a840-c0419041c0e4
03-13
<think>好的,我现在需要帮助用户解决关于Rook Ceph中查看OSD准备日志和解析集群状态的问题。用户提供了几个引用,我需要结合这些信息来给出正确的步骤和解释。 首先,用户提到使用kubectl logs来查看OSD准备日志。根据引用[2],当OSD Pod无法启动时,应该查询osd prepare日志。命令示例是kubectl -n rook-ceph logs rook-ceph-osd-prepare-nodeX-XXXXX provision。这里需要注意的是,用户需要替换节点名称和Pod后缀。所以正确的步骤应该是先获取具体的Pod名称,再执行logs命令。可能用户不知道如何获取Pod名称,需要指导他们使用kubectl get pod来列出所有相关的Pod,然后用grep过滤出osd-prepare的Pod。 然后,关于kubectl get CephCluster的输出解析。引用[2]和[4]中提到使用kubectl get CephCluster -o yaml来查看详细状态。状态中的phase字段是关键,可能的状态包括Progressing、Ready、Error等。此外,需要检查status部分中的cephStatus和conditions,这些会显示健康状态和错误信息。用户可能需要了解这些字段的含义,比如如果health显示HEALTH_ERR,就需要进一步检查。 用户的问题可能涉及到如何诊断OSD准备失败的原因,比如存储设备配置错误、权限问题或资源不足。例如,在引用[4]中,OSD Prepare Job没有完成,可能是因为节点上的存储设备未正确配置。因此,在查看日志时,需要特别关注错误信息,比如无法找到磁盘、权限问题或镜像拉取失败等。 另外,用户可能还需要知道如何重启operator或重新部署集群。引用[2]提到可以通过kubectl rollout restart deploy rook-ceph-operator来重启operator,这可能帮助解决某些部署问题。但需要提醒用户谨慎操作,尤其是在生产环境中。 最后,需要确保用户正确使用命名空间rook-ceph,因为所有相关资源都位于这个命名空间下。如果用户忘记指定-n rook-ceph,命令会执行失败,所以要在每个命令中强调命名空间参数。 总结回答结构应该是:先解释如何查看OSD准备日志的步骤,然后解析CephCluster的状态输出,最后给出常见问题及解决方法。需要确保命令示例准确,并引用提供的参考文献中的信息,比如引用[2]和[4]中的命令和注意事项。</think>### 查看Rook Ceph OSD准备日志及集群状态解析 #### 1. 查看OSD准备日志 **操作步骤**: 1. 首先获取OSD Prepare Pod的名称(Pod名称包含节点名和随机后缀): ```bash kubectl -n rook-ceph get pods | grep osd-prepare ``` 输出示例: ``` rook-ceph-osd-prepare-node01-abcde 0/1 Completed 0 3h rook-ceph-osd-prepare-node02-xyzqw 1/1 Running 0 3h ``` 2. 查看具体Pod的日志(以`node01`为例): ```bash kubectl -n rook-ceph logs rook-ceph-osd-prepare-node01-abcde provision ``` **日志关键点**: - 设备识别:检查是否成功发现预期磁盘(如`/dev/sdb`)。 - 权限问题:若出现`permission denied`,需检查Pod的SecurityContext或存储设备权限[^2]。 - 存储配置:若使用裸磁盘,需确认`cluster.yaml`中`useAllDevices`或`deviceFilter`配置正确[^3]。 #### 2. 解析CephCluster状态 **操作步骤**: ```bash kubectl -n rook-ceph get CephCluster -o yaml ``` **输出关键字段解析**: - **`status.phase`**:集群状态阶段,可能值为: - `Progressing`:部署中。 - `Ready`:正常运行。 - `Error`:存在配置或资源问题[^2]。 - **`status.cephStatus.health`**:Ceph集群健康状态: - `HEALTH_OK`:正常。 - `HEALTH_WARN`:警告(如OSD未满但接近容量)。 - `HEALTH_ERR`:严重错误(如多个OSD宕机)[^4]。 - **`status.conditions`**:详细错误描述,例如: ```yaml - lastTransitionTime: "2023-10-01T02:00:00Z" message: "OSD provisioning failed on node01" reason: InvalidDevice status: "True" type: Failure ``` #### 3. 常见问题及解决 - **问题1:OSD Prepare Pod卡在`Pending`状态** - **原因**:节点资源不足或污点(Taint)导致Pod无法调度。 - **解决**: ```bash kubectl describe pod rook-ceph-osd-prepare-node01-abcde -n rook-ceph # 查看事件详情 ``` - **问题2:日志显示`no available devices`** - **原因**:`cluster.yaml`中未正确指定磁盘或使用了已挂载的磁盘。 - **解决**:检查CephCluster CRD配置,确认`storage.nodes`包含目标节点及设备列表[^3]。 - **问题3:CephCluster状态为`Error`但PV仍可用** - **说明**:可能部分OSD异常但整体集群仍维持最小可用状态。 - **恢复**:重启Ceph Operator尝试重新部署: ```bash kubectl rollout restart deploy rook-ceph-operator -n rook-ceph # 引用[2] ```
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值