kubernetes node 节点启动报错故障排查
报错场景:
kubernetes 集群安装部署期间,部署node节点kubelet服务时,执行 systemctl start kubelet ,tailf /var/log/messages 看到大量证书验证报错;
报错内容:
May 5 22:23:40 kubnode-01 kubelet: I0505 22:23:40.583305 5336 feature_gate.go:206] feature gates: &{map[]}
May 5 22:23:40 kubnode-01 kubelet: I0505 22:23:40.589637 5336 mount_linux.go:180] Detected OS with systemd
May 5 22:23:40 kubnode-01 kubelet: I0505 22:23:40.589680 5336 server.go:407] Version: v1.13.4
May 5 22:23:40 kubnode-01 kubelet: I0505 22:23:40.589732 5336 feature_gate.go:206] feature gates: &{map[]}
May 5 22:23:40 kubnode-01 kubelet: I0505 22:23:40.589825 5336 feature_gate.go:206] feature gates: &{map[]}
May 5 22:23:40 kubnode-01 kubelet: I0505 22:23:40.589899 5336 plugins.go:103] No cloud provider specified.
May 5 22:23:40 kubnode-01 kubelet: I0505 22:23:40.589916 5336 server.go:523] No cloud provider specified: "" from the config file: ""
May 5 22:23:40 kubnode-01 kubelet: I0505 22:23:40.589938 5336 bootstrap.go:65] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file
May 5 22:23:40 kubnode-01 kubelet: I0505 22:23:40.593022 5336 bootstrap.go:96] No valid private key and/or certificate found, reusing existing private key or creating a new one
May 5 22:23:40 kubnode-01 kubelet: I0505 22:23:40.612493 5336 bootstrap.go:239] Failed to connect to apiserver: Get https://172.20.101.157:6443/healthz?timeout=1s: x509: certificate signed by unknown authority
May 5 22:23:42 kubnode-01 kubelet: I0505 22:23:42.909358 5336 bootstrap.go:239] Failed to connect to apiserver: Get https://172.20.101.157:6443/healthz?timeout=1s: x509: certificate signed by unknown authority
May 5 22:23:45 kubnode-01 kubelet: I0505 22:23:45.036663 5336 bootstrap.go:239] Failed to connect to apiserver: Get https://172.20.101.157:6443/healthz?timeout=1s: x509: certificate signed by unknown authority
解决办法如下:
在master节点创建kubelet-bootstrap用户
[root@k8s-node01 ~]#
kubectl create clusterrolebinding kubelet-bootstrap --clusterrole=system:node-bootstrapper --user=kubelet-bootstrap
clusterrolebinding "kubelet-bootstrap" created
node节点执行启动服务
[root@k8s-node01 ~]# systemctl start kubelet
node 节点kubelet启动后,会向master申请csr证书,需要在master上同意证书申请
master节点执行命令,查看csr状态是Pending
[root@kubm-01 ~]# kubectl get csr
NAME AGE REQUESTOR CONDITION
node-csr-mgZK4Cqvb7kZA7tDqVmszNQYLq27Yydia5LCqKJnnEI 4m11s kubelet-bootstrap Pending
master节点执行命令批准证书
[root@kubm-01 ~]#
kubectl certificate approve node-csr-mgZK4Cqvb7kZA7tDqVmszNQYLq27Yydia5LCqKJnnEI
master节点执行命令接受证书申请,同意后查看状态变成 Approved,Issued
[root@kubm-01 ~]# kubectl get csr
NAME AGE REQUESTOR CONDITION
node-csr-mgZK4Cqvb7kZA7tDqVmszNQYLq27Yydia5LCqKJnnEI 5m39s kubelet-bootstrap Approved,Issued
node节点验证
在node节点ssl目录可以看到,多了4个kubelet的证书文件
[root@kubnode-02 kubernetes]# ls /kubernetes/ssl/kubelet*
/kubernetes/ssl/kubelet-client-2019-05-05-22-15-53.pem /kubernetes/ssl/kubelet-client-current.pem /kubernetes/ssl/kubelet.crt /kubernetes/ssl/kubelet.key
删除csr证书 (按需执行)
[root@kubm-01 ~]# kubectl delete csr node-csr-mgZK4Cqvb7kZA7tDqVmszNQYLq27Yydia5LCqKJnnEI
certificatesigningrequest.certificates.k8s.io "node-csr-mgZK4Cqvb7kZA7tDqVmszNQYLq27Yydia5LCqKJnnEI" deleted
验证删除:
kubectl get csr
返回为空
排查过程有点坑。。。。。。。