1.
莫名其妙的整个集群崩溃,所有命令无法执行,所有组件(controller-manager和scheduler两个正常)都是启动失败.各种记录和报错,参考见下:
[root@k8s-master2 ~]# kubectl get cs
error: the server doesn't have a resource type "cs"
[root@k8s-master2 ~]# systemctl status flanneld
● flanneld.service - Flanneld overlay address etcd agent
Loaded: loaded (/etc/systemd/system/flanneld.service; enabled; vendor preset: disabled)
Active: activating (start) since Mon 2019-04-08 11:26:35 CST; 37s ago
Main PID: 6691 (flanneld)
Memory: 10.6M
CGroup: /system.slice/flanneld.service
└─6691 /opt/k8s/bin/flanneld -etcd-cafile=/etc/kubernetes/cert/ca.pem -etcd-certfile=/etc/flanneld/...
Apr 08 11:26:45 k8s-master2 flanneld[6691]: ; error #2: dial tcp 192.168.32.129:2379: getsockopt: connecti...used
Apr 08 11:26:46 k8s-master2 flanneld[6691]: timed out
Apr 08 11:26:56 k8s-master2 flanneld[6691]: E0408 11:26:56.506816 6691 main.go:349] Couldn't fetch netw...used
Apr 08 11:26:56 k8s-master2 flanneld[6691]: ; error #1: net/http: TLS handshake timeout
Apr 08 11:26:56 k8s-master2 flanneld[6691]: ; error #2: dial tcp 192.168.32.129:2379: getsockopt: connecti...used
Apr 08 11:26:57 k8s-master2 flanneld[6691]: timed out
Apr 08 11:27:07 k8s-master2 flanneld[6691]: E0408 11:27:07.511956 6691 main.go:349] Couldn't fetch netw...used
Apr 08 11:27:07 k8s-master2 flanneld[6691]: ; error #1: net/http: TLS handshake timeout
Apr 08 11:27:07 k8s-master2 flanneld[6691]: ; error #2: dial tcp 192.168.32.129:2379: getsockopt: connecti...used
Apr 08 11:27:08 k8s-master2 flanneld[6691]: timed out
Hint: Some lines were ellipsized, use -l to show in full.
[root@k8s-master2 ~]# systemctl status kube-apiserver
● kube-apiserver.service - Kubernetes API Server
Loaded: loaded (/etc/systemd/system/kube-apiserver.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Mon 2019-04-08 11:30:01 CST; 4s ago
Docs: https://github.com/GoogleCloudPlatform/kubernetes
Process: 7348 ExecStart=/opt/k8s/bin/kube-apiserver --enable-admission-plugins=Initializers,NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota --anonymous-auth=false --experimental-encryption-provider-config=/etc/kubernetes/encryption-config.yaml --advertise-address=192.168.32.129 --bind-address=192.168.32.129 --insecure-port=0 --authorization-mode=Node,RBAC --runtime-config=api/all --enable-bootstrap-token-auth --service-cluster-ip-range=10.254.0.0/16 --service-node-port-range=8400-9000 --tls-cert-file=/etc/kubernetes/cert/kubernetes.pem --tls-private-key-file=/etc/kubernetes/cert/kubernetes-key.pem --client-ca-file=/etc/kubernetes/cert/ca.pem --kubelet-client-certificate=/etc/kubernetes/cert/kubernetes.pem --kubelet-client-key=/etc/kubernetes/cert/kubernetes-key.pem --service-account-key-file=/etc/kubernetes/cert/ca-key.pem --etcd-cafile=/etc/kubernetes/cert/ca.pem --etcd-certfile=/etc/kubernetes/cert/kubernetes.pem --etcd-keyfile=/etc/kubernetes/cert/kubernetes-key.pem --etcd-servers=https://192.168.32.128:2379,https://192.168.32.129:2379,https://192.168.32.130:2379 --enable-swagger-ui=true --allow-privileged=true --apiserver-count=3 --audit-log-maxage=30 --audit-log-maxbackup=3 --audit-log-maxsize=100 --audit-log-path=/var/log/kube-apiserver-audit.log --event-ttl=1h --alsologtostderr=true --logtostderr=false --log-dir=/var/log/kubernetes --v=2 (code=exited, status=255)
Main PID: 7348 (code=exited, status=255)
Memory: 0B
CGroup: /system.slice/kube-apiserver.service
Apr 08 11:30:01 k8s-master2 systemd[1]: kube-apiserver.service: main process exited, code=exited, status=255/n/a
Apr 08 11:30:01 k8s-master2 systemd[1]: Failed to start Kubernetes API Server.
Apr 08 11:30:01 k8s-master2 systemd[1]: Unit kube-apiserver.service entered failed state.
Apr 08 11:30:01 k8s-master2 systemd[1]: kube-apiserver.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[root@k8s-master2 ~]# systemctl status etcd
● etcd.service - Etcd Server
Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Mon 2019-04-08 11:28:02 CST; 358ms ago
Docs: https://github.com/coreos
Process: 7001 ExecStart=/opt/k8s/bin/etcd --data-dir=/var/lib/etcd --name=k8s-master2 --cert-file=/etc/etcd/cert/etcd.pem --key-file=/etc/etcd/cert/etcd-key.pem --trusted-ca-file=/etc/kubernetes/cert/ca.pem --peer-cert-file=/etc/etcd/cert/etcd.pem --peer-key-file=/etc/etcd/cert/etcd-key.pem --peer-trusted-ca-file=/etc/kubernetes/cert/ca.pem --peer-client-cert-auth --client-cert-auth --listen-peer-urls=https://192.168.32.129:2380 --initial-advertise-peer-urls=https://192.168.32.129:2380 --listen-client-urls=https://192.168.32.129:2379,http://127.0.0.1:2379 --advertise-client-urls=https://192.168.32.129:2379 --initial-cluster-token=etcd-cluster-0 --initial-cluster=k8s-master1=https://192.168.32.128:2380,k8s-master2=https://192.168.32.129:2380,k8s-master3=https://192.168.32.130:2380 --initial-cluster-state=new (code=exited, status=2)
Main PID: 7001 (code=exited, status=2)
Apr 08 11:28:02 k8s-master2 systemd[1]: etcd.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 08 11:28:02 k8s-master2 systemd[1]: Failed to start Etcd Server.
Apr 08 11:28:02 k8s-master2 systemd[1]: Unit etcd.service entered failed state.
Apr 08 11:28:02 k8s-master2 systemd[1]: etcd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
2.
问题排查思路
集群最核心的数据是etcd.flannel网络数据存储在etcd,集群其他各种数据也全部存储在etcd.
集群组件通过kube-apiserver来读取etcd数据.
先从处理etcd开始,把etcd启动成功.
重启etcd报错,见下:
[root@k8s-master3 ~]# systemctl daemon-reload && systemctl restart etcd
Job for etcd.service failed because the control process exited with error code. See "systemctl status etcd.service" and "journalctl -xe" for details.
[root@k8s-master3 ~]# systemctl status etcd.service
● etcd.service - Etcd Server
Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Mon 2019-04-08 13:47:20 CST; 1s ago
Docs: https://github.com/coreos
Process: 25019 ExecStart=/opt/k8s/bin/etcd --data-dir=/var/lib/etcd --name=k8s-master3 --cert-file=/etc/etcd/cert/etcd.pem --key-file=/etc/etcd/cert/etcd-key.pem --trusted-ca-file=/etc/kubernetes/cert/ca.pem --peer-cert-file=/etc/etcd/cert/etcd.pem --peer-key-file=/etc/etcd/cert/etcd-key.pem --peer-trusted-ca-file=/etc/kubernetes/cert/ca.pem --peer-client-cert-auth --client-cert-auth --listen-peer-urls=https://192.168.32.130:2380 --initial-advertise-peer-urls=https://192.168.32.130:2380 --listen-client-urls=https://192.168.32.130:2379,http://127.0.0.1:2379 --advertise-client-urls=https://192.168.32.130:2379 --initial-cluster-token=etcd-cluster-0 --initial-cluster=k8s-master1=https://192.168.32.128:2380,k8s-master2=https://192.168.32.129:2380,k8s-master3=https://192.168.32.130:2380 --initial-cluster-state=new (code=exited, status=2)
Main PID: 25019 (code=exited, status=2)
Apr 08 13:47:20 k8s-master3 systemd[1]: etcd.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 08 13:47:20 k8s-master3 systemd[1]: Failed to start Etcd Server.
Apr 08 13:47:20 k8s-master3 systemd[1]: Unit etcd.service entered failed state.
Apr 08 13:47:20 k8s-master3 systemd[1]: etcd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[root@k8s-