当restart kubelet后,重新查看状态,显示kubelet无法连接到localhost:localdomain
localhost.localdomain kubelet[1360]: E0606 14:41:49.231598 1360 eviction_manager.go:258] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"localhost.localdomain\" not found"
查看kubelet的日志,发现无法连接到配置文件
[user@localhost kubernetes_files]$sudo journalctl kubelet
>
"command failed" err="failed to load kubelet config file, path: /var/lib/kubelet/config.yaml, error: failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory"
通过以下命令重新写入配置文件
sudo vim /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
anonymous:
enabled: false
webhook:
cacheTTL: 0s
enabled: true
x509:
clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
mode: Webhook
webhook:
cacheAuthorizedTTL: 0s
cacheUnauthorizedTTL: 0s
cgroupDriver: systemd
clusterDNS:
- 10.96.0.10
clusterDomain: cluster.local
containerRuntimeEndpoint: ""
cpuManagerReconcilePeriod: 0s
evictionPressureTransitionPeriod: 0s
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageMinimumGCAge: 0s
kind: KubeletConfiguration
logging:
flushFrequency: 0
options:
json:
infoBufferSize: "0"
verbosity: 0
memorySwap: {}
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
rotateCertificates: true
runtimeRequestTimeout: 0s
shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s
imagePullPolicy: IfNotPresent
随后重启kubelet,发现新的报错,无法访问证书
command failed" err="failed to construct kubelet dependencies: unable to load client CA file /etc/
通过reset kubeadm的方式,可以重新生成证书文件。reset kubeadm详见这个系列的教程5
进一步的,我们发现kube-apiserver
容器未能正确启动
通过手动方式来启动 kube-apiserver
容器,以确保配置文件正确,更好的排查错误在哪里
sudo docker run --rm -it --name kube-apiserver \
--network host \
-v /etc/kubernetes/pki:/etc/kubernetes/pki \
registry.aliyuncs.com/google_containers/kube-apiserver:v1.28.2 \
kube-apiserver \
--advertise-address=100.64.158.0 \
--allow-privileged=true \
--authorization-mode=Node,RBAC \
--client-ca-file=/etc/kubernetes/pki/ca.crt \
--enable-admission-plugins=NodeRestriction \
--enable-bootstrap-token-auth=true \
--etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt \
--etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt \
--etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key \
--etcd-servers=https://127.0.0.1:2379 \
--kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt \
--kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key \
--kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname \
--proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt \
--proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key \
--requestheader-allowed-names=front-proxy-client \
--requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt \
--requestheader-extra-headers-prefix=X-Remote-Extra- \
--requestheader-group-headers=X-Remote-Group \
--requestheader-username-headers=X-Remote-User \
--runtime-config=api/all=true \
--secure-port=6443 \
--service-account-issuer=https://kubernetes.default.svc.cluster.local \
--service-account-key-file=/etc/kubernetes/pki/sa.pub \
--service-account-signing-key-file=/etc/kubernetes/pki/sa.key \
--service-cluster-ip-range=10.96.0.0/12 \
--tls-cert-file=/etc/kubernetes/pki/apiserver.crt \
--tls-private-key-file=/etc/kubernetes/pki/apiserver.key
执行结果发现kube-apiserver
启动失败是因为无法连接到 etcd 服务 (127.0.0.1:2379
)。这是一个常见的问题,通常是因为 etcd 服务未启动或者配置错误。查询目录发现是etcd未启动。
Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"
F0606 12:41:16.958414 11 instance.go:291] Error creating leases: error creating storage factory: context deadline exceeded
下载etcd并启动
sudo yum install etcd -y
sudo systemctl daemon-reload
sudo systemctl enable etcd
>Created symlink from /etc/systemd/system/multi-user.target.wants/etcd.service to /usr/lib/systemd/system/etcd.service.
查看etcd状态,发现运行正常
localhost:~$ sudo systemctl start etcd
localhost:~$ sudo systemctl status etcd
● etcd.service - Etcd Server
Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: active (running) since 四 2024-06-06 21:08:52 CST; 50s ago
Main PID: 14051 (etcd)
Tasks: 19
Memory: 15.6M
CGroup: /system.slice/etcd.service
└─14051 /usr/bin/etcd --name=default --data-dir=/var/lib/etcd/default.etcd --listen-client-urls=http://localhost:2379
6月 06 21:08:52 localhost.localdomain etcd[14051]: 8e9e05c52164694d received MsgVoteResp from 8e9e05c52164694d at term 2
6月 06 21:08:52 localhost.localdomain etcd[14051]: 8e9e05c52164694d became leader at term 2
6月 06 21:08:52 localhost.localdomain etcd[14051]: raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 2
6月 06 21:08:52 localhost.localdomain etcd[14051]: published {Name:default ClientURLs:[http://localhost:2379]} to cluster cdf818194e3a8c32
6月 06 21:08:52 localhost.localdomain etcd[14051]: setting up the initial cluster version to 3.3
6月 06 21:08:52 localhost.localdomain etcd[14051]: ready to serve client requests
6月 06 21:08:52 localhost.localdomain etcd[14051]: serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!
6月 06 21:08:52 localhost.localdomain systemd[1]: Started Etcd Server.
6月 06 21:08:52 localhost.localdomain etcd[14051]: set the initial cluster version to 3.3
6月 06 21:08:52 localhost.localdomain etcd[14051]: enabled capabilities for version 3.3