kubernetes实际生产中遇到的问题及解决办法

运维潇哥

已于 2024-01-11 21:57:35 修改

阅读量621

点赞数 8

分类专栏： kubernetes 文章标签： kubernetes 容器云原生

于 2024-01-11 18:09:12 首次发布

本文链接：https://blog.csdn.net/weixin_38924998/article/details/135535955

版权

kubernetes 专栏收录该内容

8 篇文章 2 订阅

订阅专栏

文档持续跟新中，请点关注。

calico遇到的问题

calico-node启动失败

基础版本信息：Kubernetes v1.29.0 ，calico v3.27.0，centos7.9，Kernel 6.6.8

遇到问题描述：calico启动失败，pod状态running，但是ready这里只有0/1，实际没有运行。

NAME                                       READY   STATUS    RESTARTS      AGE
calico-kube-controllers-78d68c6746-cmqqg   0/1     Running   2 (49s ago)   3m11s
calico-node-769qn                          0/1     Running   0             41h
calico-node-fl4jc                          0/1     Running   0             2m57s
calico-node-gbv4p                          1/1     Running   0             2m42s
calico-node-psk2b                          0/1     Running   0             2m29s

具体报错日志：

kubectl logs -n calico-system calico-node-769qn --tail=10

# 以下是日志内容
Defaulted container "calico-node" out of: calico-node, flexvol-driver (init), install-cni (init)
2024-01-05 03:55:48.494 [WARNING][11840] felix/ipsets.go 346: Failed to resync with dataplane error=exit status 1 family="inet"
2024-01-05 03:55:48.527 [INFO][11840] felix/ipsets.go 337: Retrying after an ipsets update failure... family="inet"
2024-01-05 03:55:48.534 [ERROR][11840] felix/ipsets.go 599: Bad return code from 'ipset list'. error=exit status 1 family="inet" stderr="ipset v7.11: Kernel and userspace incompatible: settype hash:ip,port with revision 7 not supported by userspace.\n"
2024-01-05 03:55:48.534 [WARNING][11840] felix/ipsets.go 346: Failed to resync with dataplane error=exit status 1 family="inet"
2024-01-05 03:55:48.599 [INFO][11840] felix/ipsets.go 337: Retrying after an ipsets update failure... family="inet"
2024-01-05 03:55:48.606 [ERROR][11840] felix/ipsets.go 599: Bad return code from 'ipset list'. error=exit status 1 family="inet" stderr="ipset v7.11: Kernel and userspace incompatible: settype hash:ip,port with revision 7 not supported by userspace.\n"

分析问题：

Calico的calico-node容器在与dataplane同步时遇到了问题，其中涉及到ipset的错误。特别是，错误信息中提到了"Kernel and userspace incompatible: settype hash:ip,port with revision 7 not supported by userspace"，这表明内核和用户空间之间的不兼容性。

降低系统内核版本或是变更calico的版本

解决问题：

内核版本降低：使用（Kernel 5.4.264）可以正常启动。

calico-apiserver启动失败

问题描述

calico-apiserver一直处于Pending状态

NAMESPACE          NAME                                                           READY   STATUS    RESTARTS   AGE
calico-apiserver   calico-apiserver-65fb845b45-t2vpj                              0/1     Pending   0          9m33s
calico-apiserver   calico-apiserver-65fb845b45-zb2q9                              0/1     Pending   0          9m33s

排查问题

kubectl describe pod calico-apiserver-65fb845b45-t2vpj -n calico-apiserver
###
Warning  FailedScheduling  4m47s (x2 over 10m)  default-scheduler  0/4 nodes are available: 4 node(s) had untolerated taint {node.kubernetes.io/network-unavailable: }. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling

以上问题是calico-node启动失败导致的calico-apiserver一直处于Pending状态。

calico-node报错

基础信息：calico版本3.27.0

问题描述：calico-node加入网络错误，BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl:

查看pod状态 kubectl describe pod calico-node-5pmzs -n calico-system

Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused

解决问题：重新删除calico网络，这里删除calico有可能会卡主，耐心等待，如果一直卡着，可以考虑直接删除namespace命名空间资源。

 kubectl delete -f tigera-operator.yaml
 kubectl delete -f custom-resources.yaml

新增两行配置文件，这里指定网卡

nodeAddressAutodetectionV4:

interface: ens.*

# This section includes base Calico installation configuration.
# For more information, see: https://docs.tigera.io/calico/latest/reference/installation/api#operator.tigera.io/v1.Installation
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  # Configures Calico networking.
  calicoNetwork:
    # Note: The ipPools section cannot be modified post-install.
    ipPools:
    - blockSize: 26
      cidr: 30.244.0.0/16
      encapsulation: VXLANCrossSubnet
      natOutgoing: Enabled
      nodeSelector: all()
    nodeAddressAutodetectionV4:
      interface: ens.*

---

# This section configures the Calico API server.
# For more information, see: https://docs.tigera.io/calico/latest/reference/installation/api#operator.tigera.io/v1.APIServer
apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
  name: default
spec: {}

变更好配置文件后重新运行创建calico资源。

 kubectl create -f tigera-operator.yaml
 kubectl create -f custom-resources.yaml

master节点遇到的问题

master节点集群初始化不成功

集群初始化不成功的时候先重置集群，需要制定容器类型，不指定会报错（这里需要指定容器运行的类型）

kubeadm reset --cri-socket unix:///var/run/cri-dockerd.sock

注意

The reset process does not clean your kubeconfig files and you must remove them manually.

Please, check the contents of the $HOME/.kube/config file.

这里如果想重新初始化需要删除掉这个配置文件

rm -rf $HOME/.kube

worker节点遇到的问题

worker节点无法加入集群

问题描述：worker节点重复添加集群，之前有添加失败的情况

[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
[kubelet-check] Initial timeout of 40s passed.
error execution phase kubelet-start: error uploading crisocket: Unauthorized
To see the stack trace of this error execute with --v=5 or higher

解决问题：这里使用的是docker容器所以命令里需要指定。这里重新初始化再次添加节点。这个命令有风险需要谨慎操作。

# 重新初始化集群
kubeadm reset --cri-socket unix:///var/run/cri-dockerd.sock
rm -rf $HOME/.kube

# 重新加入节点
kubeadm join 192.168.13.133:6443 --token qsq414.hrw44xwjoxt2l15l \
	--discovery-token-ca-cert-hash sha256:75ae4d4d07420b61a3d9143c847ff74caxxxxxxxxxxxxxxxx1656bef98 --cri-socket unix:///var/run/cri-dockerd.sock
### 以下输出就是正常加入了集群
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

注意：

这里确保kubelet服务是加入到开机自启，非启动状态的，如果在初始化的时候就启动会报错端口占用。

运维潇哥

关注

8
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
kubernetes实际生产中遇到的问题及解决办法

特别是，错误信息中提到了"Kernel and userspace incompatible: settype hash:ip,port with revision 7 not supported by userspace"，这表明内核和用户空间之间的不兼容性。以上问题是calico-node启动失败导致的calico-apiserver一直处于Pending状态。calico启动失败，pod状态running，但是ready这里只有0/1，实际没有运行。降低系统内核版本或是变更calico的版本。
复制链接

扫一扫