安装k8s 1.9.0 实践:问题集锦

版权声明:本文为博主原创文章,转载请注明出处。 https://blog.csdn.net/zhd930818/article/details/79644903

k8s 1.5 与 k8s 1.9的差别

参照以前安装kubernetes 1.5.2失败,原因是docker包冲突。在查看高版本安装过程中发现,高版本kubernetes不再打包安装docker,而是需要用户先自行安装好docker服务。
机器上已经安装了 Docker version 17.12.0-ce, build c97c6d6
再安装kubernetes  (kubernetes.x86_64  1.5.2-0.7.git269f928.el7) 时失败。
错误:docker-ce conflicts with 2:docker-1.12.6-71.git3e8e77d.el7.centos.1.x86_64
您可以尝试添加 --skip-broken 选项来解决该问题
您可以尝试执行:rpm -Va --nofiles --nodigest

猜测可能因为版本问题,故去网上搜索安装更高级版本方法。结果如下:
“但是在kubernetes1.6之后,安装就比较繁琐了,需要证书各种认证,对于刚接触kubernetes的人来说很不友好,按照官方文档在本地安装“集群”的的话,我觉得你肯定是跑不起来的,除非你突破了GFW的限制,还要懂得怎么样不断修改参数。
意思是k8s 1.6之后的安装与之前可能有比较大的差异。google被墙,需要预先下载很多docker镜像。
以下三篇文章安装k8s 1.7.5,由于缺乏docker镜像,安装失败。
https://www.cnblogs.com/liangDream/p/7358847.html
http://www.bubuko.com/infodetail-2375091.html
https://www.kubernetes.org.cn/3063.html


docker安装问题

docker版本选择

kubernetes1.9.0 最高支持docker17.03 目前装的17.12太高了 要降级。

Kubernetes对Docker的版本支持列表 http://blog.csdn.net/csdn_duomaomao/article/details/79171027

删除docker

[root@tensorflow0 hdzhou]# yum remove docker \
docker-common \
docker-selinux \
docker-engine


======================================================================================================================================================================================
Package                                      架构                              版本                                                  源                                         大小
======================================================================================================================================================================================
正在删除:
container-selinux                            noarch                            2:2.36-1.gitff95335.el7                               @extras                                    34 k
为依赖而移除:
docker-ce                                    x86_64                            17.12.0.ce-1.el7.centos                               installed                                 123 M
nvidia-docker2                               noarch                            2.0.2-1.docker17.12.0.ce                              @nvidia-docker                            2.3 k

事务概要
======================================================================================================================================================================================
移除  1 软件包 (+2 依赖软件包)

docker启动失败问题

2月 26 16:42:00 tensorflow0 dockerd[8717]: time="2018-02-26T16:42:00.315096986+08:00" level=info msg="libcontainerd: new containerd process, pid: 8725"
2月 26 16:42:01 tensorflow0 dockerd[8717]: time="2018-02-26T16:42:01.319051277+08:00" level=error msg="[graphdriver] prior storage driver overlay2 failed: driver not supported"
2月 26 16:42:01 tensorflow0 dockerd[8717]: Error starting daemon: error initializing graphdriver: driver not supported
2月 26 16:42:01 tensorflow0 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
2月 26 16:42:01 tensorflow0 systemd[1]: Failed to start Docker Application Container Engine.

解决:
sudo mv /var/lib/docker /var/lib/docker.old











k8s安装问题

rpm安装

rpm -ivh socat-1.7.3.2-2.el7.x86_64.rpm
rpm -ivh kubernetes-cni-0.6.0-0.x86_64.rpm  kubelet-1.9.9-9.x86_64.rpm  kubectl-1.9.0-0.x86_64.rpm
rpm -ivh kubectl-1.9.0-0.x86_64.rpm
rpm -ivh kubeadm-1.9.0-0.x86_64.rpm

rpm删除

rpm -e 文件名 --nodeps
eg:
rpm -e socat-1.7.3.2-2.el7.x86_64 --nodeps
rpm -e kubernetes-cni-0.6.0-0.x86_64 --nodeps
rpm -e kubelet-1.9.0-0.x86_64 --nodeps
rpm -e kubectl-1.9.0-0.x86_64 --nodeps
rpm -e kubeadm-1.9.0-0.x86_64.rpm --nodeps

查看报错信息

cat /var/log/messages
journalctl -xeu kubelet
  • kubelet启动后 ca文件不存在是正常的,在后续步骤 kubeadm init执行后会生成ca文件。

  • kubelet启动后在不停重启是正常的!

The kubelet is now restarting every few seconds, as it waits in a crashloop for kubeadm to tell it what to do. This crashloop is expected and normal, please proceed with the next step and the kubelet will start running normally.




  • 初始化集群
kubeadm init --kubernetes-version=v1.9.0 --pod-network-cidr=10.244.0.0/16

务必记录如下信息,每次生成都不一样
eg:
kubeadm join --token 5ce44e.47b6dc4e4b66980f 192.168.1.138:6443 --discovery-token-ca-cert-hash sha256:9d7eac82d66744405c783de5403e1f2bb7191b4c1b350d721b7b8570c62ff83a
token重新获取
kubeadm token list
或者 
kubeadm token create
token 24小时后过期,超过时间需要重新获取

sha256获取方式 master节点执行:
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'


  • kubeadm init

[root@tensorflow0 etc]# kubeadm init --kubernetes-version=v1.9.0 --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.9.0
[init] Using Authorization modes: [Node RBAC]
[preflight] Running pre-flight checks.
    [WARNING FileExisting-crictl]: crictl not found in system path
[preflight] Some fatal errors occurred:
    [ERROR Swap]: running with swap on is not supported. Please disable swap
    [ERROR Port-2379]: Port 2379 is in use
    [ERROR DirAvailable--var-lib-etcd]: /var/lib/etcd is not empty
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
[root@tensorflow0 etc]#


[ERROR DirAvailable--etc-kubernetes-manifests]: /etc/kubernetes/manifests is not empty

命令后面增加 --ignore-preflight-errors 'Swap'   或者 --ignore-preflight-errors all  (这是不好的)
Port 2379 is in use 因为没有执行 kubeadm reset
[ERROR Swap]: running with swap on is not supported. Please disable swap 见下文 关闭swap



  • 查看错误:
kubectl get pod kube-proxy-d2p7p -o wide --namespace=kube-system
kubectl describe pod kube-proxy-d2p7p --namespace=kube-system
  • 修改kubelet配置,启动kubelet(所有节点)
注意:时刻查看/var/log/messages的日志输出,会看到kubelet一直启动失败。
cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
编辑10-kubeadm.conf的文件,修改cgroup-driver配置:
[root@centos7-base-ok]# cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf[Service]Environment="KUBELET_KUBECONFIG_ARGS=--kubeconfig=/etc/kubernetes/kubelet.conf --require-kubeconfig=true"Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"Environment="KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local"Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"Environment="KUBELET_CADVISOR_ARGS=--cadvisor-port=0"Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs"ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_CGROUP_ARGS $KUBELET_EXTRA_ARGS
Environment="KUBELET_SWAP_ARGS=--fail-swap-on=false"             1.8开始,如果机器开启了swap,kubulet会无法启动,默认参数是true。 可以在kubelet里配置swap false 也可以直接关闭机器的swap。关闭方法见下文。

将“--cgroup-driver=systems”修改成为“--cgroup-driver=cgroupfs”
这里需要主意的是要看一下docker的cgroup driver与 --cgroup-driver要一致。 可以用 docker info |grep Cgroup 查看,有可能是systemd 或者 cgroupfs

重新启动kubelet
[root@centos7-base-ok]# systemctl restart kubelet



  • [ERROR Swap]: running with swap on is not supported. Please disable swap

[preflight] Running pre-flight checks.
    [WARNING Service-Kubelet]: kubelet service is not enabled, please run 'systemctl enable kubelet.service'
    [WARNING FileExisting-crictl]: crictl not found in system path
[preflight] Some fatal errors occurred:
    [ERROR Swap]: running with swap on is not supported. Please disable swap
关闭swap
swapoff -a
设置永久关闭swap
修改/etc/fstab中内容,将swap哪一行用#注释掉。


  • 删除etcd 
yum erase etcd. 
删除etcd文件夹 mv /var/lib/etcd /var/lib/etcd.bak

  • The connection to the server localhost:8080 was refused - did you specify the right host or port?

export KUBECONFIG=/etc/kubernetes/admin.conf
定义在6443端口 而不是8080


  • runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

  • kube-dns 启动不成功
kube-system   po/kube-dns-6f4fd4bdf-p5x4k              0/3       Pending   0          14m
修改 /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
删除$KUBELET_NETWORK_ARGS  别这么做
dns异常,kubeadm reset重来,试试先初始化master,然后配置flannel网络,ok了以后,再加入其它机器
重启
systemctl daemon-reload && systemctl restart kubelet
kubeadm reset
kubeadm init --kubernetes-version=v1.9.0 --pod-network-cidr=10.244.0.0/16


  • kube-proxy 启动不成功
原因同 no IP addresses available
多次启动集群,虚拟ip用完了。

  • no IP addresses available
E1216 23:50:16.116098 28152 pod_workers.go:186] Error syncing pod 6f5b9673-e2b5-11e7-a0f5-001e67d35991 ("kube-dns-6f4fd4bdf-xrj4w_kube-system(6f5b9673-e2b5-11e7-a0f5-001e67d35991)"), skipping: failed to "CreatePodSandbox" for "kube-dns-6f4fd4bdf-xrj4w_kube-system(6f5b9673-e2b5-11e7-a0f5-001e67d35991)" with CreatePodSandboxError: "CreatePodSandbox for pod \"kube-dns-6f4fd4bdf-xrj4w_kube-system(6f5b9673-e2b5-11e7-a0f5-001e67d35991)\" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod \"kube-dns-6f4fd4bdf-xrj4w_kube-system\" network: failed to allocate for range 0: no IP addresses available in range set: 10.244.0.1-10.244.0.254"
多次启动集群,虚拟ip用完了。
kubeadm reset 
rm -rf /var/lib/cni/flannel/* 
rm -rf /var/lib/cni/networks/cbr0/* 
ip link delete cni0 flannel.1 
重启!!!! kubeadm reset多了 网络开辟可能有什么残留 重启能清空。

这两个问题都是和网络有关,都是因为虚拟网络问题导致服务启动不正常。原因是多次kubeadm reset 多次重新启动flannel(或者其他网络),reset可能清理不彻底,导致多次reset后出现ip用完等问题。解决办法是先reset,然后删除文件夹和配置,重启机器(可能不用),一般是报错的机器这样做,也可以每台机器都要做。重新初始化k8s集群,即可。

  • pod ContainerCreating

查看pod情况发现pod起不来

default       po/httpd-68f9d7648d-5f9gt                0/1       ContainerCreating   0          1m        <none>          tensorflow0
describe一下 说sadbox创建失败。

  Warning  FailedCreatePodSandBox  20s (x12 over 54s)  kubelet, tensorflow0  Failed create pod sandbox.
  Normal   SandboxChanged          20s (x12 over 53s)  kubelet, tensorflow0  Pod sandbox changed, it will be killed and re-created.




到那台起不来的机器上去看kubelet状态。
发现 
Error while adding to cni network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.2.1/24。
同 
Error while adding to cni network: failed to allocate for range 0: no IP addresses available in range set: 10.244.2.1-10.244.2.254
[root@tensorflow0 ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since 四 2018-03-22 14:49:29 CST; 4min 12s ago
     Docs: http://kubernetes.io/docs/
Main PID: 3873 (kubelet)
   Memory: 45.0M
   CGroup: /system.slice/kubelet.service
           ├─ 3873 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --feature-gates=DevicePlugins=true --pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true --network-plugin=cni -...
           ├─11665 /opt/cni/bin/flannel
           └─11670 /opt/cni/bin/bridge

3月 22 14:53:35 tensorflow0 kubelet[3873]: E0322 14:53:35.990200    3873 kuberuntime_manager.go:647] createPodSandbox for pod "httpd-68f9d7648d-5f9gt_default(39f66066-2d9d-11e8-bf17-98eecb73f4db)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to ...
3月 22 14:53:35 tensorflow0 kubelet[3873]: E0322 14:53:35.990287    3873 pod_workers.go:186] Error syncing pod 39f66066-2d9d-11e8-bf17-98eecb73f4db ("httpd-68f9d7648d-5f9gt_default(39f66066-2d9d-11e8-bf17-98eecb73f4db)"), skipping: failed to "Cre...bf17-98eecb73f4db)"
3月 22 14:53:37 tensorflow0 kubelet[3873]: W0322 14:53:37.041536    3873 pod_container_deletor.go:77] Container "73c43b8766686c64d31bdd0533604d1d349ebe08f95d7463d23ebdffe377113e" not found in pod's containers
3月 22 14:53:39 tensorflow0 kubelet[3873]: E0322 14:53:39.621047    3873 cni.go:259] Error adding network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.2.1/24
3月 22 14:53:39 tensorflow0 kubelet[3873]: E0322 14:53:39.621083    3873 cni.go:227] Error while adding to cni network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.2.1/24
3月 22 14:53:39 tensorflow0 kubelet[3873]: E0322 14:53:39.809286    3873 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "httpd-68f9d7648d-5f9gt_default" net...t from 10.244.2.1/24
3月 22 14:53:39 tensorflow0 kubelet[3873]: E0322 14:53:39.809337    3873 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "httpd-68f9d7648d-5f9gt_default(39f66066-2d9d-11e8-bf17-98eecb73f4db)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to s...
3月 22 14:53:39 tensorflow0 kubelet[3873]: E0322 14:53:39.809360    3873 kuberuntime_manager.go:647] createPodSandbox for pod "httpd-68f9d7648d-5f9gt_default(39f66066-2d9d-11e8-bf17-98eecb73f4db)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to ...
3月 22 14:53:39 tensorflow0 kubelet[3873]: E0322 14:53:39.809424    3873 pod_workers.go:186] Error syncing pod 39f66066-2d9d-11e8-bf17-98eecb73f4db ("httpd-68f9d7648d-5f9gt_default(39f66066-2d9d-11e8-bf17-98eecb73f4db)"), skipping: failed to "Cre...bf17-98eecb73f4db)"
3月 22 14:53:40 tensorflow0 kubelet[3873]: W0322 14:53:40.063548    3873 pod_container_deletor.go:77] Container "f1b063e5245c7a5c8527d1426858781c6554bcb06d987c7f472cfd0c41290110" not found in pod's containers
Hint: Some lines were ellipsized, use -l to show in full.



解决:
干掉cni-flannel,停运集群.清理环境.

rm -rf /var/lib/cni/flannel/* && rm -rf /var/lib/cni/networks/cbr0/* && ip link delete cni0
rm -rf /var/lib/cni/networks/cni0/*
把报错的那台清理了就行了。







  • 加入节点


节点加入不报错 但是主节点看不到,因为kubelet 启动失败 ,也要修改cgroup-driver
重启kubelet
再次kubeadm join xxx
报错
[preflight] Running pre-flight checks.
    [WARNING FileExisting-crictl]: crictl not found in system path
[preflight] Some fatal errors occurred:
    [ERROR FileAvailable--etc-kubernetes-pki-ca.crt]: /etc/kubernetes/pki/ca.crt already exists
    [ERROR FileAvailable--etc-kubernetes-kubelet.conf]: /etc/kubernetes/kubelet.conf already exists
删除存在文件即可
kubeadm join 前需要 kubectl reset

  • 删除节点
master执行 kubectl delete node {nodename}
eg:
kubectl delete node tensorflow0
节点执行 kubectl reset

master执行了删除节点操作








k8s + gpu

注意要设置default-runtime



  • 容器没有启动在虚拟网络上

设置了虚拟网段 10.244.0.0/16
容器应该启动在虚拟网段上,每个容器一个ip,现在环境2并不是这样。这样就不能准确的指定ip,分布式tf任务跑不成。

环境1是正常的:



环境2异常:

解决方案同kube-dns 启动不成功







  • 推荐打开,不打开我没发现什么问题。有时候,莫名就变成0了,就报错了,还是配置好比较好。
echo 'net.bridge.bridge-nf-call-iptables=1' >> /etc/sysctl.conf
sysctl -p
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1

  • 集群连接不上了
[root@tensorflow1 influxdb]# kubectl get all -o wide -n kube-system
error: {batch  cronjobs} matches multiple kinds [batch/v1beta1, Kind=CronJob batch/v2alpha1, Kind=CronJob]

原因是 ~/.bash_profile 里配置的k8s属性丢失了。


启动nvidia-device-plugin-daemonset失败

Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "process_linux.go:337: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=12545 /data1/docker/overlay/10be1d599f91da020b7bfced8058533bb6129b637871ea61e0547ecb8758b3a2/merged]\\n*** Error in `/usr/bin/nvidia-container-cli': double free or corruption (!prev): 0x000055c6961daa10 ***\\n======= Backtrace: =========\\n/lib64/libc.so.6(+0x7c619)[0x7f5aa0af0619]\\n/usr/lib64/nvidia/libcuda.so.1(+0x2edd7c)[0x7f5a9fb77d7c]\\n/usr/lib64/nvidia/libcuda.so.1(+0x2eddc3)[0x7f5a9fb77dc3]\\n/usr/lib64/nvidia/libcuda.so.1
发现gpu已经被占用,先清理干净,再启动就没问题了。

没有更多推荐了,返回首页