kubelet注册时永久hang住的问题定位

问题介绍

上次把k8s从虚拟机迁移到’物理机"后又碰到了一个问题,kubelet重起时报imageFs gabage collector要Evict pod,然后就不往下走,通过Kubect get node hostubuntu -o yaml 可以看到node hostubunut处于DiskPressure状态。主要原因是hostubuntu这台机器是平时的工作占,运行了registry, 平时所有的软件编译2这台机器上,以前的虚机也是部署在这台机器上,所以确实磁盘空间吃紧。

这篇文章主要记录解决这个问题的过程。

修改Evict参数

修改了eviction-hard, eviction-soft把几个参数设的很小还是报这个错,想是不是有可能这个是上次存储在etcd中的历史状态,所以打算通过kubectl patch hostubnutu -P ‘{"status":{"conditions":[{"type":"DiskPressure","state":"False"}]}}', 但是patch只能修改spec,失败

当然最后还是通过修改这两个参数解决这个问题的,可以看到inodesFree<1Ki

--eviction-hard=memory.available<100Mi,nodefs.available<10Mi,nodefs.inodesFree<1Ki,imagefs.available<10Mi 
--eviction-soft=memory.available<100Mi,nodefs.available<10Mi,nodefs.inodesFree<1Ki,imagefs.available<10Mi 
--eviction-soft-grace-period=memory.available=5m,nodefs.available=5m,nodefs.inodesFree=5m,imagefs.available=5m

删除节点重新注册

patch失败就重新注册吧,手动删除了节点以及配置文件,准备重新bootstrap

kubectl delete node hostubuntu
rm /etc/kubenetes/kubelet.conf
rm -rf /var/lib/kubelet/pki/*

重新启动kubelet,但是问题来了,kubelet启动了一会就没动静了,原来etcd, kube-apiserver等都是把yaml文件放在/etc/kubernetes/manifests下由kubelet启动的,尝试手动启动etcd,apiserver,然后在重启kubelet但效果一样.

可以看到已经生成了csr.

echyong@hostubuntu:/cloud/k8s$ kubectl get csr
NAME                                                   AGE    REQUESTOR                 CONDITION
csr-m9lsn                                              357d   system:node:vm1           Approved,Issued
csr-stgxg                                              232d   system:node:vm2           Approved,Issued
csr-txkxw                                              273d   system:node:vm1           Approved,Issued
node-csr-AcOR9jhHcIySVuFx-yzGIyTfgN2EXHXYhhHz40GDO28   8s     system:bootstrap:07401b   Pending

echyong@hostubuntu:/cloud/k8s$ kubectl certificate approve csr node-csr-AcOR9jhHcIySVuFx-yzGIyTfgN2EXHXYhhHz40GDO28
certificatesigningrequest.certificates.k8s.io/node-csr-AcOR9jhHcIySVuFx-yzGIyTfgN2EXHXYhhHz40GDO28 approved
echyong@hostubuntu:/cloud/k8s$ kubectl get csr
NAME                                                   AGE    REQUESTOR                 CONDITION
csr-m9lsn                                              357d   system:node:vm1           Approved,Issued
csr-stgxg                                              232d   system:node:vm2           Approved,Issued
csr-txkxw                                              273d   system:node:vm1           Approved,Issued
node-csr-AcOR9jhHcIySVuFx-yzGIyTfgN2EXHXYhhHz40GDO28   34s    system:bootstrap:07401b   Approved

但是approve后kubelet任然不往前走, 日志中也没有任何报错.
最后的日志:

I0530 10:38:17.473060    2410 plugins.go:103] No cloud provider specified.
I0530 10:38:17.473086    2410 server.go:523] No cloud provider specified: "" from the config file: ""
I0530 10:38:17.473120    2410 bootstrap.go:65] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file
I0530 10:38:17.478351    2410 loader.go:359] Config loaded from file /etc/kubernetes/bootstrap-kubelet.conf
I0530 10:38:17.479151    2410 bootstrap.go:96] No valid private key and/or certificate found, reusing existing private key or creating a new one

以及最后的消息:

I0530 10:38:17.512813    2410 round_trippers.go:419] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubelet/v1.13.1 (linux/amd64) kubernetes/eec55b9" -H "Authorization: Bearer 07401b.f39331af8e370fc2" 'https://172.16.137.128:6443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests?fieldSelector=metadata.name%3Dnode-csr-AcOR9jhHcIySVuFx-yzGIyTfgN2EXHXYhhHz40GDO28&limit=500'
I0530 10:38:17.516731    2410 round_trippers.go:438] GET https://172.16.137.128:6443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests?fieldSelector=metadata.name%3Dnode-csr-AcOR9jhHcIySVuFx-yzGIyTfgN2EXHXYhhHz40GDO28&limit=500 200 OK in 3 milliseconds
I0530 10:38:17.516771    2410 round_trippers.go:444] Response Headers:
I0530 10:38:17.516802    2410 round_trippers.go:447]     Content-Type: application/json
I0530 10:38:17.516820    2410 round_trippers.go:447]     Content-Length: 1301
I0530 10:38:17.516837    2410 round_trippers.go:447]     Date: Thu, 30 May 2019 02:38:17 GMT
I0530 10:38:17.516887    2410 request.go:942] Response Body: {"kind":"CertificateSigningRequestList","apiVersion":"certificates.k8s.io/v1beta1","metadata":{"selfLink":"/apis/certificates.k8s.io/v1beta1/certificatesigningrequests","resourceVersion":"33431658"},"items":[{"metadata":{"name":"node-csr-AcOR9jhHcIySVuFx-yzGIyTfgN2EXHXYhhHz40GDO28","selfLink":"/apis/certificates.k8s.io/v1beta1/certificatesigningrequests/node-csr-AcOR9jhHcIySVuFx-yzGIyTfgN2EXHXYhhHz40GDO28","uid":"fb0263c7-8283-11e9-97a7-000c295160ff","resourceVersion":"33431658","creationTimestamp":"2019-05-30T02:38:17Z"},"spec":{"request":"LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlIME1JR2FBZ0VBTURneEZUQVRCZ05WQkFvVERITjVjM1JsYlRwdWIyUmxjekVmTUIwR0ExVUVBeE1XYzNsegpkR1Z0T201dlpHVTZhRzl6ZEhWaWRXNTBkVEJaTUJNR0J5cUdTTTQ5QWdFR0NDcUdTTTQ5QXdFSEEwSUFCSFQ0ClNJREFFdGZvRjBuRmVtREpCS245VUhRQ2hjK3ptekhManJvbEJJaDMyb1plNmlQUzVrejNuMDZ3UkppeE9WLzcKK3NzN2ZMOGROZXNlMVNsa0lqcWdBREFLQmdncWhrak9QUVFEQWdOSkFEQkdBaUVBK0NUNTBiSEgyZDRTZ1crLwpVL2FLdVJPMG9jcFVIaEQzS1o3a2VMaWNFMllDSVFDdlFvKzdMcUM2VUJkTlp2eUpwRzdNbzQ5dHR3dUt0aThnCm04dHpzZy9LdFE9PQotLS0tLUVORCBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0K","usages":["digital signature","key encipherment","client auth"],"username":"system:bootstrap:07401b","groups":["system:bootstrappers","system:bootstrappers:manualnode","system:authenticated"]},"status":{}}]}
I0530 10:38:17.518206    2410 round_trippers.go:419] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubelet/v1.13.1 (linux/amd64) kubernetes/eec55b9" -H "Authorization: Bearer 07401b.f39331af8e370fc2" 'https://172.16.137.128:6443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests?fieldSelector=metadata.name%3Dnode-csr-AcOR9jhHcIySVuFx-yzGIyTfgN2EXHXYhhHz40GDO28&resourceVersion=33431658&watch=true'
I0530 10:38:17.520058    2410 round_trippers.go:438] GET https://172.16.137.128:6443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests?fieldSelector=metadata.name%3Dnode-csr-AcOR9jhHcIySVuFx-yzGIyTfgN2EXHXYhhHz40GDO28&resourceVersion=33431658&watch=true 200 OK in 1 milliseconds
I0530 10:38:17.520122    2410 round_trippers.go:444] Response Headers:
I0530 10:38:17.520142    2410 round_trippers.go:447]     Content-Type: application/json
I0530 10:38:17.520175    2410 round_trippers.go:447]     Date: Thu, 30 May 2019 02:38:17 GMT

大致定位到代码停留在
k8s.io/client-go/util/certificate/csr/csr.go: WaitForCertificate

func WaitForCertificate(client certificatesclient.CertificateSigningRequestInterface, req *certificates.CertificateSigningRequest, timeout time.Duration) (certData []byte, err error) {
	fieldSelector := fields.OneTermEqualSelector("metadata.name", req.Name).String()

	event, err := watchtools.ListWatchUntil(
		timeout,
		&cache.ListWatch{
			ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
				options.FieldSelector = fieldSelector
				return client.List(options)
			},
			WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
				options.FieldSelector = fieldSelector
				return client.Watch(options)
			},
		},
		func(event watch.Event) (bool, error) {
			switch event.Type {
			case watch.Modified, watch.Added:
			case watch.Deleted:
				return false, fmt.Errorf("csr %q was deleted", req.Name)
			default:
				return false, nil
			}
			csr := event.Object.(*certificates.CertificateSigningRequest)
			if csr.UID != req.UID {
				return false, fmt.Errorf("csr %q changed UIDs", csr.Name)
			}
			for _, c := range csr.Status.Conditions {
				if c.Type == certificates.CertificateDenied {
					return false, fmt.Errorf("certificate signing request is not approved, reason: %v, message: %v", c.Reason, c.Message)
				}
				if c.Type == certificates.CertificateApproved && csr.Status.Certificate != nil {
					return true, nil
				}
			}
			return false, nil
		},
	)
	if err == wait.ErrWaitTimeout {
		return nil, wait.ErrWaitTimeout
	}
	if err != nil {
		return nil, formatError("cannot watch on the certificate signing request: %v", err)
	}

	return event.Object.(*certificates.CertificateSigningRequest).Status.Certificate, nil
}

正常来说approve csr后应该就好了,listwatch应该能检测到这个变化,但是为什么不往前走? 问题可能出在 if c.Type == certificates.CertificateApproved && csr.Status.Certificate != nil 这个检查, 状态已经是CertificateApproved, 难道csr.Status.Certificate为空没有生成?

echyong@hostubuntu:/cloud/k8s$ kubectl get csr node-csr-AcOR9jhHcIySVuFx-yzGIyTfgN2EXHXYhhHz40GDO28 -o yaml
apiVersion: certificates.k8s.io/v1beta1
kind: CertificateSigningRequest
metadata:
  creationTimestamp: "2019-05-30T02:38:17Z"
  name: node-csr-AcOR9jhHcIySVuFx-yzGIyTfgN2EXHXYhhHz40GDO28
  resourceVersion: "33431662"
  selfLink: /apis/certificates.k8s.io/v1beta1/certificatesigningrequests/node-csr-AcOR9jhHcIySVuFx-yzGIyTfgN2EXHXYhhHz40GDO28
  uid: fb0263c7-8283-11e9-97a7-000c295160ff
spec:
  groups:
  - system:bootstrappers
  - system:bootstrappers:manualnode
  - system:authenticated
  request: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlIME1JR2FBZ0VBTURneEZUQVRCZ05WQkFvVERITjVjM1JsYlRwdWIyUmxjekVmTUIwR0ExVUVBeE1XYzNsegpkR1Z0T201dlpHVTZhRzl6ZEhWaWRXNTBkVEJaTUJNR0J5cUdTTTQ5QWdFR0NDcUdTTTQ5QXdFSEEwSUFCSFQ0ClNJREFFdGZvRjBuRmVtREpCS245VUhRQ2hjK3ptekhManJvbEJJaDMyb1plNmlQUzVrejNuMDZ3UkppeE9WLzcKK3NzN2ZMOGROZXNlMVNsa0lqcWdBREFLQmdncWhrak9QUVFEQWdOSkFEQkdBaUVBK0NUNTBiSEgyZDRTZ1crLwpVL2FLdVJPMG9jcFVIaEQzS1o3a2VMaWNFMllDSVFDdlFvKzdMcUM2VUJkTlp2eUpwRzdNbzQ5dHR3dUt0aThnCm04dHpzZy9LdFE9PQotLS0tLUVORCBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0K
  usages:
  - digital signature
  - key encipherment
  - client auth
  username: system:bootstrap:07401b
status:
  conditions:
  - lastUpdateTime: "2019-05-30T02:38:46Z"
    message: This CSR was approved by kubectl certificate approve.
    reason: KubectlApprove
    type: Approved

果然空的,晕倒没有启动kube-controller-manager,这也不奇怪,kubelet还没启动好,它没有办法去启动/etc/kubernetes/manifests/kube-controller-manager.yaml,
手动启动kube-controller-manager后,没过多久kubelet开始活跃起来了。

所以感觉容器化部署k8s自身的一些组件感觉不是很靠谱,可能会碰到鸡生蛋,蛋生鸡的问题。特别是master节点自身的注册时,要先把etcd,kube-apiserver, kube-controller-manager先手动启动起来,注册好后再切回通过kubelet自动启动/etc/kubernetes/manifests容器的方式
这些kubeadm都考虑了.

StatefulSet Mongodb Replicaset启动失败

又碰到mongo replicatset StatefulSets启动失败的问题,本来是部署了三个实例,但第一个实例停留在2/3处就没动静了:

echyong@hostubuntu:/etc/dnsmasq.d$ kubectl get all -n ai-cloud
NAME                                     READY   STATUS             RESTARTS   AGE
pod/aicloud-api-server-bbf8bd67d-5ktwn   1/2     CrashLoopBackOff   25         152m
pod/mongors-mongodb-replicaset-0         0/1     Init:2/3          4          64m

mongors的启动过程包含了三个InitContainers,做的事情分别是:

  • 从configmap拷贝mongod.conf
  • 安装/work-dir/peer-finder软件
  • 运行命令"/work-dir/peer-finder -on-start=/init/on-start.sh -service=mongors-mongodb-replicaset",

peer-finder的主要工作就是lookup函数,也就是解析mongors-mongodb-replicaset的DNS SRV records:

func lookup(svcName string) (sets.String, error) {
	endpoints := sets.NewString()
	_, srvRecords, err := net.LookupSRV("", "", svcName)
	if err != nil {
		return endpoints, err
	}
	for _, srvRecord := range srvRecords {
		// The SRV records ends in a "." for the root domain
		ep := fmt.Sprintf("%v", srvRecord.Target[:len(srvRecord.Target)-1])
		endpoints.Insert(ep)
	}
	return endpoints, nil
}

mongo rs通过已有records的个数来判断自己是第几个加入ReplicaSet集群的,如果是第一个就创建RS,并把自己设为master,如果是后来者就加入已有RS,这个逻辑为什么会失败呢?
从coredns日志中可以看到如下错误,

 [ERROR] plugin/errors: 2 mongors-mongodb-replicaset-1.mongors-mongodb-replicaset.ai-cloud.svc.cluster.local.localdomain. A: unreachable backend: read udp 192.168.1.25:41821->172.16.137.2:53: i/o timeout
 [ERROR] plugin/errors: 2 mongors-mongodb-replicaset-1.mongors-mongodb-replicaset.ai-cloud.svc.cluster.local.localdomain. A: unreachable backend: read udp 192.168.1.25:56364->172.16.137.128:53: i/o timeout
 [ERROR] plugin/errors: 2 mongors-mongodb-replicaset-2.mongors-mongodb-replicaset.ai-cloud.svc.cluster.local.localdomain. A: unreachable backend: read udp 192.168.1.25:49054->172.16.137.2:53: i/o timeout

判断上述函数net.LookupSRV一直返回失败,导致mongo启动失败。

至于怎么会访问172.16.137.2得?

pod的dnsPolicy设为 ClusterFirst,coredns的配置:

echyong@hostubuntu:/cloud/k8s$ kubectl get cm coredns -n kube-system -o yaml
apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            upstream
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        proxy . 172.16.137.128:53 /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
kind: ConfigMap

如果coredns查找失败就fallbak到172.16.137.128:53 /etc/resolv.conf, 这样在普通pod里也可以通过域名访问外网

echyong@hostubuntu:/cloud/k8s$ cat /etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
# 127.0.0.53 is the systemd-resolved stub resolver.
# run "systemd-resolve --status" to see details about the actual nameservers.

nameserver 127.0.0.1
search localdomain
options edns0


dnsmasq   2686     1  0 May29 ?        00:00:01 /usr/sbin/dnsmasq -x /run/dnsmasq/dnsmasq.pid -u dnsmasq -r /run/dnsmasq/resolv.conf -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new --local-service --trust-anchor=.,19036,8,2,49aac11d7b6f6446702e54a1607371607a1a41855200fd2ce1cdde32f24e8fb5 --trust-anchor=.,20326,8,2,e06d44b80b8f1d39a95c0b0d7c65d08458e880409bbc683457104237c7f8ec8d

顺便多说一句,hostubuntu:172.16.137.128其实也是vmware workstation里的一个虚拟机,
网络配置图:
hostubuntu的网路配置
从图中可以看到网关设为172.16.137.2, 另外vnet8是一个nat网络通过windows机器的网卡出去的,可是我的windows机器没有连网.

知道原因后就好办了,打开并连接手机热点,过一会看到mongors起来了:

echyong@hostubuntu:/etc/dnsmasq.d$ kubectl get all -n ai-cloud 
NAME                                     READY   STATUS             RESTARTS   AGE
pod/aicloud-api-server-bbf8bd67d-5ktwn   1/2     CrashLoopBackOff   31         179m
pod/mongors-mongodb-replicaset-0         1/1     Running            0          25m
pod/mongors-mongodb-replicaset-1         1/1     Running            0          25m
pod/mongors-mongodb-replicaset-2         1/1     Running            0          25m

跨namespace网络不通

aicloud-api-server-bbf8bd67d-5ktwn一直CrashLoopBackOff

I0530 06:07:35.711182       1 reflector.go:240] Listing and watching *v1.Pod from ******/cloud/vendor/k8s.io/client-go/informers/factory.go:73
E0530 06:08:05.710251       1 reflector.go:205] ******/cloud/vendor/k8s.io/client-go/informers/factory.go:73: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0530 06:08:05.710836       1 reflector.go:205] ******/cloud/vendor/k8s.io/client-go/informers/factory.go:73: Failed to list *v1.Job: Get https://10.96.0.1:443/apis/batch/v1/jobs?resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0530 06:08:05.711155       1 reflector.go:205] ******/cloud/vendor/github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/factory.go:59: Failed to list *v1alpha1.TFJob: Get https://10.96.0.1:443/apis/kubeflow.org/v1alpha1/tfjobs?resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0530 06:08:05.711187       1 reflector.go:205] ******/cloud/vendor/k8s.io/client-go/informers/factory.go:73: Failed to list *v1beta1.Deployment: Get https://10.96.0.1:443/apis/extensions/v1beta1/deployments?resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0530 06:08:05.711706       1 reflector.go:205] ******/cloud/vendor/k8s.io/client-go/informers/factory.go:73: Failed to list *v1.Pod: Get https://10.96.0.1:443/api/v1/pods?resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout

访问10.96.0.1:443失败,可是service是正常的

echyong@hostubuntu:/cloud/k8s$ kubectl get svc kubernetes
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   533d
echyong@hostubuntu:/cloud/k8s$ kubectl get ep kubernetes
NAME         ENDPOINTS             AGE
kubernetes   172.16.137.128:6443   533d

原因是因为我用的ovs multitenant网络插件,默认不同namespace网络是隔离的,放开ai-cloud的限制就好了.

/ # etcdctl --endpoints=http://0.0.0.0:6666 set /k8s.ovs.com/ovs/network/k8ssdn/netnamespaces/ai-cloud '{"NetName":"ai-cloud","NetID":0,"Action":"","Namesp
ace":""}'
{"NetName":"ai-cloud","NetID":0,"Action":"","Namespace":""}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值