traefik ingress troubleshooting

理解ingress

通常情况下,service和pod仅可在集群内部网络中通过IP地址访问。所有到达边界路由器的流量或被丢弃或被转发到其他地方。从概念上讲,可能像下面这样:

    internet
        |
  ------------
  [ Services ]
复制代码

Ingress是授权入站连接到达集群服务的规则集合

    internet
        |
   [ Ingress ]
   --|-----|--
   [ Services ]
复制代码

简单来讲,ingress是集群外访问集群内服务的入口,为方便起见,可以当成是一个web应用的gateway,例如一个nginx,其中还包括规则定义,即URL的路由信息,路由信息得的刷新由Ingress controller来提供。

理解traefik

Ingress Controller 实质上可以理解为是个监视器,Ingress Controller 通过不断地跟 kubernetes API 打交道,实时的感知后端 service、pod 等变化,比如新增和减少 pod,service 增加与减少等;当得到这些变化信息后,Ingress Controller 再结合下文的 Ingress 生成配置,然后更新反向代理负载均衡器,并刷新其配置,达到服务发现的作用。

事件经过

客户反馈,确认服务正常,在集群内可访问,单通过ingress访问不了。 于是到集群上调查。

  • ingress的内容

    $ kubectl get ing --all-namespaces | sed -n '1p;/rule-service.sze-sf-paihe-platform-stg-ac0d1270.sze-sf-caas.paic.com.cn/Ip'
      NAMESPACE                            NAME                                                    HOSTS                                                                         ADDRESS   PORTS     AGE
      sze-sf-paihe-platform-stg-ac0d1270   rule-service-19969-pafa-cloud-microservice-ingress      rule-service.sze-sf-paihe-platform-stg-ac0d1270.sze-sf-caas.paic.com.cn                 80        8d
    $ kubectl describe ing rule-service-19969-pafa-cloud-microservice-ingress -n sze-sf-paihe-platform-stg-ac0d1270
      Name:             rule-service-19969-pafa-cloud-microservice-ingress
      Namespace:        sze-sf-paihe-platform-stg-ac0d1270
      Address:
      Default backend:  default-http-backend:80 (<none>)
      Rules:
        Host                                                                     Path  Backends
        ----                                                                     ----  --------
        rule-service.sze-sf-paihe-platform-stg-ac0d1270.sze-sf-caas.paic.com.cn
                                                                                 /   rule-service-19969-pafa-cloud-microservice-svc:8080 (<none>)
      Annotations:
      Events:  <none>
    复制代码

    得知ingress的backend为此服务: rule-service-19969-pafa-cloud-microservice-svc:8080 。

  • service的内容

    $  kubectl describe svc rule-service-19969-pafa-cloud-microservice-svc -n sze-sf-paihe-platform-stg-ac0d1270
       Name:              rule-service-19969-pafa-cloud-microservice-svc
       Namespace:         sze-sf-paihe-platform-stg-ac0d1270
       Labels:            app=pafa-cloud-microservice
                          chart=pafa-cloud-microservice-1.2.0
                          component=microservice
                          group=pafa-cloud
                          heritage=Tiller
                          release=rule-service-19969
       Annotations:       <none>
       Selector:          app=pafa-cloud-microservice,component=microservice,release=rule-service-19969
       Type:              ClusterIP
       IP:                172.254.8.110
       Port:              http  8080/TCP
       TargetPort:        8080/TCP
       Endpoints:         172.1.23.10:8080,172.1.23.9:8080
       Session Affinity:  None
       Events:            <none>
    复制代码

    得知service ip,经过验证发现在集群内访问service ip确实没有问题。 那为什么ingress访问就会返回404呢? 我们的集群中配置了一个泛域名,如下图:

    泛域名对应的ELB vip绑定的后端是集群的system-node,在system-node上运行traefik来作为集群外访问集群内的gateway。所以检查一下system-node的traefik日志。

  • traefik的内容

    $ ping rule-service.sze-sf-paihe-platform-stg-ac0d1270.sze-sf-caas.paic.com.cn
      PING rule-service.sze-sf-paihe-platform-stg-ac0d1270.sze-sf-caas.paic.com.cn (30.99.3.130) 56(84) bytes of data.
      64 bytes from 30.99.3.130 (30.99.3.130): icmp_seq=1 ttl=255 time=0.128 ms
      64 bytes from 30.99.3.130 (30.99.3.130): icmp_seq=2 ttl=255 time=0.118 ms
      64 bytes from 30.99.3.130 (30.99.3.130): icmp_seq=3 ttl=255 time=0.109 ms
    $ kubectl get ds -n kube-system -o wide | sed -n '1p;/ingress/Ip'
      NAME                      DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR                                           AGE         CONTAINERS                IMAGES                                       SELECTOR  
      traefik-ingress-lb        2         2         2         2            2             caas_cluster=kube-system                                70d         traefik-ingress-lb        hub.yun.paic.com.cn/official/traefik:1.5.3   k8s-app=traefik-ingress-lb,name=traefik-ingress-lb  
      traefik-ingress-lb-user   2         2         2         0            2           caas_cluster=sze-sf-cif2-sync-prd-95e8e362,lb=traefik   39d         traefik-ingress-lb-user   hub.yun.paic.com.cn/official/traefik:1.5.3   k8s-app=traefik-ingress-lb-user,name=traefik-ingress-lb-user
    复制代码

    可以看到kube-system这个namespace下有2个关于traefik的daemonset。nodeSelector的条件为caas_cluster=kube-system将会使pod运行在system-node上。

    $ kubectl -n kube-system get po -o wide -l k8s-app=traefik-ingress-lb,name=traefik-ingress-lb
    NAME                       READY     STATUS    RESTARTS   AGE       IP           NODE
    traefik-ingress-lb-p4cjp   1/1       Running   0          6d        30.99.3.12   30.99.3.12
    traefik-ingress-lb-xjj2n   1/1       Running   0          6d        30.99.3.15   30.99.3.15
    $ kubectl describe ds traefik-ingress-lb -n kube-system
    Name:           traefik-ingress-lb
    Selector:       k8s-app=traefik-ingress-lb,name=traefik-ingress-lb
    Node-Selector:  caas_cluster=kube-system
    Labels:         k8s-app=traefik-ingress-lb
    Annotations:    kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"extensions/v1beta1","kind":"DaemonSet","metadata":{"annotations":{},"creationTimestamp":"2018-12-19T08:23:53Z","generation":1,"labels":{...
    Desired Number of Nodes Scheduled: 2
    Current Number of Nodes Scheduled: 2
    Number of Nodes Scheduled with Up-to-date Pods: 2
    Number of Nodes Scheduled with Available Pods: 2
    Number of Nodes Misscheduled: 0
    Pods Status:  2 Running / 0 Waiting / 0 Succeeded / 0 Failed
    Pod Template:
    Labels:           k8s-app=traefik-ingress-lb
                      name=traefik-ingress-lb
    Service Account:  ingress
    Containers:
     traefik-ingress-lb:
      Image:  hub.yun.paic.com.cn/official/traefik:1.5.3
      Ports:  80/TCP, 8580/TCP
      Args:
        --web
        --web.address=:8580
        --kubernetes
        --accesslog.filepath=/wls/logs/traefik/access.log
        --traefiklog.filepath=/wls/logs/traefik/traefik.log
      Limits:
        cpu:     1
        memory:  200Mi
      Requests:
        cpu:        100m
        memory:     20Mi
      Environment:  <none>
      Mounts:
        /wls/logs from logs (rw)
      Volumes:
       logs:
        Type:          HostPath (bare host directory volume)
        Path:          /wls/logs
        HostPathType:
    Events:            <none>
    复制代码

    30.99.3.12, 30.99.3.15这2个node就是我们系统中的system-node,进一步看看traefik的配置项,得知其日志输出到宿主机的/wls/logs/目录中,于是登录到节点上调查一下日志。

    $ ifconfig eth0| grep -i "inet .*[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+"| sed -e 's@\s\+@ @g' -e 's@addr:@@g'| cut -d' ' -f3
    30.99.3.15
    $ ls -lh /wls/logs/traefik/access.log
    -rw-r--r-- 1 root root 158M Apr  3 12:41 /wls/logs/traefik/access.log
    $ grep -i 'rule-service.sze-sf-paihe-platform-stg-ac0d1270.sze-sf-caas.paic.com.cn' /wls/logs/traefik/access.log
    30.99.237.17 - - [26/Mar/2019:03:00:36 +0000] "GET /favicon.ico HTTP/1.1" - - "http://rule-service.sze-sf-paihe-platform-stg-ac0d1270.sze-sf-caas.paic.com.cn/" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36" 8855 - - 0ms
    30.99.237.27 - - [27/Mar/2019:04:23:19 +0000] "GET /favicon.ico HTTP/1.1" - - "http://rule-service.sze-sf-paihe-platform-stg-ac0d1270.sze-sf-caas.paic.com.cn/" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36" 9902 - - 0ms
    30.99.237.18 - - [27/Mar/2019:06:11:54 +0000] "GET /favicon.ico HTTP/1.1" - - "http://rule-service.sze-sf-paihe-platform-stg-ac0d1270.sze-sf-caas.paic.com.cn/" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36" 9991 - - 0ms
    30.99.237.25 - - [27/Mar/2019:14:18:33 +0000] "GET /favicon.ico HTTP/1.1" - - "http://rule-service.sze-sf-paihe-platform-stg-ac0d1270.sze-sf-caas.paic.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36" 10392 - - 0ms
    30.99.237.18 - - [27/Mar/2019:14:29:29 +0000] "GET /favicon.ico HTTP/1.1" - - "http://rule-service.sze-sf-paihe-platform-stg-ac0d1270.sze-sf-caas.paic.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36" 10430 - - 0ms
    复制代码

    到其中一台system-node(30.99.3.15)上看到traefik的访问日志中,确实有收到访问rule-service.sze-sf-paihe-platform-stg-ac0d1270.sze-sf-caas.paic.com.cn的请求请求,但是没有返回状态码,也没有获取到backend。 因为在k8s集群中使用traefik作为外部入口gateway,需要让traefik的pod有权限访问k8s api,这其中涉及到3个对象。

    pod_access-to_k8s-api

当前CaaS的k8s集群架构中如何使用traefik

为了方便阅读,省略掉不相关的信息了。

  • traefik的daemonSet中使用了一个名为ingress的serviceAccount。这个serviceAccount属于namespace/kube-system。
  • serviceAccount又与一个名为ingress的ClusterRoleBinding相关联,它的subject属性是一个列表,里面放的是serviceAccount。这个ClusterRoleBinding又配置了一个ClusterRole,大概是拥有集群管理员级别的api访问权。
  • 最后在traefik的daemonSet中又配置了名为ingress的serviceAccount(如果不单独配置SA,每个pod也会拥有一个默认的serviceAccount,由k8s自动配置,但权限没有这么大)。
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: ingress
      namespace: kube-system
    ---
    kind: ClusterRoleBinding
    apiVersion: rbac.authorization.k8s.io/v1beta1
    metadata:
      name: ingress
    subjects:
      - kind: ServiceAccount
        name: ingress
        namespace: kube-system
    roleRef:
      kind: ClusterRole
      name: cluster-admin
      apiGroup: rbac.authorization.k8s.io
    ---
    apiVersion: extensions/v1beta1
    kind: DaemonSet
    metadata:
      name: traefik-ingress-lb
      namespace: kube-system
      labels:
        k8s-app: traefik-ingress-lb
    spec:
      template:
        metadata:
          labels:
            k8s-app: traefik-ingress-lb
            name: traefik-ingress-lb
        spec:
    复制代码
    elipsis...
          serviceAccountName: ingress
          nodeSelector:
              caas_cluster: kube-system
    复制代码

这里值得注意的是,ClusterRoleBinding这个对象并没有namespace属性,这就是本次事件的问题所在了。 bash $ kubectl get clusterrolebinding -o wide | sed -n '1p;/ingress/Ip' NAME AGE ROLE USERS GROUPS SERVICEACCOUNTS ingress 106d ClusterRole/cluster-admin sze-sf-ehis-renewal-prd-4fc00b7a/ingress ingress-kube-system 7d ClusterRole/cluster-admin kube-system/ingress 名为ingress-kube-system的ClusterRoleBinding是重新创建的,用来给traefik使用kube-system/ingress的ServiceAccount赋予角色。可以观察到在2个namespace下有同样名为ingress的serviceAccount,也就是说有人不小心修改了原本绑定kube-system/ingress的ClusterRoleBinding(笑)。 到此才能使得traefik正常访问到k8s的api。 ※其实在名为ingress的ClusterRoleBinding中添加一个subject,赋予kube-system/ingress角色也可以,单为了方便区分,还是另外创建了一个对象。

参考:

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值