k8s-netchecker-server网络检查组件

适用场景

k8s-netchecker-server网络用于检查k8s集群的网络连通性,包含宿主机物理网络和虚拟网络。结合prometheus-operator开源告警组件,可及时发出k8s网络异常告警。k8s网络异常可由多种原因引起,是业务故障的主要故障之一,如网段冲突、iptables清空、安装networkmanager软件、DNS ACL限制、coredns解析、docker网卡配置、宿主机参数、calico或flanneld运行异常、防火墙未开放端口和安全组限制等。

使用说明

在已经安装prometheus-operator监控组件的k8s集群上,直接运行deploy.sh,即可部署k8s-netchecker-server网络检查组件。检查k8s网络方式为curl ‘http://127.0.0.1:31081/api/v1/connectivity_check’。该脚本使用于centos7操作系统上的部署的k8s集群,其它操作系统需根据实际情况调整脚本。

安装脚本

deploy.sh

#!/bin/bash
# Copyright 2017 Mirantis
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# env: NS - a namespace name (also as $1)
# env: KUBE_DIR - manifests directory, e.g. /etc/kubernetes
# env: KUBE_USER - a user to own the manifests directory
# env: NODE_PORT - a node port for the server app to listen on
# env: PURGE - if true, will only erase applications
# env: AGENT_REPORT_INTERVAL - an interval for agents to report

set -o xtrace
set -o pipefail
set -o errexit
set -o nounset


#NS=${NS:-default}
NS=${NS:-netchecker}
REAL_NS="--namespace=${1:-$NS}"
KUBE_DIR=${KUBE_DIR:-.}
KUBE_USER=${KUBE_USER:-}
NODE_PORT=${NODE_PORT:-31081}
PURGE=${PURGE:-false}
SERVER_IMAGE_NAME=${SERVER_IMAGE_NAME:-mirantis/k8s-netchecker-server}
AGENT_IMAGE_NAME=${AGENT_IMAGE_NAME:-mirantis/k8s-netchecker-agent}
#IMAGE_TAG=${IMAGE_TAG:-stable}
IMAGE_TAG=${IMAGE_TAG:-v1.2.2}
SERVER_IMAGE_TAG=${SERVER_IMAGE_TAG:-$IMAGE_TAG}
AGENT_IMAGE_TAG=${AGENT_IMAGE_TAG:-$IMAGE_TAG}
SERVER_PORT=${SERVER_PORT:-8081}

# added by X.L.Xia
USE_ETCD_ENDPOINT=""

if [ -z ${USE_ETCD_ENDPOINT} ] ; then
  # use 3rd party resources (TPR) API to store agent reports
  SERVER_ENV_TAIL="-kubeproxyinit"
else
  # use ETCD to store agent reports
  ETCD_ENDPOINT=${ETCD_ENDPOINT:-"https://localhost:2379"}
  echo "[etcd_endpoint] information of etcd: ${ETCD_ENDPOINT}"
  EEPS=$(etcdctl --endpoints=${ETCD_ENDPOINT} member list | awk '{print $4}' | awk -F'=' '{print $2}' | paste -sd "," -)
  SERVER_ENV_TAIL="-etcd-endpoints=${EEPS}"
fi


if [ "${KUBE_DIR}" != "." ] && [ -n "${KUBE_USER}" ]; then
  mkdir -p "${KUBE_DIR}"
fi

# check there are nodes in the cluster
kubectl get nodes

echo "Deploying netchecker server and agents"
cat << EOF > "${KUBE_DIR}"/netchecker-server-dep.yml
apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: netchecker-server
spec:
  replicas: 1
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "${SERVER_PORT}"
      name: netchecker-server
      labels:
        app: netchecker-server
    spec:
      containers:
        - name: netchecker-server
          image: ${SERVER_IMAGE_NAME}:${SERVER_IMAGE_TAG}
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: ${SERVER_PORT}
          args:
            - "-v=5"
            - "-logtostderr"
            - "-endpoint=0.0.0.0:${SERVER_PORT}"
            - "${SERVER_ENV_TAIL}"
EOF

cat << EOF > "${KUBE_DIR}"/netchecker-server-svc.yml
apiVersion: v1
kind: "Service"
metadata:
  name: netchecker-service
  labels:
    app: netchecker-server
spec:
  selector:
    app: netchecker-server
  ports:
    - name: http-metrics
      protocol: TCP
      port: ${SERVER_PORT}
      targetPort: ${SERVER_PORT}
      nodePort: ${NODE_PORT}
  type: NodePort
EOF

cat << EOF > "${KUBE_DIR}"/netchecker-agent-ds.yml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  labels:
    app: netchecker-agent
  name: netchecker-agent
spec:
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      name: netchecker-agent
      labels:
        app: netchecker-agent
    spec:
      containers:
        - name: netchecker-agent
          image: ${AGENT_IMAGE_NAME}:${AGENT_IMAGE_TAG}
          env:
            - name: MY_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: MY_POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
          args:
            - "-v=5"
            - "-logtostderr"
            - "-serverendpoint=netchecker-service:${SERVER_PORT}"
            - "-reportinterval=60"
          imagePullPolicy: IfNotPresent
EOF

cat << EOF > "${KUBE_DIR}"/netchecker-agent-hostnet-ds.yml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  labels:
    app: netchecker-agent-hostnet
  name: netchecker-agent-hostnet
spec:
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      name: netchecker-agent-hostnet
      labels:
        app: netchecker-agent-hostnet
    spec:
      hostNetwork: True
      dnsPolicy: ClusterFirstWithHostNet
      containers:
        - name: netchecker-agent
          image: ${AGENT_IMAGE_NAME}:${AGENT_IMAGE_TAG}
          env:
            - name: MY_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: MY_POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
          args:
            - "-v=5"
            - "-logtostderr"
            - "-serverendpoint=netchecker-service:${SERVER_PORT}"
            - "-reportinterval=60"
          imagePullPolicy: IfNotPresent
EOF

if [ "${KUBE_DIR}" != "." ] && [ -n "${KUBE_USER}" ]; then
  chown -R "${KUBE_USER}":"${KUBE_DIR}"
fi

# netchecker-rbac.yml is added by X.L.Xia
echo "delete ns netchecker, then sleep 10s"
kubectl delete namespace ${NS} && sleep 10
kubectl delete --grace-period=1 -f "${KUBE_DIR}"/netchecker-rbac.yml "${REAL_NS}" || true
kubectl delete --grace-period=1 -f "${KUBE_DIR}"/netchecker-agent-ds.yml "${REAL_NS}" || true
kubectl delete --grace-period=1 -f "${KUBE_DIR}"/netchecker-agent-hostnet-ds.yml "${REAL_NS}" || true
kubectl delete --grace-period=1 -f "${KUBE_DIR}"/netchecker-server-svc.yml "${REAL_NS}" || true
echo "sleep 10s"
(kubectl delete --grace-period=1 -f "${KUBE_DIR}"/netchecker-server-dep.yml "${REAL_NS}" && sleep 10) || true

if [ "${PURGE}" != "true" ]; then
  # netchecker-rbac.yml is added by X.L.Xia
  kubectl create namespace ${NS}
  kubectl create -f "${KUBE_DIR}"/netchecker-rbac.yml "${REAL_NS}"
  kubectl create -f "${KUBE_DIR}"/netchecker-server-dep.yml  "${REAL_NS}"
  kubectl create -f "${KUBE_DIR}"/netchecker-server-svc.yml "${REAL_NS}"
  kubectl create -f "${KUBE_DIR}"/netchecker-agent-ds.yml "${REAL_NS}"
  kubectl create -f "${KUBE_DIR}"/netchecker-agent-hostnet-ds.yml "${REAL_NS}"
  echo "restart prometheus-k8s-0 prometheus-k8s-1"
  kubectl apply -f "${KUBE_DIR}"/mon.yml -n monitoring
  kubectl delete pod prometheus-k8s-0 prometheus-k8s-1 --namespace monitoring
fi

set +o xtrace
echo "DONE"

if [ "${PURGE}" != "true" ]; then
  echo "Use the following commands to "
  echo "- get latest agents reports:"
  #echo "  curl -s -X GET 'http://localhost:${NODE_PORT}/api/v1/agents/' | python -mjson.tool"
  echo "  curl 'http://127.0.0.1:${NODE_PORT}/api/v1/agents/' | python -mjson.tool"
  echo "- check connectivity with agents:"
  #echo "  curl -X GET 'http://localhost:${NODE_PORT}/api/v1/connectivity_check'"
  echo "  curl 'http://127.0.0.1:${NODE_PORT}/api/v1/connectivity_check'"
  echo "- get agents metrics:"
  #echo "  curl -X GET 'http://localhost:${NODE_PORT}/metrics'"
  echo "  curl 'http://127.0.0.1:${NODE_PORT}/metrics'"
fi

netchecker-rbac.yml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: rds-admin-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: User
  name: system:serviceaccount:netchecker:default
  #name: system:serviceaccount:default:default

mon.yml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app: netchecker-server
  name: netchecker-server
  namespace: monitoring
spec:
  endpoints:
  - interval: 15s
    port: http-metrics
  namespaceSelector:
    matchNames:
    - netchecker
  selector:
    matchLabels:
      app: netchecker-server

k8s-netchecker告警规则

# 省略其它告警规则
# kubectl apply -f test-prometheus-k8s-rules.yaml -n monitoring

  - name: netchecker
    rules:
    - alert: NetCheckerAgentErrors
      expr: absent(ncagent_error_count_total) OR increase(ncagent_error_count_total[1h]) > 10
      for: 5m
      labels:
        severity: warning
      annotations:
        description: "{{ $value }} errors have been registered within the last hour for Netchecker Agent {{ $labels.instance }}"
        summary: "A high number of errors in Netchecker is happening"
    - alert: NetCheckerReportsMissing
      expr: absent(ncagent_report_count_total)
      #expr: absent(ncagent_report_count_total) OR increase(ncagent_report_count_total[5m]) < 15
      for: 5m
      labels:
        severity: warning
      annotations:
        description: "Netchecker Agent {{ $labels.instance }} has reported only {{ $value }} times for the last 5 minutes"
        summary: "The number of agent reports is lower than expected"
    - alert: NetCheckerTCPServerDelay
      expr: absent(ncagent_http_probe_tcp_connection_time_ms) OR delta(ncagent_http_probe_tcp_connection_time_ms{url="http://netchecker-service:8081/api/v1/ping"}[5m]) > 100
      for: 5m
      labels:
        severity: warning
      annotations:
        description: "Netchecker Agent {{ $labels.instance }} TCP connection time to Netchecker server has increased by {{ $value }} within the last 5 minutes"
        summary: "TCP connection to Netchecker server takes too much time"
    - alert: NetCheckerDNSSlow
      expr: absent(ncagent_http_probe_dns_lookup_time_ms) OR delta(ncagent_http_probe_dns_lookup_time_ms[5m]) > 300
      for: 5m
      labels:
        severity: warning
      annotations:
        description:  "DNS lookup time on Netchecker Agent {{ $labels.instance }} has increased by {{ $value }} within the last 5 minutes"
        summary: "DNS lookup time is too high" 

参考资料

1. netchecker: Error occurred while checking the agents. Details: unknown (get agents.network-checker.ext netchecker-agent-xxxxx)
2.Mirantis/k8s-netchecker-server

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值