监控k8s controller和scheduler,创建serviceMonitor以及Rules

目录

一、修改kube-controller和kube-schduler的yaml文件

二、创建service、endpoint、serviceMonitor

三、Prometheus验证

四、创建PrometheusRule资源

五、Prometheus验证


直接上干货

一、修改kube-controller和kube-schduler的yaml文件

注意:修改时要一个节点一个节点的修改,等上一个修改的节点服务正常启动后再修改下个节点

kube-controller文件路径:/etc/kubernetes/manifests/kube-controller-manager.yaml
kube-scheduler文件路径:/etc/kubernetes/manifests/kube-scheduler.yaml

vim /etc/kubernetes/manifests/kube-controller-manager.yaml
vim /etc/kubernetes/manifests/kube-scheduler.yaml

修改后系统监听的端口从127.0.0.1变成了0.0.0.0

二、创建service、endpoint、serviceMonitor

kube-controller-monitor.yaml

apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: kube-controller-manager
  name: kube-controller-manage-monitor
  namespace: kube-system
spec:
  ports:
  - name: https-metrics
    port: 10257
    protocol: TCP
    targetPort: 10257
  sessionAffinity: None
  type: ClusterIP
--- 
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    k8s-app: kube-controller-manager
  name: kube-controller-manage-monitor
  namespace: kube-system
subsets:
- addresses:
  - ip: 10.50.238.191
  - ip: 10.50.107.48
  - ip: 10.50.140.151
  ports:
  - name: https-metrics
    port: 10257
    protocol: TCP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: kube-controller-manager
  name: kube-controller-manager
  namespace: kube-system
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 30s
    port: https-metrics
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
  jobLabel: k8s-app
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      k8s-app: kube-controller-manager

kube-scheduler-monitor.yaml

apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: kube-scheduler
  name: kube-scheduler-monitor
  namespace: kube-system
spec:
  ports:
  - name: https-metrics
    port: 10259
    protocol: TCP
    targetPort: 10259
  sessionAffinity: None
  type: ClusterIP
--- 
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    k8s-app: kube-scheduler
  name: kube-scheduler-monitor
  namespace: kube-system
subsets:
- addresses:
  - ip: 10.50.238.191
  - ip: 10.50.107.48
  - ip: 10.50.140.151
  ports:
  - name: https-metrics
    port: 10259
    protocol: TCP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: kube-scheduler
  name: kube-scheduler
  namespace: kube-system
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 30s
    port: https-metrics
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
  jobLabel: k8s-app
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      k8s-app: kube-scheduler

root@10-50-238-191:/home/sunwenbo/prometheus-serviceMonitor/serviceMonitor/kubernetes-cluster# kubectl  apply -f ./
service/kube-controller-manage-monitor created
endpoints/kube-controller-manage-monitor created
servicemonitor.monitoring.coreos.com/kube-controller-manager created
service/kube-scheduler-monitor created
endpoints/kube-scheduler-monitor created
servicemonitor.monitoring.coreos.com/kube-scheduler created

三、Prometheus验证

四、创建PrometheusRule资源

kube-controller-rules.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  annotations:
    meta.helm.sh/release-namespace: cattle-monitoring-system
    prometheus-operator-validated: "true"
  generation: 3
  labels:
    app: rancher-monitoring
    app.kubernetes.io/instance: rancher-monitoring
    app.kubernetes.io/part-of: rancher-monitoring
  name: kube-controller-manager
  namespace: cattle-monitoring-system
spec:
  groups:
  - name: kube-controller-manager.rule
    rules:
    - alert: K8SControllerManagerDown
      expr: absent(up{job="kube-controller-manager"} == 1)
      for: 1m
      labels:
        severity: critical
        cluster: manage-prod
      annotations:
        description: There is no running K8S controller manager. Deployments and replication controllers are not making progress.
        summary: No kubernetes controller manager are reachable
    - alert: K8SControllerManagerDown
      expr: up{job="kube-controller-manager"} == 0
      for: 1m
      labels:
        severity: warning
        cluster: manage-prod
      annotations:
        description: kubernetes controller manager {{ $labels.instance }} is down. {{ $labels.instance }} isn't reachable
        summary: kubernetes controller manager is down
    - alert: K8SControllerManagerUserCPU
      expr: sum(rate(container_cpu_user_seconds_total{pod=~"kube-controller-manager.*",container_name!="POD"}[5m]))by(pod) > 5
      for: 5m
      labels:
        severity: warning
        cluster: manage-prod
      annotations:
        description: kubernetes controller manager {{ $labels.instance }} is user cpu time > 5s. {{ $labels.instance }} isn't reachable
        summary: kubernetes controller 负载较高超过5s
   
    - alert: K8SControllerManagerUseMemory
      expr: sum(rate(container_memory_usage_bytes{pod=~"kube-controller-manager.*",container_name!="POD"}[5m])/1024/1024)by(pod) > 20
      for: 5m
      labels:
        severity: info
        cluster: manage-prod
      annotations:
        description: kubernetes controller manager {{ $labels.instance }} is use memory More than 20MB
        summary: kubernetes controller 使用内存超过20MB
   
    - alert: K8SControllerManagerQueueTimedelay
      expr: histogram_quantile(0.99, sum(rate(workqueue_queue_duration_seconds_bucket{job="kubernetes-controller-manager"}[5m])) by(le)) > 10
      for: 5m
      labels:
        severity: warning
        cluster: manage-prod
      annotations:
        description: kubernetes controller manager {{ $labels.instance }} is QueueTimedelay More than 10s
        summary: kubernetes controller 队列停留时间超过10秒,请检查ControllerManager

kube-scheduler-rules.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  annotations:
    meta.helm.sh/release-namespace: cattle-monitoring-system
    prometheus-operator-validated: "true"
  generation: 3
  labels:
    app: rancher-monitoring
    app.kubernetes.io/instance: rancher-monitoring
    app.kubernetes.io/part-of: rancher-monitoring
  name: kube-scheduler
  namespace: cattle-monitoring-system
spec:
  groups:
  - name: kube-scheduler.rule
    rules:
    - alert: K8SSchedulerDown
      expr: absent(up{job="kube-scheduler"} == 1)
      for: 1m
      labels:
        severity: critical
        cluster: manage-prod
      annotations:
        description: "There is no running K8S scheduler. New pods are not being assigned to nodes."
        summary: "all k8s scheduler is down"
    - alert: K8SSchedulerDown
      expr: up{job="kube-scheduler"} == 0
      for: 1m
      labels:
        severity: warning
        cluster: manage-prod
      annotations:
        description: "K8S scheduler {{ $labels.instance }} is no running. New pods are not being assigned to nodes."
        summary: "k8s scheduler {{ $labels.instance }} is down"
    - alert: K8SSchedulerUserCPU
      expr: sum(rate(container_cpu_user_seconds_total{pod=~"kube-scheduler.*",container_name!="POD"}[5m]))by(pod) > 1
      for: 5m
      labels:
        severity: warning
        cluster: manage-prod
      annotations:
        current_value: '{{$value}}'
        description: "kubernetes scheduler {{ $labels.instance }} is user cpu time > 1s. {{ $labels.instance }} isn't reachable"
        summary: "kubernetes scheduler 负载较高超过1s,当前值为{{$value}}"
   
    - alert: K8SSchedulerUseMemory
      expr: sum(rate(container_memory_usage_bytes{pod=~"kube-scheduler.*",container_name!="POD"}[5m])/1024/1024)by(pod) > 20
      for: 5m
      labels:
        severity: info
        cluster: manage-prod
      annotations:
        current_value: '{{$value}}'
        description: "kubernetess scheduler {{ $labels.instance }} is use memory More than 20MB"
        summary: "kubernetes scheduler 使用内存超过20MB,当前值为{{$value}}MB"
   
    - alert: K8SSchedulerPodPending
      expr: sum(scheduler_pending_pods{job="kubernetes-scheduler"})by(queue) > 5
      for: 5m
      labels:
        severity: info
        cluster: manage-prod
      annotations:
        current_value: '{{$value}}'
        description: "kubernetess scheduler {{ $labels.instance }} is Pending pod More than 5"
        summary: "kubernetes scheduler pod无法调度 > 5,当前值为{{$value}}"
   
    - alert: K8SSchedulerPodPending
      expr: sum(scheduler_pending_pods{job="kubernetes-scheduler"})by(queue) > 10
      for: 5m
      labels:
        severity: warning
        cluster: manage-prod
      annotations:
        current_value: '{{$value}}'
        description: kubernetess scheduler {{ $labels.instance }} is Pending pod More than 10
        summary: "kubernetes scheduler pod无法调度 > 10,当前值为{{$value}}"
   
    - alert: K8SSchedulerPodPending
      expr: sum(rate(scheduler_binding_duration_seconds_count{job="kubernetes-scheduler"}[5m])) > 1
      for: 5m
      labels:
        severity: warning
        cluster: manage-prod
      annotations:
        current_value: '{{$value}}'
        description: kubernetess scheduler {{ $labels.instance }}
        summary: "kubernetes scheduler pod 无法绑定调度有问题,当前值为{{$value}}"
   
    - alert: K8SSchedulerVolumeSpeed
      expr: sum(rate(scheduler_volume_scheduling_duration_seconds_count{job="kubernetes-scheduler"}[5m])) > 1
      for: 5m
      labels:
        severity: warning
        cluster: manage-prod
      annotations:
        current_value: '{{$value}}'
        description: kubernetess scheduler {{ $labels.instance }}
        summary: "kubernetes scheduler pod Volume 速度延迟,当前值为{{$value}}"
   
    - alert: K8SSchedulerClientRequestSlow
      expr: histogram_quantile(0.99, sum(rate(rest_client_request_duration_seconds_bucket{job="kubernetes-scheduler"}[5m])) by (verb, url, le)) > 1
      for: 5m
      labels:
        severity: warning
        cluster: manage-prod
      annotations:
        current_value: '{{$value}}'
        description: kubernetess scheduler {{ $labels.instance }}
        summary: "kubernetes scheduler 客户端请求速度延迟,当前值为{{$value}}"
root@10-50-238-191:/home/sunwenbo/prometheus-serviceMonitor/rules# kubectl  apply -f kube-controller-rules.yaml 
prometheusrule.monitoring.coreos.com/kube-apiserver-rules configured
root@10-50-238-191:/home/sunwenbo/prometheus-serviceMonitor/rules# kubectl  apply -f kube-scheduler-rules.yaml 
prometheusrule.monitoring.coreos.com/kube-apiserver-rules configured
root@10-50-238-191:/home/sunwenbo/prometheus-serviceMonitor/rules# 

五、Prometheus验证

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
一、安装准备: 1.环境 主机名 IP k8s-master 192.168.250.111 k8s-node01 192.168.250.112 k8s-node02 192.168.250.116 2.设置主机名 hostnamectl --static set-hostname k8s-master hostnamectl --static set-hostname k8s-node01 hostnamectl --static set-hostname k8s-node02 3.关闭防火墙和selinux systemctl disable firewalld systemctl stop firewalld sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config 执行完成后重启虚拟机。 4.在master机器上安装ansible 执行命令:sudo yum install ansible (离线处理补充) 5.配置 ansible ssh密钥登录,此操作需要在所有机器上执行 ssh-keygen -t rsa -b 2048 回车 回车 回车 ssh-copy-id $IP #$IP为所有虚拟机,按照提示输入yes 和root密码 (密钥补充) 二、安装kubernetes集群 进入ansible安装路径 : cd /etc/ansible 将路径下的roles文件夹和hosts文件删除。 解压压缩文件kubeasz.zip文件,将解压后的内容放入当前目录下(/etc/ansible) 根据搭建集群环境要求,进入/etc/ansible/example 目录下选取 hosts.allinone.example 单节点AllInOne hosts.m-masters.example 单主多节点 hosts.s-master.example 多主多节点 红色标记的是需要自行修改的地方 修改完成后将文件名改为hosts 放入/etc/ansible/目录下。 安装prepare ansible-playbook 01.prepare.yml 安装etcd ansible-playbook 02.etcd.yml 安装kubectl命令 ansible-playbook 03.kubectl.yml 安装docker ansible-playbook 04.docker.yml 如果执行时出现报错: 可忽略。 解决方法: 在master节点上执行:curl -s -S "https://registry.hub.docker.com/v2/repositories/$@/tags/" | jq '."results"[]["name"]' |sort 所有机器上执行: wget http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm rpm -ivh epel-release-latest-7.noarch.rpm yum install jq -y 在重新执行: ansible-playbook 04.docker.yml 安装calico ansible-playbook 05.calico.yml 部署master节点 ansible-playbook 06.kube-master.yml 加入node节点 ansible-playbook 07.kube-node.yml 如果执行成功,k8s集群就安装好了。 三、验证安装 如果提示kubectl: command not found,退出重新ssh登陆一下,环境变量生效即可 kubectl version #查看kubernetes版本 kubectl get componentstatus # 可以看到scheduler/controller-manager/etcd等组件 Healthy kubectl cluster-info # 可以看到kubernetes master(apiserver)组件 running kubectl get node # 可以看到单 node Ready状态 kubectl get pod --all-namespaces # 可以查看所有集群pod状态 kubectl get svc --all-namespaces # 可以查看所有集群服务状态 calicoctl node status # 可以在master或者node节点上查看calico网络状态 四、安装主要组件 安装kubedns kubectl create -f manifests/kubedns 安装heapster kubectl create -f manifests/heapster 安装dashboard kubectl create -f manifests/dashboard 访问dashboard 先执行命令查看dashboard的NodePort 端口 kubectl get svc -n kube-system 访问web页面 https://masterIP: 7443 选择令牌按钮 ,用命令查询登录令牌 之前安装过 heapster 执行命令:kubectl get secret -n kube-system 查询 heapster-token-twpw4 的详细内容 执行命令:kubectl describe secret heapster-token-twpw4 -n kube-system Token就是登录令牌,复制登录就好了 安装ingress kubectl create -f manifests/ingress/ 安装EFK(elasticsearch+ fluentd + kibana) 首先进入 manifests/EFK 文件夹下 (cd /etc/ansible/manifests/EFK) 查看并修改 ceph-sercet.yaml 文件。 此key值是 ceph存储用户的token值 ,将此key值转换为base64 将文件中红色选选中部分修改为转换后的值。 修改完成后 部署 pv 和 pvc 执行命令:kubectl create -f es-pv-data.yaml kubectl create -f es-pvc-data.yaml 部署fluentd 执行命令:kubectl create -f fluentd-rbac.yml -f fluentd-configmap.yml -f fluentd-daemonset.yml 部署elasticsearch 先设置node节点中role ,指定master client data 部署位置 执行命令:kubectl get nodes kubectl label node 10.2.0.244 role=master (10.2.0.244 是我本机kubernetes 的master节点 ,所以我也将此master也部署在这里) 其余的两个节点分别是data 和 client 执行命令:kubectl create -f es-discovery-svc.yaml -f es-svc.yaml -f es-master.yaml -f es-client.yaml -f es-data.yaml 其中部署elasticsearch集群需要注意一些事项 : Master节点一般只有一个 并且提供9300 端口 客户端通讯使用 Client 节点一般提供9200端口 用于连接kibana 和 fluentd http访问使用 Data 节点是提供数据存储,持久化对data节点进行就可以。 其中 master , client , data 部署文件中 配置的 CLUSTER_NAME 指的是 elasticsearch集群名称 Java运行自行设置,最大值和最小值需要一致。 最小为-Xms256m 部署kibana 执行命令:kubectl create -f kibana-svc.yaml -f kibana.yaml 这里需要注意 kibana.yaml 文件中 参数的设置 这里的CLUSTER_NAME 也是elasticsearch部署文件中设置的集群名称。 #安装 flannel 执行命令: cd /etc/ansible/roles/flannel 先修改kube-flannel.yml文件 --iface 对应的是本机的网卡名称 command: [ "/opt/bin/flanneld", "--ip-masq", "--kube-subnet-mgr", "--iface=eth1" ] 修改完成后 执行: kubectl create -f kube-flannel-rbac.yml kubectl apply -f kube-flannel.yml
要在Kubernetes上部署DolphinScheduler,您可以按照以下步骤进行操作: 1. 首先,确保您已经安装并配置Kubernetes集群。您可以使用Minikube进行本地测试,或者使用其他云提供商(如AWS、GCP、Azure)提供的Kubernetes服务。 2. 创建DolphinScheduler的命名空间。使用以下命令创建一个新的命名空间: ``` kubectl create namespace dolphinscheduler ``` 3. 创建MySQL数据库。您可以使用Kubernetes的StatefulSet来创建一个运行MySQL的Pod,并使用PersistentVolumeClaim(PVC)来持久化数据。您可以根据自己的需求进行配置。 4. 创建DolphinScheduler配置文件。您可以创建一个ConfigMap来存储DolphinScheduler配置文件。将配置文件保存为一个YAML文件,并使用以下命令创建ConfigMap: ``` kubectl create -n dolphinscheduler configmap dolphinscheduler-config --from-file=dolphinscheduler.properties ``` 5. 部署DolphinScheduler的主控节点(Master)。使用以下命令创建一个Deployment来运行DolphinScheduler的Master组件: ``` kubectl create -n dolphinscheduler -f https://raw.githubusercontent.com/apache/incubator-dolphinscheduler/v3.1.0/kubernetes/dolphinscheduler-master.yaml ``` 6. 部署DolphinScheduler的Worker节点。使用以下命令创建一个Deployment来运行DolphinScheduler的Worker组件: ``` kubectl create -n dolphinscheduler -f https://raw.githubusercontent.com/apache/incubator-dolphinscheduler/v3.1.0/kubernetes/dolphinscheduler-worker.yaml ``` 7. 部署DolphinScheduler的Logger节点。使用以下命令创建一个Deployment来运行DolphinScheduler的Logger组件: ``` kubectl create -n dolphinscheduler -f https://raw.githubusercontent.com/apache/incubator-dolphinscheduler/v3.1.0/kubernetes/dolphinscheduler-logger.yaml ``` 8. 等待所有组件都正常运行。使用以下命令检查

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Cloud孙文波

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值