文章目录
企业中需要哪些告警Rules
PrometheusRules是Prometheus中一个比较重要的组件,用于定义和管理告警规则,它允许用户根据指标的值或者其他的条件定义告警规则,并在满足条件时触发告警,一个真正的企业中需要通过以下几个维度配置告警规则
- 业务维度:在企业中,不同的业务拥有不同的指标和告警规则。例如,对于ToC平台,需要监控订单量、库存、支付成功率等指标,以确保业务的正常运行。
- 环境维度:企业中通常会有多个环境,例如开发、测试、预生产和生产环境等。由于每个环境的特点不同,因此需要为每个环境制定不同的告警规则。
- 应用程序维度:不同的应用程序拥有不同的指标和告警规则。例如,在监控Web应用程序时,需要监控HTTP请求失败率、响应时间和内存使用情况等指标。
- 基础设施维度:企业中的基础设施包括服务器、网络设备和存储设备等。在监控基础设施时,需要监控CPU使用率、磁盘空间和网络带宽等指标。
分析典型告警规则
groups:
- name: general.rules
rules:
- alert: InstanceDown
expr: |
up{job=~"other-ECS|k8s-nodes|prometheus"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} 停止工作"
description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} 已经停止1分钟以上."
在告警规则文件中,我们可以将一组相关的规则设置定义在一个group下。
在每一个group中我们可以定义多个告警规则(rule)。一条告警规则主要由以下几部分组成:
- alert:告警规则的名称。 expr:基于PromQL表达式告警触发条件,用于计算是否有时间序列满足该条件。
- for:评估等待时间,可选参数。用于表示只有当触发条件持续一段时间后才发送告警。在等待期间新产生告警的状态为pending。
- labels:自定义标签,允许用户指定要附加到告警上的一组附加标签。
- annotations:用于指定一组附加信息,比如用于描述告警详细信息的文字等,annotations的内容在告警产生时会一同作为参数发送到Alertmanager。
企业告警Rules
结合公司的业务场景参考:https://samber.github.io/awesome-prometheus-alerts/rules#kubernetes
NodeRules
groups:
- name: node.rules
rules:
- alert: NodeFilesystemUsage
expr: |
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 85
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高"
description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} : {{ $labels.mountpoint }} 分区使用大于85% (当前值: {{ $value }})"
- alert: NodeMemoryUsage
expr: |
100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} 内存使用率过高"
description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} 内存使用大于85% (当前值: {{ $value }})"
- alert: NodeCPUUsage
expr: |
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 85
for: 10m
labels:
hostname: '{{$labels.hostname}}'
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} CPU使用率过高"
description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} CPU使用大于85% (当前值: {{ $value }})"
- alert: TCP_Estab
expr: |
node_netstat_Tcp_CurrEstab > 5500
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} TCP_Estab链接过高"
description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} TCP_Estab链接过高!(当前值: {{ $value }})"
- alert: TCP_TIME_WAIT
expr: |
node_sockstat_TCP_tw > 3000
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} TCP_TIME_WAIT过高"
description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} TCP_TIME_WAIT过高!(当前值: {{ $value }})"
- alert: TCP_Sockets
expr: |
node_sockstat_sockets_used > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} TCP_Sockets链接过高"
description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} TCP_Sockets链接过高!(当前值: {{ $value }})"
- alert: KubeNodeNotReady
expr: |
kube_node_status_condition{condition="Ready",status="true"} == 0
for: 1m
labels:
severity: critical
annotations:
description: '{{ $labels.node }} NotReady已经1分钟.'
- alert: KubernetesMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes memory pressure (instance {{ $labels.instance }})
description: "{{ $labels.node }} has MemoryPressure condition VALUE = {{ $value }}"
- alert: KubernetesDiskPressure
expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes disk pressure (instance {{ $labels.instance }})
description: "{{ $labels.node }} has DiskPressure condition."
- alert: KubernetesContainerOomKiller
expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
for: 10m
labels:
severity: warning
annotations:
summary: Kubernetes container oom killer (instance {{ $labels.instance }})
description: "{{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes."
- alert: KubernetesJobFailed
expr: kube_job_status_failed > 0
for: 1m
labels:
severity: warning
annotations:
summary: Kubernetes Job failed (instance {{ $labels.instance }})
description: "Job {{$labels.namespace}}/{{$labels.job_name}} failed to complete."
- alert: UnusualDiskReadRate
expr: |
sum by (job,instance) (irate(node_disk_read_bytes_total[5m])) / 1024 / 1024 > 140
for: 5m
labels:
severity: critical
hostname: '{{ $labels.hostname }}'
annotations:
description: '{{ $labels.instance }} 主机名:{{ $labels.hostname }} 持续5分钟磁盘读取数据(> 140 MB/s) (当前值: {{ $value }}) 阿里云ESSD PL0最大吞吐量180MB/s, PL1最大350MB/s'
- alert: UnusualDiskWriteRate
expr: |
sum by (job,instance) (irate(node_disk_written_bytes_total[5m])) / 1024 / 1024 > 140
for: 5m
labels:
severity: critical
hostname: '{{ $labels.hostname }}'
annotations:
description: '{{ $labels.instance }} 主机名:{{ $labels.hostname }} 持续5分钟磁盘写入数据(> 140 MB/s) (当前值: {{ $value }}) 阿里云ESSD PL0最大吞吐量180MB/s, PL1最大350MB/s'
- alert: UnusualNetworkThroughputIn
expr: |
sum by (job,instance) (irate(node_network_receive_bytes_total{job=~"aws-hk-monitor|k8s-nodes"}[5m])) / 1024 / 1024 > 80
for: 5m
labels:
severity: critical
annotations:
description: '{{ $labels.instance }} 主机名:{{ $labels.hostname }} 持续5分钟网络带宽接收数据(> 80 MB/s) (当前值: {{ $value }})'
- alert: UnusualNetworkThroughputOut
expr: |
sum by (job,instance) (irate(node_network_transmit_bytes_total{job=~"aws-hk-monitor|k8s-nodes"}[5m])) / 1024 / 1024 > 80
for: 5m
labels:
severity: critical
annotations:
description: '{{ $labels.instance }} 主机名:{{ $labels.hostname }} 持续5分钟网络带宽发送数据(> 80 MB/s) (当前值: {{ $value }})'
- alert: SystemdServiceCrashed
expr: |
node_systemd_unit_state{state="failed"} == 1
for: 5m
labels:
severity: warning
annotations:
description: '{{ $labels.instance }} 主机名:{{ $labels.hostname }} 上的{{$labels.name}}服务有问题已经5分钟,请及时处理'
- alert: HostDiskWillFillIn24Hours
expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
for: 2m
labels:
severity: warning
annotations:
summary: Host disk will fill in 24 hours (instance {{ $labels.instance }})
description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} 以当前写入速率,预计文件系统将在未来24小时内耗尽空间!"
- alert: HostOutOfInodes
expr: node_filesystem_files_free / node_filesystem_files * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
for: 2m
labels:
severity: warning
annotations:
summary: Host out of inodes (instance {{ $labels.instance }})
description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} 磁盘iNode空间剩余小于10%!\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostOomKillDetected
expr: increase(node_vmstat_oom_kill[1m]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: Host OOM kill detected (instance {{ $labels.instance }})
description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} 当前主机检查到有OOM现象!"
apptype可以作为标签
device系统中的一个分区
PrometheusRules
groups:
- name: prometheus.rules
rules:
- alert: PrometheusErrorSendingAlertsToAnyAlertmanagers
expr: |
(rate(prometheus_notifications_errors_total{instance="localhost:9090", job="prometheus"}[5m]) / rate(prometheus_notifications_sent_total{instance="localhost:9090", job="prometheus"}[5m])) * 100 > 3
for: 5m
labels:
severity: warning
annotations:
description: '{{ printf "%.1f" $value }}% minimum errors while sending alerts from Prometheus {{$labels.namespace}}/{{$labels.pod}} to any Alertmanager.'
- alert: PrometheusNotConnectedToAlertmanagers
expr: |
max_over_time(prometheus_notifications_alertmanagers_discovered{instance="localhost:9090", job="prometheus"}[5m]) != 1
for: 5m
labels:
severity: critical
annotations:
description: "Prometheus {{$labels.namespace}}/{{$labels.pod}} 链接alertmanager异常!"
- alert: PrometheusRuleFailures
expr: |
increase(prometheus_rule_evaluation_failures_total{instance="localhost:9090", job="prometheus"}[5m]) > 0
for: 5m
labels:
severity: critical
annotations:
description: 'Prometheus {{$labels.namespace}}/{{$labels.pod}} 在5分钟执行失败的规则次数 {{ printf "%.0f" $value }}'
- alert: PrometheusRuleEvaluationFailures
expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus rule evaluation failures (instance {{ $labels.instance }})
description: "Prometheus 遇到规则 {{ $value }} 载入失败, 请及时检查."
- alert: PrometheusTsdbReloadFailures
expr: increase(prometheus_tsdb_reloads_failures_total[1m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus TSDB reload failures (instance {{ $labels.instance }})
description: "Prometheus {{ $value }} TSDB 重载失败!"
- alert: PrometheusTsdbWalCorruptions
expr: increase(prometheus_tsdb_wal_corruptions_total[1m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }}
description: "Prometheus {{ $value }} TSDB WAL 模块出现问题!"
website.rules
groups:
- name: website.rules
rules:
- alert: "ssl证书过期警告"
expr: (probe_ssl_earliest_cert_expiry - time())/86400 <30
for: 1h
labels:
severity: warning
annotations:
description: '域名{{$labels.instance}}的证书还有{{ printf "%.1f" $value }}天就过期了,请尽快更新证书'
summary: "ssl证书过期警告"
- alert: blackbox_network_stats
expr: probe_success == 0
for: 1m
labels:
severity: critical
pod: '{{$labels.instance}}'
namespace: '{{$labels.kubernetes_namespace}}'
annotations:
summary: "接口/主机/端口/域名 {{ $labels.instance }} 不能访问"
description: "接口/主机/端口/域名 {{ $labels.instance }} 不能访问,请尽快检测!"
- alert: curlHttpStatus
expr: probe_http_status_code{job="blackbox-http"} >= 422 and probe_success{job="blackbox-http"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: '业务报警: 网站不可访问'
description: '{{$labels.instance}} 不可访问,请及时查看,当前状态码为{{$value}}'
PodRules
groups:
- name: pod.rules
rules:
- alert: PodCPUUsage
expr: |
sum(rate(container_cpu_usage_seconds_total{image!=""}[5m]) * 100) by (pod, namespace) > 90
for: 5m
labels:
severity: warning
pod: '{{$labels.pod}}'
annotations:
description: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} CPU使用大于90% (当前值: {{ $value }})"
- alert: PodMemoryUsage
expr: |
sum(container_memory_rss{image!=""}) by(pod, namespace) / sum(container_spec_memory_limit_bytes{image!=""}) by(pod, namespace) * 100 != +inf > 85
for: 5m
labels:
severity: critical
pod: '{{$labels.pod}}'
annotations:
description: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} 内存使用大于85% (当前值: {{ $value }})"
- alert: KubeDeploymentError
expr: |
kube_deployment_spec_replicas{job="kubernetes-service-endpoints"} != kube_deployment_status_replicas_available{job="kubernetes-service-endpoints"}
for: 3m
labels:
severity: warning
pod: '{{$labels.deployment}}'
annotations:
description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }}控制器与实际数量不相符 (当前值: {{ $value }})"
- alert: coreDnsError
expr: |
kube_pod_container_status_running{container="coredns"} == 0
for: 1m
labels:
severity: critical
annotations:
description: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} coreDns服务异常 (当前值: {{ $value }})"
- alert: kubeProxyError
expr: |
kube_pod_container_status_running{container="kube-proxy"} == 0
for: 1m
labels:
severity: critical
annotations:
description: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} kube-proxy服务异常 (当前值: {{ $value }})"
- alert: filebeatError
expr: |
kube_pod_container_status_running{container="filebeat"} == 0
for: 1m
labels:
severity: critical
annotations:
description: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} filebeat服务异常 (当前值: {{ $value }})"
- alert: PodNetworkReceive
expr: |
sum(rate(container_network_receive_bytes_total{image!="",name=~"^k8s_.*"}[5m]) /1000) by (pod,namespace) > 60000
for: 5m
labels:
severity: warning
annotations:
description: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} 入口流量大于60MB/s (当前值: {{ $value }}K/s)"
- alert: PodNetworkTransmit
expr: |
sum(rate(container_network_transmit_bytes_total{image!="",name=~"^k8s_.*"}[5m]) /1000) by (pod,namespace) > 60000
for: 5m
labels:
severity: warning
annotations:
description: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} 出口流量大于60MB/s (当前值: {{ $value }}/K/s)"
- alert: PodRestart
expr: |
sum(changes(kube_pod_container_status_restarts_total[1m])) by (pod,namespace) > 1
for: 1m
labels:
severity: warning
annotations:
description: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod重启 (当前值: {{ $value }})"
- alert: PodFailed
expr: |
sum(kube_pod_status_phase{phase="Failed"}) by (pod,namespace) > 0
for: 5s
labels:
severity: critical
annotations:
description: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态Failed (当前值: {{ $value }})"
- alert: PodPending
expr: |
sum(kube_pod_status_phase{phase="Pending"}) by (pod,namespace) > 0
for: 30s
labels:
severity: critical
annotations:
description: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态Pending (当前值: {{ $value }})"
- alert: PodErrImagePull
expr: |
sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="ErrImagePull"}) == 1
for: 1m
labels:
severity: warning
annotations:
description: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态ErrImagePull (当前值: {{ $value }})"
- alert: PodImagePullBackOff
expr: |
sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="ImagePullBackOff"}) == 1
for: 1m
labels:
severity: warning
annotations:
description: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态ImagePullBackOff (当前值: {{ $value }})"
- alert: PodCrashLoopBackOff
expr: |
sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}) == 1
for: 1m
labels:
severity: warning
annotations:
description: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态CrashLoopBackOff (当前值: {{ $value }})"
- alert: PodInvalidImageName
expr: |
sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="InvalidImageName"}) == 1
for: 1m
labels:
severity: warning
annotations:
description: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态InvalidImageName (当前值: {{ $value }})"
- alert: PodCreateContainerConfigError
expr: |
sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="CreateContainerConfigError"}) == 1
for: 1m
labels:
severity: warning
annotations:
description: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态CreateContainerConfigError (当前值: {{ $value }})"
- alert: KubernetesContainerOomKiller
expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
for: 0m
labels:
severity: warning
annotations:
summary: Kubernetes container oom killer (instance {{ $labels.instance }})
description: "{{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes!"
- alert: KubernetesPersistentvolumeError
expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending", job="kube-state-metrics"} > 0
for: 0m
labels:
severity: critical
annotations:
summary: Kubernetes PersistentVolume error (instance {{ $labels.instance }})
description: "{{ $labels.instance }} Persistent volume is in bad state!"
- alert: KubernetesStatefulsetDown
expr: (kube_statefulset_status_replicas_ready / kube_statefulset_status_replicas_current) != 1
for: 1m
labels:
severity: critical
annotations:
summary: Kubernetes StatefulSet down (instance {{ $labels.instance }})
description: "{{ $labels.statefulset }} A StatefulSet went down!"
- alert: KubernetesStatefulsetReplicasMismatch
expr: kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas
for: 10m
labels:
severity: warning
annotations:
summary: Kubernetes StatefulSet replicas mismatch (instance {{ $labels.instance }})
description: "{{ $labels.statefulset }} A StatefulSet does not match the expected number of replicas."
VolumeRules
groups:
- name: volume.rules
rules:
- alert: PersistentVolumeClaimLost
expr: |
sum by(namespace, persistentvolumeclaim) (kube_persistentvolumeclaim_status_phase{phase="Lost"}) == 1
for: 2m
labels:
severity: warning
annotations:
description: "PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is lost!"
- alert: PersistentVolumeClaimPendig
expr: |
sum by(namespace, persistentvolumeclaim) (kube_persistentvolumeclaim_status_phase{phase="Pendig"}) == 1
for: 2m
labels:
severity: warning
annotations:
description: "PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pendig!"
- alert: PersistentVolume Failed
expr: |
sum(kube_persistentvolume_status_phase{phase="Failed",job="kubernetes-service-endpoints"}) by (persistentvolume) == 1
for: 2m
labels:
severity: warning
annotations:
description: "Persistent volume is failed state\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PersistentVolume Pending
expr: |
sum(kube_persistentvolume_status_phase{phase="Pending",job="kubernetes-service-endpoints"}) by (persistentvolume) == 1
for: 2m
labels:
severity: warning
annotations:
description: "Persistent volume is pending state\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
process.rules
groups:
- name: process.rules
rules:
- alert: Process for Sparkxtask already down!!!
expr: |
(namedprocess_namegroup_num_procs{groupname="map[:sparkxtask]"}) < 4
for: 1m
labels:
severity: warning
pod: sparkxads-process
annotations:
description: "任务名称: sparktask | 正常进程数量: 4个 | 当前值: {{ $value }},请Robot及时处理!"
- Prometheus 规则是一种基于 PromQL 表达式的告警和记录生成机制,可以通过对 指标的计算 和 聚合 来产生新的时间序列。
- 通过定义 不同维度 的规则,可以让 Prometheus 对 不同层次 和细节的 指标
- 进行监控和告警,从而更好地了解应用程序的状态和性能状况。 为了实现简单而 有效的告警策略,根据哪些指标来触发告警?避免过度告警和噪声干扰,提高监控和告警的 可靠性 和 准确性。
Grafana
Grafana 是一个开源的度量分析与可视化工具。提供查询、可视化、报警和指标展示等功能,能灵活创建图表、仪表盘等可视化界面。
主要功能
- 可视化: 提供多种可选择的不同类型的图形,能够灵活绘制不同样式,且还提供很多插件。
- 动态仪表盘:提供以模板和变量的方式来创建动态且可重复使用的仪表盘,可以灵活调整。
- 浏览指标:通过瞬时查询和动态变化等方式展示数据,可以根据不同的时间范围拆分视图。
- 警报: 可以直观地根据重要的指标定义警报规则。Grafana将不断评估并向 Slack,邮件,快消息等系统发送通知。
- 混合数据源: 在同一图中混合不同的数据源,可以基于每个查询指定不同数据源。比如:Prometheus、elk、数据库等
部署Granfana
由于数据需要持久化存储,故需要创建一个PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: grafana-data-pvc
namespace: monitor
spec:
accessModes:
- ReadWriteMany
storageClassName: "nfs-storage"
resources:
requests:
storage: 10Gi
Granfana Config配置
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-config
namespace: monitor
data:
grafana.ini: |
[server]
root_url = http://grafana.kubernets.cn
[smtp]
enabled = true
host = smtp.exmail.qq.com:465
user = devops@xxxx.com
password = aDhUcxxxxyecE
skip_verify = true
from_address = devops@xxxx.com
[alerting]
enabled = true
execute_alerts = true
配置SVC暴漏Granfana
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: monitor
labels:
app: grafana
component: core
spec:
type: ClusterIP
ports:
- port: 3000
selector:
app: grafana
component: core
Granfana控制器
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana-core
namespace: monitor
labels:
app: grafana
component: core
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
component: core
spec:
containers:
- name: grafana-core
image: grafana/grafana:latest
imagePullPolicy: IfNotPresent
volumeMounts:
- name: storage
subPath: grafana
mountPath: /var/lib/grafana
# env:
resources:
# keep request = limit to keep this container in guaranteed class
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 100m
memory: 500Mi
env: #配置环境变量,设置Grafana 的默认管理员用户名/密码
# The following env variables set up basic auth twith the default admin user and admin password.
- name: GF_AUTH_BASIC_ENABLED
value: "true"
- name: GF_AUTH_ANONYMOUS_ENABLED
value: "false"
# - name: GF_AUTH_ANONYMOUS_ORG_ROLE
# value: Admin
# does not really work, because of template variables in exported dashboards:
# - name: GF_DASHBOARDS_JSON_ENABLED
# value: "true"
readinessProbe:
httpGet:
path: /login
port: 3000
# initialDelaySeconds: 30
# timeoutSeconds: 1
volumeMounts:
- name: data
subPath: grafana
mountPath: /var/lib/grafana
- name: grafana-config
mountPath: /etc/grafana
readOnly: true
securityContext: #容器安全策略,设置运行容器使用的归属组与用户
fsGroup: 472
runAsUser: 472
volumes:
- name: data
persistentVolumeClaim:
claimName: grafana-data-pvc
- name: grafana-config
configMap:
name: grafana-config
配置ingress外部暴漏
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grafana-ingress
namespace: monitor
annotations:
prometheus.io/http_probe: "true"
spec:
ingressClassName: nginx
rules:
- host: grafana.kubernets.cn
http:
paths:
- pathType: Prefix
backend:
service:
name: grafana
port:
number: 3000
path: /
curl验证
第一次登录会强制要求修改密码
Granfana饼图插件安装
# kubectl exec -it -n monitor grafana-58ffb4db5d-c4wlz bash
bash-5.0$ grafana-cli plugins install grafana-piechart-panel
bash-5.0$ grafana-cli plugins install camptocamp-prometheus-alertmanager-datasource
Granfana不会热加载插件,所以需要重启,重启后再curl验证下
配置数据源
Grafana官方提供了对:Graphite, InfluxDB, OpenTSDB, Prometheus, Elasticsearch, CloudWatch的支持。
添加数据源:Configuration --> Data Sources --> Prometheus
Granfana和Prometheus都部署在k8s中,直接使用Prometheus的svc地址
企业大盘
根据不同的维度创建不同的大盘目录
创建不同维度的大盘:Create --> New dashboard folder --> 集群层面
官方大盘指引:https://grafana.com/grafana/dashboards/
监控指标说明:https://v2-1.docs.kubesphere.io/docs/zh-CN/api-reference/monitoring-metrics/
通过以上链接可以看到Granfana给我们提供了不同的监控大盘和监控指标可以直接参考引用
如何使用官网大盘
在官网搜索kubernetes的大盘
选中心仪的大盘
复制大盘ID
返回Granfana大盘import导入
输入ID,点击load
选择大盘目录和数据源