k8s学习 — (运维)第十章 k8s 集群监控
※ 各章节重要知识点
1 监控方案
1.1 Heapster
Heapster 是容器集群监控和性能分析工具,天然的支持Kubernetes 和 CoreOS。
Kubernetes 有个出名的监控 agent—cAdvisor。在每个kubernetes Node 上都会运行 cAdvisor,它会收集本机以及容器的监控数据(cpu,memory,filesystem,network,uptime)。
在较新的版本中,K8S 已经将 cAdvisor 功能集成到 kubelet 组件中。每个 Node 节点可以直接进行 web 访问。
1.2 Weave Scope
Weave Scope 可以监控 kubernetes 集群中的一系列资源的状态、资源使用情况、应用拓扑、scale、还可以直接通过浏览器进入容器内部调试等,其提供的功能包括:
- 交互式拓扑界面
- 图形模式和表格模式
- 过滤功能
- 搜索功能
- 实时度量
- 容器排错
- 插件扩展
1.3 Prometheus
Prometheus 是一套开源的监控系统、报警、时间序列的集合,最初由 SoundCloud 开发,后来随着越来越多公司的使用,于是便独立成开源项目。自此以后,许多公司和组织都采用了 Prometheus 作为监控告警工具。
2 Prometheus 监控 k8s
2.1 自定义配置
2.1.1 创建 ConfigMap 配置
创建 prometheus-config.yml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
创建 configmap
kubectl create -f prometheus-config.yml
2.1.2 部署 Prometheus
创建 prometheus-deploy.yml
apiVersion: v1
kind: Service
metadata:
name: prometheus
labels:
name: prometheus
spec:
ports:
- name: prometheus
protocol: TCP
port: 9090
targetPort: 9090
selector:
app: prometheus
type: NodePort
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
name: prometheus
name: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.2.1
command:
- "/bin/prometheus"
args:
- "--config.file=/etc/prometheus/prometheus.yml"
ports:
- containerPort: 9090
protocol: TCP
volumeMounts:
- mountPath: "/etc/prometheus"
name: prometheus-config
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
创建部署对象
kubectl create -f prometheus-deploy.yml
查看是否在运行中
kubectl get pods -l app=prometheus
获取服务信息
kubectl get svc -l name=prometheus
通过 http://节点ip:端口 进行访问
2.1.3 配置访问权限
创建 prometheus-rbac-setup.yml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: default
创建资源对象
kubectl create -f prometheus-rbac-setup.yml
修改 prometheus-deploy.yml 配置文件
spec:
replicas: 1
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
serviceAccount: prometheus
升级 prometheus-deployment
kubectl apply -f prometheus-deployment.yml
查看 pod
kubectl get pods -l app=prometheus
查看 serviceaccount 认证证书
kubectl exec -it <pod name> -- ls /var/run/secrets/kubernetes.io/serviceaccount/
2.1.4 服务发现配置
# 配置 job,帮助 prometheus 找到所有节点信息,修改 prometheus-config.yml 增加为如下内容
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kubernetes-nodes'
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
- job_name: 'kubernetes-service'
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: service
- job_name: 'kubernetes-endpoints'
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: endpoints
- job_name: 'kubernetes-ingress'
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: ingress
- job_name: 'kubernetes-pods'
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: pod
升级配置
kubectl apply -f prometheus-config.yml
获取 prometheus pod
kubectl get pods -l app=prometheus
删除 pod
kubectl delete pods <pod name>
查看 pod 状态
kubectl get pods
重新访问 ui 界面
2.1.5 系统时间同步
查看系统时间
date
同步网络时间
ntpdate cn.pool.ntp.org
2.1.6 监控 k8s 集群
# 往 prometheus-config.yml 中追加如下配置
- job_name: 'kubernetes-kubelet'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
升级资源
kubectl apply -f prometheus-config.yml
重新构建应用
kubectl delete pods <pod name>
利用指标获取当前节点中 pod 的启动时间
kubelet_pod_start_latency_microseconds{quantile="0.99"}
计算平均时间
kubelet_pod_start_latency_microseconds_sum / kubelet_pod_start_latency_microseconds_count
2.1.6.1 从 kubelet 获取节点容器资源使用情况
# 修改配置文件,增加如下内容,并更新服务
- job_name: 'kubernetes-cadvisor'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
2.1.6.2 Exporter 监控资源使用情况
# 创建 node-exporter-daemonset.yml 文件
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
spec:
template:
metadata:
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9100'
prometheus.io/path: 'metrics'
labels:
app: node-exporter
name: node-exporter
spec:
containers:
- image: prom/node-exporter
imagePullPolicy: IfNotPresent
name: node-exporter
ports:
- containerPort: 9100
hostPort: 9100
name: scrape
hostNetwork: true
hostPID: true
创建 daemonset
kubectl create -f node-exporter-daemonset.yml
查看 daemonset 运行状态
kubectl get daemonsets -l app=node-exporter
查看 pod 状态
kubectl get pods -l app=node-exporter
# 修改配置文件,增加监控采集任务
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# 通过监控 apiserver 来监控所有对应的入口请求,增加 api-server 监控配置
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- target_label: __address__
replacement: kubernetes.default.svc:443
2.1.6.3 对 Ingress 和 Service 进行网络探测
# 创建 blackbox-exporter.yaml 进行网络探测
apiVersion: v1
kind: Service
metadata:
labels:
app: blackbox-exporter
name: blackbox-exporter
spec:
ports:
- name: blackbox
port: 9115
protocol: TCP
selector:
app: blackbox-exporter
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: blackbox-exporter
name: blackbox-exporter
spec:
replicas: 1
selector:
matchLabels:
app: blackbox-exporter
template:
metadata:
labels:
app: blackbox-exporter
spec:
containers:
- image: prom/blackbox-exporter
imagePullPolicy: IfNotPresent
name: blackbox-exporter
创建资源对象
kubectl -f blackbox-exporter.yaml
# 配置监控采集所有 service/ingress 信息,加入配置到配置文件
- job_name: 'kubernetes-services'
metrics_path: /probe
params:
module: [http_2xx]
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter.default.svc.cluster.local:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: kubernetes_name
- job_name: 'kubernetes-ingresses'
metrics_path: /probe
params:
module: [http_2xx]
kubernetes_sd_configs:
- role: ingress
relabel_configs:
- source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
regex: (.+);(.+);(.+)
replacement: ${1}://${2}${3}
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter.default.svc.cluster.local:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_ingress_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_ingress_name]
target_label: kubernetes_name
2.1.7 Grafana 可视化
Grafana 是一个通用的可视化工具。‘通用’意味着 Grafana 不仅仅适用于展示 Prometheus 下的监控数据,也同样适用于一些其他的数据可视化需求。在开始使用Grafana之前,我们首先需要明确一些 Grafana下的基本概念,以帮助用户能够快速理解Grafana。
2.1.7.1 基本概念
数据源(Data Source)
对于Grafana而言,Prometheus这类为其提供数据的对象均称为数据源(Data Source)。目前,Grafana官方提供了对:Graphite, InfluxDB, OpenTSDB, Prometheus, Elasticsearch, CloudWatch的支持。对于Grafana管理员而言,只需要将这些对象以数据源的形式添加到Grafana中,Grafana便可以轻松的实现对这些数据的可视化工作。
仪表盘(Dashboard)
官方 Dashboard 模板
通过数据源定义好可视化的数据来源之后,对于用户而言最重要的事情就是实现数据的可视化。在Grafana中,我们通过Dashboard来组织和管理我们的数据可视化图表:
组织和用户
作为一个通用可视化工具,Grafana除了提供灵活的可视化定制能力以外,还提供了面向企业的组织级管理能力。在Grafana中Dashboard是属于一个Organization(组织),通过Organization,可以在更大规模上使用Grafana,例如对于一个企业而言,我们可以创建多个Organization,其中**User(用户)**可以属于一个或多个不同的Organization。 并且在不同的Organization下,可以为User赋予不同的权限。 从而可以有效的根据企业的组织架构定义整个管理模型。
2.1.7.2 集成 Grafana
部署 Grafana
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana-core
namespace: kube-system
labels:
app: grafana
component: core
spec:
selector:
matchLabels:
app: grafana
replicas: 1
template:
metadata:
labels:
app: grafana
component: core
spec:
containers:
- image: grafana/grafana:6.5.3
name: grafana-core
imagePullPolicy: IfNotPresent
env:
# The following env variables set up basic auth twith the default admin user and admin password.
- name: GF_AUTH_BASIC_ENABLED
value: "true"
- name: GF_AUTH_ANONYMOUS_ENABLED
value: "false"
# - name: GF_AUTH_ANONYMOUS_ORG_ROLE
# value: Admin
# does not really work, because of template variables in exported dashboards:
# - name: GF_DASHBOARDS_JSON_ENABLED
# value: "true"
readinessProbe:
httpGet:
path: /login
port: 3000
# initialDelaySeconds: 30
# timeoutSeconds: 1
volumeMounts:
- name: grafana-persistent-storage
mountPath: /var
volumes:
- name: grafana-persistent-storage
hostPath:
path: /data/devops/grafana
type: Directory
服务发现
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: kube-system
labels:
app: grafana
component: core
spec:
type: NodePort
ports:
- port: 3000
nodePort: 30011
selector:
app: grafana
component: core
配置 Grafana 面板
添加 Prometheus 数据源
2.2 kube-prometheus
2.2.1 替换国内镜像
sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' prometheusOperator-deployment.yaml
sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' prometheus-prometheus.yaml
sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' alertmanager-alertmanager.yaml
sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' kubeStateMetrics-deployment.yaml
sed -i 's/k8s.gcr.io/lank8s.cn/g' kubeStateMetrics-deployment.yaml
sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' nodeExporter-daemonset.yaml
sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' prometheusAdapter-deployment.yaml
sed -i 's/k8s.gcr.io/lank8s.cn/g' prometheusAdapter-deployment.yaml
# 查看是否还有国外镜像
grep "image: " * -r
2.2.2 修改访问入口
分别修改 prometheus、alertmanager、grafana 中的 spec.type 为 NodePort 方便测试访问。
2.2.3 安装
kubectl apply --server-side -f manifests/setup
kubectl wait \
--for condition=Established \
--all CustomResourceDefinition \
--namespace=monitoring
kubectl apply -f manifests/
2.2.4 配置 Ingress
# 通过域名访问(没有域名可以在主机配置 hosts)
192.168.113.121 grafana.wolfcode.cn
192.168.113.121 prometheus.wolfcode.cn
192.168.113.121 alertmanager.wolfcode.cn
# 创建 prometheus-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
namespace: monitoring
name: prometheus-ingress
spec:
ingressClassName: nginx
rules:
- host: grafana.wolfcode.cn # 访问 Grafana 域名
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: grafana
port:
number: 3000
- host: prometheus.wolfcode.cn # 访问 Prometheus 域名
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-k8s
port:
number: 9090
- host: alertmanager.wolfcode.cn # 访问 alertmanager 域名
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: alertmanager-main
port:
number: 9093
创建 ingress
kubectl apply -f prometheus-ingress.yaml
2.2.5 卸载
kubectl delete --ignore-not-found=true -f manifests/ -f manifests/setup