[夜莺监控系列3]夜莺生产环境部署-1

Rock-Wang

已于 2023-08-15 21:16:40 修改

阅读量629

点赞数 1

文章标签：运维开发

于 2023-08-01 21:40:40 首次发布

本文链接：https://blog.csdn.net/weixin_42924568/article/details/132051070

版权

#0 前言
下面n9e的理解图和架构图，后面的安装会以架构图的方式进行。

理解图：
架构图：

1 helm服务安装（主服务端）

这里说的主服务端，就是整个监控系统的最核心的那个n9e，所以相关的用户数据、配置数据等都在这个n9e上了。

1.1 chart包下载

A.根据官网(https://n9e.flashcat.cloud/docs/content/flashcat-monitor/nightingale/install/helm/)文档提示，
  到 https://github.com/flashcatcloud/n9e-helm 中拿包

B.下载官网chart包
# git clone https://github.com/flashcatcloud/n9e-helm.git
# cd n9e-helm

C.pull/tag/push image
docker pull docker.io/library/nginx:stable-alpine
docker pull docker.io/library/mysql:5.7
docker pull docker.io/library/redis:6.2
docker pull docker.io/prom/prometheus:latest
docker pull flashcatcloud/categraf:latest
docker pull flashcatcloud/nightingale:5.14.5

1.2 配置

webapi支持nodePort(按需开放)

# vim templates/nwebapi/service.yaml
*/}}
{{- if eq .Values.nwebapi.type "internal" -}}
apiVersion: v1
kind: Service
metadata:
  name: "{{ template "nightingale.nwebapi" . }}"
  labels:
{{ include "nightingale.labels" . | indent 4 }}
spec:
  type: NodePort  # type为NodePort,也可以在values.yaml里控制，我这里懒的配置了
  ports:
    - name: "nwebapi-port"
      port: {{ template "nightingale.nwebapi.servicePort" . }}
      targetPort: {{ template "nightingale.nwebapi.port" . }}
      nodePort: {{ .Values.expose.nodePort.ports.http.nodePort }}  # 并指定端口
  selector:
{{ include "nightingale.matchLabels" . | indent 4 }}
    component: nwebapi
{{- end -}}

values.yaml主要配置

# cat values.yaml
expose:
  type: ingress
  tls:
    enabled: false
    certSource: auto
    auto:
      commonName: ""
    secret:
      secretName: ""
  ingress:
    hosts:
      web: n9e.example.com
    controller: default
  clusterIP:
    name: nightingale
    ports:
      httpPort: 80
      httpsPort: 443
  nodePort:
    name: nightingale
    ports:
      http:
        port: 80
        nodePort: 30007
      https:
        port: 443
        nodePort: 30009
  loadBalancer:
    name: nightingale
    IP: ""
    ports:
      httpPort: 80
      httpsPort: 443

externalURL: http://hello.n9e.info

persistence:  # 这里是持久化存储的配置，有storageClass的话直接配上就行，没有的话可以用hostpath方式
  enabled: true
  resourcePolicy: "keep"
  persistentVolumeClaim:
    database:
      existingClaim: ""
      storageClass: "nfs-storage"
      subPath: ""
      accessMode: ReadWriteOnce
      size: 40Gi
    redis:
      existingClaim: ""
      storageClass: "nfs-storage"
      subPath: ""
      accessMode: ReadWriteOnce
      size: 10Gi
    prometheus:
      existingClaim: ""
      storageClass: "nfs-storage"
      subPath: ""
      accessMode: ReadWriteOnce
      size: 100Gi

imagePullSecrets:

caSecretName: ""

secretKey: "not-a-secure-key"

nginx:
  image:
    repository: docker.io/library/nginx
    tag: stable-alpine
  replicas: 1
  # resources:
  #  requests:
  #    memory: 256Mi
  #    cpu: 100m

database:
  type: internal  # 这个表达的是使用n9e chart启动的mysql还是外部的mysql
  internal:
    image:
      repository: docker.io/library/mysql
      tag: 5.7
    username: "root"
    password: "n9e_v5"
    shmSizeLimit: 512Mi
    resources: {}
  external:
    host: "192.168.0.1"  # 如果有外部的mysql，可以直接配置到这里，并将上面的type改为: external就行了
    port: "3306"
    name: "n9e_v5"
    username: "user"
    password: "password"
    sslmode: "disable"

redis:
  type: internal
  internal:
    image:
      repository: docker.io/library/redis
      tag: 6.2
  external:
    addr: "192.168.0.2:6379"
    sentinelMasterSet: "admin"
    password: ""
    mode: "standalone" # standalone/sentinel

prometheus:
  type: internal
  internal:
    username: ""
    password: ""
    image:
      repository: docker.io/prom/prometheus
      tag: latest
  external:
    host: "192.168.0.2"
    port: "9090"
    username: ""
    password: ""
  nodePort: 30090

categraf:
  type: internal
  internal:
    image:
      repository: flashcatcloud/categraf
      tag: latest
  external:
    host: "192.168.0.3"
    port: "8094"
    password: ""

nwebapi:
  type: internal
  internal:
    replicas: 1
    image:
      repository: flashcatcloud/nightingale
      tag: 5.14.5
  external:
    host: "172.20.0.11"
    port: "5040"

nserver:
  type: internal
  internal:
    replicas: 1
    image:
      repository: flashcatcloud/nightingale
      tag: 5.14.5
  external:
    host: "172.20.0.11"
    port: "19000"

cluster name配置

n9e前端服务中配置:
# vim templates/nwebapi/conf-cm.yaml
    [[Clusters]]
    Name = "Ops-k8s-cluster"  # 修改自己的cluster name
    Prom = "http://{{ template "nightingale.prometheus.host" . }}:{{ template "nightingale.prometheus.servicePort" .}}"
    BasicAuthUser = ""
    BasicAuthPass = ""
    Timeout = 30000      # 如果大盘图有timeout报错，可以在这里配置超时时间来解决。例如改为 90000
    DialTimeout = 3000   # 如果大盘图有timeout报错，可以在这里配置超时时间来解决。例如改为 90000
    MaxIdleConnsPerHost = 100
    Headers = ["X-From", "n9e"]
    [[Clusters]]
    Name = "Ops-repository"    # 衔接其他的n9e-server 配置对应的cluster name
    Prom = "http://172.20.0.11:9090"
    BasicAuthUser = ""
    BasicAuthPass = ""
    Timeout = 30000      # 如果大盘图有timeout报错，可以在这里配置超时时间来解决。例如改为 90000
    DialTimeout = 3000   # 如果大盘图有timeout报错，可以在这里配置超时时间来解决。例如改为 90000
    MaxIdleConnsPerHost = 100

n9e后端
服务中配置:
# vim templates/nserver/conf-cm.yaml
data:
  server.conf: |-
    RunMode = "release"
    ClusterName = "Ops-k8s-cluster"  # Default 改为自己需要的名称
... ...
    [AnonymousAccess]
    PromQuerier = true  # 打开匿名访问
    AlertDetail = false

告警模板配置

# vim tpl/feishu.tpl
级别状态: S{{.Severity}} {{if .IsRecovered}}Recovered{{else}}Triggered{{end}}
规则名称: {{.RuleName}}{{if .RuleNote}}
规则备注: {{.RuleNote}}{{end}}
告警集群: {{ .Cluster }}
告警组: {{ .GroupName }}
监控指标: {{range $index, $value := .TagsJSON}}
  {{$value}}{{end}}
{{if .IsRecovered}}恢复时间: {{timeformat .LastEvalTime}}{{else}}触发时间: {{timeformat .TriggerTime}}
触发值: {{.TriggerValue}}{{end}}
发送时间: {{timestamp}}


告警样例展示：
级别状态: S3 Triggered
规则名称: 测试告警
告警集群: ops-k8s-cluster
告警组: devops
监控指标: 
  alert_type=test
  daemonset=nvidia-gpu-exporter
  instance=kube-state-metrics:8080
  job=kube-state-metrics
  namespace=monitoring
  rulename=测试告警
恢复时间：2023-03-16 11:33:00
触发值: 1
发送时间: 2023-03-16 11:33:00

1.3 安装及登录

1.3.1 安装

# helm upgrade -i nightingale --namespace=monitoring -f override.yaml --create-namespace .
# kubectl get pods -n monitoring
NAME                                                READY   STATUS             RESTARTS      AGE
nightingale-categraf-57st8                          1/1     Running            0             19h
nightingale-categraf-hncl6                          1/1     Running            0             19h
nightingale-categraf-pxrl7                          1/1     Running            0             19h
nightingale-database-0                              1/1     Running            0             19h
nightingale-nserver-96557c66f-9wtf6                 1/1     Running            2 (19h ago)   19h
nightingale-nwebapi-6d6bc8f964-m9n8v                1/1     Running            0             166m
nightingale-prometheus-0                            1/1     Running            0             19h
nightingale-redis-0                                 1/1     Running            0             19h

# kubectl get svc -n monitoring
NAME                     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
nightingale-database     ClusterIP   10.233.16.151   <none>        3306/TCP       19h
nightingale-nserver      ClusterIP   10.233.42.48    <none>        80/TCP         19h
nightingale-nwebapi      NodePort    10.233.17.185   <none>        80:30007/TCP   170m
nightingale-prometheus   ClusterIP   10.233.12.40    <none>        9090/TCP       19h
nightingale-redis        ClusterIP   10.233.45.201   <none>        6379/TCP       19h

1.3.2 安全组放行

放行 30007 tcp端口，用于n9e-web的nodeport

1.3.3 登录访问

http://ip:30007/dashboards

默认密码: root/root.2020

2 n9e web页面配置

2.1 用户配置

在飞书APP上拉一个群，添加机器人，然后生成webhook的url。
添加用户，并将飞书的webhook的url添加到个人信息的联系方式中，用于后续告警使用。

2.2 监控大盘配置

官网的大盘图地址：

categraf的inputs中找对应的json文件：https://github.com/flashcatcloud/categraf/tree/main/inputs/kubernetes
categraf的k8s中找对应的json文件：https://github.com/flashcatcloud/categraf/blob/main/k8s/pod-dash.json

2.2.1 当前k8s集群大盘图

直接导入此模板即可
在这里插入图片描述

2.2.2 当前集群pod大盘

通过下面的json模板导入即可

https://github.com/flashcatcloud/categraf/blob/main/k8s/pod-dash.json

如需我优化后的大盘图，可以留言找我要哦~

以上主n9e的服务就部署好了，当然还有一些exportor还需要安装一下，后面会一一讲到的。

下面就开始进行其他集群资源监控了。

3 其他集群资源监控

3.1 k8s集群安装n9e

对于k8s集群，就直接安装上面的方式安装好n9e就可以了。

3.2 二进制安装n9e

3.2.1 下载n9e包

# mkdir /usr/local/n9e && cd /usr/local/n9e
# wget https://github.com/ccfos/nightingale/releases/download/v5.14.5/n9e-v5.14.5-linux-amd64.tar.gz
# tar xf n9e-v5.14.5-linux-amd64.tar.gz
# ln -s /usr/local/n9e/n9e /usr/local/bin/

3.2.2 配置

# vim etc/server.conf
ClusterName = "Ops-repository"
... ...
[DB]
DSN="root:zelos@n9e@tcp(ip:3306)/n9e_v5?charset=utf8mb4&parseTime=True&loc=Local&allowNativePasswords=true"  # 数据库配置
... ...
[Reader]
# prometheus base url
Url = "http://127.0.0.1:9095"
... ...
[[Writers]]
Url = "http://127.0.0.1:9095/api/v1/write"


# vim etc/webapi.conf
[[Clusters]]
# Prometheus cluster name
Name = "Ops-repository"
# Prometheus APIs base url
Prom = "http://127.0.0.1:9095"  # 改为本机Prometheus端口

3.2.3 redis安装

# docker run -it -d --name=n9e-redis -p 6379:6379 redis:6.2

3.2.4 prometheus安装

# wget https://s3-gz01.didistatic.com/n9e-pub/prome/prometheus-2.28.0.linux-amd64.tar.gz -O prometheus-2.28.0.linux-amd64.tar.gz
# tar xf prometheus-2.28.0.linux-amd64.tar.gz
# mv prometheus-2.28.0.linux-amd64 /usr/local/prometheus/

# cat <<EOF >/etc/systemd/system/prometheus.service
[Unit]
Description="prometheus"
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple

ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path=/usr/local/prometheus/data --web.enable-lifecycle --enable-feature=remote-write-receiver --query.lookback-delta=2m 

Restart=always
SuccessExitStatus=0
LimitNOFILE=65536
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=prometheus

[Install]
WantedBy=multi-user.target
EOF


# systemctl daemon-reload
# systemctl enable prometheus
# systemctl start prometheus
# systemctl status prometheus

测试Prometheus服务状态：
http://IP:9090/targets

3.2.5 启动n9e

普通方式：

端安装:
# nohup ./n9e server & > server.log &

前端不用安装，这里只是告诉大家怎么运行的:
# nohup ./n9e webapi & > webapi.log &

守护进程方式

# vim /lib/systemd/system/n9e-server.service
[Unit]
Description=n9e Service
After=syslog.target network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/n9e server
RestartSec=3
Restart=always

[Install]
WantedBy=multi-user.target


# systemctl daemon-reload
# systemctl start n9e-server
# systemctl status n9e-server
# systemctl enable n9e-server

3.3 categraf安装

3.3.1 下载

二进制下载页面: https://github.com/flashcatcloud/categraf/releases

# cd /opt/monitoring/3.categraf
# wget https://github.com/flashcatcloud/categraf/releases/download/v0.2.35/categraf-v0.2.35-linux-amd64.tar.gz
# tar xf categraf-v0.2.35-linux-amd64.tar.gz
# cd categraf-v0.2.35-linux-amd64
# ll
total 117064
-rwxr-xr-x  1 1001 docker 119869440 Mar 16 07:18 categraf
drwxr-xr-x 59 root root        4096 Mar 22 02:27 conf

# mv categraf-v0.2.35-linux-amd64 /usr/local/categraf
# ln -s categraf-v0.2.35-linux-amd64 .

3.3.2 配置

# vim conf/config.toml
[[writers]]
url = "http://127.0.0.1:19000/prometheus/v1/write"

3.3.3 启动

配置service
# cat <<EOF >/etc/systemd/system/categraf.service
[Unit]
Description="categraf"
Documentation=https://github.com/flashcatcloud/categraf
After=network.target

[Service]
Type=simple

ExecStart=/usr/local/categraf/categraf

Restart=always
SuccessExitStatus=0
LimitNOFILE=65536
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=categraf

[Install]
WantedBy=multi-user.target
EOF

启动服务
# systemctl daemon-reload
# systemctl enable categraf
# systemctl start categraf
# systemctl status categraf


注意：如果是多节点的环境，还需要将categraf的文件和配置分发到其他环境去，并启动起来哦~

3.4 主n9e添加其他集群的配置进来

最后还需要在主n9e集群中配置一下新增的集群，用于被监控，具体如下：

在主n9e的配置中，新增其他集群的prometheus进来，这里示例新的集群名称为：Ops-repository
# vim templates/nwebapi/conf-cm.yaml
    [[Clusters]]
    Name = "Ops-k8s-cluster"
    Prom = "http://{{ template "nightingale.prometheus.host" . }}:{{ template "nightingale.prometheus.servicePort" .}}"
    BasicAuthUser = ""
    BasicAuthPass = ""
    Timeout = 30000
    DialTimeout = 3000
    MaxIdleConnsPerHost = 100
    Headers = ["X-From", "n9e"]
    [[Clusters]]            # ================> 增加此段配置
    Name = "Ops-repository"
    Prom = "http://172.20.0.11:9090"
    BasicAuthUser = ""
    BasicAuthPass = ""
    Timeout = 30000      # 如果大盘图有timeout报错，可以在这里配置超时时间来解决。例如改为 90000
    DialTimeout = 3000   # 如果大盘图有timeout报错，可以在这里配置超时时间来解决。例如改为 90000
    MaxIdleConnsPerHost = 100


在[Reader]下面新增[[Readers]]，将当前主集群和其他接入的集群的prometheus挂载进来，用于告警引擎的展示，注意有个s结尾哦~
# templates/nserver/conf-cm.yaml
    [[Readers]]
    ClusterName = "Ops-k8s-cluster"
    #prometheus base url
    Url = "http://nightingale-prometheus:9090"
    #Basic auth username
    BasicAuthUser = ""
    #Basic auth password
    BasicAuthPass = ""
    #timeout settings, unit: ms
    Timeout = 30000      # 如果大盘图有timeout报错，可以在这里配置超时时间来解决。例如改为 90000
    DialTimeout = 3000   # 如果大盘图有timeout报错，可以在这里配置超时时间来解决。例如改为 90000
    MaxIdleConnsPerHost = 100

    [[Readers]]            # ================> 增加此段配置
    ClusterName = "Ops-repository"
    #prometheus base url
    Url = "http://172.20.0.11:9090"
    #Basic auth username
    BasicAuthUser = ""
    #Basic auth password
    BasicAuthPass = ""
    #timeout settings, unit: ms
    Timeout = 30000      # 如果大盘图有timeout报错，可以在这里配置超时时间来解决。例如改为 90000
    DialTimeout = 3000   # 如果大盘图有timeout报错，可以在这里配置超时时间来解决。例如改为 90000
    MaxIdleConnsPerHost = 100


然后主n9e服务重新部署下（更新configmap），然后重启前端和后端的pod，否则不生效。

好了，这一章节的内容还是挺多的，关于告警、出图、各exporter的配置与使用，请看后续章节，感谢大家的关注，么么哒~