【运维知识大神篇】zabbix监控ES集群API实战（端口+节点+健康状态）

最新推荐文章于 2024-05-23 09:00:00 发布

我是koten

最新推荐文章于 2024-05-23 09:00:00 发布

阅读量654

点赞数 3

分类专栏：运维知识分享 # 大神运维知识文章标签： zabbix Elasticsearch kibana 运维 linux elk elasticsearch

不予转载

本文链接：https://blog.csdn.net/qq_37510195/article/details/130904623

版权

运维知识分享同时被 2 个专栏收录

143 篇文章 119 订阅

订阅专栏

大神运维知识

36 篇文章 7 订阅

订阅专栏

本篇文章介绍，使用zabbix监控下ES集群API，zabbix监控是一个长期的过程，需要搞完善也很困难，并不是简单监控数值这么简单，深入的还有故障排查，告警抑制，链路追踪等等，我们本篇文章结合ES集群的API对ES的端口，节点，健康状态进行监控，发现问题及时告警。

环境准备

我们准备了一台zabbix服务器，和3台ES主机，思路就是先创建监控模板，对一台ES进行监控，测试没有问题后再加入另外两台主机，也可以不加入，不加入也可以监控集群信息，加入多台是为了防止单点故障

主机名	IP
Zabbix	10.0.0.71
ELK101	10.0.0.101
ELK102	10.0.0.102
ELK103	10.0.0.103

Zabbix客户端

以ELK101为例，另外两台操作同理

一、取出健康状态

[root@ELK101 ~]# yum -y install jq    #安装json解析工具，方便待会儿取数据
[root@ELK101 ~]# curl 10.0.0.101:19200/_cluster/health    #直接查看健康状态信息
{"cluster_name":"koten-es7","status":"green","timed_out":false,"number_of_nodes":3,"number_of_data_nodes":3,"active_primary_shards":16,"active_shards":26,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}
[root@ELK101 ~]# curl 10.0.0.101:19200/_cluster/health|jq    #jq格式取数据，发现开头一些报错信息不需要 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- 100   387  100   387    0     0  14468      0 --:--:-- --:--:-- --:--:-- 14884
{
  "cluster_name": "koten-es7",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 3,
  "number_of_data_nodes": 3,
  "active_primary_shards": 16,
  "active_shards": 26,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100
}
[root@ELK101 ~]# curl 10.0.0.101:19200/_cluster/health 2> /dev/null|jq    #将报错信息加空
{
  "cluster_name": "koten-es7",              #集群名称
  "status": "green",                        #集群健康状态
  "timed_out": false,                       #是否在参数false指定时间段内返回响应（默认情况下是30秒）
  "number_of_nodes": 3,                     #集群内的节点数
  "number_of_data_nodes": 3,                #作为专用数据节点的节点数
  "active_primary_shards": 16,              #可用主分片的数量
  "active_shards": 26,                      #可用主分片和副本分片的总数
  "relocating_shards": 0,                   #正在重定位的分片数
  "initializing_shards": 0,                 #正在初始化的分片数
  "unassigned_shards": 0,                   #未分配的分片数
  "delayed_unassigned_shards": 0,           #分配因超时设置而延迟的分片数
  "number_of_pending_tasks": 0,             #尚未执行集群级别更改的数量
  "number_of_in_flight_fetch": 0,           #未完成的提取次数
  "task_max_waiting_in_queue_millis": 0,    #自最早启动的任务等待执行依赖的时间（以毫秒为单位）
  "active_shards_percent_as_number": 100    #集群中活动分片的比率，以百分比表示
}
[root@ELK101 ~]# curl 10.0.0.101:19200/_cluster/health 2>/dev/null | jq .number_of_nodes    #取健康状态的信息按照这种格式取即可
3
[root@ELK102 ~]# curl 10.0.0.101:19200/_cluster/health 2> /dev/null|jq .cluster_name|sed 's#\"##g'    #遇到有引号的不想要用sed删除即可
koten-es7
[root@ELK101 ~]# curl 10.0.0.101:19200/_cluster/health 2>/dev/null | jq -r .cluster_name    #或者用参数-r也可以删除引号
koten-es7

二、取出端口

可以根据取端口的行数是否开启，也可以使用zabbix内置的模板

[root@ELK101 ~]# netstat -tnulp|grep 19200
tcp6       0      0 :::19200                :::*                    LISTEN      8608/java           
[root@ELK101 ~]# netstat -tnulp|grep 19200|wc -l    
1

[root@ELK101 ~]# netstat -tnulp|grep 19300
tcp6       0      0 :::19300                :::*                    LISTEN      8608/java           
[root@ELK101 ~]# netstat -tnulp|grep 19200|wc -l
1

三、取出节点信息

[root@ELK101 ~]# curl 10.0.0.101:19200/_cat/nodes
10.0.0.101 66 91 8 0.16 0.08 0.12 cdfhilmrstw - ELK101
10.0.0.102 75 89 1 0.01 0.03 0.05 cdfhilmrstw * ELK102
10.0.0.103 51 91 0 0.01 0.03 0.05 cdfhilmrstw - ELK103

四、安装并配置zabbix客户端

1、配置yum源

[root@ELK101 ~]# rpm -Uvh https://repo.zabbix.com/zabbix/5.0/rhel/7/x86_64/zabbix-release-5.0-1.el7.noarch.rpmRetrieving https://repo.zabbix.com/zabbix/5.0/rhel/7/x86_64/zabbix-release-5.0-1.el7.noarch.rpm
warning: /var/tmp/rpm-tmp.LL8cND: Header V4 RSA/SHA512 Signature, key ID a14fe591: NOKEY
Preparing...                          ################################# [100%]
Updating / installing...
   1:zabbix-release-5.0-1.el7         ################################# [100%]
[root@ELK101 ~]# yum clean all
Loaded plugins: fastestmirror
Cleaning repos: base epel extras updates zabbix
              : zabbix-non-supported
Cleaning up list of fastest mirrors

2、安装软件

[root@ELK101 ~]# yum -y install zabbix-agent

3、在客户端修改服务端的IP信息

[root@ELK101 ~]# cat /etc/zabbix/zabbix_agentd.conf
Server=10.0.0.71

五、配置zabbix的监控文件

将先前取到的数据都加入到监控项的配置文件里，自定义节点数和监控状态所有信息，端口我们选择使用zabbix系统自带的即可

[root@ELK101 ~]# cat /etc/zabbix/zabbix_agentd.d/ELK.conf
UserParameter=cat_nodes,curl 10.0.0.101:19200/_cat/nodes 2>/dev/null
UserParameter=cluster_name,curl 10.0.0.101:19200/_cluster/health 2>/dev/null | jq -r .cluster_name
UserParameter=status,curl 10.0.0.101:19200/_cluster/health 2>/dev/null | jq -r .status
UserParameter=timed_out,curl 10.0.0.101:19200/_cluster/health 2>/dev/null | jq -r .timed_out
UserParameter=number_of_nodes,curl 10.0.0.101:19200/_cluster/health 2>/dev/null | jq -r .number_of_nodes
UserParameter=number_of_data_nodes,curl 10.0.0.101:19200/_cluster/health 2>/dev/null | jq -r .number_of_data_nodes
UserParameter=active_primary_shards,curl 10.0.0.101:19200/_cluster/health 2>/dev/null | jq -r .active_primary_shards
UserParameter=active_shards,curl 10.0.0.101:19200/_cluster/health 2>/dev/null | jq -r .active_shards
UserParameter=relocating_shards,curl 10.0.0.101:19200/_cluster/health 2>/dev/null | jq -r .relocating_shards
UserParameter=initializing_shards,curl 10.0.0.101:19200/_cluster/health 2>/dev/null | jq -r .initializing_shards
UserParameter=unassigned_shards,curl 10.0.0.101:19200/_cluster/health 2>/dev/null | jq -r .unassigned_shards
UserParameter=delayed_unassigned_shards,curl 10.0.0.101:19200/_cluster/health 2>/dev/null | jq -r .delayed_unassigned_shards
UserParameter=number_of_pending_tasks,curl 10.0.0.101:19200/_cluster/health 2>/dev/null | jq -r .number_of_pending_tasks
UserParameter=number_of_in_flight_fetch,curl 10.0.0.101:19200/_cluster/health 2>/dev/null | jq -r .number_of_in_flight_fetch
UserParameter=task_max_waiting_in_queue_millis,curl 10.0.0.101:19200/_cluster/health 2>/dev/null | jq -r .task_max_waiting_in_queue_millis
UserParameter=active_shards_percent_as_number,curl 10.0.0.101:19200/_cluster/health 2>/dev/null | jq -r .active_shards_percent_as_number

查看监控项

[root@ELK101 ~]# zabbix_agentd -p|grep number_of_nodes
number_of_nodes                                 [t|3]

最后重启zabbix客户端

[root@ELK101 ~]# systemctl restart zabbix-agent

Zabbix服务端

服务端测试是否能取到ELK的数据

[root@Zabbix ~]# zabbix_get -s 10.0.0.101 -p 10050 -k number_of_nodes
3

一、创建模板

二、添加主机

先添加ELK101

三、添加监控项

添加的时候注意取的信息类型和单位，百分百为单位的比率写%，一些字符串类型的信息类型写文本型，布尔值也可以写文本型。

后面依次添加自定义监控项

最终监控项的状态

四、查看最新数据

没有问题，都可以查看到

五、添加触发器

其他自定义监控项一一添加即可，考虑它在何种状态下需要告警，由于我们是测试，所以我把严重性设置为信息等级别，在企业中看实际情况决定告警级别，最终触发器列表如下

六、添加图形

七、启用触发器动作

八、添加报警

配置好后可以手动点击测试测试下邮件推送

九、测试报警与邮箱推送

[root@ELK101 ~]# systemctl stop es7

出现报警，并且推送给了邮箱

[root@ELK101 ~]# systemctl start es7

十、测试企业微信

在服务端准备脚本和python环境，详细内容可见我主页博客

【运维知识进阶篇】zabbix5.0稳定版详解2（自定义监控+报警+图形+模板）

添加报警信息媒介

添加动作

测试告警

[root@ELK101 ~]# systemctl stop es7

企业微信正常接收，没有问题

我是koten，10年运维经验，持续分享运维干货，感谢大家的阅读和关注！

我是koten

关注

3
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
0
评论
【运维知识大神篇】zabbix监控ES集群API实战（端口+节点+健康状态）

本篇文章介绍，使用zabbix监控下ES集群API，zabbix监控是一个长期的过程，需要搞完善也很困难，并不是简单监控数值这么简单，深入的还有故障排查，告警抑制，链路追踪等等，我们本篇文章结合ES集群的API对ES的端口，节点，健康状态进行监控，发现问题及时告警。
复制链接

扫一扫