红色为能够反映集群状态异常的关键指标
蓝色为需重点关注的性能指标
告警阈值均设置为宏变量,可根据集群情况自定义,表格中均为默认值
ES进程监控模板
指标 | 具体的含义 | 监控间隔 | Warning | High | Disaster | 备注 |
proc.num[,,,bootstrap.Elasticsearch] | 检测ES进程是否存活 | 30s | <1 ,且原先值>0 |
ES节点监控模板
指标 | 具体的含义 | 监控间隔 | Warning | High | Disaster | 备注 |
集群汇总指标 | ||||||
cluster_status | 集群状态(0-green 1-yellow 2-red) | 1m | yellow(值=1) | red(值=2) | ||
cluster_nodes_count | 集群总节点数 | 1m | 有节点离开集群 (本次数值<上次数值) | |||
cluster_indices_count | 集群开启状态的索引数 | 1m | ||||
cluster_indices_indexing_index_total | 集群总的写入TPS | 1m | 业务集群当前值比5分钟/1天前均值增长/下降20% 总写入<20 | 在zabbix中已转为速率,以下所有total值相同 | ||
cluster_indices_search_query_total | 集群总的查询QPS | 1m | 业务集群当前值比5分钟/1天前均值增长/下降20% 总查询<20 | |||
各节点指标 | ||||||
es_roles | es节点角色 | 1m | ||||
heap_committed_in_bytes | 已提交的JVM堆量 | 1m | ||||
heap_used_percent | JVM堆内存使用比例 | 1m | >80% | |||
http_current_open | 当前打开的HTTP连接数 | 1m | ||||
http_total_opened | 一共打开的HTTP连接数 | 1m | ||||
indices_indexing_flush_total | flush 次数 | 1m | ||||
indices_indexing_flush_total_time_in_millis | flush 总耗时 | 1m | ||||
indices_indexing_index_current | 当前写入值 | 1m | ||||
indices_indexing_index_time_in_millis | 写入总耗时 | 1m | ||||
indices_indexing_index_total | 写入数量(TPS) | 1m | 业务集群 >5000 日志集群 >20000 | |||
indexing_latency | 写入延时 | 1m | 业务集群> 10ms | 写入总耗时/写入数量 | ||
indices_indexing_refresh_total | 写入index后执行refresh的总次数 | 1m | ||||
indices_indexing_refresh_total_time_in_millis | 写入index后执行refresh的总耗时 | 1m | ||||
indices_search_fetch_current | 当前写入search fetch段的次数 | 1m | ||||
indices_search_fetch_time_in_millis | 当前写入search fetch段的耗时 | 1m | ||||
indices_search_fetch_total | 当前写入search fetch段的总次数 | 1m | ||||
indices_search_query_current | 当前写入search query段的次数 | 1m | ||||
indices_search_query_time_in_millis | 查询总耗时 | 1m | ||||
indices_search_query_total | 查询数量(TPS) | 1m | 日志集群 无 业务集群 >700 | |||
search_latency | 查询延时 | 1m | 业务集群 >10ms | 查询总耗时/查询数量 | ||
old_collection_count | old gc数量 | 1m | 日志集群 >100 | 业务集群 >0 | ||
old_collection_time_in_millis | old gc耗时 | 1m | ||||
thread_pool_bulk_queue | bulk写入请求队列长度 | 1m | 日志集群 >100 | 业务集群 >10 | ES5 有此指标 | |
thread_pool_bulk_rejected | bulk写入请求被拒绝的次数 | 1m | 日志集群 >0 | 日志集群 >0 | ES5 有此指标 | |
thread_pool_write_queue | write写入请求队列长度 | 1m | 日志集群 >100 | 业务集群 >10 | ES6 及以上有此指标 | |
thread_pool_write_rejected | write写入请求被拒绝的次数 | 1m | 日志集群 >0 | 日志集群 >0 | ES6 及以上有此指标 | |
thread_pool_get_completed | get请求被拒绝的次数 | 1m | ||||
thread_pool_index_queue | index写入请求队列长度 | 1m | ||||
thread_pool_index_rejected | index写入请求被拒绝的次数 | 1m | ||||
thread_pool_search_completed | 当前搜索成功的处理次数 | 1m | ||||
thread_pool_search_queue | 查询请求队列长度 | 1m | 日志集群 >100 | 业务集群 >0 | ||
thread_pool_search_rejected | 查询请求被拒绝的次数 | 1m | 日志集群 >0 | 业务集群 >0 | ||
young_collection_count | young gc数量 | 1m | ||||
young_collection_time_in_millis | young gc耗时 | 2m |
ES索引监控模板
指标 | 具体的含义 | 监控间隔 | Warning | High | Disaster | 备注 |
集群汇总指标 | ||||||
cluster_no_hidden_indices_count | 排除掉以.开头的索引外的索引总数 | 1m | ||||
cluster_primaries_xxx | 各索引监控指标都有对应的集群汇总指标 | 1m | ||||
各节点指标 | ||||||
index_type | 索引类型(索引或别名) | 1m | ||||
primaries_docs_count | 索引文档数 | 1m | ||||
primaries_size_in_bytes | 索引大小 | 1m | ||||
primaries_segments_count | segment数量 | 1m | ||||
primaries_segments_memory_in_bytes | segment使用内存 | 1m | ||||
primaries_indexing_index_total | 写入速率 | 1m | ||||
primaries_indexing_index_time_in_millis | 写入总耗时 | 1m | ||||
indexing_latency | 写入延时 | 1m | 写入总耗时/写入速率 | |||
primaries_search_query_total | 查询速率 | 1m | ||||
primaries_search_scroll_time_in_millis | 查询总耗时 | 1m | ||||
search_latency | 查询延时 | 1m | 查询总耗时/查询速率 | |||
primaries_search_fetch_total | fetch查询速率 | 1m | ||||
primaries_search_fetch_time_in_millis | fetch查询总耗时 | 1m | ||||
primaries_search_scroll_total | scroll查询速率 | 1m | ||||
primaries_search_scroll_time_in_millis | scroll查询总耗时 | 1m | ||||
primaries_indexing_delete_total | delete操作速率 | 1m | ||||
primaries_indexing_delete_time_in_millis | delete操作总耗时 | 1m | ||||
primaries_merges_total | merge操作速率 | 1m | ||||
primaries_merges_total_time_in_millis | merge操作总耗时 | 1m | ||||
primaries_refresh_total | refresh操作速率 | 1m | ||||
primaries_refresh_total_time_in_millis | refresh操作总耗时 | 1m |