一、HMaster监控指标
Metric | Type(GAUGE,COUNTER) | 类型 | 业务意义 | 备注 |
---|---|---|---|---|
averageLoad | GAUGE | Average number of regions served by each region server | ||
numRegionServers | GAUGE | Number of live regionservers |
regionserver计数 | |
numDeadRegionServers | GAUGE | Number of dead regionservers | ||
clusterRequests | COUNTER | Total number of requests from all region servers to a cluster | ||
ritCount | GAUGE | The number of regions in transition |
rit状态 | |
ritOldestAge | GAUGE | ms | The age of the longest region in transition, in milliseconds | |
ritCountOverThreshold | GAUGE | The number of regions that have been in transition longer than a threshold time (default: 60 seconds) | ||
如下为非核心指标 | ||||
HlogSplitTime_num_ops | COUNTER | Time to split Write-ahead log files | ||
HlogSplitTime_mean | GAUGE | Average time to split the total size of a Write-ahead log file | ||
MetaHlogSplitSize_num_ops | COUNTER | |||
MetaHlogSplitTime_mean | GAUGE | |||
HlogSplitSize_num_ops | COUNTER | Average time to split the total size of an Hlog file | ||
HlogSplitSize_mean | GAUGE | Size of write-ahead log files the were split | ||
BulkAssign_num_ops | COUNTER | |||
BulkAssign_mean | GAUGE | |||
Assign_num_ops | COUNTER | |||
Assign_mean | GAUGE | |||
BalancerCluster_num_ops | COUNTER | |||
BalancerCluster_mean | GAUGE |
二、告警策略
Metric | 报警策略 | 报警级别 | 备注 |
---|---|---|---|
averageLoad | all(#3) > 300 | P1 | 每个RegionServer的平均region数目 |
numDeadRegionServers | all(#3) >= 1 | P1 | 存在dead的RegionServer |
clusterRequests | all(#10) >= 1000000 all(#10) <= 10000 | P1 | 集群的压力超过100w 集群的压力小于1w(可能存在问题) |
ritCount | all(#3) >= 1 | P1 | 存在rit的region |