一、RegionServer级别的监控
Metric | Type(GAUGE,COUNTER) | 类型 | 业务意义 | 备注 |
---|---|---|---|---|
regionCount | GAUGE | The number of regions hosted by the regionserver | RegionServer包含对象 | |
storeCount | GAUGE | |||
storeFileCount | GAUGE | The number of store files on disk currently managed by the regionserver | ||
storeFileSize | GAUGE | Aggregate size of the store files on disk | ||
hlogFileCount | GAUGE | The number of write ahead logs not yet archived | ||
totalRequestCount | COUNTER | The total number of requests received | 负载 | |
readRequestCount | COUNTER | The number of read requests received | ||
writeRequestCount | COUNTER | The number of write requests received | ||
numOpenConnections | GAUGE | The number of open connections at the RPC layer | 连接与队列 | |
numActiveHandler | GAUGE | The number of RPC handlers actively servicing requests | ||
numCallsInGeneralQueue | GAUGE | The number of currently enqueued user requests | ||
numCallsInReplicationQueue | GAUGE | The number of currently enqueued operations received from replication | ||
numCallsInPriorityQueue | GAUGE | The number of currently enqueued priority (internal housekeeping) requests | ||
flushQueueLength | GAUGE | Current depth of the memstore flush queue. If increasing, we are falling behind with clearing memstores out to HDFS. | ||
compactionQueueLength | GAUGE | Current depth of the compaction request queue. If increasing, we are falling behind with storefile compaction. | ||
updatesBlockedTime | COUNTER | ms | Number of milliseconds updates have been blocked so the memstore can be flushed | |
blockCacheHitCount | COUNTER | The number of block cache hits | blockcache使用情况 | |
blockCacheMissCount | COUNTER | The number of block cache misses | ||
blockCacheExpressHitPercent | GAUGE | percent | The percent of the time that requests with the cache turned on hit the cache | |
percentFilesLocal | GAUGE | percent | Percent of store file data that can be read from the local DataNode, 0-100 | 文件本地化比例 |
<op>_<measure> | GAUGE | Operation latencies, where <op> is one of Append, Delete, Mutate, Get, Replay, Increment; and where <measure> is one of min, max, mean, median, 75th_percentile, 95th_percentile, 99th_percentile | 详细的各类操作计数器 | |
slow<op>Count | COUNTER | The number of operations we thought were slow, where <op> is one of the list above | ||
GcTimeMillis | COUNTER | ms | Time spent in garbage collection, in milliseconds | GC时间 |
GcTimeMillisParNew | COUNTER | ms | Time spent in garbage collection of the young generation, in milliseconds | |
GcTimeMillisConcurrentMarkSweep | COUNTER | ms | Time spent in garbage collection of the old generation, in milliseconds | |
authenticationSuccesses | COUNTER | Number of client connections where authentication succeeded | ACL模块的统计 | |
authenticationFailures | COUNTER | Number of client connection authentication failures | ||
mutationsWithoutWALCount | COUNTER | Count of writes submitted with a flag indicating they should bypass the write ahead log | ||
如下部分为非核心指标,暂未实现 | ||||
compactedCellsCount | COUNTER | 合并cell个数 | cell统计 | |
majorCompactedCellsCount | COUNTER | 大合并cell个数 | ||
flushedCellsSize | COUNTER | flush到磁盘的大小 | ||
blockedRequestCount | COUNTER | 因memstore大于阈值而引发flush的次数 | ||
splitRequestCount | COUNTER | region分裂请求次数 | region分裂情况 | |
splitSuccessCounnt | COUNTER | region分裂成功次数 | ||
receivedBytes | COUNTER | bytes | 收到数据量 | 带宽 |
sentBytes | COUNTER | bytes | 发出数据量SyncTime_mean | |
compactionQueueSize | GAUGE | compaction Queue的大小 | compaction情况统计 | |
compactionSize_avg_time | GAUGE | ms | 履行一次Compaction的数据大小 | |
compactionSize_num_ops | COUNTER | 履行compaction的次数 | ||
compactionTime_avg_time | GAUGE | ms | 均匀履行一次Compaction的时间 | |
compactionTime_num_ops | COUNTER | 履行compaction的次数 |
二、RegionServe报警设置
Metric | 报警策略 | 报警级别 | 备注 |
---|---|---|---|
totalRequestCount | all(#3) > 50000 | P1 | 负载过大 |
compactionQueueLength | all(#3) > 100 | P1 | 压缩队列过长 |
percentFilesLocal | all(#3) <= 90 | P1 | 文件本地化低于95% |
blockCacheExpressHitPercent | all(#3) <= 90 | P1 | blockCache命中率低于95% |
GcTimeMillisConcurrentMarkSweep | all(#3) > 200 | P1 | GC时间过长 |
storeFileCount | all(#3) > 1000 | P1 | StoreFile过多,需要考虑compact |
三、RegionServer上的table(region)级别的监控
Metric | Type(GAUGE,COUNTER) | 类型 | 业务意义 | 备注 |
---|---|---|---|---|
appendCount | COUNTER |
region级别的各类操作计数器 | ||
deleteCount | COUNTER | |||
mutateCount | COUNTER | |||
incrementCount | COUNTER | |||
scanNext_num_ops | COUNTER | |||
get_num_ops | COUNTER | |||
numBytesCompactedCount | COUNTER | bytes | 合并完成文件总大小 |
合并操作 |
numFilesCompactedCount | COUNTER | 合并完成文件个数 |