HBase通过Hadoop metrics API统计指标,默认是10秒统计一次,可以把这些指标与Ganglia结合,也可以过滤某些指标或者扩展指标。
1 指标设置
HBase 0.95后,HBase附带了默认的指标配置或sink。编辑文件conf/hadoop-metrics2-hbase.properties
配置region server的指标,重启改变了的region server使其生效。
改变默认的抽样速率,在行开始编辑*.period
,过滤或扩展指标框架,参见http://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html。
HBase Metrics and Ganglia
HBase默认会统计每个region server中大量的指标,Ganglia难以处理所有的指标,要么升级Ganglia server的处理能力,要么减少指标数量,参见Metrics Filtering。
2 禁用指标
禁用某个region server的指标,编辑conf/hadoop-metrics2-hbase.properties
,注释相关行,重启改变了的region server使其生效。
3 查看可用指标
- Web UI,Metrics Dump
- JMX工具,如jconsole
4 指标测量单位
不同的指标都不同的测量单位,下面是常见示例:
- 时间点描述为时间戳
- 时间年龄(如ageOfLastShippedOp)描述为毫秒
- 内存大小描述为字节
- 队列大小(如sizeOfLogQueue)描述为items的个数
- 某种操作的次数(如logEditsRead)描述为整数
5 Master重要指标
hbase.master.numRegionServers:
Number of live regionservers
hbase.master.numDeadRegionServers:
Number of dead regionservers
hbase.master.ritCount:
The number of regions in transition
hbase.master.ritCountOverThreshold:
The number of regions that have been in transition longer than a threshold time (default: 60 seconds)
hbase.master.ritOldestAge:
The age of the longest region in transition, in milliseconds
6 RegionServer重要指标
hbase.regionserver.regionCount:
The number of regions hosted by the regionserver
hbase.regionserver.storeFileCount:
The number of store files on disk currently managed by the regionserver
hbase.regionserver.storeFileSize:
Aggregate size of the store files on disk
hbase.regionserver.hlogFileCount:
The number of write ahead logs not yet archived
hbase.regionserver.totalRequestCount:
The total number of requests received
hbase.regionserver.readRequestCount:
The number of read requests received
hbase.regionserver.writeRequestCount:
The number of write requests received
hbase.regionserver.numOpenConnections:
The number of open connections at the RPC layer
hbase.regionserver.numActiveHandler:
The number of RPC handlers actively servicing requests
hbase.regionserver.numCallsInGeneralQueue:
The number of currently enqueued user requests
hbase.regionserver.numCallsInReplicationQueue:
The number of currently enqueued operations received from replication
hbase.regionserver.numCallsInPriorityQueue:
The number of currently enqueued priority (internal housekeeping) requests
hbase.regionserver.flushQueueLength:
Current depth of the memstore flush queue. If increasing, we are falling behind with clearing memstores out to HDFS.
hbase.regionserver.updatesBlockedTime:
Number of milliseconds updates have been blocked so the memstore can be flushed
hbase.regionserver.compactionQueueLength:
Current depth of the compaction request queue. If increasing, we are falling behind with storefile compaction.
hbase.regionserver.blockCacheHitCount:
The number of block cache hits
hbase.regionserver.blockCacheMissCount:
The number of block cache misses
hbase.regionserver.blockCacheExpressHitPercent:
The percent of the time that requests with the cache turned on hit the cache
hbase.regionserver.percentFilesLocal:
Percent of store file data that can be read from the local DataNode, 0-100
hbase.regionserver._:
Operation latencies, where is one of Append, Delete, Mutate, Get, Replay, Increment; and where is one of min, max, mean, median, 75th_percentile, 95th_percentile, 99th_percentile
hbase.regionserver.slowCount:
The number of operations we thought were slow, where is one of the list above
hbase.regionserver.GcTimeMillis:
Time spent in garbage collection, in milliseconds
hbase.regionserver.GcTimeMillisParNew:
Time spent in garbage collection of the young generation, in milliseconds
hbase.regionserver.GcTimeMillisConcurrentMarkSweep:
Time spent in garbage collection of the old generation, in milliseconds
hbase.regionserver.authenticationSuccesses:
Number of client connections where authentication succeeded
hbase.regionserver.authenticationFailures:
Number of client connection authentication failures
hbase.regionserver.mutationsWithoutWALCount:
Count of writes submitted with a flag indicating they should bypass the write ahead log