1、Context 监控实现:
GangliaContext : 推送至Ganglia
FileContext: 写入文件
TimeStampingFileContext: 写入文件,带时间戳
CompositeContext: 多个实现
NullContext: 不监控
NullContextWithUpdateThread 不监控,启动聚合统计线程。
2、 HMaster 监控指标
cluster requests 集群请求数
split time 拆分预写日志的时间
split size 拆分预写日志的大小
3、HRegionServer 监控指标
block cache 块缓存: count, size, free, evicted
compaction 合并: size, tine, request size
memstore 内存缓存: size, flush queue size, flush size, flush time
stores 存储: store files, stores, file index
I/O I/O: fs read latency, fs write latency, fs sync latency
其他: read request count, write request count
4、RPC 监控
RPC Processing Time
RPC Queue Time
5. JVM 监控
Heap
GC
Thread
System event
6、Info监控
date version revision url user hdfsDate hdfsVersion hdfsRevision hdfsUrl hdfsUser
7、Ganglia 结构
gmond 在所监控的每个节点上收集数据
gmetad 一个节点,从gmond 获取整个集群的数据
web页面 展示数据
安装完成后修改 hadoop-metrics.properties 或 hadoop-metrics2.properties
8. JMX 监控配置:
export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote.port=10101 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote.port=10102 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false $HADOOP_DATANODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote.port=10103 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false $HADOOP_SECONDARYNAMENODE_OPTS"
export HBASE_MASTER_OPTS="-Dcom.sun.management.jmxremote.port=11101 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false $HBASE_MASTER_OPTS"
export HBASE_REGIONSERVER_OPTS="-Dcom.sun.management.jmxremote.port=11102 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false $HBASE_REGIONSERVER_OPTS"
export HBASE_ZOOKEEPER_OPTS="-Dcom.sun.management.jmxremote.port=11103 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false $HBASE_ZOOKEEPER_OPTS"
export HBASE_THRIFT_OPTS="-Dcom.sun.management.jmxremote.port=11104 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false $HBASE_THRIFT_OPTS"
9. JVM监控:
ClassLoading: LoadedClassCount, TotalLoadedClassCount, UnloadedClassCount
Compilation: Name, CompilationTimeMonitoringSupported, TotalCompilationTime
GarbageCollecto --> PS MarkSweep : Name, CollectionCount, CollectionTime, LastGcInfo, MemoryPoolNames, Valid
GarbageCollecto --> PS Scavenge : Name, CollectionCount, CollectionTime, LastGcInfo, MemoryPoolNames, Valid
Memory: HeapMemoryUsage (init, max, commit, used), NonHeapMemoryUsage (init, max, commit, used), ObjectPendingFinalizationCount
MemoryManager -> CodeCacheManager: Name, MemoryPoolName
MemoryPool -> Code Cache: Name, Type, UsageThresholdSupported, CollectionUsageUsageThresholdSupported , MemoryManagerNames, Usage(init, max, commit, used),PeakUsage(init, max, commit, used) , UsageThreshold UsageThresholdCount, UsageThresholdExceeded ,CollectionUsage(init, max, commit, used) , CollectionUsageThreshold, CollectionUsageThresholdCount ,CollectionUsageThresholdExceeded
MemoryPool -> PS Eden Space: Name, Type, UsageThresholdSupported, CollectionUsageUsageThresholdSupported , MemoryManagerNames, Usage(init, max, commit, used),PeakUsage(init, max, commit, used) , UsageThreshold UsageThresholdCount, UsageThresholdExceeded ,CollectionUsage(init, max, commit, used) , CollectionUsageThreshold, CollectionUsageThresholdCount ,CollectionUsageThresholdExceeded
MemoryPool -> PS Servivor Space: Name, Type, UsageThresholdSupported, CollectionUsageUsageThresholdSupported , MemoryManagerNames, Usage(init, max, commit, used),PeakUsage(init, max, commit, used) , UsageThreshold UsageThresholdCount, UsageThresholdExceeded ,CollectionUsage(init, max, commit, used) , CollectionUsageThreshold, CollectionUsageThresholdCount ,CollectionUsageThresholdExceeded
MemoryPool -> PS Old Gen: Name, Type, UsageThresholdSupported, CollectionUsageUsageThresholdSupported , MemoryManagerNames, Usage(init, max, commit, used),PeakUsage(init, max, commit, used) , UsageThreshold UsageThresholdCount, UsageThresholdExceeded ,CollectionUsage(init, max, commit, used) , CollectionUsageThreshold, CollectionUsageThresholdCount ,CollectionUsageThresholdExceeded
MemoryPool -> PS Pern Gen: Name, Type, UsageThresholdSupported, CollectionUsageUsageThresholdSupported , MemoryManagerNames, Usage(init, max, commit, used),PeakUsage(init, max, commit, used) , UsageThreshold UsageThresholdCount, UsageThresholdExceeded ,CollectionUsage(init, max, commit, used) , CollectionUsageThreshold, CollectionUsageThresholdCount ,CollectionUsageThresholdExceeded
OperatingSystem: Name, Arch, AvailableProcessors, CommittedVirtualMemorySize, FreePhysicalMemorySize, FreeSwapSpaceSize, MaxFileDescriptorCount,OpenFileDescriptorCount,ProcessCpuLoad,ProcessCpuTime, SystemCpuLoad, SystemLoadAverage, TotalPhysicalMemorySize, TotalSwapSpaceSize, Version
Runtime: Name, BootClassPathSupported, BootClassPath, ClassPath, InputArguments, LibraryPath, ManagementSpecVersion, SpecName,SpecVendor,SpecVersion, StartTime,SystemProperties,Uptime,VmName,VmVendor,VmVersion
Threading: CurrentThreadCpuTimeSupported, AllThreadIds, CurrentThreadCpuTime, CurrentThreadUserTime, CurrentThreadUserTime, ,ObjectMonitorUsageSupported, PeakThreadCount, SynchronizerUsageSupported, ThreadAllocatedMemoryEnabled, ThreadAllocatedMemorySupported, ThreadContentionMonitoringEnabled, ThreadContentionMonitoringSupported, ThreadCount, ThreadCpuTimeEnabled, ThreadCpuTimeSupported, TotalStartedThreadCount
java.io.BufferPool -> direct: Name, TotalCapacity, Count, MemoryUsed
java.io.BufferPool -> mapped: Name, TotalCapacity, Count, MemoryUsed10. Hadoop 各个进程共有属性
JvmMetrics: GcCount, GcCountPS MarkSweep, GcCountPS Scavenge, GcTimeMillis,GcTimeMillisPS MarkSweep, GcTimeMillisPS Scavenge, LogError,LogFatal, LogInfo, LogWarn, MemHeapCommittedM, MemHeapUsedM,MemMaxM, MemNonHeapCommittedM, MemNonHeapUsedM, ThreadsBlocked, ThreadsNew, ThreadsRunnable, ThreadsTerminated, ThreadsTimedWaiting, ThreadsWaiting, tag.Context, tag.Hostname, tag.ProcessName , tag.SessionId
MetricsSystemStats :DroppedPubAll, NumActiveSinks, NumActiveSources, NumAllSinks, NumAllSources, PublishAvgTime, PublishNumOps, SnapshotAvgTime, SnapshotNumOps, tag.Context, tag.Hostname
StartupProgress: ElapsedTime, LoadingEditsCount, LoadingEditsElapsedTime, LoadingEditsPercentComplete, LoadingEditsTotal, LoadingFsImageCount, LoadingFsImageElapsedTime, LoadingFsImagePercentComplete, LoadingFsImageTotal,PercentComplete, SafeModeCount, SafeModeElapsedTime, SafeModePercentComplete, SafeModeTotal, SavingCheckpointCount, SavingCheckpointElapsedTime, SavingCheckpointPercentComplete, SavingCheckpointTotal, tag.Hostname
UgiMetrics (User and group): LoginFailureAvgTime, LoginFailureNumOps, LoginSuccessAvgTime, LoginSuccessNumOps, tag.Context, tag.Hostname
11. NameNode 监控:
FSNamesystem: BlockCapacity, BlocksTotal, CapacityRemaining, CapacityTotal,CapacityUsed,CapacityUsedNonDFS,CorruptBlocks, ExcessBlocks, ExpiredHeartbeats, FilesTotal,LastCheckpointTime, LastWrittenTransactionId, MillisSinceLastLoadedEdits, MissingBlocks, PendingDataNodeMessageCount, PendingDeletionBlocks, PendingReplicationBlocks, PostponedMisreplicatedBlocks, ScheduledReplicationBlocks, Snapshots, SnapshottableDirectories, StaleDataNodes, TotalFiles, TotalLoad, TransactionsSinceLastCheckpoint, TransactionsSinceLastLogRoll, UnderReplicatedBlocks, tag.Context, tag.HAState, tag.Hostname
FSNamesystemState: BlocksTotal, CapacityRemaining, CapacityTotal, CapacityUsed, FSState, FilesTotal, NumDeadDataNodes, NumStaleDataNodes, ScheduledReplicationBlocks, TotalLoad, UnderReplicatedBlocks
NameNodeActivity: AddBlockOps, AllowSnapshotOps, BlockReportAvgTime, BlockReportNumOps, CreateFileOps, CreateSnapshotOps, CreateSymlinkOps, DeleteFileOps, DeleteSnapshotOps, DisallowSnapshotOps, FileInfoOps, FilesAppended, FilesCreated, FilesDeleted, FilesInGetListingOps, FilesRenamed, FsImageLoadTime, GetAdditionalDatanodeOps, GetBlockLocations, GetLinkTargetOps, GetListingOps, ListSnapshottableDirOps, RenameSnapshotOps,SafeModeTime , SnapshotDiffReportOps, SyncsAvgTime, TransactionsAvgTime, TransactionsBatchedInSync, TransactionsNumOps, tag.Context, tag.Hostname, tag.ProcessName
NameNodeInfo:BlockPoolId, BlockPoolUsedSpace, ClusterId, DeadNodes, DecomNodes, DistinctVersionCount, DistinctVersions,Free, JournalTransactionInfo, LiveNodes, NameDirStatuses, NonDfsUsedSpace, NumberOfMissingBlocks, PercentBlockPoolUsed, PercentRemaining, PercentUsed,Safemode, Threads, Total, TotalBlocks,TotalFiles, UpgradeFinalized, Used, Version
RpcActivityForPort9000: CallQueueLength,NumOpenConnections, ReceivedBytes,RpcAuthenticationFailures, RpcAuthenticationSuccesses, RpcAuthorizationFailures, RpcAuthorizationSuccesses, RpcProcessingTimeAvgTime,RpcProcessingTimeNumOps, RpcQueueTimeAvgTime, RpcQueueTimeNumOps, SentBytes, tag.Context, tag.Hostname, tag.port
RpcDetailedActivityForPort9000:AddBlockAvgTime,AddBlockNumOps, BlockReceivedAndDeletedAvgTime, BlockReceivedAndDeletedNumOps, BlockReportAvgTime, BlockReportNumOps, CommitBlockSynchronizationAvgTime, CommitBlockSynchronizationNumOps, CompleteAvgTime, CompleteNumOps, CreateAvgTime, CreateNumOps, DeleteAvgTime, DeleteNumOps, FsyncAvgTime, FsyncNumOps, GetBlockLocationsAvgTime, GetBlockLocationsNumOps, GetEditLogManifestAvgTime, GetEditLogManifestNumOps, GetFileInfoAvgTime, GetFileInfoNumOps, GetListingAvgTime, GetListingNumOps,GetServerDefaultsAvgTime, GetServerDefaultsNumOps, GetTransactionIdAvgTime, GetTransactionIdNumOps,MkdirsAvgTime, MkdirsNumOps , RecoverLeaseAvgTime, RecoverLeaseNumOps, ,RegisterDatanodeAvgTime, RegisterDatanodeNumOps, RenameAvgTime, RenameNumOps, RenewLeaseAvgTime, RenewLeaseNumOps, RollEditLogAvgTime, RollEditLogNumOps, SendHeartbeatAvgTime,SendHeartbeatNumOps, SetSafeModeAvgTime, SetSafeModeNumOps, SetTimesAvgTime, SetTimesNumOps, UpdateBlockForPipelineAvgTime, UpdateBlockForPipelineNumOps, UpdatePipelineAvgTime, UpdatePipelineNumOps, VersionRequestAvgTime, VersionRequestNumOps, tag.Context, tag.Hostname, tag.port
JvmMetrics:
MetricsSystemStats :
StartupProgress:
UgiMetrics (User and group):
12. DataNode 监控:
DataNodeActivity:BlockChecksumOpAvgTime, BlockChecksumOpNumOps,BlockReportsAvgTime,BlockReportsNumOps,BlockVerificationFailures,BlocksGetLocalPathInfo, BlocksRead, BlocksRemoved, BlocksReplicated, BlocksVerified, BlocksWritten, BytesRead,BytesWritten, CopyBlockOpAvgTime,CopyBlockOpNumOps,FlushNanosAvgTime,FlushNanosNumOps,FsyncCount, FsyncNanosAvgTime, FsyncNanosNumOps, PacketAckRoundTripTimeNanosAvgTime, PacketAckRoundTripTimeNanosNumOps, ReadBlockOpAvgTime, ReadBlockOpNumOps
DataNodeInfo:ClusterId,HttpPort,NamenodeAddresses,RpcPort,Version,VolumeInfo,XceiverCount
FSDatasetState:Capacity,DfsUsed,NumFailedVolumes,Remaining,StorageInfo
RpcActivityForPort50020:CallQueueLength,NumOpenConnections, ReceivedBytes,RpcAuthenticationFailures, RpcAuthenticationSuccesses, RpcAuthorizationFailures, RpcAuthorizationSuccesses, RpcProcessingTimeAvgTime,RpcProcessingTimeNumOps, RpcQueueTimeAvgTime, RpcQueueTimeNumOps, SentBytes, tag.Context, tag.Hostname, tag.port
RpcDetailedActivityForPort50020:tag.Context, tag.Hostname, tag.port
JvmMetrics:
MetricsSystemStats :
StartupProgress:
UgiMetrics (User and group):
13. SecondaryNameNode 监控:
JvmMetrics:
MetricsSystemStats :
StartupProgress:
UgiMetrics (User and group):
14. HMaster 监控:
IPC:ProcessCallTime ,QueueCallTime ,authenticationFailures,authenticationSuccesses,authorizationFailures,authorizationSuccesses,numActiveHandler,numCallsInGeneralQueue,numCallsInPriorityQueue,numCallsInReplicationQueue,numOpenConnections,queueSize,receivedBytes,sentBytes,tag.Context,tag.Hostname
AssignmentManger:Assign ,BulkAssign ,ritCount,ritCountOverThreshold,ritOldestAge,tag.Context,tag.Hostname
Balancer:BalancerCluster ,miscInvocationCount,tag.Context,tag.Hostname
FileSystem:HlogSplitSize ,HlogSplitTime ,MetaHlogSplitSize ,MetaHlogSplitTime ,tag.Context,tag.Hostname
Server:averageLoad,clusterRequests,masterActiveTime,masterStartTime,numDeadRegionServers,numRegionServers,tag.Context,tag.Hostname,tag.clusterId,tag.deadRegionServers,tag.isActiveMaster,tag.liveRegionServers,tag.serverName,tag.zookeeperQuorum
JvmMetrics:
MetricsSystemStats :
StartupProgress:
UgiMetrics (User and group):
15. HRegionServer 监控:
IPC:ProcessCallTime ,QueueCallTime ,authenticationFailures,authenticationSuccesses,authorizationFailures,authorizationSuccesses,numActiveHandler,numCallsInGeneralQueue,numCallsInPriorityQueue,numCallsInReplicationQueue,numOpenConnections,queueSize,receivedBytes,sentBytes,tag.Context,tag.Hostname
Regions:tablename_get(75th_percentile, 95th_percentile, 99th_percentile, max, mean, median, min, num_ops), tablename_scanNext(75th_percentile, 95th_percentile, 99th_percentile, max, mean, median, min, num_ops), coprocessorExecutionStatistics, region_appendCount, region_compactionsCompletedCount, region_deleteCount, region_incrementCount, region_memStoreSize, region_mutateCount, region_numBytesCompactedCount, region_numFilesCompactedCount, region_storeCount, region_storeFileCount, region_storeFileSize
Replication:tag.Contextt,tag.Hostname
Server:Append ,Delete ,Get ,Increment ,Mutate ,Replay ,blockCacheCount,blockCacheEvictionCount,blockCacheExpressHitPercent,blockCacheFreeSize, blockCacheHitCount,blockCacheMissCount,blockCacheSize,blockCountHitPercent,checkMutateFailedCount,checkMutatePassedCount,compactedCellsCount,compactedCellsSize,compactionQueueLength,flushQueueLength,flushedCellsCount,flushedCellsSize,hlogFileCount,hlogFileSize,majorCompactedCellsCount,majorCompactedCellsSize,memStoreSize,mutationsWithoutWALCount,mutationsWithoutWALSize,percentFilesLocal,readRequestCount,regionCount,regionServerStartTime,slowAppendCount,slowDeleteCount,slowGetCount,slowIncrementCount,slowPutCount,staticBloomSize,staticIndexSize,storeCount,storeFileCount,storeFileIndexSize,storeFileSize,totalRequestCount,updatesBlockedTime,writeRequestCount,tag.Context,tag.Hostname,tag.clusterId, tag.serverName,tag.zookeeperQuorum
WAL:AppendSize ,AppendTime ,SyncTime ,appendCount,slowAppendCount,tag.Contextt,tag.Hostname
JvmMetrics:
MetricsSystemStats :
StartupProgress:
UgiMetrics (User and group):
16. ZooKeeper 监控:
ReplicatedServer_id1:Name,QuorumSize
replica.0:Name,QuorumAddress
replica.1:Name,QuorumAddress
replica.2:Name,QuorumAddress
Leader:AvgRequestLatency,ClientPort,CurrentZxid,MaxClientCnxnsPerHost,MaxRequestLatency,MaxSessionTimeout,MinRequestLatency,MinRequestLatency, MinSessionTimeout,NumAliveConnections,OutstandingRequests,PacketsReceived,PacketsSent,StartTime,TickTime,Version
InMemoryDataTree:LastZxid,NodeCount,WatchCount
Connection:AvgLatency,EphemeralNodes,LastCxid,LastLatency,LastOperation,LastResponseTime,LastZxid,MaxLatency,MinLatency,OutstandingRequests, PacketsReceived,PacketsSent,SessionId,SessionTimeout,SourceIP,StartedTime
17. Thrift Server 监控:
ThriftOne: BatchGet , BatchMutate , SlowThriftCall , ThriftCall , TimeInQueue , callQueueLen, tag.Hostname, tag.Context
ThriftTwo:: 同 ThriftOne
JvmMetrics:
MetricsSystemStats :
UgiMetrics (User and group):