CHM ( Cluster Health Monitor)是Oracle 提供的一款工具, 用来自动收集操作系统资源(CPU、内存、SWAP 、进程、I/0 以及网络等)的统计信息。从1 1.2.0.2 版本开始, CHM 会以初始化资源ora.crf 的形式存在于集群的每一个节点上。CHM每秒收集一次数据(在11.2.0.3为5秒一次)。
CHM 搜集的系统资源数据对于诊断集群系统的节点重启、hang、实例驱逐(Eviction)、性能问题等是非常有帮助的。另外,CHM还可作为单独的工具安装在Linux和Windows平台上。
CHM组件:
1:组件I: CHM档案库(Repository):默认情况下, 它会存在于< gi_home>/crf/db/<节点名〉下,默认占用I GB 的磁盘空间,是CHM 最大允许的信息保留天数为3天,每个节点每天收集的统计信息大约会占用500MB的空间。 它是Berkeley数据库,作为保存从各个节点收集到的系统信息。
2:系统监控服务(System Monitor Service):它会以osysmond.bin守护进程的方式在所有节点运行,osysmond.bin负责定期搜集本地节点的操作系统统计信息,并将搜集到的统计信息发送给主节点上的集群日志服务。
3:集群日志服务(Cluster Logger Service):这个服务会以守护进程ologgerd 的形式运行在集群的CHM主节点(Master Node)和副节点(Replication Node) 上。
CHM数据文件
-rw-r-----. 1 root root 6373376 Mar 27 09:11 crfalert.bdb
-rw-r-----. 1 root root 168624128 Mar 27 09:11 crfclust.bdb
-rw-r-----. 1 root root 8192 Jan 31 19:31 crfconn.bdb
-rw-r-----. 1 root root 9863168 Mar 27 09:11 crfcpu.bdb
-rw-r-----. 1 root root 4186112 Mar 27 09:11 crfhosts.bdb
-rw-r-----. 1 root root 3964928 Mar 27 09:11 crfloclts.bdb
-rw-r-----. 1 root root 5058560 Mar 27 09:11 crfts.bdb
-rw-r-----. 1 root root 24576 Oct 16 09:46 __db.001
-rw-r-----. 1 root root 401408 Mar 27 09:11 __db.002
-rw-r-----. 1 root root 2629632 Mar 27 09:11 __db.003
-rw-r-----. 1 root root 2162688 Mar 27 09:11 __db.004
-rw-r-----. 1 root root 1187840 Mar 27 09:11 __db.005
-rw-r-----. 1 root root 57344 Mar 27 09:11 __db.006
-rw-r--r--. 1 root root 120000000 Jan 31 19:26 jhdb01.ldb
-rw-r-----. 1 root root 16777216 Mar 27 09:06 log.0000042022
-rw-r-----. 1 root root 16777216 Mar 27 09:11 log.0000042023
-rw-r-----. 1 root root 8192 Oct 16 09:46 repdhosts.bdb
基本操作:
oraacle提供了oclumon和CHM/OS Graphical User Interface(CHMOSG)两款工具来访问CHM的数据。
jhdb01-> oclumon manage -h
MANAGE verb usage
=================
manage [[-repos {resize <time>|changesize <memsize>|reploc <new_loc> [[-maxtime <time>]|
[-maxspace <memsize>]] }]|[-get <key1> <key2>..]]
*Where
-repos = Required to specify Cluster Health Monitor repository related options
-get = Fetch manage information for one or more named keys
<key1> <key2>= <key> can be repsize, reppath, master, and replica
resize = Option for resizing Cluster Health Monitor repository
<time> = Size of Cluster Health Monitor repository in number of seconds
must be more than 3600 (1 hour) and less than 259200 (3 days)
changesize = Option for Change Cluster Health Monitor repository space limit
<memsize> = Size of Cluster Health Monitor repository in megabytes
reploc = Option for Change Repository Location
<new_loc> = Path to new directory e.g.: /opt/db
-maxtime = Option to specify Cluster Health Monitor repository size in terms of elapsed seconds of data capture for new location
-maxspace = Option to specify space limit for new Cluster Health Monitor repository location
*Requirements
The local system monitor service must be running to resize the Cluster Health Monitor repository.
The Cluster Logger Service must be running to resize Cluster Health Monitor repository.
*Example :
manage -get MASTER REPLICA
manage -repos resize 86400
manage -repos changesize 6000
manage -repos reploc /opt/oracrfdb
manage -repos reploc /opt/oracrfdb -maxtime 86400
manage -repos reploc /opt/oracrfdb -maxspace 6000
1、查看当前目录设置
[oracle@host01$]oclumon manage -get reppath
CHM Repository Path = /u01/app/11.2.0/grid/crf/db/host01
Done
2、查看当前收集大小
[oracle@host01$]oclumon manage -get repsize
CHM Repository Size = 61646
Done
3、修改路径
[oracle@host01$]oclumon manage -repos reploc /u01/app/11.2.0/grid/crf/
host01 --> Ready to commit new location
host02 --> Ready to commit new location
New retention is 61646 and will use 1073725369 bytes of disk space
CRS-9113-Cluster Health Monitor repository location change completed on all nodes. Restarting Loggerd.
Done
修改完路径,需要重启下ora.crf才能生效。
4、修改大小
oclumon manage -repos resize 60000
获取CHM数据的方法:
1、使用Grid_home/bin/diagcollection.pl
首先,确定cluster logger service的主节点
oclumon manage -get master
用root身份在主节点执行下面的命令
[root@host02 ~]# /u01/app/11.2.0/grid/bin/diagcollection.pl -collect -shmos -incidenttime inc_time -incidentduration duration
Production Copyright 2004, 2010, Oracle. All rights reserved
Cluster Ready Services (CRS) diagnostic collection tool
Unknown option: shmos
The following CRS diagnostic archives will be created in the local directory.
crsData_host02_20180327_1516.tar.gz -> logs,traces and cores from CRS home. Note: core files will be packaged only with the --core option.
ocrData_host02_20180327_1516.tar.gz -> ocrdump, ocrcheck etc
coreData_host02_20180327_1516.tar.gz -> contents of CRS core files in text format
osData_host02_20180327_1516.tar.gz -> logs from Operating System
Collecting crs data
/bin/tar: log/host02/ctssd/octssd.log: file changed as we read it
Collecting OCR data
Collecting information from core files
No corefiles found
Collecting OS logs
incidenttime是指从什么时间开始获得数据,格式为MM/DD/YYYY24HH:MM:SS, incidentduration指的是获得开始时间后多长时间的数据。
例如:
diagcollection.pl -collect -crshome /u01/app/11.2.0/grid -chmoshome /u01/app/11.2.0/grid -chmos -incidenttime 06/15/201412:30:00 -incidentduration 00:05
运行这个命令之后,CHM的数据会生成在文件chmosData_rac2_20140615_1237.tar.gz。
[root@host02 host02]# ll
total 120
drwxr-xr-x 2 root root 4096 Mar 27 15:19 acfs
drwxr-xr-x 4 root root 4096 Mar 27 15:19 agent
-rw-rw-r-- 1 oracle oinstall 12199 Mar 27 15:05 alerthost02.log
drwxr-xr-x 2 root root 4096 Mar 27 15:19 client
drwxr-xr-x 2 root root 4096 Mar 27 15:19 crflogd
drwxr-xr-x 2 root root 4096 Mar 27 15:19 crfmond
drwxr-xr-x 2 root root 4096 Mar 27 15:19 crsd
drwxr-xr-x 2 root root 4096 Mar 27 15:19 cssd
drwxr-xr-x 2 root root 4096 Mar 27 15:19 ctssd
drwxr-xr-x 2 root root 4096 Mar 27 15:19 evmd
drwxr-xr-x 2 root root 4096 Mar 27 15:19 gipcd
drwxr-xr-x 2 root root 4096 Mar 27 15:19 gpnpd
drwxr-xr-x 2 root root 4096 Mar 27 15:19 mdnsd
drwxr-xr-x 2 root root 4096 Mar 27 15:19 ohasd
由此可见,CHM收集节点信息发送给集群日志,有集群日志在转发给Berkeley库。
2、使用oclumon来获得CHM数据
oclumon dumpnodeview [[-allnodes] | [-n node1 node2] [-last “duration”] | [-s “time_stamp” -e “time_stamp”] [-v] [-warning]] [-h]
(-s表示开始时间,-e表示结束时间)
例如:
oclumon dumpnodeview -allnodes -v -s “2012-06-15 07:40:00” -e “2012-06-15 07:57:00” > /tmp/chm1.txt
oclumon dumpnodeview -n node1 node2 -last “12:00:00” >/tmp/chm1.txt
oclumon dumpnodeview -allnodes -last “00:15:00” >/tmp/chm1.txt
关闭以及开启CHM(使用grid在每个节点执行)
这个关闭类似Linux的service stop,重启后还会自动打开,关闭后系统IO会有所下降
关闭
crsctl stop res ora.crf -init
启动
crsctl start res ora.crf -init
禁用与启用CHM(类似chkconfig)
使用root用户
禁用
crsctl modify resource ora.crf -attr “AUTO_START=never” -init
启用
crsctl modify resource ora.crf -attr “AUTO_START=always” -init