[Oracle 11g r2(11.2.0.4.0)]集群CHM工具介绍及CHM收集节点信息导致oracle 根目录撑爆

CHM ( Cluster Health Monitor）是Oracle 提供的一款工具，用来自动收集操作系统资源(CPU、内存、SWAP 、进程、I/0 以及网络等）的统计信息。从1 1.2.0.2 版本开始， CHM 会以初始化资源ora.crf 的形式存在于集群的每一个节点上。CHM每秒收集一次数据（在11.2.0.3为5秒一次）。
CHM 搜集的系统资源数据对于诊断集群系统的节点重启、hang、实例驱逐（Eviction）、性能问题等是非常有帮助的。另外，CHM还可作为单独的工具安装在Linux和Windows平台上。
CHM组件：
1：组件I: CHM档案库（Repository）：默认情况下，它会存在于＜ gi_home>/crf/db／＜节点名〉下，默认占用I GB 的磁盘空间，是CHM 最大允许的信息保留天数为3天，每个节点每天收集的统计信息大约会占用500MB的空间。它是Berkeley数据库，作为保存从各个节点收集到的系统信息。
2：系统监控服务（System Monitor Service）：它会以osysmond.bin守护进程的方式在所有节点运行，osysmond.bin负责定期搜集本地节点的操作系统统计信息，并将搜集到的统计信息发送给主节点上的集群日志服务。
3：集群日志服务（Cluster Logger Service）：这个服务会以守护进程ologgerd 的形式运行在集群的CHM主节点（Master Node）和副节点（Replication Node）上。
CHM数据文件

-rw-r-----. 1 root root   6373376 Mar 27 09:11 crfalert.bdb
-rw-r-----. 1 root root 168624128 Mar 27 09:11 crfclust.bdb
-rw-r-----. 1 root root      8192 Jan 31 19:31 crfconn.bdb
-rw-r-----. 1 root root   9863168 Mar 27 09:11 crfcpu.bdb
-rw-r-----. 1 root root   4186112 Mar 27 09:11 crfhosts.bdb
-rw-r-----. 1 root root   3964928 Mar 27 09:11 crfloclts.bdb
-rw-r-----. 1 root root   5058560 Mar 27 09:11 crfts.bdb
-rw-r-----. 1 root root     24576 Oct 16 09:46 __db.001
-rw-r-----. 1 root root    401408 Mar 27 09:11 __db.002
-rw-r-----. 1 root root   2629632 Mar 27 09:11 __db.003
-rw-r-----. 1 root root   2162688 Mar 27 09:11 __db.004
-rw-r-----. 1 root root   1187840 Mar 27 09:11 __db.005
-rw-r-----. 1 root root     57344 Mar 27 09:11 __db.006
-rw-r--r--. 1 root root 120000000 Jan 31 19:26 jhdb01.ldb
-rw-r-----. 1 root root  16777216 Mar 27 09:06 log.0000042022
-rw-r-----. 1 root root  16777216 Mar 27 09:11 log.0000042023
-rw-r-----. 1 root root      8192 Oct 16 09:46 repdhosts.bdb

基本操作：
oraacle提供了oclumon和CHM/OS Graphical User Interface(CHMOSG)两款工具来访问CHM的数据。

jhdb01-> oclumon manage -h 

MANAGE verb usage
=================
  manage [[-repos {resize <time>|changesize <memsize>|reploc <new_loc> [[-maxtime <time>]|
         [-maxspace <memsize>]] }]|[-get <key1> <key2>..]]

*Where
  -repos       = Required to specify Cluster Health Monitor repository related options 
  -get         = Fetch manage information for one or more named keys
  <key1> <key2>= <key> can be repsize, reppath, master, and replica
  resize       = Option for resizing Cluster Health Monitor repository
  <time>       = Size of Cluster Health Monitor repository in number of seconds
                 must be more than 3600 (1 hour) and less than 259200 (3 days)
  changesize   = Option for Change Cluster Health Monitor repository space limit
  <memsize>    = Size of Cluster Health Monitor repository in megabytes
  reploc       = Option for Change Repository Location
  <new_loc>    = Path to new directory e.g.: /opt/db
  -maxtime     = Option to specify Cluster Health Monitor repository size in terms of elapsed seconds of data capture for new location
  -maxspace    = Option to specify space limit for new Cluster Health Monitor repository location

*Requirements
  The local system monitor service must be running to resize the Cluster Health Monitor repository.
  The Cluster Logger Service must be running to resize Cluster Health Monitor repository.

*Example :
  manage -get MASTER REPLICA
  manage -repos resize 86400
  manage -repos changesize 6000
  manage -repos reploc /opt/oracrfdb
  manage -repos reploc /opt/oracrfdb -maxtime 86400
  manage -repos reploc /opt/oracrfdb -maxspace 6000

1、查看当前目录设置

 [oracle@host01$]oclumon manage -get reppath

CHM Repository Path = /u01/app/11.2.0/grid/crf/db/host01

 Done

2、查看当前收集大小

[oracle@host01$]oclumon manage -get repsize 

CHM Repository Size = 61646

 Done

3、修改路径

[oracle@host01$]oclumon manage -repos reploc /u01/app/11.2.0/grid/crf/
host01 --> Ready to commit new location
host02 --> Ready to commit new location
New retention is 61646 and will use 1073725369 bytes of disk space

CRS-9113-Cluster Health Monitor repository location change completed on all nodes. Restarting Loggerd.

 Done

修改完路径，需要重启下ora.crf才能生效。

4、修改大小

 oclumon manage -repos resize 60000

获取CHM数据的方法：

1、使用Grid_home/bin/diagcollection.pl

首先，确定cluster logger service的主节点
oclumon manage -get master
用root身份在主节点执行下面的命令

[root@host02 ~]# /u01/app/11.2.0/grid/bin/diagcollection.pl -collect -shmos -incidenttime inc_time -incidentduration duration 
Production Copyright 2004, 2010, Oracle.  All rights reserved
Cluster Ready Services (CRS) diagnostic collection tool
Unknown option: shmos
The following CRS diagnostic archives will be created in the local directory.
crsData_host02_20180327_1516.tar.gz -> logs,traces and cores from CRS home. Note: core files will be packaged only with the --core option. 
ocrData_host02_20180327_1516.tar.gz -> ocrdump, ocrcheck etc 
coreData_host02_20180327_1516.tar.gz -> contents of CRS core files in text format

osData_host02_20180327_1516.tar.gz -> logs from Operating System
Collecting crs data
/bin/tar: log/host02/ctssd/octssd.log: file changed as we read it
Collecting OCR data 
Collecting information from core files
No corefiles found 
Collecting OS logs

incidenttime是指从什么时间开始获得数据，格式为MM/DD/YYYY24HH:MM:SS, incidentduration指的是获得开始时间后多长时间的数据。
例如：

diagcollection.pl -collect -crshome /u01/app/11.2.0/grid -chmoshome /u01/app/11.2.0/grid -chmos -incidenttime 06/15/201412:30:00 -incidentduration 00:05

运行这个命令之后，CHM的数据会生成在文件chmosData_rac2_20140615_1237.tar.gz。

[root@host02 host02]# ll
total 120
drwxr-xr-x 2 root   root      4096 Mar 27 15:19 acfs
drwxr-xr-x 4 root   root      4096 Mar 27 15:19 agent
-rw-rw-r-- 1 oracle oinstall 12199 Mar 27 15:05 alerthost02.log
drwxr-xr-x 2 root   root      4096 Mar 27 15:19 client
drwxr-xr-x 2 root   root      4096 Mar 27 15:19 crflogd
drwxr-xr-x 2 root   root      4096 Mar 27 15:19 crfmond
drwxr-xr-x 2 root   root      4096 Mar 27 15:19 crsd
drwxr-xr-x 2 root   root      4096 Mar 27 15:19 cssd
drwxr-xr-x 2 root   root      4096 Mar 27 15:19 ctssd
drwxr-xr-x 2 root   root      4096 Mar 27 15:19 evmd
drwxr-xr-x 2 root   root      4096 Mar 27 15:19 gipcd
drwxr-xr-x 2 root   root      4096 Mar 27 15:19 gpnpd
drwxr-xr-x 2 root   root      4096 Mar 27 15:19 mdnsd
drwxr-xr-x 2 root   root      4096 Mar 27 15:19 ohasd

由此可见，CHM收集节点信息发送给集群日志，有集群日志在转发给Berkeley库。

2、使用oclumon来获得CHM数据
oclumon dumpnodeview [[-allnodes] | [-n node1 node2] [-last “duration”] | [-s “time_stamp” -e “time_stamp”] [-v] [-warning]] [-h]
(-s表示开始时间，-e表示结束时间)
例如：

oclumon dumpnodeview -allnodes -v -s “2012-06-15 07:40:00” -e “2012-06-15 07:57:00” > /tmp/chm1.txt
oclumon dumpnodeview -n node1 node2 -last “12:00:00” >/tmp/chm1.txt
oclumon dumpnodeview -allnodes -last “00:15:00” >/tmp/chm1.txt

关闭以及开启CHM(使用grid在每个节点执行)

这个关闭类似Linux的service stop，重启后还会自动打开，关闭后系统IO会有所下降