11gR2 新特性:Oracle Cluster Health Monitor(CHM)简介
这些系统资源数据对于诊断集群系统的节点重启、Hang、实例驱逐(Eviction)、性能问题等是非常有帮助的。另外,用户可以使用CHM来及早发现一些系统负载高、内存异常等问题,从而避免产生更严重的问题。
CHM会自动安装在下面的软件:
11.2.0.2 及更高版本的 Oracle Grid Infrastructure for Linux (不包括Linux Itanium) 、Solaris (Sparc 64 和 x86-64)
11.2.0.3 及更高版本 Oracle Grid Infrastructure for AIX 、 Windows (不包括Windows Itanium)。
在集群中,可以通过下面的命令查看CHM对应的资源(ora.crf)的状态:
$ crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS Cluster Resources
ora.crf ONLINE ONLINE rac1
CHM主要包括两个服务:
1). System Monitor Service(osysmond):这个服务在所有节点都会运行,osysmond会将每个节点的资源使用情况发送给cluster logger service,后者将会把所有节点的信息都接收并保存到CHM的资料库。
$ ps -ef|grep osysmond
root 7984 1 0 Jun05 ? 01:16:14 /u01/app/11.2.0/grid/bin/osysmond.bin
2). Cluster Logger Service(ologgerd):在一个集群中的,ologgerd 会有一个主机点(master),还有一个备节点(standby)。当ologgerd在当前的节点遇到问题无法启动后,它会在备用节点启用。
主节点:
$ ps -ef|grep ologgerd
root 8257 1 0 Jun05 ? 00:38:26 /u01/app/11.2.0/grid/bin/ologgerd -M -d /u01/app/11.2.0/grid/crf/db/rac2
备节点:
$ ps -ef|grep ologgerd
root 8353 1 0 Jun05 ? 00:18:47 /u01/app/11.2.0/grid/bin/ologgerd -m rac2 -r -d
/u01/app/11.2.0/grid/crf/db/rac1
CHM Repository:用于存放收集到数据,默认情况下,会存在于Grid Infrastructure home 下 ,需要1 GB 的磁盘空间,每个节点大约每天会占用0.5GB的空间。 您可以使用OCLUMON来调整它的存放路径以及允许的空间大小(最多只能保存3天的数据)。
下面的命令用来查看它当前设置:
$ oclumon manage -get reppath
CHM Repository Path = /u01/app/11.2.0/grid/crf/db/rac2
Done
$ oclumon manage -get repsize
CHM Repository Size = 68082 <====单位为秒
Done
修改路径:
$ oclumon manage -repos reploc /shared/oracle/chm
修改大小:
$ oclumon manage -repos resize 68083 <==在3600(小时) 到 259200(3天)之间
rac1 --> retention check successful
New retention is 68083 and will use 1073750609 bytes of disk space
CRS-9115-Cluster Health Monitor repository size change completed on all nodes.
Done
获得CHM生成的数据的方法有两种:
1. 一种是使用Grid_home/bin/diagcollection.pl:
1). 首先,确定cluster logger service的主节点:
$ oclumon manage -get master
Master = rac2
2).用root身份在主节点rac2执行下面的命令:
# /bin/diagcollection.pl -collect -chmos -incidenttime inc_time -incidentduration duration
inc_time是指从什么时间开始获得数据,格式为MM/DD/YYYY24HH:MM:SS, duration指的是获得开始时间后多长时间的数据。
比如:# diagcollection.pl -collect -crshome /u01/app/11.2.0/grid -chmoshome /u01/app/11.2.0/grid -chmos -incidenttime 06/15/201215:30:00 -incidentduration 00:05
3).运行这个命令之后,CHM的数据会生成在文件chmosData_rac2_20120615_1537.tar.gz。
2. 另外一种获得CHM生成的数据的方法为oclumon:
$oclumon dumpnodeview [[-allnodes] | [-n node1 node2] [-last "duration"] | [-s "time_stamp" -e "time_stamp"] [-v] [-warning]] [-h]
-s表示开始时间,-e表示结束时间
$ oclumon dumpnodeview -allnodes -v -s "2012-06-15 07:40:00" -e "2012-06-15 07:57:00" > /tmp/chm1.txt
$ oclumon dumpnodeview -n node1 node2 node3 -last "12:00:00" >/tmp/chm1.txt
$ oclumon dumpnodeview -allnodes -last "00:15:00" >/tmp/chm1.txt
下面是/tmp/chm1.txt中的部分内容:
----------------------------------------
Node: rac1 Clock: '06-15-12 07.40.01' SerialNo:168880
----------------------------------------
SYSTEM:
#cpus: 1 cpu: 17.96 cpuq: 5 physmemfree: 32240 physmemtotal: 2065856 mcache: 1064024 swapfree: 3988376 swaptotal: 4192956 ior: 57 io
w: 59 ios: 10 swpin: 0 swpout: 0 pgin: 57 pgout: 59 netr: 65.767 netw: 34.871 procs: 183 rtprocs: 10 #fds: 4902 #sysfdlimit: 6815744
#disks: 4 #nics: 3 nicErrors: 0
TOP CONSUMERS:
topcpu: 'mrtg(32385) 64.70' topprivmem: 'ologgerd(8353) 84068' topshm: 'oracle(8760) 329452' topfd: 'ohasd.bin(6627) 720' topthread:
'crsd.bin(8235) 44'
PROCESSES:
name: 'mrtg' pid: 32385 #procfdlimit: 65536 cpuusage: 64.70 privmem: 1160 shm: 1584 #fd: 5 #threads: 1 priority: 20 nice: 0
name: 'oracle' pid: 32381 #procfdlimit: 65536 cpuusage: 0.29 privmem: 1456 shm: 12444 #fd: 32 #threads: 1 priority: 15 nice: 0
...
name: 'oracle' pid: 8756 #procfdlimit: 65536 cpuusage: 0.0 privmem: 2892 shm: 24356 #fd: 47 #threads: 1 priority: 16 nice: 0
----------------------------------------
Node: rac2 Clock: '06-15-12 07.40.02' SerialNo:168878
----------------------------------------
SYSTEM:
#cpus: 1 cpu: 40.72 cpuq: 8 physmemfree: 34072 physmemtotal: 2065856 mcache: 1005636 swapfree: 3991808 swaptotal: 4192956 ior: 54 io
w: 104 ios: 11 swpin: 0 swpout: 0 pgin: 54 pgout: 104 netr: 77.817 netw: 33.008 procs: 178 rtprocs: 10 #fds: 4948 #sysfdlimit: 68157
44 #disks: 4 #nics: 4 nicErrors: 0
TOP CONSUMERS:
topcpu: 'orarootagent.bi(8490) 1.59' topprivmem: 'ologgerd(8257) 83108' topshm: 'oracle(8873) 324868' topfd: 'ohasd.bin(6744) 720' t
opthread: 'crsd.bin(8362) 47'
PROCESSES:
name: 'oracle' pid: 9040 #procfdlimit: 65536 cpuusage: 0.19 privmem: 6040 shm: 121712 #fd: 33 #threads: 1 priority: 16 nice: 0
...
关于CHM的更多解释,请参考Oracle官方文档:
http://docs.oracle.com/cd/E11882_01/rac.112/e16794/troubleshoot.htm#CWADD92242
Oracle? Clusterware Administration and Deployment Guide
11g Release 2 (11.2)
Part Number E16794-17
或者 My Oracle Support文档:
Cluster Health Monitor (CHM) FAQ (Doc ID 1328466.1)
Cluster Health Monitor (CHM) FAQ (文档 ID 1328466.1)
In this Document
Purpose |
Questions and Answers |
What is the Cluster Health Monitor? |
What is the purpose of the Cluster Health Monitor? |
What platform does Cluster Health Monitor support and where can I get the Cluster Health Monitor? |
What is the resource name for Cluster Health Monitor in 11.2.0.2 or higher? |
Is stop/start ora.crf affecting clusterware function or cluster database function? |
Can the Cluster Health Monitor be installed on a single node, non-RAC server? |
Do Engineered Systems like Exadata have a default usage with CHM and if so, any specific version?? |
Where is oclumon? |
How do I collect the Cluster Health Monitor data? |
Why does “diagcollection.pl --collect --chmos” return “Cannot parse master from output: ERROR : in reading init file” error? |
How do you get the syntax of different options and explanations for those options for diagcollection.pl and oclumon? |
What is IPD/OS? |
How is the Cluster Health Monitor different from OSWatcher? |
Is the Cluster Health Monitor replacing OSWatcher? |
How much of overhead does the Cluster Health Monitor cause? |
Does CHM on Multiple Node configurations (e.g. 4 to 8 nodes) have scaling concerns? |
Will CDB and PDB result in any new information or special conditions using CHM? |
How much of disk space is needed for the Cluster Health Monitor? |
How do I find out the size of data collected and saved by the Cluster Health Monitor in my system? |
How can I increase the size of the Cluster Health Monitor repository ? |
What platforms can I run the Cluster Health Monitor? |
What steps are needed to install 11.2.0.2 when the Cluster Health Monitor from OTN is already running? |
Where does the Cluster Health Monitor from OTN installed in Linux? |
What logs and data should I gather before logging a SR for the Cluster Health Monitor error? |
How do I increase the trace level the Cluster Health Monitor? |
Can I use procwatcher to get the pstack of the Cluster Health Monitor regularly? |
What are the processes and components for the Cluster Health Monitor? |
What is oclumon? |
What is definition of some of the files like *.bdb, _db.* , *.ldb , log.* files created by tool in the BDB (Berkeley Database) location directory ? |
Where is the location for the log files for the Cluster Health Monitor from OTN (pre 11.2.0.2)? |
How do I fix the problem that the time in the oclumon report is in UTC time zone instead of the time zone of my server? |
Can I install CHM from OTN on 11.2.0.2? What if I stop and disable CHM resource (ora.crf) on 11.2.0.2? |
Where is the trace file for client like oclumon? How do I increase the trace level for oclumon? |
Can the Directory path to the CHM Repository be same on all nodes if shared storage is used? |
How much of data (how long in time) does the node store CHM data locally when it cannot communicate with the master? |
How often does CHM collect the system metric data? Can this be changed? |
What is the CHM retention time? |
How can you reduce the size of bdb file that became big for any reason? |
Can you set up CHM to run locally on each node? |
Can CHM be used on a single node non-RAC server? |
How to start and stop CHM that is installed as a part of GI in 11.2 and higher? |
Database - RAC/Scalability Community |
References |
APPLIES TO:
Oracle Database - Enterprise Edition - Version 10.1.0.2 to 12.1.0.2 [Release 10.1 to 12.1]Information in this document applies to any platform.
PURPOSE
The Cluster Health Monitor FAQ is an evolving document that answers common questions about the Cluster Health Monitor
QUESTIONS AND ANSWERS
What is the Cluster Health Monitor?
What is the purpose of the Cluster Health Monitor?
By monitoring the data constantly, users can use the Cluster Health Monitor detect potential problem areas such as CPU load, memory constraints, and spinning processes before the problem causes an unwanted outage.
What platform does Cluster Health Monitor support and where can I get the Cluster Health Monitor?
The Cluster Health Monitor is integrated part of 11.2.0.2 Oracle Grid Infrastructure for Linux (not on Linux Itanium and IBM Linux Z) and Solaris (Sparc 64 and x86-64 only), so installing 11.2.0.2 Oracle Grid Infrastructure on those platforms will automatically install the Cluster Health Monitor. AIX will have the Cluster Health Monitor starting from 11.2.0.3. The Cluster Health Monitor is also enabled for Windows (except Windows Itanium) in 11.2.0.3.
Prior to 11.2.0.2 on Linux (not on Linux Itanium and IBM Linux Z), the Cluster Health Monitor can be downloaded from OTN.
http://www-content.oracle.com/technetwork/products/clustering/downloads/ipd-download-homepage-087212.html
The OTN version for Windows is not available. Please upgrade to 11.2.0.3 if you need CHM for Windows.
What is the resource name for Cluster Health Monitor in 11.2.0.2 or higher?
Is stop/start ora.crf affecting clusterware function or cluster database function?
Can the Cluster Health Monitor be installed on a single node, non-RAC server?
Do Engineered Systems like Exadata have a default usage with CHM and if so, any specific version??
Where is oclumon?
If the CHM is manually installed using the CHM file from OTN, then the location of oclumon is in:
Linux : /usr/lib/oracrf/bin
Windows : C:\Program Files\oracrf\bin
How do I collect the Cluster Health Monitor data?
For example, issue “/bin/diagcollection.pl --collect --crshome $ORA_CRS_HOME --chmos --incidenttime --incidentduration 05:00”
The above outputs the report that covers 5 hours from the time specified by incidenttime.
The incidenttime must be in MM/DD/YYYYHH:MN:SS where MM is month, DD is date, YYYY is year, HH is hour in 24 hour format, MN is minute, and SS is second. For example, if you want to put the incident time to start from 10:15 PM on June 01, 2011, the incident time is 06/01/201122:15:00. The incidenttime and incidentduration can be changed to capture more data.
Alternatively, ‘oclumon dumpnodeview -allnodes -v -last "11:59:59" > your-filename’ if diagcollection.pl fails with any reason. This will generate a report from the repository up to last 12 hours. The -last value can be changed to get more or less data.
Another example of using oclumon is 'oclumon dumpnodeview -allnodes -v -s "2012-06-01 22:15:00" -e "2012-06-02 03:15:00" > /tmp/chm.log '. The difference in this command is that it specifies the start (-s flag) and end time (-e flag).
In this case, the time format used is "YYYY-MM-DD HH24:MI:SS" like "2007-11-12 23:05:00".
Why does “diagcollection.pl --collect --chmos” return “Cannot parse master from output: ERROR : in reading init file” error?
The workaround for this is to issue
oclumon dumpnodeview -allnodes -v -last “amount of data needed”
For example, oclumon dumpnodeview -allnodes -v -last “01:00:00”
will provide last one hour of data from all nodes.
How do you get the syntax of different options and explanations for those options for diagcollection.pl and oclumon?
What is IPD/OS?
How is the Cluster Health Monitor different from OSWatcher?
Is the Cluster Health Monitor replacing OSWatcher?
On the other hand, if only one of the tools can be used, then Oracle recommends that the Cluster Health Monitor is used.
How much of overhead does the Cluster Health Monitor cause?
Does CHM on Multiple Node configurations (e.g. 4 to 8 nodes) have scaling concerns?
Will CDB and PDB result in any new information or special conditions using CHM?
How much of disk space is needed for the Cluster Health Monitor?
How do I find out the size of data collected and saved by the Cluster Health Monitor in my system?
To estimate the space required, use the following formula:
# of nodes * 720MB * 3 = Size required for 3 days retention
eg. for 4 node cluster: 4 * 720 * 3 = 8,640MB (8.4GB)
How can I increase the size of the Cluster Health Monitor repository ?
What platforms can I run the Cluster Health Monitor?
11.2.0.2: Solaris (Sparc 64 and x86-64 only), and Linux.
11.2.0.3: AIX, Solaris (Sparc 64 and x86-64 only), Linux, and Windows.
Cluster Health Monitor is NOT available for any Itanium platform such as Linux Itanium and Windows Itanium.
What steps are needed to install 11.2.0.2 when the Cluster Health Monitor from OTN is already running?
Where does the Cluster Health Monitor from OTN installed in Linux?
What logs and data should I gather before logging a SR for the Cluster Health Monitor error?
2) output of strace -v for osysmond.bin about 2 minutes.
3) strace -cp for about 2 min
4) oclumon dumpnodeview -v output for that node for 2 min.
5) output of "uname -a"
6) outpuft of "ps -eLf | grep osysmond.bin"
7) The ologgerd and sysmond log files in the CRS_HOME/log/ directory from all nodes
How do I increase the trace level the Cluster Health Monitor?
oclumon debug log all allcomp:
Higher the trace level, more detailed tracing is done, so do not forget to reset the trace level back to 1 (the trace level when the CHM is first installed) by issuing "oclumon debug log all allcomp:1"
Can I use procwatcher to get the pstack of the Cluster Health Monitor regularly?
What are the processes and components for the Cluster Health Monitor?
System Monitor Service (Sysmond) – the sysmond process collects the system statistics of the local node and sends the data to the master ologgerd. A sysmond process runs on every node and collects the system statistics including CPU, memory usage, platform info, disk info, nic info, process info, and filesystem info.
To find the master olggerd, one can use the following command:
oclumon manage -get master
What is oclumon?
You can also use oclumon to query and print the durations and the states for a resource on a node during a specified time period. These states are based on predefined thresholds for each resource metric and are denoted as red, orange, yellow, and green, indicating decreasing order of criticality.
What is definition of some of the files like *.bdb, _db.* , *.ldb , log.* files created by tool in the BDB (Berkeley Database) location directory ?
log.* - These are berkeley bdb logfiles which preserve changes before making them to the db files. We have checkpointing setup and it reuses the log files.
*.ldb - This is the local logging file and MUST be present on all servers.
Do not delete above files except in case of trying to reduce the size of bdb file that get grow to a large size. To reduce the size of bdb file, refer to the question "How can you reduce the size of bdb file that became big for any reason?" in this document.
Because it takes many days / weeks to resolve a problem like the node reboot or performance degradation, is there any way to keep the Cluster Health Monitor data for that long so that it can be replayed any time later when needed ?
Before 12.1.0.2, another way is to archive the whole BDB regularly (like every day) by making a copy of BDB file in the BDB location directory.
The way that CHMOS reads archived BDB is to start it in debug mode. It starts by using
ologdbg -d
After it starts, issue the oclumon dumpnodeview to get the data from the archived BDB.
For example, issue
oclumon dumpnodeview -n -s -e -v
Where is the location for the log files for the Cluster Health Monitor from OTN (pre 11.2.0.2)?
How do I fix the problem that the time in the oclumon report is in UTC time zone instead of the time zone of my server?
Can I install CHM from OTN on 11.2.0.2? What if I stop and disable CHM resource (ora.crf) on 11.2.0.2?
Where is the trace file for client like oclumon? How do I increase the trace level for oclumon?
Generally its not generated because, at the log level 0, there is no log data.
To see logs at higher log level one needs to do the following
1. oclumon [Enter the interactive mode]
2. query> debug log all allcomp:3
After this, any command execution will produce finer logs in oclumon.log
Can the Directory path to the CHM Repository be same on all nodes if shared storage is used?
How much of data (how long in time) does the node store CHM data locally when it cannot communicate with the master?
With a sampling interval of 1 second, ideally it will be around 1 hour of data. With 11.2.0.3, we have moved to sampling interval of 5 seconds, hence, in that case the data that can be retained is 4-5 hours of data.
How often does CHM collect the system metric data? Can this be changed?
Currently, the collection interval can not be changed.
What is the CHM retention time?
In 11.2.0.2, the retention time is determined by the size. The size has changed to 1GB. Depending on how large the cluster is, the retention time is different. For example, it is usually 6.9 hours for a one-node cluster when sampling interval is 1 second. Please issue "oclumon manage -get repsize" to find out the retention time of your cluster. The output is in seconds.
With sampling interval moving to 5 seconds in 11.2.0.3, the retention time becomes 5 times retention time with sampling interval 1 second.
It is recommended to set 72hours retention time.
How can you reduce the size of bdb file that became big for any reason?
oclumon manage -repos changesize .
As a temporary work around, you can kill ologgerd and delete the contents in the BDB directory. osysmond should respawn ologgerd and new bdb file will get created. The past data is lost when this is done.
Please note the minimum size must be >= 1024 MB (1 GB), otherwise CRS-9100 "Error setting Cluster Health Monitor repository size" will be reported.
Can you set up CHM to run locally on each node?
The Cluster Health Monitor that comes with the Grid Infrastructure install image must run with only one master ologgerd, so it can not be set up to run locally on each node.
Can CHM be used on a single node non-RAC server?
How to start and stop CHM that is installed as a part of GI in 11.2 and higher?
To stop CHM (or ora.crf resource managed by ohasd)
$GRID_HOME/bin/crsctl stop res ora.crf -init
To start CHM (or ora.crf resource managed by ohasd)
$GRID_HOME/bin/crsctl start res ora.crf -init
Database - RAC/Scalability Community
To discuss this topic further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Database - RAC/Scalability Community
How to relocate CHM repository and increase retention time (文档 ID 2062234.1)
In this Document
Goal |
Solution |
11.2 |
12.1 |
References |
APPLIES TO:
Oracle Database - Enterprise Edition - Version 11.2.0.1 and laterInformation in this document applies to any platform.
GOAL
Often CHM data ages out when if not collected on time, this note provides steps to increase the retention time which is strongly recommended.
SOLUTION
11.2
In 11.2, the repository of CHM is in Grid home, to change the retention time:
$ /bin/oclumon manage -repos resize 259200
racnode1 --> retention check successful
racnode2 --> retention check successful
New retention is 259200 and will use 4525424640 bytes of disk space
CRS-9115-Cluster Health Monitor repository size change completed on all nodes.
Done
Note: the command line specifies for how many seconds to retain the data and it's recommended to be at least 259200 which is 3 days.
In case there's insufficient amount of space in Grid home, relocate CHM data with the following command:
$ /bin/oclumon manage -repos reploc /home/grid/chm
racnode1 --> Ready to commit new location
racnode2 --> Ready to commit new location
New retention is 259200 and will use 4525424640 bytes of disk space
CRS-9113-Cluster Health Monitor repository location change completed on all nodes. Restarting Loggerd.
Done
12.1
In 12c, the repository of CHM is GIMR which is a database, only retention time can be changed. To change the retention time:
1. Check how much space is needed for the expected retention time:
The Cluster Health Monitor repository is too small for the desired retention. Please first resize the repository to 3896 MB
Note: the command line specifies for how many seconds to retain the data and it's recommended to be at least 259200 which is 3 days. The output tells that the repository needs to be at least 3896 MB for 3 days.
2. Change the repository size:
The Cluster Health Monitor repository was successfully resized.The new retention is 259200 seconds.
REFERENCES
NOTE:1589394.1 - How to Move/Recreate GI Management Repository to Different Shared Storage (Diskgroup, CFS or NFS etc)
About Me
...............................................................................................................................
● 本文整理自网络
● 本文在itpub(http://blog.itpub.net/26736162)、博客园(http://www.cnblogs.com/lhrbest)和个人微信公众号(xiaomaimiaolhr)上有同步更新
● 本文itpub地址:http://blog.itpub.net/26736162/abstract/1/
● 本文博客园地址:http://www.cnblogs.com/lhrbest
● 本文pdf版及小麦苗云盘地址:http://blog.itpub.net/26736162/viewspace-1624453/
● 数据库笔试面试题库及解答:http://blog.itpub.net/26736162/viewspace-2134706/
● QQ群:230161599 微信群:私聊
● 联系我请加QQ好友(646634621),注明添加缘由
● 于 2017-06-02 09:00 ~ 2017-06-30 22:00 在魔都完成
● 文章内容来源于小麦苗的学习笔记,部分整理自网络,若有侵权或不当之处还请谅解
● 版权所有,欢迎分享本文,转载请保留出处
...............................................................................................................................
拿起手机使用微信客户端扫描下边的左边图片来关注小麦苗的微信公众号:xiaomaimiaolhr,扫描右边的二维码加入小麦苗的QQ群,学习最实用的数据库技术。
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/26736162/viewspace-2132364/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/26736162/viewspace-2132364/