今天业务反馈有一台主机登入不上去,环境是RAC,只有一个节点重启了,另外一个节点正常。我同事比我先处理一段时间,导致了ocssd.log里面的日志没有拿下来(ohasd.log,ocssd.log这些日志产生的日志量比较大,不像DB的alter日志,当ocssd.log日志是会覆盖的,假设你9:00集群宕机的,当你10:00再去看日志会发现里面日志没有9:00日志了,都是最近的日志,这也是我为什么没有拿到ocssd.log里面发生故障点时间的日志)。同事确认过在主机发生重启的时候在ocsd.log里面没有因为网络问题导致节点驱逐导致脑裂的信息。
(1)先上主机,查看主机何时宕机重启的。
[root@zjhzbjwgzhzg01 cssd]# last reboot | head -1 --这条命令才看到主机重启的时间是08:52,这个时间是启动的时间,其实主机宕机的时间是比08:52早几分钟。
reboot system boot 2.6.32-431.el6.x Wed Sep 12 08:52 - 10:30 (01:38)
(2)由于主机发生了重启,去查看操作系统让日志
操作系统日志:
ep 12 08:24:19 zjhzbjwgzhzg01 kernel: __ratelimit: 16 callbacks suppressed
Sep 12 08:24:19 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events logged --这里可以看到在比主机重启的时间早一些的时候,radhat linux操作系统的主机已经开始报关于硬件方面的错误了。
Sep 12 08:24:23 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events logged
Sep 12 08:25:27 zjhzbjwgzhzg01 kernel: __ratelimit: 38 callbacks suppressed
Sep 12 08:25:27 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events logged
Sep 12 08:25:27 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events logged
Sep 12 08:26:00 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 595c01f000 exceed threshold 10 in 24h: 10 in 24h
Sep 12 08:26:00 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? []
Sep 12 08:26:00 zjhzbjwgzhzg01 mcelog: Offlining page 595c01f000
Sep 12 08:26:28 zjhzbjwgzhzg01 kernel: __ratelimit: 39 callbacks suppressed
Sep 12 08:26:28 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events logged
Sep 12 08:26:28 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events logged
Sep 12 08:27:31 zjhzbjwgzhzg01 kernel: __ratelimit: 88 callbacks suppressed
Sep 12 08:27:31 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events logged
Sep 12 08:27:31 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events logged
Sep 12 08:27:42 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 58dc5bf000 exceed threshold 10 in 24h: 10 in 24h
Sep 12 08:27:42 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? []
Sep 12 08:27:42 zjhzbjwgzhzg01 mcelog: Offlining page 58dc5bf000
Sep 12 08:27:42 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 591dd3f000 exceed threshold 10 in 24h: 10 in 24h
Sep 12 08:27:42 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? []
Sep 12 08:27:42 zjhzbjwgzhzg01 mcelog: Offlining page 591dd3f000
Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 599c859000 exceed threshold 10 in 24h: 10 in 24h
Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? []
Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Offlining page 599c859000
Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 599c858000 exceed threshold 10 in 24h: 10 in 24h
Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? []
Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Offlining page 599c858000
Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Offlining page 599c858000 failed: Device or resource busy
Sep 12 08:27:43 zjhzbjwgzhzg01 kernel: soft offline: 0x599c858 page already poisoned
Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 599c815000 exceed threshold 10 in 24h: 10 in 24h
Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? []
Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Offlining page 599c815000
Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Offlining page 599c815000 failed: Device or resource busy
Sep 12 08:27:43 zjhzbjwgzhzg01 kernel: soft offline: 0x599c815 page already poisoned
Sep 12 08:27:44 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 599cc34000 exceed threshold 10 in 24h: 10 in 24h
Sep 12 08:27:44 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? []
Sep 12 08:27:44 zjhzbjwgzhzg01 mcelog: Offlining page 599cc34000
Sep 12 08:27:46 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 589d81d000 exceed threshold 10 in 24h: 10 in 24h
Sep 12 08:27:46 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? []
Sep 12 08:27:46 zjhzbjwgzhzg01 mcelog: Offlining page 589d81d000
Sep 12 08:27:53 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 599dd34000 exceed threshold 10 in 24h: 10 in 24h
Sep 12 08:27:53 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? []
--More--(4%)
Sep 12 08:35:06 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? []
Sep 12 08:35:06 zjhzbjwgzhzg01 mcelog: Offlining page 58ddd9c000
Sep 12 08:36:10 zjhzbjwgzhzg01 kernel: __ratelimit: 50 callbacks suppressed
Sep 12 08:36:10 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo
gged
Sep 12 08:36:11 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo
gged
Sep 12 08:36:45 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 591d45f00
0 exceed threshold 10 in 24h: 10 in 24h
Sep 12 08:36:45 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? []
Sep 12 08:36:45 zjhzbjwgzhzg01 mcelog: Offlining page 591d45f000
Sep 12 08:37:07 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 595c4df00
0 exceed threshold 10 in 24h: 10 in 24h
Sep 12 08:37:07 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? []
Sep 12 08:37:07 zjhzbjwgzhzg01 mcelog: Offlining page 595c4df000
Sep 12 08:37:14 zjhzbjwgzhzg01 kernel: __ratelimit: 61 callbacks suppressed
Sep 12 08:37:14 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo
gged
Sep 12 08:37:14 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo
gged
Sep 12 08:37:25 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 589d0ff00 --这里报错是内存报错
0 exceed threshold 10 in 24h: 10 in 24h
Sep 12 08:37:25 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? []
Sep 12 08:37:25 zjhzbjwgzhzg01 mcelog: Offlining page 589d0ff000
Sep 12 08:38:14 zjhzbjwgzhzg01 kernel: __ratelimit: 55 callbacks suppressed
Sep 12 08:38:14 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo
gged
Sep 12 08:38:14 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo
gged
Sep 12 08:39:15 zjhzbjwgzhzg01 kernel: __ratelimit: 72 callbacks suppressed
Sep 12 08:39:15 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo
gged
Sep 12 08:39:16 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo
gged
Sep 12 08:39:56 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 589c93f00
0 exceed threshold 10 in 24h: 10 in 24h
Sep 12 08:39:56 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? []
Sep 12 08:39:56 zjhzbjwgzhzg01 mcelog: Offlining page 589c93f000
Sep 12 08:40:19 zjhzbjwgzhzg01 kernel: __ratelimit: 56 callbacks suppressed
Sep 12 08:40:19 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo
gged
Sep 12 08:40:19 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo
gged
Sep 12 08:40:22 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 589ccff00
0 exceed threshold 10 in 24h: 10 in 24h
Sep 12 08:40:22 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? []
Sep 12 08:40:22 zjhzbjwgzhzg01 mcelog: Offlining page 589ccff000
Sep 12 08:40:25 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 595cd9f00
0 exceed threshold 10 in 24h: 10 in 24h
Sep 12 08:40:25 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? []
Sep 12 08:40:25 zjhzbjwgzhzg01 mcelog: Offlining page 595cd9f000
Sep 12 08:41:00 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 5c9cdb600
0 exceed threshold 10 in 24h: 10 in 24h
Sep 12 08:41:00 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? []
Sep 12 08:41:00 zjhzbjwgzhzg01 mcelog: Offlining page 5c9cdb6000 --这里可以看到日志断了,08:42-08:51分钟的日志看不到了,可以判断这段时间主机是宕机了。所以这里可以将故障的时间定为到08:40,如果要去查看ohasd.log,ocrsd.log,ocssd.log日志就应该将注意力放在08:40之前一段时间,而不是像无头苍蝇一样将大部分日志看一遍。
Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: imklog 5.8.10, log source = /proc/kmsg st
arted.
Sep 12 08:52:53 zjhzbjwgzhzg01 rsyslogd: [origin software="rsyslogd" swVersion="
5.8.10" x-pid="7988" x-info="http://www.rsyslog.com"] start
Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: Initializing cgroup subsys cpuset
Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: Initializing cgroup subsys cpu
Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: Linux version 2.6.32-431.el6.x86_64 (mock
build@x86-023.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4
.7-4) (GCC) ) #1 SMP Sun Nov 10 22:19:54 EST 2013
Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: Command line: ro root=UUID=40b7f075-d922-
4995-aebe-de61630e7037 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=la
tarcyrheb-sun16 crashkernel=auto KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quie
t nohz=0ff intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll transparent_
hugepage=never
Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: KERNEL supported cpus:
Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: Intel GenuineIntel
Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: AMD AuthenticAMD
Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: Centaur CentaurHauls
Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: BIOS-provided physical RAM map:
Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: BIOS-e820: 0000000000000000 - 00000000000
9a000 (usable)
Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: BIOS-e820: 000000000009a000 - 00000000000
a0000 (reserved)
Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: BIOS-e820: 00000000000e0000 - 00000000001
00000 (reserved)
Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: BIOS-e820: 0000000000100000 - 00000000755
07000 (usable)
Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: BIOS-e820: 0000000075507000 - 0000000075c
bb000 (reserved)
Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: BIOS-e820: 0000000075cbb000 - 0000000075d
bb000 (ACPI data)
(3)查看节点数据库日志(上面通过查看同事通过观察ocssd.log内容反馈不是网络导致的集群脑裂重启,可以判断网络这一块不是导致主机重启的原因,上面操作系统日志报错的是硬件和内存方面的错误,可能是主机重启的原因,但是也有可能是数据库层面的导致的,所以看看DB日志)
Errors in file /opt/oracle/oracle/diag/rdbms/sdh/sdh1/trace/sdh1_j003_91488.trc:
ORA-12012: error on auto execute of job 90926
ORA-01403: no data found
ORA-06512: at "IRM.PKG_DATACOMPARE_TASKEXECUTE", line 11
ORA-06512: at line 1
Errors in file /opt/oracle/oracle/diag/rdbms/sdh/sdh1/trace/sdh1_j002_91432.trc:
ORA-12012: error on auto execute of job 90927
ORA-01403: no data found
ORA-06512: at "IRM.PKG_DATACOMPARE_TASKEXECUTE", line 11
ORA-06512: at line 1
Wed Sep 12 08:31:39 2018
Errors in file /opt/oracle/oracle/diag/rdbms/sdh/sdh1/trace/sdh1_j004_92891.trc:
ORA-12012: error on auto execute of job 90929
ORA-01403: no data found
ORA-06512: at "IRM.PKG_DATACOMPARE_TASKEXECUTE", line 11
ORA-06512: at line 1
Wed Sep 12 08:38:45 2018
Thread 1 advanced to log sequence 338086 (LGWR switch)
Current log# 1 seq# 338086 mem# 0: +SDH_SYS_DG/sdh/onlinelog/group_1.329.879523497
Wed Sep 12 08:38:45 2018
LNS: Standby redo logfile selected for thread 1 sequence 338086 for destination LOG_ARCHIVE_DEST_2
Wed Sep 12 08:38:50 2018 --可以看到在08:38-09:14之间的日志没有了,这段时间数据库是宕机的
Archived Log entry 1126324 added for thread 1 sequence 338085 ID 0xd26cc373 dest 1:
Wed Sep 12 09:14:56 2018
Starting ORACLE instance (normal) --同事手动将数据库拉起
************************ Large Pages Information *******************
Per process system memlock (soft) limit = UNLIMITED
Large page usage restricted to processor group "sys"
Total Shared Global Region in Large Pages = 200 GB (100%)
WARNING: --数据库启动的时候又报出来内存方面错误,之前主机也报出内存方面错误。
The parameter _linux_prepage_large_pages is explicitly disabled.
Oracle strongly recommends setting the _linux_prepage_large_pages
parameter since the instance is running in a Processor Group. If there is
insufficient large page memory, instance may encounter SIGBUS error
and may terminate abnormally.
Large Pages used by this instance: 102401 (200 GB)
Large Pages unused in Processor Group sys = 145081 (283 GB)
Large Pages configured in Processor Group sys = 153496 (300 GB)
Large Page size = 2048 KB
********************************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 88
Number of processor cores in the system is 44
Number of processor sockets in the system is 4
Private Interface 'Bond1:1' configured from GPnP for use as a private interconnect.
[name='Bond1:1', type=1, ip=169.254.185.193, mac=a0-00-01-00-fe-80-00-00-00-00-00-00-64-3e-00-00-00-00-00-00, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62]
Public Interface 'vlan304' configured from GPnP for use as a public interface.
[name='vlan304', type=1, ip=10.212.252.84, mac=e0-97-96-06-e7-a5, net=10.212.252.80/28, mask=255.255.255.240, use=public/1]
Public Interface 'vlan304:1' configured from GPnP for use as a public interface.
[name='vlan304:1', type=1, ip=10.212.252.87, mac=e0-97-96-06-e7-a5, net=10.212.252.80/28, mask=255.255.255.240, use=public/1]
Picked latch-free SCN scheme 3
Autotune of undo retention is turned off.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
NUMA system with 4 nodes detected
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options.
ORACLE_HOME = /opt/oracle/oracle/product/11.2.0/db_1
System name: Linux
Node name: zjhzbjwgzhzg01
Release: 2.6.32-431.el6.x86_64
Version: #1 SMP Sun Nov 10 22:19:54 EST 2013
Machine: x86_64
Using parameter settings in server-side pfile /opt/oracle/oracle/product/11.2.0/db_1/dbs/initsdh1.ora
System parameters with non-default values:
processes = 3000
sessions = 4576
event = "10949 trace name context forever, level 1"
sga_max_size = 200G
shared_pool_size = 20G
large_pool_size = 3584M
java_pool_size = 3584M
streams_pool_size = 512M
spfile = "+SDH_SYS_DG/sdh/spfilesdh.ora
(4)查看ASM日志,其实ASM磁盘有问题并不会导致集群宕了导致主机重启,这里只是为了看看
ASM日志
Wed Sep 12 08:56:59 2018
NOTE: No asm libraries found in the system --开始重启ASM实例,可以看到在主机重启后,集群也自启动拉起了ASM。不出意外去看看crsd.log
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 1, checking for the existence of node 0...
* node 0 does not exist. instance_number = 1
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 88
Number of processor cores in the system is 44
Number of processor sockets in the system is 4
Private Interface 'Bond1:1' configured from GPnP for use as a private interconnect.
[name='Bond1:1', type=1, ip=169.254.185.193, mac=a0-00-01-00-fe-80-00-00-00-00-00-00-64-3e-00-00-00-00-00-00, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62]
Public Interface 'vlan304' configured from GPnP for use as a public interface.
[name='vlan304', type=1, ip=10.212.252.84, mac=e0-97-96-06-e7-a5, net=10.212.252.80/28, mask=255.255.255.240, use=public/1]
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /opt/oracle/grid/11.2.0/grid/dbs/arch
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
NOTE: Volume support enabled
NUMA system with 4 nodes detected
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options.
ORACLE_HOME = /opt/oracle/grid/11.2.0/grid
System name: Linux
Node name: zjhzbjwgzhzg01
Release: 2.6.32-431.el6.x86_64
Version: #1 SMP Sun Nov 10 22:19:54 EST 2013
Machine: x86_64
Using parameter settings in server-side spfile +OCR_VOTE/zjhzbjw-cluster/asmparameterfile/registry.253.876056333
System parameters with non-default values:
large_pool_size = 12M
instance_type = "asm"
remote_login_passwordfile= "EXCLUSIVE"
asm_diskstring = "/dev/asmdisk/*"
asm_diskgroups = "OCR_VOTE"
asm_diskgroups = "SDH_SYS_DG"
asm_diskgroups = "ARCHIVE_DG"
asm_diskgroups = "WPS_SYS_DG"
asm_power_limit = 8
diagnostic_dest = "/opt/oracle/grid/grid"
Cluster communication is configured to use the following interface(s) for this instance
169.254.185.193
cluster interconnect IPC version:Oracle UDP/IP (generic)
(5)查看CRS的日志
crsd日志:整个日志可以看到08:57的时候。CRS组件重启了
2018-09-12 08:37:37.898: [UiServer][1156998912]{1:47993:28042} Done for ctx=0x7fcfbc0088a0
2018-09-12 08:57:08.163: [ CRSMAIN][1714153248] First attempt: init CSS context succeeded.
[ clsdmt][1707702016]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=zjhzbjwgzhzg01DBG_CRSD))
2018-09-12 08:57:08.165: [ clsdmt][1707702016]PID for the Process [35434], connkey 1
2018-09-12 08:57:08.165: [ clsdmt][1707702016]Creating PID [35434] file for home /opt/oracle/grid/11.2.0/grid host zjhzbjwgzhzg01 bin crs to /opt/oracle/gri
d/11.2.0/grid/crs/init/
2018-09-12 08:57:08.165: [ clsdmt][1707702016]Writing PID [35434] to the file [/opt/oracle/grid/11.2.0/grid/crs/init/zjhzbjwgzhzg01.pid]
2018-09-12 08:57:08.607: [ CRSMAIN][1707702016] Policy Engine is not initialized yet!
2018-09-12 08:57:08.607: [ CRSMAIN][1714153248] CRS Daemon Starting
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: allcomp 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: default 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: COMMCRS 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: COMMNS 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: CSSCLNT 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCLIB 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCXBAD 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCLXPT 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCUNDE 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPC 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCGEN 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCTRAC 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCWAIT 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCXCPT 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCOSD 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCBASE 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCCLSA 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCCLSC 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCEXMP 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCGMOD 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCHEAD 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCMUX 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCNET 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCNULL 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCPKT 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCSMEM 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCHAUP 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCHALO 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCHTHR 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCHGEN 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCHLCK 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCHDEM 0
2018-09-12 08:57:08.608: [ CRSD][1714153248] Logging level for Module: GIPCHWRK 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CRSMAIN 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: clsdmt 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: clsdms 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CRSUI 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CRSCOMM 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CRSRTI 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CRSPLACE 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CRSAPP 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CRSRES 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CRSTIMER 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CRSEVT 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CRSD 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CLUCLS 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CLSVER 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CLSFRAME 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CRSPE 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CRSSE 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CRSRPT 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CRSOCR 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: UiServer 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: AGFW 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: SuiteTes 1
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CRSSHARE 1
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CRSSEC 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CRSCCL 0
2018-09-12 08:57:08.609: [ CRSMAIN][1707702016] Policy Engine is not initialized yet!
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: CRSCEVT 0
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: AGENT 1
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: OCRAPI 1
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: OCRCLI 1
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: OCRSRV 1
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: OCRMAS 1
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: OCRMSG 1
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: OCRCAC 1
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: OCRRAW 1
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: OCRUTL 1
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: OCROSD 1
2018-09-12 08:57:08.609: [ CRSD][1714153248] Logging level for Module: OCRASM 1
2018-09-12 08:57:08.609: [ CRSMAIN][1714153248] Checking the OCR device
2018-09-12 08:57:08.610: [ CRSMAIN][1714153248] Sync-up with OCR
2018-09-12 08:57:08.610: [ CRSMAIN][1714153248] Connecting to the CSS Daemon
2018-09-12 08:57:08.610: [ CRSMAIN][1714153248] Getting local node number
2018-09-12 08:57:08.610: [ CRSMAIN][1714153248] Initializing OCR
[ CLWAL][1714153248]clsw_Initialize: OLR initlevel [70000]
2018-09-12 08:57:09.198: [ OCRRAW][1714153248]proprioo: for disk 0 (+OCR_VOTE), id match (1), total id sets, (2) need recover (0), my votes (2), total votes
(2), commit_lsn (1469), lsn (1469)
2018-09-12 08:57:09.198: [ OCRRAW][1714153248]proprioo: my id set: (760227868, 1028247821, 0, 0, 0)
2018-09-12 08:57:09.198: [ OCRRAW][1714153248]proprioo: 1st set: (340549372, 760227868, 0, 0, 0)
2018-09-12 08:57:09.198: [ OCRRAW][1714153248]proprioo: 2nd set: (760227868, 1028247821, 0, 0, 0)
2018-09-12 08:57:09.207: [ OCRSRV][1714153248]th_init: Successfully retrieved CSS misscount [31].
2018-09-12 08:57:09.207: [ OCRSRV][1714153248]th_init: Successfully query CLSS mode [3].
2018-09-12 08:57:09.208: [ OCRSRV][1714153248]th_init:1: FROM PUBDATA Node num [2]Remote Listening Port [0] Cache invalidation port [0]
2018-09-12 08:57:09.208: [ OCRSRV][1714153248]th_init:1.1: FROM PUBDATA Node num [2]CLSC Private IP or GIPC connect string [gipcha<zjhzbjwgzhzg02><467b-167f
-bbd0-bd81><b6b7-5893-3f00-7370>]
通过上面可以判断出可能是由于内存问题和硬件问题导致的主机重启,和主机那边的人联系还真是内存条有问题导致整个主机的重启,同时导致集群的重启。
总结:当集群发生故障可以是硬件方面的,可能是等待事件导致CPU过高导致集群重启,也可能是掉盘导致的数据库挂了但是集群正常,这方面的内容要通过日志才可以判断,关键还是定位故障发生的时间,通过时间点日志定位故障。