下午接到一个同事电话,说一体机(RAC)第二个节点数据库连不上了,让我帮忙看看。我便登上系统,在第一个节点查看信息,如下
点击(此处)折叠或打开
- [grid@pwjkdb01 ~]$ crs_stat -t
- Name Type Target State Host
- ------------------------------------------------------------
- ora....PWJK.dg ora....up.type ONLINE ONLINE pwjkdb01
- ora.DBFS_DG.dg ora....up.type ONLINE ONLINE pwjkdb01
- ora....ER.lsnr ora....er.type ONLINE ONLINE pwjkdb01
- ora....N1.lsnr ora....er.type ONLINE OFFLINE
- ora....PWJK.dg ora....up.type ONLINE ONLINE pwjkdb01
- ora.asm ora.asm.type ONLINE ONLINE pwjkdb01
- ora.cvu ora.cvu.type ONLINE OFFLINE
- ora.gsd ora.gsd.type OFFLINE OFFLINE
- ora....network ora....rk.type ONLINE ONLINE pwjkdb01
- ora.oc4j ora.oc4j.type ONLINE OFFLINE
- ora.ons ora.ons.type ONLINE ONLINE pwjkdb01
- ora.pwdata.db ora....se.type ONLINE ONLINE pwjkdb01
- ora....rv1.svc ora....ce.type ONLINE ONLINE pwjkdb01
- ora....rv2.svc ora....ce.type ONLINE ONLINE pwjkdb01
- ora....SM1.asm application ONLINE ONLINE pwjkdb01
- ora....01.lsnr application ONLINE ONLINE pwjkdb01
- ora....b01.gsd application OFFLINE OFFLINE
- ora....b01.ons application ONLINE ONLINE pwjkdb01
- ora....b01.vip ora....t1.type ONLINE ONLINE pwjkdb01
- ora....b02.vip ora....t1.type ONLINE OFFLINE
- ora.scan1.vip ora....ip.type ONLINE OFFLINE
点击(此处)折叠或打开
- [root@pwjkdb02 ~]# ps -ef |grep pmon
- root 6679 1 0 2015 ? 00:01:42 /usr/bin/perl -w /opt/oracle.cellos/compmon/exadata_mon_hw_asr.pl -server
- root 63055 62491 0 16:45 pts/1 00:00:00 grep pmon
由于业务系统关系,我查看了下系统时间、运行时间,便尝试启动第二个节点集群资源
点击(此处)折叠或打开
- [root@pwjkdb02 bin]# ./crsctl start crs
- CRS-4640: Oracle High Availability Services is already active
- CRS-4000: Command Start failed, or completed with errors.
- [root@pwjkdb02 bin]# ./crsctl start cluster
- CRS-2672: Attempting to start 'ora.asm' on 'pwjkdb02'
- CRS-2676: Start of 'ora.asm' on 'pwjkdb02' succeeded
- CRS-2672: Attempting to start 'ora.crsd' on 'pwjkdb02'
- CRS-2676: Start of 'ora.crsd' on 'pwjkdb02' succeeded
- [root@pwjkdb02 bin]#
节点二启动正常,如下
点击(此处)折叠或打开
- [grid@pwjkdb02 ~]$ crs_stat -t
- Name Type Target State Host
- ------------------------------------------------------------
- ora....PWJK.dg ora....up.type ONLINE ONLINE pwjkdb01
- ora.DBFS_DG.dg ora....up.type ONLINE ONLINE pwjkdb01
- ora....ER.lsnr ora....er.type ONLINE ONLINE pwjkdb01
- ora....N1.lsnr ora....er.type ONLINE ONLINE pwjkdb02
- ora....PWJK.dg ora....up.type ONLINE ONLINE pwjkdb01
- ora.asm ora.asm.type ONLINE ONLINE pwjkdb01
- ora.cvu ora.cvu.type ONLINE ONLINE pwjkdb02
- ora.gsd ora.gsd.type OFFLINE OFFLINE
- ora....network ora....rk.type ONLINE ONLINE pwjkdb01
- ora.oc4j ora.oc4j.type ONLINE ONLINE pwjkdb02
- ora.ons ora.ons.type ONLINE ONLINE pwjkdb01
- ora.pwdata.db ora....se.type ONLINE ONLINE pwjkdb01
- ora....rv1.svc ora....ce.type ONLINE ONLINE pwjkdb01
- ora....rv2.svc ora....ce.type ONLINE ONLINE pwjkdb01
- ora....SM1.asm application ONLINE ONLINE pwjkdb01
- ora....01.lsnr application ONLINE ONLINE pwjkdb01
- ora....b01.gsd application OFFLINE OFFLINE
- ora....b01.ons application ONLINE ONLINE pwjkdb01
- ora....b01.vip ora....t1.type ONLINE ONLINE pwjkdb01
- ora....SM2.asm application ONLINE ONLINE pwjkdb02
- ora....02.lsnr application ONLINE ONLINE pwjkdb02
- ora....b02.gsd application OFFLINE OFFLINE
- ora....b02.ons application ONLINE ONLINE pwjkdb02
- ora....b02.vip ora....t1.type ONLINE ONLINE pwjkdb02
- ora.scan1.vip ora....ip.type ONLINE ONLINE pwjkdb02
启动后,查看部分日志
数据库日志:
点击(此处)折叠或打开
- tail -100f alertpwjkdb02.log
-
- 4016-01-02 16:26:00.736
- [/u01/app/11.2.0.3/grid/bin/oraagent.bin(48051)]CRS-5011:Check of resource "pwdata" failed: details at "(:CLSN00007:)" in "/u01/app/11.2.0.3/grid/log/pwjkdb02/agent/crsd/oraagent_oracle/oraagent_oracle.log"
- 4016-01-02 16:26:01.262
- [crsd(9981)]CRS-2765:Resource 'ora.pwdata.db' has failed on server 'pwjkdb02'.
- 4016-01-02 16:26:01.329
- [crsd(9981)]CRS-2765:Resource 'ora.pwdata.pwdatasrv2.svc' has failed on server 'pwjkdb02'.
- 4016-01-02 16:26:01.329
- [crsd(9981)]CRS-2771:Maximum restart attempts reached for resource 'ora.pwdata.pwdatasrv2.svc'; will not restart.
- 4016-01-02 16:26:01.510
- [/u01/app/11.2.0.3/grid/bin/oraagent.bin(8988)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0.3/grid/log/pwjkdb02/agent/ohasd/oraagent_grid/oraagent_grid.log"
- 4016-01-02 16:26:01.614
- [ohasd(6722)]CRS-2765:Resource 'ora.asm' has failed on server 'pwjkdb02'.
- 4016-01-02 16:26:01.663
- [/u01/app/11.2.0.3/grid/bin/oraagent.bin(8988)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0.3/grid/log/pwjkdb02/agent/ohasd/oraagent_grid/oraagent_grid.log"
- ……………………………………
点击(此处)折叠或打开
- tail -600f alert_+ASM2.log |more
-
Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details.
Tue Jan 12 00:39:25 2016
Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details.
Sat Jan 02 16:26:00 4016
Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details.
Sat Jan 02 16:26:00 4016
Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_pmon_9913.trc:
ORA-01513: invalid current time returned by operating system
PMON (ospid: 9913): terminating the instance due to error 1513
Sat Jan 02 16:26:01 4016
System state dump requested by (instance=2, osid=9913 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_diag_9923.trc
Dumping diagnostic data in directory=[cdmp_19740622152201], requested by (instance=2, osid=9913 (PMON)), summary=[abnormal instance
termination].
Sat Jan 02 16:26:01 4016
ORA-1092 : opitsk aborting process
Sat Jan 02 16:26:01 4016
License high water mark = 24
Instance terminated by PMON, pid = 9913
USER (ospid: 49268): terminating the instance
查看该节点操作系统操作历史记录,
点击(此处)折叠或打开
- vi .bash_profile
-
- su - oracle
- #1454306838
- sar 1 5
- #1454315175
- date
- #1454315199
- date 010216264016.00
- #64565627161
- date
- #64565627177
- date 0102162716.00
- #1451723220
- date
- #1451723223
- date
- #1451723225
- date
- #1451723227
- date
- #1451723233
- date
- #1451723250
- date
- #1451723302
- date
- #1451723310
- date
- #1451723315
- date
- #1451723323
- su - oracle
- #1451723446
- date
- #1451723534
- date 0201163016.00
- #1454315401
- date
- #1454315505
- date 0201163416.00
- #1454315642
- date
- #1454315647
- date
- #1454315648
通过以上我们可以 找到一条记录为: date 010216264016.00,再通过警告日志及查看其它集群日志,可以确认,由于更改操作系统时间造成RAC节点二集群关闭,经过电话沟通,该同事发现系统时间慢5分钟,直接在操作系统上更改(请注意,更改操作系统时间需谨慎,尤其数据库系统运行状态,以免影响业务应用),由于命令不熟,将时间改为4016年,带来以上问题。
任何操作都有风险性,在做操作时,我们应该提前做好规划、操作方案以及应急预案及风险性评估,切不可想当然对在线系统做任何更改。
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/29487349/viewspace-1991263/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/29487349/viewspace-1991263/