RAC集群宕机案例1

 

oracle rac集群在非正常关机后,(数据库实例无法正常释放,猜想是关数据库的时候,没有关闭前端的业务)其中一个节点无法正常启动。

 

首先登录无法正常的启动的宿主机,发现asm实例没有起来。数据库无法打开。

1 查看asm的alert日志

Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production

With the Real Application Clusters and Automatic Storage Management options.

ORACLE_HOME = /u01/app/11.2.0/grid

System name: Linux

Node name:   posdb2

Release: 3.8.13-68.3.4.el6uek.x86_64

Version: #2 SMP Tue Jul 14 15:03:36 PDT 2015

Machine: x86_64

Using parameter settings in server-side spfile +RACOCR/posdb-scan/asmparameterfile/registry.253.939810311

System parameters with non-default values:

  large_pool_size          = 12M

  instance_type            = "asm"

  remote_login_passwordfile= "EXCLUSIVE"

  remote_listener          = "posdbscan.jinianri.com:1521"

  asm_diskstring           = "/dev/oracleasm/disks"

  asm_diskgroups           = "RACDATA"

  asm_power_limit          = 1

  diagnostic_dest          = "/u01/app/grid_base"

Cluster communication is configured to use the following interface(s) for this instance

  169.254.184.182

cluster interconnect IPC version:Oracle UDP/IP (generic)

IPC Vendor 1 proto 2

Mon Apr 22 17:15:16 2019

PMON started with pid=2, OS id=35859

Mon Apr 22 17:15:16 2019

PSP0 started with pid=3, OS id=35861

Mon Apr 22 17:15:17 2019

VKTM started with pid=4, OS id=35863 at elevated priority

VKTM running at (1)millisec precision with DBRM quantum (100)ms

Mon Apr 22 17:15:17 2019

GEN0 started with pid=5, OS id=35867

Mon Apr 22 17:15:17 2019

DIAG started with pid=6, OS id=35869

Mon Apr 22 17:15:17 2019

PING started with pid=7, OS id=35871

Mon Apr 22 17:15:17 2019

DIA0 started with pid=8, OS id=35873

Mon Apr 22 17:15:17 2019

LMON started with pid=9, OS id=35875

Mon Apr 22 17:15:17 2019

LMD0 started with pid=10, OS id=35877

* Load Monitor used for high load check

* New Low - High Load Threshold Range = [76800 - 102400]

Mon Apr 22 17:15:17 2019

LMS0 started with pid=11, OS id=35879 at elevated priority

Mon Apr 22 17:15:17 2019

LMHB started with pid=12, OS id=35883

Mon Apr 22 17:15:17 2019

MMAN started with pid=13, OS id=35885

Mon Apr 22 17:15:17 2019

DBW0 started with pid=14, OS id=35887

Mon Apr 22 17:15:17 2019

LGWR started with pid=15, OS id=35889

Mon Apr 22 17:15:17 2019

CKPT started with pid=16, OS id=35891

Mon Apr 22 17:15:17 2019

SMON started with pid=17, OS id=35893

Mon Apr 22 17:15:17 2019

RBAL started with pid=18, OS id=35895

Mon Apr 22 17:15:17 2019

GMON started with pid=19, OS id=35897

Mon Apr 22 17:15:17 2019

MMON started with pid=20, OS id=35899

Mon Apr 22 17:15:17 2019

MMNL started with pid=21, OS id=35901

lmon registered with NM - instance number 2 (internal mem no 1)

Mon Apr 22 17:17:14 2019

PMON (ospid: 35859): terminating the instance due to error 481

Mon Apr 22 17:17:14 2019

ORA-1092 : opitsk aborting process

Mon Apr 22 17:17:15 2019

System state dump requested by (instance=2, osid=35859 (PMON)), summary=[abnormal instance termination].

System State dumped to trace file /u01/app/grid_base/diag/asm/+asm/+ASM2/trace/+ASM2_diag_35869.trc

Mon Apr 22 17:17:15 2019

ORA-1092 : opitsk aborting process

Dumping diagnostic data in directory=[cdmp_20190422171715], requested by (instance=2, osid=35859 (PMON)), summary=[abnormal instance termination].

Instance terminated by PMON, pid = 35859

Mon Apr 22 17:17:15 2019

License high water mark = 1

USER (ospid: 35925): terminating the instance

Instance terminated by USER, pid = 35925

 

 

oracleasm listdisks发现asm磁盘是可以带过来的。

 

这里继续查看posdb2 alert日志,可以看到posdb2实例是被posdb1杀死的。踢出了集群。

简单说就是posdb1拒绝让posdb2启动

 

 

 

 

[cssd(34506)]CRS-1662:Member kill requested by node posdb1 for member number 1, group DB+ASM

2019-04-22 10:13:42.476:

[/u01/app/11.2.0/grid/bin/oraagent.bin(34407)]CRS-5019:All OCR locations are on ASM disk groups [RACOCR], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/u01/app/11.2.0/grid/log/posdb2/agent/ohasd/oraagent_grid/oraagent_grid.log".

2019-04-22 10:13:42.476:

[/u01/app/11.2.0/grid/bin/oraagent.bin(34407)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0/grid/log/posdb2/agent/ohasd/oraagent_grid/oraagent_grid.log"

2019-04-22 10:13:44.403:

[/u01/app/11.2.0/grid/bin/oraagent.bin(34407)]CRS-5019:All OCR locations are on ASM disk groups [RACOCR], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/u01/app/11.2.0/grid/log/posdb2/agent/ohasd/oraagent_grid/oraagent_grid.log".

2019-04-22 10:13:44.606:

[/u01/app/11.2.0/grid/bin/oraagent.bin(34407)]CRS-5019:All OCR locations are on ASM disk groups [RACOCR], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/u01/app/11.2.0/grid/log/posdb2/agent/ohasd/oraagent_grid/oraagent_grid.log".

2019-04-22 10:13:45.061:

[/u01/app/11.2.0/grid/bin/oraagent.bin(34407)]CRS-5019:All OCR locations are on ASM disk groups [RACOCR], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/u01/app/11.2.0/grid/log/posdb2/agent/ohasd/oraagent_grid/oraagent_grid.log".

2019-04-22 10:13:45.235:

[ohasd(34059)]CRS-2807:Resource 'ora.asm' failed to start automatically.

2019-04-22 10:13:45.235:

[ohasd(34059)]CRS-2807:Resource 'ora.crsd' failed to start automatically.

2019-04-22 10:48:03.585:

 

 

posdb2反复尝试重启大概三次后,会主动杀死的自己的进程,停止自启动。

 

白天遇到这个问题,根据报错,在夜里二点关闭所有前端业务。杀死plsql会话。关闭所有集群后,

RAC节点逐一启动,功能恢复正常。

 

 

 

在另外一个时间点

 

 

 

 

这里看到节点2又被移出了集群

2019-04-23 21:46:31.302:

[cssd(34550)]CRS-1632:Node posdb2 is being removed from the cluster in cluster incarnation 451452533

2019-04-23 21:46:31.855:

[ohasd(36684)]CRS-8011:reboot advisory message from host: posdb2, component: cssagent, with time stamp: L-2019-04-23-21:46:31.803

[ohasd(36684)]CRS-8013:reboot advisory message text: Rebooting after limit 28230 exceeded; disk timeout 28230, network timeout 0, last heartbeat from CSSD at epoch seconds 1556027163.497, 28311 milliseconds ago based on invariant clock value of 68696263

2019-04-23 21:46:31.856:

[ohasd(36684)]CRS-8011:reboot advisory message from host: posdb2, component: cssmonit, with time stamp: L-2019-04-23-21:46:31.803

[ohasd(36684)]CRS-8013:reboot advisory message text: Rebooting after limit 28230 exceeded; disk timeout 28230, network timeout 0, last heartbeat from CSSD at epoch seconds 1556027163.497, 28311 milliseconds ago based on invariant clock value of 68696263

2019-04-23 21:46:34.116:

[cssd(34550)]CRS-1601:CSSD Reconfiguration complete. Active nodes are posdb1 .

2019-04-23 21:46:34.121:

[crsd(35129)]CRS-5504:Node down event reported for node 'posdb2'.

2019-04-23 21:46:36.708:

[crsd(35129)]CRS-2773:Server 'posdb2' has been removed from pool 'Generic'.

2019-04-23 21:46:36.709:

[crsd(35129)]CRS-2773:Server 'posdb2' has been removed from pool 'ora.orcl'.

                                                                             

 

 

重启服务器后,查看alert日志

 

Non critical error ORA-48181 caught while writing to trace file "/u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_67596.trc"

Error message: Linux-x86_64 Error: 28: No space left on device

Additional information: 1

Writing to the above trace file is disabled for now on...

Tue Apr 23 21:45:49 2019

Non critical error ORA-48113 caught while writing to trace file "/u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmd0_35801.trc"

Error message:

Writing to the above trace file is disabled for now on...

Tue Apr 23 21:45:50 2019

Non critical error ORA-48113 caught while writing to trace file "/u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lms0_35803.trc"

Error message:

Writing to the above trace file is disabled for now on...

Tue Apr 23 21:45:53 2019

Non critical error ORA-48113 caught while writing to trace file "/u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmon_35799.trc"

Error message:

 

这次有点尴尬,居然是本地磁盘满了。好吧。清除多余的备份文件,重启集群服务,解决问题。

ORACLE RAC的好处是当有一个节点因为这样那样的问题,不能提供服务时,我另外一个节点还可以正常跑。

 

事后的总结文档总是简单的,但是过程是曲折的。要等时间,要承担各种压力,在高压下正常分析问题的能力。

如果对数据库不精通,可能就没有思路。

大多数时候,我们解决问题,都是靠想象力。

  • 3
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值