RAC集群宕机案例1

最新推荐文章于 2024-05-22 15:09:05 发布

有趣是好的

最新推荐文章于 2024-05-22 15:09:05 发布

阅读量2.5k

点赞数 3

分类专栏： ORACLE RAC

本文链接：https://blog.csdn.net/stephen11111/article/details/89512520

版权

ORACLE RAC 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

oracle rac集群在非正常关机后，（数据库实例无法正常释放，猜想是关数据库的时候，没有关闭前端的业务）其中一个节点无法正常启动。

首先登录无法正常的启动的宿主机，发现asm实例没有起来。数据库无法打开。

1 查看asm的alert日志

Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production

With the Real Application Clusters and Automatic Storage Management options.

ORACLE_HOME = /u01/app/11.2.0/grid

System name: Linux

Node name: posdb2

Release: 3.8.13-68.3.4.el6uek.x86_64

Version: #2 SMP Tue Jul 14 15:03:36 PDT 2015

Machine: x86_64

Using parameter settings in server-side spfile +RACOCR/posdb-scan/asmparameterfile/registry.253.939810311

System parameters with non-default values:

large_pool_size = 12M

instance_type = "asm"

remote_login_passwordfile= "EXCLUSIVE"

remote_listener = "posdbscan.jinianri.com:1521"

asm_diskstring = "/dev/oracleasm/disks"

asm_diskgroups = "RACDATA"

asm_power_limit = 1

diagnostic_dest = "/u01/app/grid_base"

Cluster communication is configured to use the following interface(s) for this instance

169.254.184.182

cluster interconnect IPC version:Oracle UDP/IP (generic)

IPC Vendor 1 proto 2

Mon Apr 22 17:15:16 2019

PMON started with pid=2, OS id=35859

Mon Apr 22 17:15:16 2019

PSP0 started with pid=3, OS id=35861

Mon Apr 22 17:15:17 2019

VKTM started with pid=4, OS id=35863 at elevated priority

VKTM running at (1)millisec precision with DBRM quantum (100)ms

Mon Apr 22 17:15:17 2019

GEN0 started with pid=5, OS id=35867

Mon Apr 22 17:15:17 2019

DIAG started with pid=6, OS id=35869

Mon Apr 22 17:15:17 2019

PING started with pid=7, OS id=35871

Mon Apr 22 17:15:17 2019

DIA0 started with pid=8, OS id=35873

Mon Apr 22 17:15:17 2019

LMON started with pid=9, OS id=35875

Mon Apr 22 17:15:17 2019

LMD0 started with pid=10, OS id=35877

* Load Monitor used for high load check

* New Low - High Load Threshold Range = [76800 - 102400]

Mon Apr 22 17:15:17 2019

LMS0 started with pid=11, OS id=35879 at elevated priority

Mon Apr 22 17:15:17 2019

LMHB started with pid=12, OS id=35883

Mon Apr 22 17:15:17 2019

MMAN started with pid=13, OS id=35885

Mon Apr 22 17:15:17 2019

DBW0 started with pid=14, OS id=35887

Mon Apr 22 17:15:17 2019

LGWR started with pid=15, OS id=35889

Mon Apr 22 17:15:17 2019

CKPT started with pid=16, OS id=35891

Mon Apr 22 17:15:17 2019

SMON started with pid=17, OS id=35893

Mon Apr 22 17:15:17 2019

RBAL started with pid=18, OS id=35895

Mon Apr 22 17:15:17 2019

GMON started with pid=19, OS id=35897

Mon Apr 22 17:15:17 2019

MMON started with pid=20, OS id=35899

Mon Apr 22 17:15:17 2019

MMNL started with pid=21, OS id=35901

lmon registered with NM - instance number 2 (internal mem no 1)

Mon Apr 22 17:17:14 2019

PMON (ospid: 35859): terminating the instance due to error 481

Mon Apr 22 17:17:14 2019

ORA-1092 : opitsk aborting process

Mon Apr 22 17:17:15 2019

System state dump requested by (instance=2, osid=35859 (PMON)), summary=[abnormal instance termination].

System State dumped to trace file /u01/app/grid_base/diag/asm/+asm/+ASM2/trace/+ASM2_diag_35869.trc

Mon Apr 22 17:17:15 2019

ORA-1092 : opitsk aborting process

Dumping diagnostic data in directory=[cdmp_20190422171715], requested by (instance=2, osid=35859 (PMON)), summary=[abnormal instance termination].

Instance terminated by PMON, pid = 35859

Mon Apr 22 17:17:15 2019

License high water mark = 1

USER (ospid: 35925): terminating the instance

Instance terminated by USER, pid = 35925

oracleasm listdisks发现asm磁盘是可以带过来的。

这里继续查看posdb2 alert日志，可以看到posdb2实例是被posdb1杀死的。踢出了集群。

简单说就是posdb1拒绝让posdb2启动

[cssd(34506)]CRS-1662:Member kill requested by node posdb1 for member number 1, group DB+ASM

2019-04-22 10:13:42.476:

[/u01/app/11.2.0/grid/bin/oraagent.bin(34407)]CRS-5019:All OCR locations are on ASM disk groups [RACOCR], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/u01/app/11.2.0/grid/log/posdb2/agent/ohasd/oraagent_grid/oraagent_grid.log".

2019-04-22 10:13:42.476:

[/u01/app/11.2.0/grid/bin/oraagent.bin(34407)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0/grid/log/posdb2/agent/ohasd/oraagent_grid/oraagent_grid.log"

2019-04-22 10:13:44.403:

2019-04-22 10:13:44.606:

2019-04-22 10:13:45.061:

2019-04-22 10:13:45.235:

[ohasd(34059)]CRS-2807:Resource 'ora.asm' failed to start automatically.

2019-04-22 10:13:45.235:

[ohasd(34059)]CRS-2807:Resource 'ora.crsd' failed to start automatically.

2019-04-22 10:48:03.585:

posdb2反复尝试重启大概三次后，会主动杀死的自己的进程，停止自启动。

白天遇到这个问题，根据报错，在夜里二点关闭所有前端业务。杀死plsql会话。关闭所有集群后，

RAC节点逐一启动，功能恢复正常。

在另外一个时间点

这里看到节点2又被移出了集群

2019-04-23 21:46:31.302:

[cssd(34550)]CRS-1632:Node posdb2 is being removed from the cluster in cluster incarnation 451452533

2019-04-23 21:46:31.855:

[ohasd(36684)]CRS-8011:reboot advisory message from host: posdb2, component: cssagent, with time stamp: L-2019-04-23-21:46:31.803

[ohasd(36684)]CRS-8013:reboot advisory message text: Rebooting after limit 28230 exceeded; disk timeout 28230, network timeout 0, last heartbeat from CSSD at epoch seconds 1556027163.497, 28311 milliseconds ago based on invariant clock value of 68696263

2019-04-23 21:46:31.856:

[ohasd(36684)]CRS-8011:reboot advisory message from host: posdb2, component: cssmonit, with time stamp: L-2019-04-23-21:46:31.803

2019-04-23 21:46:34.116:

[cssd(34550)]CRS-1601:CSSD Reconfiguration complete. Active nodes are posdb1 .

2019-04-23 21:46:34.121:

[crsd(35129)]CRS-5504:Node down event reported for node 'posdb2'.

2019-04-23 21:46:36.708:

[crsd(35129)]CRS-2773:Server 'posdb2' has been removed from pool 'Generic'.

2019-04-23 21:46:36.709:

[crsd(35129)]CRS-2773:Server 'posdb2' has been removed from pool 'ora.orcl'.

重启服务器后，查看alert日志

Non critical error ORA-48181 caught while writing to trace file "/u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_67596.trc"