oracle rac集群在非正常关机后,(数据库实例无法正常释放,猜想是关数据库的时候,没有关闭前端的业务)其中一个节点无法正常启动。
首先登录无法正常的启动的宿主机,发现asm实例没有起来。数据库无法打开。
1 查看asm的alert日志
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production With the Real Application Clusters and Automatic Storage Management options. ORACLE_HOME = /u01/app/11.2.0/grid System name: Linux Node name: posdb2 Release: 3.8.13-68.3.4.el6uek.x86_64 Version: #2 SMP Tue Jul 14 15:03:36 PDT 2015 Machine: x86_64 Using parameter settings in server-side spfile +RACOCR/posdb-scan/asmparameterfile/registry.253.939810311 System parameters with non-default values: large_pool_size = 12M instance_type = "asm" remote_login_passwordfile= "EXCLUSIVE" remote_listener = "posdbscan.jinianri.com:1521" asm_diskstring = "/dev/oracleasm/disks" asm_diskgroups = "RACDATA" asm_power_limit = 1 diagnostic_dest = "/u01/app/grid_base" Cluster communication is configured to use the following interface(s) for this instance 169.254.184.182 cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2 Mon Apr 22 17:15:16 2019 PMON started with pid=2, OS id=35859 Mon Apr 22 17:15:16 2019 PSP0 started with pid=3, OS id=35861 Mon Apr 22 17:15:17 2019 VKTM started with pid=4, OS id=35863 at elevated priority VKTM running at (1)millisec precision with DBRM quantum (100)ms Mon Apr 22 17:15:17 2019 GEN0 started with pid=5, OS id=35867 Mon Apr 22 17:15:17 2019 DIAG started with pid=6, OS id=35869 Mon Apr 22 17:15:17 2019 PING started with pid=7, OS id=35871 Mon Apr 22 17:15:17 2019 DIA0 started with pid=8, OS id=35873 Mon Apr 22 17:15:17 2019 LMON started with pid=9, OS id=35875 Mon Apr 22 17:15:17 2019 LMD0 started with pid=10, OS id=35877 * Load Monitor used for high load check * New Low - High Load Threshold Range = [76800 - 102400] Mon Apr 22 17:15:17 2019 LMS0 started with pid=11, OS id=35879 at elevated priority Mon Apr 22 17:15:17 2019 LMHB started with pid=12, OS id=35883 Mon Apr 22 17:15:17 2019 MMAN started with pid=13, OS id=35885 Mon Apr 22 17:15:17 2019 DBW0 started with pid=14, OS id=35887 Mon Apr 22 17:15:17 2019 LGWR started with pid=15, OS id=35889 Mon Apr 22 17:15:17 2019 CKPT started with pid=16, OS id=35891 Mon Apr 22 17:15:17 2019 SMON started with pid=17, OS id=35893 Mon Apr 22 17:15:17 2019 RBAL started with pid=18, OS id=35895 Mon Apr 22 17:15:17 2019 GMON started with pid=19, OS id=35897 Mon Apr 22 17:15:17 2019 MMON started with pid=20, OS id=35899 Mon Apr 22 17:15:17 2019 MMNL started with pid=21, OS id=35901 lmon registered with NM - instance number 2 (internal mem no 1) Mon Apr 22 17:17:14 2019 PMON (ospid: 35859): terminating the instance due to error 481 Mon Apr 22 17:17:14 2019 ORA-1092 : opitsk aborting process Mon Apr 22 17:17:15 2019 System state dump requested by (instance=2, osid=35859 (PMON)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/grid_base/diag/asm/+asm/+ASM2/trace/+ASM2_diag_35869.trc Mon Apr 22 17:17:15 2019 ORA-1092 : opitsk aborting process Dumping diagnostic data in directory=[cdmp_20190422171715], requested by (instance=2, osid=35859 (PMON)), summary=[abnormal instance termination]. Instance terminated by PMON, pid = 35859 Mon Apr 22 17:17:15 2019 License high water mark = 1 USER (ospid: 35925): terminating the instance Instance terminated by USER, pid = 35925 |
oracleasm listdisks发现asm磁盘是可以带过来的。
这里继续查看posdb2 alert日志,可以看到posdb2实例是被posdb1杀死的。踢出了集群。
简单说就是posdb1拒绝让posdb2启动
[cssd(34506)]CRS-1662:Member kill requested by node posdb1 for member number 1, group DB+ASM 2019-04-22 10:13:42.476: [/u01/app/11.2.0/grid/bin/oraagent.bin(34407)]CRS-5019:All OCR locations are on ASM disk groups [RACOCR], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/u01/app/11.2.0/grid/log/posdb2/agent/ohasd/oraagent_grid/oraagent_grid.log". 2019-04-22 10:13:42.476: [/u01/app/11.2.0/grid/bin/oraagent.bin(34407)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0/grid/log/posdb2/agent/ohasd/oraagent_grid/oraagent_grid.log" 2019-04-22 10:13:44.403: [/u01/app/11.2.0/grid/bin/oraagent.bin(34407)]CRS-5019:All OCR locations are on ASM disk groups [RACOCR], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/u01/app/11.2.0/grid/log/posdb2/agent/ohasd/oraagent_grid/oraagent_grid.log". 2019-04-22 10:13:44.606: [/u01/app/11.2.0/grid/bin/oraagent.bin(34407)]CRS-5019:All OCR locations are on ASM disk groups [RACOCR], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/u01/app/11.2.0/grid/log/posdb2/agent/ohasd/oraagent_grid/oraagent_grid.log". 2019-04-22 10:13:45.061: [/u01/app/11.2.0/grid/bin/oraagent.bin(34407)]CRS-5019:All OCR locations are on ASM disk groups [RACOCR], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/u01/app/11.2.0/grid/log/posdb2/agent/ohasd/oraagent_grid/oraagent_grid.log". 2019-04-22 10:13:45.235: [ohasd(34059)]CRS-2807:Resource 'ora.asm' failed to start automatically. 2019-04-22 10:13:45.235: [ohasd(34059)]CRS-2807:Resource 'ora.crsd' failed to start automatically. 2019-04-22 10:48:03.585: |
posdb2反复尝试重启大概三次后,会主动杀死的自己的进程,停止自启动。
白天遇到这个问题,根据报错,在夜里二点关闭所有前端业务。杀死plsql会话。关闭所有集群后,
RAC节点逐一启动,功能恢复正常。
在另外一个时间点
这里看到节点2又被移出了集群
2019-04-23 21:46:31.302: [cssd(34550)]CRS-1632:Node posdb2 is being removed from the cluster in cluster incarnation 451452533 2019-04-23 21:46:31.855: [ohasd(36684)]CRS-8011:reboot advisory message from host: posdb2, component: cssagent, with time stamp: L-2019-04-23-21:46:31.803 [ohasd(36684)]CRS-8013:reboot advisory message text: Rebooting after limit 28230 exceeded; disk timeout 28230, network timeout 0, last heartbeat from CSSD at epoch seconds 1556027163.497, 28311 milliseconds ago based on invariant clock value of 68696263 2019-04-23 21:46:31.856: [ohasd(36684)]CRS-8011:reboot advisory message from host: posdb2, component: cssmonit, with time stamp: L-2019-04-23-21:46:31.803 [ohasd(36684)]CRS-8013:reboot advisory message text: Rebooting after limit 28230 exceeded; disk timeout 28230, network timeout 0, last heartbeat from CSSD at epoch seconds 1556027163.497, 28311 milliseconds ago based on invariant clock value of 68696263 2019-04-23 21:46:34.116: [cssd(34550)]CRS-1601:CSSD Reconfiguration complete. Active nodes are posdb1 . 2019-04-23 21:46:34.121: [crsd(35129)]CRS-5504:Node down event reported for node 'posdb2'. 2019-04-23 21:46:36.708: [crsd(35129)]CRS-2773:Server 'posdb2' has been removed from pool 'Generic'. 2019-04-23 21:46:36.709: [crsd(35129)]CRS-2773:Server 'posdb2' has been removed from pool 'ora.orcl'.
|
重启服务器后,查看alert日志
Non critical error ORA-48181 caught while writing to trace file "/u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_67596.trc" Error message: Linux-x86_64 Error: 28: No space left on device Additional information: 1 Writing to the above trace file is disabled for now on... Tue Apr 23 21:45:49 2019 Non critical error ORA-48113 caught while writing to trace file "/u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmd0_35801.trc" Error message: Writing to the above trace file is disabled for now on... Tue Apr 23 21:45:50 2019 Non critical error ORA-48113 caught while writing to trace file "/u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lms0_35803.trc" Error message: Writing to the above trace file is disabled for now on... Tue Apr 23 21:45:53 2019 Non critical error ORA-48113 caught while writing to trace file "/u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmon_35799.trc" Error message: |
这次有点尴尬,居然是本地磁盘满了。好吧。清除多余的备份文件,重启集群服务,解决问题。
ORACLE RAC的好处是当有一个节点因为这样那样的问题,不能提供服务时,我另外一个节点还可以正常跑。
事后的总结文档总是简单的,但是过程是曲折的。要等时间,要承担各种压力,在高压下正常分析问题的能力。
如果对数据库不精通,可能就没有思路。
大多数时候,我们解决问题,都是靠想象力。