RAC删除 /tmp/.oracle目录导致故障

最新推荐文章于 2023-08-10 18:04:28 发布

congjiong3047

最新推荐文章于 2023-08-10 18:04:28 发布

阅读量407

点赞数

文章标签：数据库操作系统运维

故障问题描述

客户技术人员电话反映应用出现有张表无法访问的问题（其实是数据库已经无法进行写入操作），同时，也无法远程登录检查，数据库日志等信息。根据当时情况，基本无法判断确定具体是什么原因导致表无法访问。

第一时间，我们建议客户相关技术人员重启数据库，但是发现无法正常关闭数据库，使用shutdown abort强制关闭数据库然后重启。

到现场后经过同现场工程师沟通了解到在凌晨2点删除了 /tmp目录下临时文件。

经过分析确认这次数据库故障是删除 /tmp/.oracle目录导致故障。

故障分析

1、根据数据库报警日志确认问题：

Sun Nov 15 04:42:48 2015

WARNING: ASM communication error: op 11 state 0x50 (3113)

ERROR: slave communication error with ASM

Unable to create archive log file '+ARDATA/arch/1_15180_854962843.dbf'

ARC0:Error19504 Creating archive log file to

'+ARDATA/arch/1_15180_854962843.dbf'

ARCH: Archival stopped, error occurred. Will continue retrying

ORACLE Instance szps1 - Archival Error

ORA-16038: log 3 sequence# 15180 cannot be archived

ORA-19504: failed to create file ""

ORA-00312: online log 3 thread 1:

'+DATA/szonline/onlinelog/group_3.263.854962845'

Sun Nov 15 04:42:48 2015

ARCH: Archival stopped, error occurred. Will continue retrying

ORACLE Instance szps1 - Archival Error

ORA-16014: log 3 sequence# 15180 not archived, no available destinations

ORA-00312: online log 3 thread 1:

'+DATA/szonline/onlinelog/group_3.263.854962845'

Sun Nov 15 04:47:49 2015

WARNING: ASM communication error: op 11 state 0x50 (3113)

ERROR: slave communication error with ASM

Unable to create archive log file '+ARDATA/arch/1_15180_854962843.dbf'

ARC3: Error 19504 Creating archive log file to

'+ARDATA/arch/1_15180_854962843.dbf'

ARCH: Archival stopped, error occurred. Will continue retrying

ORACLE Instance szps1 - Archival Error

ORA-16038: log 3 sequence# 15180 cannot be archived

ORA-19504: failed to create file ""

ORA-00312: online log 3 thread 1:

'+DATA/szonline/onlinelog/group_3.263.854962845'

ARCH: Archival stopped, error occurred. Will continue retrying

ORACLE Instance szps1 - Archival Error

ORA-16014: log 3 sequence# 15180 not archived, no available destinations

ORA-00312: online log 3 thread 1:

'+DATA/szonline/onlinelog/group_3.263.854962845'

从上面的信息，我们看到，从2015年11月15日凌晨4:42 开始到库被重启前一直都要报无法连接ASM实例，导致了无法写日志写归档错误。

2. ASM 告警日志提示错误

Sun Nov 15 04:42:48 2015

ERROR: unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 15204714

Sun Nov 15 04:47:49 2015

ERROR: unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 5832778

Sun Nov 15 04:52:50 2015

ERROR: unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 5701650

Sun Nov 15 04:57:50 2015

ERROR: unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 14221586

Sun Nov 15 05:02:50 2015

ERROR: unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 15204858

从上面的信息看到，ASM也从2015年11月15日开始报错，结合之前无法写日志写归档的报错，我们基本可以确认数据库不正常是由于ASM问题引发的。

3、grid 错误日志：

2015-11-15 02:00:51.685:

[/oracle/grid/bin/oraagent.bin(4325770)]CRS-5016:Process "/oracle/grid/bin/lsnrctl" spawned by agent "/oracle/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in

"/oracle/grid/log/szdb01/agent/crsd/oraagent_grid/oraagent_grid.log"

2015-11-15 02:03:51.823:

[/oracle/grid/bin/oraagent.bin(4325770)]CRS-5818:Aborted command 'start' for resource 'ora.LISTENER.lsnr'. Details at (:CRSAGF00113:) {0:1:8} in

/oracle/grid/log/szdb01/agent/crsd/oraagent_grid/oraagent_grid.log.

2015-11-15 02:05:55.825:

[/oracle/grid/bin/oraagent.bin(4325770)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER.lsnr'. Details at (:CRSAGF00113:) {0:1:8} in

/oracle/grid/log/szdb01/agent/crsd/oraagent_grid/oraagent_grid.log.

2015-11-15 02:06:01.721:

[/oracle/grid/bin/oraagent.bin(4325770)]CRS-5016:Process "/oracle/grid/bin/lsnrctl" spawned by agent "/oracle/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in

"/oracle/grid/log/szdb01/agent/crsd/oraagent_grid/oraagent_grid.log"

2015-11-15 02:15:37.519:

[ctssd(5505096)]CRS-2409:The clock on host szdb01 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.

2015-11-15 02:27:12.538:

[ohasd(5177356)]CRS-2765:Resource 'ora.crsd' has failed on server 'szdb01'.

2015-11-15 02:27:12.621:

[/oracle/grid/bin/orarootagent.bin(5374128)]CRS-5822:Agent '/oracle/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:5:41827} in

/oracle/grid/log/szdb01/agent/crsd/orarootagent_root/orarootagent_root.log.

2015-11-15 02:27:12.648:

[/oracle/grid/bin/oraagent.bin(4325770)]CRS-5822:Agent '/oracle/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:1:68} in

/oracle/grid/log/szdb01/agent/crsd/oraagent_grid/oraagent_grid.log.

2015-11-15 02:27:12.649:

[/oracle/grid/bin/oraagent.bin(6160520)]CRS-5822:Agent '/oracle/grid/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:9:52067} in

/oracle/grid/log/szdb01/agent/crsd/oraagent_oracle/oraagent_oracle.log.

2015-11-15 02:27:14.712:

[crsd(7012528)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in

/oracle/grid/log/szdb01/crsd/crsd.log.

2015-11-15 02:27:15.718:

[ohasd(5177356)]CRS-2765:Resource 'ora.crsd' has failed on server 'szdb01'.

2015-11-15 02:27:16.818:

[crsd(6160522)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in

/oracle/grid/log/szdb01/crsd/crsd.log.

2015-11-15 02:27:17.817:

[ohasd(5177356)]CRS-2765:Resource 'ora.crsd' has failed on server 'szdb01'.

2015-11-15 02:27:18.920:

[crsd(15794626)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in

/oracle/grid/log/szdb01/crsd/crsd.log.

2015-11-15 02:27:19.922:

[ohasd(5177356)]CRS-2765:Resource 'ora.crsd' has failed on server 'szdb01'.

2015-11-15 02:27:21.009:

[crsd(6094920)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in

/oracle/grid/log/szdb01/crsd/crsd.log.

2015-11-15 02:27:22.009:

[ohasd(5177356)]CRS-2765:Resource 'ora.crsd' has failed on server 'szdb01'.

2015-11-15 02:27:23.091:

[crsd(6160526)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in

/oracle/grid/log/szdb01/crsd/crsd.log.

2015-11-15 02:27:24.099:

[ohasd(5177356)]CRS-2765:Resource 'ora.crsd' has failed on server 'szdb01'.

2015-11-15 02:27:25.196:

[crsd(6160528)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in

/oracle/grid/log/szdb01/crsd/crsd.log.

2015-11-15 02:27:26.196:

[ohasd(5177356)]CRS-2765:Resource 'ora.crsd' has failed on server 'szdb01'.

2015-11-15 02:27:27.288:

[crsd(6160530)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in

/oracle/grid/log/szdb01/crsd/crsd.log.

2015-11-15 02:27:28.291:

[ohasd(5177356)]CRS-2765:Resource 'ora.crsd' has failed on server 'szdb01'.

2015-11-15 02:27:29.366:

[crsd(12779904)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in

/oracle/grid/log/szdb01/crsd/crsd.log.

2015-11-15 02:27:30.371:

[ohasd(5177356)]CRS-2765:Resource 'ora.crsd' has failed on server 'szdb01'.

2015-11-15 02:27:31.465:

[crsd(5832832)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in

/oracle/grid/log/szdb01/crsd/crsd.log.

2015-11-15 02:27:32.466:

[ohasd(5177356)]CRS-2765:Resource 'ora.crsd' has failed on server 'szdb01'.

2015-11-15 02:27:33.545:

[crsd(5832834)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in

/oracle/grid/log/szdb01/crsd/crsd.log.

2015-11-15 02:27:34.548:

[ohasd(5177356)]CRS-2765:Resource 'ora.crsd' has failed on server 'szdb01'.

2015-11-15 02:27:34.548:

[ohasd(5177356)]CRS-2771:Maximum restart attempts reached for resource 'ora.crsd'; will not restart.

2015-11-15 02:27:34.555:

[ohasd(5177356)]CRS-2769:Unable to failover resource 'ora.crsd'.

根据GRID的报何错信息，我们基本可以推出：

1) GRID 日志记录，在11月15日凌晨2:00出现监听故障，2:27出现 CRS故障，2:00删除了 /tmp/.oracle的文件夹，GRID 马上就出现了监听器故障，后续有出现了CRS故障；

2) ORACLE数据库实例通过监听器连接ASM实例，在监听器故障之前已经建立的连接，当监听器故障时仍然可以正常使用，而数据库实例的启动归档日志进程进行归档时需要与ASM 实例建立新的连接，这个时候因为监听器已经故障了，导致数据库实例新建的连接无法连接到ASM实例，导致归档失败；

3) 由于数据库实例有多个日志组，刚开始的时候只有一个日志组被写满无法归档，后来随着时间推移所有的日志组都被写满，但所有的日志组都没有完成归档，导致无日志组可用来写入 redo 条目，阻塞了应用的SQL。

4、删除 /tmp/.oracle目录导致故障的案例（该案来源于ORALCE metalink文档 ID 370605.1）

Clusterware Intermittently Hangs And Commands Fail With CRS-184 as Network Socker Files in /tmp/.oracle or /var/tmp/.oracle Gets Deleted (文档 ID 370605.1)

In this Document

Symptoms

Cause

Solution

Applies to:

Oracle Database - Enterprise Edition - Version 10.1.0.2 to 11.1.0.7

[Release 10.1 to 11.1]
Information in this document applies to any platform.

SYMPTOMS

CRS hangs intermittently

crs_stat -treturns

CRS-0184: Cannot communicate with the CRS daemon.

node1 [crs]> crsctl check crsd
Cannot communicate with CRS
node1 [crs]> crsctl check css
Failure 1 contacting CSS daemon

ps -ef |grep d.bin will give you the pid of the process

for example

ps -ef |grep d.bin
oracle 19703 192810 Apr10 ? 00:01:03 /home/oracle/oracle/product/10.2.0/crs/bin/evmd.bin
oracle 19976 19950 0 Apr10 ? 00:06:47 /home/oracle/oracle/product/10.2.0/crs/bin/ocssd.bin
root 19323 1 0 Apr10? 00:08:47 /home/oracle/oracle/product/10.2.0/crs/bin/crsd.bin

CAUSE

This is caused bya cron job that cleans up the /tmp directory which also removes the Oracle socket files in /tmp/.oracle

SOLUTION

Do not remove /tmp/.oracle or /var/tmp/.oracle or its files while Oracle Clusterware is up.

结论：根据当前的实际情况，再通过ORACLE 官方METALINK的实际案例，我们最终确认这次数据库突然无法正常运行是由于相关操作系统运维人员在一些日常运维的管理中，清理一些日志引发的，和数据库无关。为了防范于未然，为了下次再出现类似的问题，我们建议对数据库服务器相关的操作系统层面的运维，需要加强数据库管理员和操作系统管理员之间的协商和合作。

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/21582653/viewspace-2128357/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/21582653/viewspace-2128357/