人工调整时间引起的ORACLE RAC数据库实例终止

最新推荐文章于 2021-05-26 05:06:14 发布

Yushan Bai

最新推荐文章于 2021-05-26 05:06:14 发布

阅读量1.6k

点赞数

分类专栏： ORACLE 故障排查

本文链接：https://blog.csdn.net/haibusuanyun/article/details/114455028

版权

ORACLE 故障排查专栏收录该内容

196 篇文章

订阅专栏

数据库集群中节点1因时间调整导致异常终止，出现时间漂移警告和ASMB卡死错误。检查发现时间差异超过允许范围，且有手动修改时间的记录。问题与NTP配置错误有关，解决方案包括修复NTP设置，避免大幅调整时间，并解决集群内参数一致性问题。最终成功重启数据库并修复了时间同步服务。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

近期，某客户反馈集群的数据库节点1异常终止，重启数据库时报错。对此问题排查过程是：

1.首先通过集群状态检查crsctl stat res -t确认了集群资源状态，仅节点1数据库实例异常。

2.检查节点1的数据库alert日志，问题发生时出现了Warning: VKTM detected a backward time drift.……ERROR: terminating instance because ASMB is stuck for 557 seconds……GEN0 (ospid: ): terminating the instance due to ORA error ，初步判断可能与时间调整相关。

3.检查GRID集群日志，日志指出时间差异过大，问题更加清晰：[OCTSSD(6305)]CRS-2419: The clock on host erp-db1 differs from mean cluster time by 28897581407 microseconds. The Cluster Time Synchronization Service will not perform time synchronization because the time difference is beyond the permissible offset of 600 seconds.

4.检查操作系统日志中时间修改相关命令，history |grep date可以发现问题发生时有时间修改的命令。与客户沟通并联系到主机运维，确认当时是由于NTP异常导致时区出错，人工做的修改，进而引起的问题。

5.数据库实例异常终止后为何没被集群启动？检查日志，可以发现是由于DG相关的convert参数不一致导致的，这个参数在RAC环境中也导致过不少客户遇到问题了；其实在RAC做为DG主节点时，此参数并没有实际作用，由于设置后需要重启数据库才能生效；如果没有条件在参数设置完成重启时，建议不设置，避免后续出现这种节点间参数不一致导致数据库节点无法轮流启动，这时需要将参数设置为空或存活节点也关闭，才能启动。19:59:07.930+ORA-01105: mount is incompatible with mounts by other instances
ORA-01677: standby file name conversion parameters differ from other instance

同时对此类问题，ORACLE MOS上有对应文档：Terminating Instance Because ASMB is Stuck for x Seconds (Doc ID 2278744.1) Database Instance Crashes With ORA-15064 ORA-03113: Possible Causes and Solution (Doc ID 2378963.1)中均指向了此问题，第一查询的就是NTP或时间类问题。建议均为不要大幅调整系统时间，同时在RAC安装最佳实践文档(RAC 和 Oracle Clusterware 最佳实践和初学者指南 (Linux) (文档 ID 1525820.1)中，也明确指定使用NTP时要使用-x slew option模式，避免时间调整范围过大。

相关日志如下：

1.检查异常节点的数据库alert日志
2021-02-27T08:49:19.803840+08:00
ARC0 (PID:74760): Archived Log entry 6640 added for T-1.S-2013 ID 0x84ed3940 LAD:1
2021-02-27T11:46:47.593056+08:00
Warning: VKTM detected a backward time drift.  ====>>>>>
Time drifts can result in unexpected behavior such as time-outs. 
Please see the VKTM trace file for more details:
/opt/app/oracle/diag/rdbms/test/test1/trace/test1_vktm_74436.trc
2021-02-27T19:57:40.195323+08:00
ERROR: terminating instance because ASMB is stuck for 557 seconds  ====>>>>>
2021-02-27T19:57:40.335713+08:00
System state dump requested by (instance=1, osid=74440 (GEN0)), summary=[abnormal instance termination]. error - 'Instance is terminating.
'
System State dumped to trace file /opt/app/oracle/diag/rdbms/test/test1/trace/test1_diag_74449.trc
GEN0 (ospid: ): terminating the instance due to ORA error 
2021-02-27T19:57:41.386728+08:00
ORA-1092 : opitsk aborting process
2021-02-27T19:57:44.108585+08:00
License high water mark = 54
2021-02-27T19:57:46.527225+08:00
Instance terminated by GEN0, pid = 74440
2021-02-27T19:57:46.691517+08:00
Warning: 2 processes are still attacheded to shmid 917510:
 (size: 49152 bytes, creator pid: 74145, last attach/detach pid: 74472)
2021-02-27T19:57:47.110412+08:00
USER(prelim) (ospid: 48839): terminating the instance
2021-02-27T19:57:47.112558+08:00
Instance terminated by USER(prelim), pid = 48839
2021-02-27T19:57:52.548577+08:00
Starting ORACLE instance (normal) (OS id: 49014)

2.检查GRID集群alert日志

2020-03-15 00:52:18.319 [OCTSSD(6305)]CRS-2412: The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster t
ime. Details in /opt/app/grid/diag/crs/erp-db1/crs/trace/octssd.trc.
2020-03-15 01:22:19.516 [OCTSSD(6305)]CRS-2412: The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster t
ime. Details in /opt/app/grid/diag/crs/erp-db1/crs/trace/octssd.trc.
2020-03-15 02:21:28.342 [ORAAGENT(67341)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 67341
2020-03-15 02:32:03.441 [ORAAGENT(74586)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 74586
2021-02-27 11:46:47.597 [OCTSSD(6305)]CRS-2419: The clock on host erp-db1 differs from mean cluster time by 28897581407 microseconds. The Cluster Time Synchronization 
Service will not perform time synchronization because the time difference is beyond the permissible offset of 600 seconds. Details in /opt/app/grid/diag/crs/erp-db1/cr
s/trace/octssd.trc.     ====>>>>>
2021-02-27 11:46:48.567 [OCTSSD(6305)]CRS-2402: The Cluster Time Synchronization Service aborted on host erp-db1. Details at (:ctsselect_msm3:) in /opt/app/grid/diag/c
rs/erp-db1/crs/trace/octssd.trc.
2021-02-27 19:57:41.089 [ORAAGENT(74586)]CRS-5011: Check of resource "test" failed: details at "(:CLSN00007:)" in "/opt/app/grid/diag/crs/erp-db1/crs/trace/crsd_oraag
ent_oracle.trc"
2021-02-27 19:57:42.758 [ORAAGENT(48812)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 48812
2021-02-27 19:57:48.580 [ORAAGENT(48979)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 48979
2021-02-27 19:59:07.930 [ORAAGENT(48979)]CRS-5017: The resource action "ora.test.db start" encountered the following error: 
2021-02-27 19:59:07.930+ORA-01105: mount is incompatible with mounts by other instances
ORA-01677: standby file name conversion parameters differ from other instance ====>>>>>
. For details refer to "(:CLSN00107:)" in "/opt/app/grid/diag/crs/erp-db1/crs/trace/crsd_oraagent_oracle.trc".

3.检查操作系统中是否有时间修改的的命令

[root@erp-db1 ~]# history |grep date 
^^
  902  date -s 20210227 19:57:40
  903  date -s "20210227 19:57:40"
  904  date
  905  date -s "20210227 19:58:40"
  906  date