oracle trace alert 清理_那些年Oracle数据库主机时间调整的风花雪月

最新推荐文章于 2023-08-22 14:39:17 发布

weixin_39524147

最新推荐文章于 2023-08-22 14:39:17 发布

阅读量163

点赞数

文章标签： oracle trace alert 清理

本文链接：https://blog.csdn.net/weixin_39524147/article/details/113627086

版权

最近工作中又遇到因时间问题导致的故障，这让本新四有好青年想起了N年前的一个案例，今天整理分享一下。当时是应用反应主机时间与正确的时间相差有8分多钟，影响了正常的业务，登录发现主机的NTP服务是开启的，查看NTP同步状态：

可以看到offset是0.051s,基本没有延迟，那么问题就出在Ntpserver时间存在不准确的可能，通过主机侧查看，果然server端存在延迟的情况。

为尽快恢复业务，通过以下方式来处理时间延迟，停止NTP服务更改服务端到一个正常的NTP服务器，在不停库的情况下，手工微调时间，来追平发生的延迟，步骤如下：

1.停止NTP服务修改服务器地址

#/etc/init.d/ntpd stop

#vi /etc/ntp.conf

# Enable writing of statisticsrecords.

#statistics clockstats cryptostatsloopstats peerstats

#server 172.72.20.131 prefer minpoll6 maxpoll 6

server 10.19.244.52 prefer minpoll 6maxpoll 6

logfile /var/log/dsware_ntp.log.0

2.每半分钟调一次，等半分钟，再调一次

date -s "10:41:002017-01-06";clock -w

date -s "10:42:002017-01-06";clock -w

date -s "10:43:002017-01-06";clock -w

date -s "10:44:002017-01-06";clock -w

date -s "10:45:002017-01-06";clock -w

date -s "10:46:002017-01-06";clock -w

date -s "10:47:002017-01-06";clock -w

date -s "10:48:002017-01-06";clock -w

date -s "10:49:002017-01-06";clock -w

date -s "10:50:002017-01-06";clock -w

date -s "10:51:002017-01-06";clock -w

date -s "10:52:002017-01-06";clock -w

date -s "10:53:002017-01-06";clock -w

date -s "10:54:002017-01-06";clock -w

date -s "10:55:002017-01-06";clock -w

date -s "10:56:002017-01-06";clock -w

date -s "10:57:002017-01-06";clock -w

date -s "10:58:002017-01-06";clock -w

3. 启动NTP服务

#/etc/init.d/ntpd start

以上操作在一个数据库主机上正常执行后，数据库没有发生任何异常的情况。

由于某种不便明说原因，在调整另一台数据库主机服务器时间时，主机工程师手动调整server时间到正确时间，然后又通过ntpdate调整数据库服务器时间追平服务端。结果是数据库主机调整了8分多钟的时间跨度，当调整完成后，悲剧就发生了，数据库宕机，如下：

ALERT报错：

Fri Jan 06 11:33:30 2017

Errors in file/oracle_log/diag/rdbms/orcl/orcl2/trace/orcl2_asmb_67035.trc:

ORA-15064: communication failurewith ASM instance

ORA-03113: end-of-file oncommunication channel

Process ID:

Session ID: 90 Serial number: 56760

Fri Jan 06 11:33:30 2017

Errors in file/oracle_log/diag/rdbms/orcl/orcl2/trace/orcl2_asmb_67035.trc:

ORA-15064: communication failurewith ASM instance

ORA-03113: end-of-file oncommunication channel

Process ID:

Session ID: 90 Serial number: 56760

USER (ospid: 67035): terminating theinstance due to error 15064

Fri Jan 06 11:33:30 2017

opiodr aborting process unknownospid (22340) as a result of ORA-1092

Fri Jan 06 11:33:30 2017

ORA-1092 : opitsk aborting process

报错无法与ASM实例发生通信，那么接下来我们查看ASM的ALERT日志。

2016-12-27 23:05:53.756000 +08:00

Warning: VKTM detected a time drift.

Time drifts can result in anunexpected behavior such as time-outs. Please check trace file formore details.

2017-01-06 11:33:30.143000 +08:00

WARNING: client[+ASM1:+ASM:c5ogx2-cluster] not responsive for 494s;state=0x1. pid 121601

NOTE: umbilicus traces dumped to/oracle_log/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_97907.trc

WARNING: client[orcl2:orcl:c5ogx2-cluster] not responsive for 494s; state=0x1.killing pid 67039

NOTE: umbilicus traces dumped to/oracle_log/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_97907.trc

WARNING: fencing client[orcl2:orcl:c5ogx2-cluster] after 494 seconds (mbr 1)

WARNING: ASMB has not responded for494 seconds

NOTE: ASM umbilicus running slowerthan expected, ASMB diagnostic requested after 494 seconds

NOTE: ASMB process state dumped totrace file /oracle_log/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_97907.trc

ERROR: terminating instance becauseASMB is stuck for 494 seconds

System State dumped to trace file/oracle_log/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_97907.trc

2017-01-06 11:33:32.261000 +08:00

报错，客户端-cluster在494s内无法响应，导致ASMB阻塞终止了ASM实例，顺理成章的，DB实例无法连接ASM实例，之后宕机。

查看指定TRACE文件内容如下：

*** 2017-01-06 11:33:32.261

GEN0 (ospid: 97907): terminating theinstance due to error 15082

ksuitm: waiting up to [5] secondsbefore killing DIAG(97913)

查看错误官方解释：

[/home/oracle] oerr ora 15082

15082, 00000, "ASM failed tocommunicate with client"

// *Cause: There was a failureor time out when ASM tried to communicate with

// aconnected RDBMS or Oracle ASM Dynamic Volume Manager

// (OracleADVM) client.

// *Action: Check the accompanyingerror messages and alert logs

// formore information on the reason for the failure.

// Checksystem specific logs (/var/log/messages on Linux,

// EventLog on Windows) for Oracle ADVM messages.

通过错误提示，表明是ASM无法与客户端通信，或超时，检查相关日志，包括网络层面，OS层面等日志。

Jan 6 11:21:34 c5ogx2bntpd[18672]: ntpd 4.2.6p5@1.2349-o Fri Oct 11 03:18:05 UTC 2013 (1)

当然也就是主机工程师做的ntpupdate操作。

发现日志中的超时494s,换算成分钟，也就是8.33分钟，正好是修改的时间跨度。基本可以确诊是大跨度修改主机时间导致的宕机。按照本好青年理解，这里正常的timeout时间，应该是<1秒的时间，当时由于时间调整，两次获取操作系统的时间大于了允许的超时时间，导致ASM误认为有问题，为了数据一致性等考虑，选择宕机保护。

所以，当我们需要调整数据库主机时间，还是建议微调，禁止一次跨度太大，以上证明以半分钟为调整跨度是比较合理方式之一。

weixin_39524147

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
oracle trace alert 清理_那些年Oracle数据库主机时间调整的风花雪月

最近工作中又遇到因时间问题导致的故障，这让本新四有好青年想起了N年前的一个案例，今天整理分享一下。当时是应用反应主机时间与正确的时间相差有8分多钟，影响了正常的业务，登录发现主机的NTP服务是开启的，查看NTP同步状态：可以看到offset是0.051s,基本没有延迟，那么问题就出在Ntpserver时间存在不准确的可能，通过主机侧查看，果然server端存在延迟的情况。为尽快恢复业务，...
复制链接

扫一扫

oracle trace alert 清理_那些年Oracle数据库主机时间调整的风花雪月

“相关推荐”对你有帮助么？