背景:
2019年5月22日,12点多,在另外一个厂商调用我司的应用接口时,突然报错;另外,我司的业务系统的菜单功能,点击进去也是报错。
前台报错如下:
通过分析应用日志和中间件控制台数据源,发现连接的那个节点宕机了。
第一步:srvctl status databse –d orcl 用此命令看的数据库状态;
第二步:srvctl status database –d orcl –I orcl1 尝试启动此节点;--报集群软件有问题
第三步:/u01/11.2.0/grid/bin/crsctl check cluster -all 发现集群软件挂掉了;
第四步:使用root 用户 /u01/11.2.0/grid/bin/crsctl stop crs -f 强制关闭crs资源;
第五步:用root 执行/u01/11.2.0/grid/bin/crsctl start crs 启动此节点crs资源;
第六步:接着使用grid用户启动节点1 srvctl status database –d orcl –I orcl1 ,节点1实例启动,业务恢复正常。
上面,使用常规的重新启动操作,数据库节点1恢复正常了,只能说这是万幸。具体的原因,还是需要继续查找的。
如果使用常规手段,启动不了 就需要进一步分析日志。
原因排查:
第一步,查看告警日志:alert_orcl1.log
Tue May 21 12:33:44 2019
WARNING: Write Failed. group:1 disk:7 AU:4389 offset:901120 size:131072
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_dbw4_88627.trc:
ORA-15080: synchronous I/O operation to a disk failed
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 131072
WARNING: failed to write mirror side 1 of virtual extent 2183 logical extent 0 of file 264 in group 1 on disk 7 allocation unit 4389
KCF: read, write or open error, block=0x443ee online=1
file=3 '+DATA/orcl/datafile/undotbs1.264.997101273'
error=15081 txt: ''
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_dbw4_88627.trc: 提示到此trace文件中查看。上面只能看到是DATA磁盘组磁盘好像有些问题,但是具体的看不出来。
第二步,查看orcl1_dbw4_88627.trc
WARNING: Write Failed. group:1 disk:5 AU:4389 offset:393216 size:131072
path:/dev/asm-diskk --可以看到对应的磁盘确实出问题了,导致读写磁盘失败,IO报错。
incarnation:0xd622c620 asynchronous result:'I/O error'
subsys:System iop:0x7ffff4440148 bufp:0x1ddbd7f000 osderr:0x0 osderr1:0x0
ORA-15080: synchronous I/O operation to a disk failed
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 131072
WARNING: failed to write mirror side 1 of virtual extent 2184 logical extent 0 of file 264 in group 1 on disk 5 allocation unit 4389
KCF: read, write or open error, block=0x44430 online=1
file=3 '+DATA/orcl/datafile/undotbs1.264.997101273'
error=15081 txt: ''
Encountered write error上面,可以看到确实磁盘出问题了,但是磁盘具体出什么问题了,trace里面看不出来,还需要看操作系统日志。
第三步,查看操作系统日志:/var/log/messages文件
May 21 03:28:45 localhost auditd[1926]: Audit daemon rotating log files
May 21 12:33:38 localhost kernel: sd 33:0:6:0: [sdi] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May 21 12:33:38 localhost kernel: sd 33:0:6:0: [sdi] CDB: Write(10): 2a 00 34 14 5d 00 00 00 10 00
May 21 12:33:38 localhost kernel: end_request: I/O error, dev sdi, sector 873749760
May 21 12:33:38 localhost kernel: sd 33:0:6:0: [sdi] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May 21 12:33:38 localhost kernel: sd 33:0:6:0: [sdi] CDB: Write(10): 2a 00 34 14 5c 00 00 00 10 00
May 21 12:33:38 localhost kernel: end_request: I/O error, dev sdi, sector 873749504 --传说中的磁盘坏道。
May 21 12:33:38 localhost kernel: sd 33:0:6:0: [sdi] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May 21 12:33:38 localhost kernel: sd 33:0:6:0: [sdi] CDB: Read(10): 28 00 2d 3c f8 40 00 04 00 00
May 21 12:33:38 localhost kernel: end_request: I/O error, dev sdi, sector 758970432
May 21 12:33:38 localhost kernel: sd 33:0:6:0: [sdi] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May 21 12:33:38 localhost kernel: sd 33:0:6:0: [sdi] CDB: Read(10): 28 00 2d 3c fc 40 00 03 c0 00
May 21 12:33:38 localhost kernel: end_request: I/O error, dev sdi, sector 758971456
May 21 12:33:42 localhost kernel: sd 33:0:6:0: [sdi] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May 21 12:33:42 localhost kernel: sd 33:0:6:0: [sdi] CDB: Read(10): 28 00 2d cb 17 c0 00 00 10 00
May 21 12:33:42 localhost kernel: end_request: I/O error, dev sdi, sector 768284608
May 21 12:33:43 localhost kernel: sd 33:0:8:0: [sdj] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May 21 12:33:43 localhost kernel: sd 33:0:8:0: [sdj] CDB: Read(10): 28 00 11 9d a0 60 00 00 20 00
May 21 12:33:43 localhost kernel: end_request: I/O error, dev sdj, sector 295542880
May 21 12:33:43 localhost kernel: sd 33:0:8:0: [sdj] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May 21 12:33:43 localhost kernel: sd 33:0:8:0: [sdj] CDB: Write(10): 2a 00 0f 94 c0 14 00 00 02 00
May 21 12:33:43 localhost kernel: end_request: I/O error, dev sdj, sector 261406740最后,查阅相关资料:
查到这里,给大家一些忠告:如果你是负责硬件运维的,日常就要做好监控了;如果你是负责应用系统运维的,把此事情给客户汇报,让客户协调硬件厂商去处理。
结语:经过一段时间的监控,这样的磁盘故障暂时没有出现,后续继续监控。
供大家学习,参考。