今天CRM反应说他们有个作业今天没跑出来报错了,一查发现Exadata的节点2上的alert日志里面在同一时间点果然报错了:
Wed Oct 17 06:22:51 2012
Errors in file /u01/app/oracle/diag/rdbms/srcbfin/SRCBFIN2/trace/SRCBFIN2_arc2_72209.trc:
ORA-27603: Cell storage I/Oerror, I/O failed on disk o/172.11.211.9/DATA_DM01_CD_01_dm01cel01 at offset464519168 for data length 1048576
ORA-27626: Exadata error: 201(Generic I/O error)
WARNING: Read Failed. group:1 disk:25AU:110 offset:3145728 size:1048576
WARNING: failed to read mirror side 1 ofvirtual extent 29 logical extent 0 of file 266 in group [1.2063103479] fromdisk DATA_DM01_CD_01_DM01CEL01 allocation unit 110 reason error; if possible, will try another mirrorside
NOTE: successfully read mirror side 2 ofvirtual extent 29 logical extent 1 of file 266 in group [1.2063103479] fromdisk DATA_DM01_CD_10_DM01CEL03 allocation unit 113
相应的那个trace日志文件里面也是只用相同的寥寥几行:
*** 2012-10-17 06:22:51.655
ORA-27626: Exadata error: 201 (Generic I/Oerror)
WARNING: Read Failed. group:1 disk:25AU:110 offset:3145728 size:1048576
path:o/172.11.211.9/DATA_DM01_CD_01_dm01cel01
incarnation:0xe9688586 asynchronousresult:'I/O error'
subsys:OSS iop:0x2b58752a2680 bufp:0x2b5879517000osderr:0xc9 osderr1:0x0
Exadata error:'Generic I/O error'
IO elapsed time: 12334426 usec Time waited onI/O: 12334426 usec
WARNING: failed to read mirror side 1 ofvirtual extent 29 logical extent 0 of file 266 in group [1.2063103479] from diskDATA_DM01_CD_01_DM01CEL01 allocationunit 110 reason error; if possible, will try another mirror side
NOTE: successfully read mirror side 2 ofvirtual extent 29 logical extent 1 of file 266 in group [1.2063103479] fromdisk DATA_DM01_CD_10_DM01CEL03 allocation unit 113
在MOS上查了一下发现这可能是Exadata的一个Bug,无语了,上线两个月各种Bug各种宕机,官方还吹嘘的那么牛叉!!!
Bug 8782572 ARCH fenced in ASM disks causing internalerrors at shutdown
This note gives a brief overview of bug8782572.
The content was last updated on: 17-JUN-2011
Click here for details of each of the sectionsbelow.
Affects:
Product (Component) | Oracle Server (Rdbms) |
Range of versions believed to be affected | Versions BELOW 12.1 |
Versions confirmed as being affected | |
Platforms affected | Generic (all / most platforms affected) |
Fixed:
This issue is fixed in | ||||
Symptoms: | Related To: | |||
Description
This bug causes ARCH processes to keep issuing IO's (or archive logs)
even after an RDBMS instance has been dismounted. Such IO's are
fenced off in ASM disks after instance is no longer part of the cluster
(i.e. after dismount), and the ASM diskgroup can be dismounted
as a result of these IOs.
Here is an example excerpt from alert log showing the IO errors due to fence:
ORA-27603: Cell storage I/O error, I/O failed on disk o/<IP Address>/<ASM Disk> at offset <offset#> for data length <length> WARNING: IO Failed. group:<group#> disk(number.incarnation):<number.inc> disk_path:o/<IP Address>/<ASM disk>
AU:<AU> disk_offset(bytes):<bytes> io_size:<IO size> operation:Read type:asynchronous
result:I/O error process_id:<pid>
Exadata error:221 (I/O request fenced)
Another example from a cell alert log:
Information: Cellsrv canceling OSSMSG_COMMAND_BREAD request from host
xxxx[pid:<pid>] for fencing, send port <port#> open fd 2
Please note: The above is a summary description only. Actual symptoms can vary. Matching to any symptoms here does not confirm that you are encountering this problem. For questions about this bug please consult Oracle Support. |
References
Bug:8782572 (This link will only work forPUBLISHED bugs)
Note:245840.1 Information on the sections in thisarticle
这个Bug可以早到相应的Patch:
另一篇文章,不过好像不是这么一回事,应该是个Bug,不过还是把文章附上:
| Exadata/Rac - Ora-27603: Cell Storage I/O Error [ID 1445223.1] |
修改时间: 2012-5-30 类型: PROBLEM 状态:PUBLISHED 优先级:3 |
Appliesto:
Oracle Exadata Hardware - Version 11.2.0.1 to11.2.0.1 [Release 11.2]
Information in this document applies to any platform.
Symptoms
Getting following errors in the exadata environment running 11.2.0.1 database:
Errors in file /u01/app/oracle/diag/rdbms/test/test2/trace/test2_ora_22485.trc(incident=242913):
ORA-00600: internal error code, arguments: [kssadpm1], [], [], [], [], [], [], [],[], [], [], []
Incident details in:/u01/app/oracle/diag/rdbms/test/test2/incident/incdir_242913/test2_ora_22485_i242913.trc
Errors in file /u01/app/oracle/diag/rdbms/test/test2/trace/test2_ora_22485.trc(incident=242914):
ORA-00600: internal error code, arguments: [kfddsGet03], [56861], [], [], [],[], [], [], [], [], [], []
Incident details in:/u01/app/oracle/diag/rdbms/test/test2/incident/incdir_242914/test2_ora_22485_i242914.trc
Changes
No recent changes
Cause
IOs are fenced even before thetxn state object gets deleted which needs to performs IOs in order to do thetxn rollback. This is causing the error.
ORA-600[kssadpm1] is raised because of bug 9750033
The second error, ORA-600[kfddsGet03], is caused by ORA-600[kssadpm1].
We can conclude on the bug 9750033 based on the following criteria:
1. Callstack matches as follows:
kssadpm
ksz_gen_reid
kfddsGet
kfioTranslateIO
kfioRqSetPrepare
2. Problematic state object is 'ksz parent'
This can be verified from the trace file.
Example:
SO: 0x6cbc55f40, type: 22, owner: (nil), flag: -/FLST/-/0x00 if: 0x0 c: 0x0
proc=(nil), name=ksz parent, file=ksz2.h LINE:394, pg=0
Dump of memory from 0x00000006CBC55F40 to 0x00000006CBC55F98
Solution
Impact of the bug is the process failure. As the error is from serverprocess(not background process), no effect at the instance level and nocorruption as well.
Only the process encountering the error will be terminated. Also the error isduring normal server process exit. So, the impact is very minimal.
Fix included in 11.2.0.1 BP12.
Other option is to upgrade to 11.2.0.2 or above which includes the fix.