最近遇到了几起linux 异步IO满了,导致ASM实例一致报错。严重的也有出现ASM实例挂了无法启动,导致集群节点异常的。
ASM alert的报错大概这样,信息还是很明确的,ORA-27090,
asynchronous disk I/O 不足。
Wed Apr 07 20:00:39 2021
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_17619.trc:
ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Additional information: 3
Additional information: 128
Additional information: 8
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_17619.trc:
ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
Additional information: 128
Additional information: 180464224
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_17619.trc:
ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
Additional information: 128
Additional information: -1514856272
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_17619.trc:
ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
Additional information: 128
Additional information: -1514856272
通过检查aio-max-nr,aio-nr 两个参数的值,确定了就是异步IO被占满了。
[grid@node01 trace]$ cat /proc/sys/fs/aio-max-nr
1048576
[grid@node01 trace]$ cat /proc/sys/fs/aio-nr
1048493
当然,临时的故障解决的方法就很简单了:
修改/etc/sysctl.conf中的fs.aio-max-nr=3145728
然后:
sysctl -w fs.aio-max-nr=3145728
那到底是什么进程打开了这么多aio呢? 查遍了/proc/pid下的信息,也没有找到aio相关的值,起初怀疑的监控语句访问asm实例产生的,但仔细想想,报错是不足128,也就是每次请求不多,并且是短连接,很快就释放。当前的现象明显是不释放。
还好,查aio的值的方法没有,找到了一个监控aio分配的语句,而且也是ora-27090,异步IO不足脑子里对这个文章 :
https://blog.pythian.com/troubleshooting-ora-27090-async-io-errors/
按需安装了软件包:
这两个在ISO里面:
systemtap
kernel-devel
下面两个可以在这里按内核版本下载:
http://debuginfo.centos.org/6/x86_64/
kernel-debuginfo-common-x86_64
kernel-debuginfo
安装完成后,就整成脚本放后台运行了。
vi aio
#!/bin/sh
stap -ve '
global allocated, allocatedctx, freed
probe syscall.io_setup {
allocatedctx[pid()] += maxevents; allocated[pid()]++;
printf("%d AIO events requested by PID %d (%s)\n",
maxevents, pid(), cmdline_str());
}
probe syscall.io_destroy {freed[pid()]++}
probe kprocess.exit {
if (allocated[pid()]) {
printf("PID %d exited\n", pid());
delete allocated[pid()];
delete allocatedctx[pid()];
delete freed[pid()];
}
}
probe end {
foreach (pid in allocated) {
printf("PID %d allocated=%d allocated events=%d freed=%d\n",
pid, allocated[pid], allocatedctx[pid], freed[pid]);
}
}
'
nohup sh aio > aio.out &
等着后台收集的时候也没有闲着。搞好有一个aio高的环境可以重启,就想着一个个进程关闭,看能否把这个偷了aio”贼“抓出来。
那么需要这么做 ;会话1 ,操作 ,会话2 :不断的 cat /proc/sys/fs/aio-nr
测试1)关闭数据库实例
会话一:
20210412_13:24:01
20210412_13:24:07 13:24:01 sys@MYDB > alter system switch logfile;
20210412_13:24:07
20210412_13:24:07 System altered.
20210412_13:24:07
20210412_13:24:07 Elapsed: 00:00:00.32
20210412_13:24:11 13:24:07 sys@MYDB > alter system checkpoint;
20210412_13:24:11
20210412_13:24:11 System altered.
20210412_13:24:11
20210412_13:24:11 Elapsed: 00:00:00.21
20210412_13:24:12 13:24:11 sys@MYDB >
20210412_13:24:12 13:24:12 sys@MYDB >
20210412_13:25:05 13:24:12 sys@MYDB > shut immediate;
20210412_13:25:20 Database closed.
20210412_13:25:20 Database dismounted.
20210412_13:26:37 ORACLE instance shut down.
20210412_13:26:41 13:26:37 sys@MYDB > exit
会话二:
20210412_13:23:55 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:23:55 1063499
20210412_13:25:03 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:25:03 1063627
20210412_13:25:08 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:25:08 1062896
20210412_13:27:31 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:31 1050752
20210412_13:27:32 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:32 1050752
20210412_13:27:33 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:33 1050752
显然,通过测试关闭实例,可以确定不是实例占用的。
测试2:关闭集群节点
会话一:
20210412_13:27:26 [root@node01 ~]# /u01/app/11.2.0/grid/bin/crsctl stop crs
20210412_13:27:27 CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'node01'
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.crsd' on 'node01'
20210412_13:27:27 CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'node01'
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'node01'
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.cvu' on 'node01'
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.oc4j' on 'node01'
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.CRS.dg' on 'node01'
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.registry.acfs' on 'node01'
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.ARCH.dg' on 'node01'
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.DATA.dg' on 'node01'
20210412_13:27:27 CRS-2673: Attempting to stop <