[Oracle]&[Linux]Linux异步IO的怪事,aio被偷走了

最近遇到了几起linux 异步IO满了,导致ASM实例一致报错。严重的也有出现ASM实例挂了无法启动,导致集群节点异常的。

ASM alert的报错大概这样,信息还是很明确的,ORA-27090,
asynchronous disk I/O 不足。

Wed Apr 07 20:00:39 2021
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_17619.trc:
ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Additional information: 3
Additional information: 128
Additional information: 8
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_17619.trc:
ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
Additional information: 128
Additional information: 180464224
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_17619.trc:
ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
Additional information: 128
Additional information: -1514856272
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_17619.trc:
ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
Additional information: 128
Additional information: -1514856272

通过检查aio-max-nr,aio-nr 两个参数的值,确定了就是异步IO被占满了。

 [grid@node01 trace]$ cat /proc/sys/fs/aio-max-nr
 1048576

 [grid@node01 trace]$ cat /proc/sys/fs/aio-nr
 1048493

当然,临时的故障解决的方法就很简单了:

 修改/etc/sysctl.conf中的fs.aio-max-nr=3145728 
 然后:
sysctl -w fs.aio-max-nr=3145728

那到底是什么进程打开了这么多aio呢? 查遍了/proc/pid下的信息,也没有找到aio相关的值,起初怀疑的监控语句访问asm实例产生的,但仔细想想,报错是不足128,也就是每次请求不多,并且是短连接,很快就释放。当前的现象明显是不释放。

还好,查aio的值的方法没有,找到了一个监控aio分配的语句,而且也是ora-27090,异步IO不足脑子里对这个文章 :

https://blog.pythian.com/troubleshooting-ora-27090-async-io-errors/

按需安装了软件包:
这两个在ISO里面:

systemtap  
kernel-devel 

下面两个可以在这里按内核版本下载:
http://debuginfo.centos.org/6/x86_64/

kernel-debuginfo-common-x86_64
kernel-debuginfo

安装完成后,就整成脚本放后台运行了。

vi aio
#!/bin/sh

stap -ve '
global allocated, allocatedctx, freed
probe syscall.io_setup {
  allocatedctx[pid()] += maxevents; allocated[pid()]++;
  printf("%d AIO events requested by PID %d (%s)\n",
      maxevents, pid(), cmdline_str());
}
probe syscall.io_destroy {freed[pid()]++}
probe kprocess.exit {
  if (allocated[pid()]) {
     printf("PID %d exited\n", pid());
     delete allocated[pid()];
     delete allocatedctx[pid()];
     delete freed[pid()];
  }
}
probe end {
foreach (pid in allocated) {
   printf("PID %d allocated=%d allocated events=%d freed=%d\n",
      pid, allocated[pid], allocatedctx[pid], freed[pid]);
}
}
'

nohup sh aio > aio.out & 

等着后台收集的时候也没有闲着。搞好有一个aio高的环境可以重启,就想着一个个进程关闭,看能否把这个偷了aio”贼“抓出来。
那么需要这么做 ;会话1 ,操作 ,会话2 :不断的 cat /proc/sys/fs/aio-nr

测试1)关闭数据库实例

会话一:
20210412_13:24:01  
20210412_13:24:07  13:24:01 sys@MYDB > alter system switch logfile;
20210412_13:24:07  
20210412_13:24:07  System altered.
20210412_13:24:07  
20210412_13:24:07  Elapsed: 00:00:00.32
20210412_13:24:11  13:24:07 sys@MYDB > alter system checkpoint;
20210412_13:24:11  
20210412_13:24:11  System altered.
20210412_13:24:11  
20210412_13:24:11  Elapsed: 00:00:00.21
20210412_13:24:12  13:24:11 sys@MYDB > 
20210412_13:24:12  13:24:12 sys@MYDB > 
20210412_13:25:05  13:24:12 sys@MYDB > shut immediate; 
20210412_13:25:20  Database closed.
20210412_13:25:20  Database dismounted.
20210412_13:26:37  ORACLE instance shut down.
20210412_13:26:41  13:26:37 sys@MYDB > exit
会话二:
20210412_13:23:55  [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr 
20210412_13:23:55  1063499
20210412_13:25:03  [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr 
20210412_13:25:03  1063627
20210412_13:25:08  [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr 
20210412_13:25:08  1062896
20210412_13:27:31  [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr 
20210412_13:27:31  1050752
20210412_13:27:32  [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr 
20210412_13:27:32  1050752
20210412_13:27:33  [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr 
20210412_13:27:33  1050752

显然,通过测试关闭实例,可以确定不是实例占用的。

测试2:关闭集群节点

会话一:
20210412_13:27:26  [root@node01 ~]# /u01/app/11.2.0/grid/bin/crsctl stop crs 
20210412_13:27:27  CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'node01'
20210412_13:27:27  CRS-2673: Attempting to stop 'ora.crsd' on 'node01'
20210412_13:27:27  CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'node01'
20210412_13:27:27  CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'node01'
20210412_13:27:27  CRS-2673: Attempting to stop 'ora.cvu' on 'node01'
20210412_13:27:27  CRS-2673: Attempting to stop 'ora.oc4j' on 'node01'
20210412_13:27:27  CRS-2673: Attempting to stop 'ora.CRS.dg' on 'node01'
20210412_13:27:27  CRS-2673: Attempting to stop 'ora.registry.acfs' on 'node01'
20210412_13:27:27  CRS-2673: Attempting to stop 'ora.ARCH.dg' on 'node01'
20210412_13:27:27  CRS-2673: Attempting to stop 'ora.DATA.dg' on 'node01'
20210412_13:27:27  CRS-2673: Attempting to stop <
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值