记录一则因ORA-27302: failure occurred at: sskgxpsnd2 故障导致系统No buffer space available 排查及整理记录

 欢迎您关注我的公众号【尚雷的驿站】
****************************************************************************
公众号:尚雷的驿站
CSDN :https://blog.csdn.net/shlei5580
墨天轮:https://www.modb.pro/u/2436
PGFans:https://www.pgfans.cn/user/home?userId=4159
****************************************************************************

一、问题描述

问题的起因是这样的,本周周三晚上10点多收到一条告警短信,提示当前运维的一套生产Oracle 19C RAC库其中一个节点收到ASM告警信息,那天因为忙别的事情,因为没收到告警电话,所以没有太在意,告警的内容如下:

[xxx-xxx-xxx-DB-Oracle]Oracle ORA ASM Log Error(s) found on xxx-xxx-xxx-DB-Oracle-192.168.xxx.xxx-xxxcdb1: PROBLEM (Value: ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
opiodr aborting process unknown ospid (29344) as a result of ORA-603)2023.11.22 22:01:39

第二天早上接到上司电话,说早上收到这条告警短信,担心数据库有问题,让我排查下问题。
我们的生产数据库是配置了告警级别,如果是生产库实例宕了,会收到告警电话,但我从一直没收到告警电话。心里也担心,会不会是告警电话遗漏,奔着这想法,还是登录服务器看下放心。
我登录数据库,查看了两个节点的grid和Oracle日志,确定两个节点都没出现实例重启过的信息。该告警日志只是在该RAC的节点一ASM日志里出现,最早是在节点一ASM日志11月22日 21点时候出现过,而另一个节点ASM日志还停留在11月8日。

这套Oracle 19C RAC库是从11.2.0.3通过OGG升级的,已经运行了一年多,但出现这个报错还是第一次。

二、问题排查

登录数据库,排查下日志信息,ASM的alert日志第一次出现该报错的信息如下:

2023-11-22T21:16:29.917410+08:00
skgxpvfynet: mtype: 61 process 29344 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_29344.trc  (incident=73617):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /u01/app/grid/diag/asm/+asm/+ASM1/incident/incdir_73617/+ASM1_ora_29344_i73617.trc
opiodr aborting process unknown ospid (29344) as a result of ORA-603

提示因为buffer不足,导致操作系统29344进程宕。什么原因导致buffer不足呢,trc里有没有更详细的信息,于是获取了+ASM1_ora_29344.trc和+ASM1_ora_29344_i73617.trc的相关日志信息,+ASM1_ora_29344_i73617.trc信息内容更详细,日志信息过长,此处只展示部分内容:
+ASM1_ora_29344.trc日志信息如下:

[root@xxx-xxx-xxx-DB-oracle-db1 ~]# cat /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_29344.trc
Trace file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_29344.trc
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.18.0.0.0
Build label:    RDBMS_19.18.0.0.0DBRU_LINUX.X64_230111
ORACLE_HOME:    /u01/app/19.3.0/grid
System name:    Linux
Node name:      xxx-xxx-xxx-DB-oracle-db1
Release:        3.10.0-1160.el7.x86_64
Version:        #1 SMP Mon Oct 19 16:18:59 UTC 2020
Machine:        x86_64
Instance name: +ASM1
Redo thread mounted by this instance: 0 <none>
Oracle process number: 46
Unix process pid: 29344, image: oracle@xxx-xxx-xxx-DB-oracle-db1 (TNS V1-V3)


*** 2023-11-22T21:16:29.924984+08:00
*** CLIENT ID:() 2023-11-22T21:16:29.925039+08:00
*** SERVICE NAME:() 2023-11-22T21:16:29.925053+08:00
*** MODULE NAME:() 2023-11-22T21:16:29.925065+08:00
*** ACTION NAME:() 2023-11-22T21:16:29.925078+08:00
*** CLIENT DRIVER:() 2023-11-22T21:16:29.925090+08:00
 
SKGXP:[7ff10b325050.0]{0}: SKGXPVFYNET: Socket self-test could not verify successful transmission of 32768 bytes (mtype 61).
SKGXP:[7ff10b325050.1]{0}: The network is required to support UDP protocol sends of this size.  Socket is bound to 169.254.6.239.
SKGXP:[7ff10b325050.2]{0}: phase 'send', 0 tries, 100 loops, 17875 ms (last)
KSXP:ksxpsigosderr(): ose error msg[No buffer space available] err[105] OpName[sendmsg] Location[sskgxpsnd2]
struct ksxpp * ksxppg_ [0x7ff10b384d88, 0x7ff10b274c48) = 0x7ff10b274c40
Dump of memory from 0x00007FF10B274C40 to 0x00007FF10B278140
7FF10B274C40 0662F1A0 00007FF1 00000000 00000000  [..b.............]
7FF10B274C50 00000000 00000000 00000000 00000000  [................]
7FF10B274C60 0B325050 00007FF1 00000000 00000000  [PP2.............]
7FF10B274C70 00002DA8 00000000 00000000 00000000  [.-..............]
7FF10B274C80 00000000 00000000 00000000 00000000  [................]
        Repeat 6 times
7FF10B274CF0 00000000 00000000 0B274CF8 00007FF1  [.........L'.....]
7FF10B274D00 0B274CF8 00007FF1 00000000 00000000  [.L'.............]
7FF10B274D10 00000000 00000000 00000000 00000000  [................]
        Repeat 2 times
......
        Repeat 30 times
7FF10B327D90 00000002 00000000 00000000 00000000  [................]
7FF10B327DA0 00000000 00000000 00000000 00000000  [................]
        Repeat 4 times
7FF10B327DF0 00000000 00000000                    [........]        
2023-11-22T21:16:29.988489+08:00
Incident 73617 created, dump file: /u01/app/grid/diag/asm/+asm/+ASM1/incident/incdir_73617/+ASM1_ora_29344_i73617.trc
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2

kgefec: fatal error 0

另一个+ASM1_ora_29344_i73617.trc相关日志信息如下:

Dump file /u01/app/grid/diag/asm/+asm/+ASM1/incident/incdir_73617/+ASM1_ora_29344_i73617.trc
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.18.0.0.0
Build label:    RDBMS_19.18.0.0.0DBRU_LINUX.X64_230111
ORACLE_HOME:    /u01/app/19.3.0/grid
System name:	Linux
Node name:	xxx-xxx-xxx-DB-oracle-db1
Release:	3.10.0-1160.el7.x86_64
Version:	#1 SMP Mon Oct 19 16:18:59 UTC 2020
Machine:	x86_64
Instance name: +ASM1
Redo thread mounted by this instance: 0 <none>
Oracle process number: 46
Unix process pid: 29344, image: oracle@xxx-xxx-xxx-DB-oracle-db1 (TNS V1-V3)


*** 2023-11-22T21:16:29.990116+08:00
*** CLIENT ID:() 2023-11-22T21:16:29.990144+08:00
*** SERVICE NAME:() 2023-11-22T21:16:29.990154+08:00
*** MODULE NAME:() 2023-11-22T21:16:29.990162+08:00
*** ACTION NAME:() 2023-11-22T21:16:29.990171+08:00
*** CLIENT DRIVER:() 2023-11-22T21:16:29.990179+08:00
 
[TOC00000]
Jump to table of contents
Dump continued from file: /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_29344.trc
[TOC00001]
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2

[TOC00001-END]
[TOC00002]
========= Dump for incident 73617 (ORA 603) ========

*** 2023-11-22T21:16:29.991988+08:00
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
[TOC00003]
----- SQL Statement (None) -----
Current SQL information unavailable - no session.
[TOC00003-END]

[TOC00004]
----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------

*** 2023-11-22T21:16:30.042181+08:00
ksedst1()+95         call     kgdsdst()            7FFCFFB3F980 000000002
                                                   7FFCFFB39CC0 ? 7FFCFFB39DD8 ?
                                                   000000000 000000082 ?
ksedst()+58          call     ksedst1()            000000000 000000001
                                                   7FFCFFB39CC0 ? 7FFCFFB39DD8 ?
                                                   000000000 ? 000000082 ?
dbkedDefDump()+2344  call     ksedst()             000000000 000000001 ?
8                                                  7FFCFFB39CC0 ? 7FFCFFB39DD8 ?
                                                   000000000 ? 000000082 ?
ksedmp()+577         call     dbkedDefDump()       000000003 000000002
......
| | TOC00047 - START DDE Action: 'dumpASMState' (Sync)
| | | TOC00048 - Trace Bucket Dump Begin: KFKIO
| | | TOC00049 - Trace Bucket Dump Begin: KFKSB
| | TOC00050 - START DDE Action: 'DUMP_PATCH' (Sync)
| | | TOC00051 - Applied Patch Information
| | TOC00052 - START DDE Action: 'dumpKSIPCState' (Sync)
| | TOC00053 - START DDE Action: 'FLUSH_DBREPLAY_CAPTURE_BUFFER' (Sync)
| | TOC00054 - START DDE Action: 'xdb_dump_buckets' (Sync)
| | | TOC00055 - Trace Bucket Dump Begin: xdb_bucket
| | | TOC00056 - Trace Bucket Dump Begin: xdb_bucket
| | | TOC00057 - Trace Bucket Dump Begin: xdb_bucket
| | | TOC00058 - Trace Bucket Dump Begin: xdb_bucket
| | TOC00059 - START DDE Action: 'dumpKGERing' (Sync)
| | TOC00060 - START DDE Action: 'dumpKGEIEParms' (Sync)
| | TOC00061 - START DDE Action: 'dumpKGEState' (Sync)
| | TOC00062 - START DDE Action: 'kpuActionDefault' (Sync)
| | | TOC00063 - Dumping 'Buffer dump info:' addr=0x7ff10661eb28 size=1520 bytes
End of TOC

这套Oracle 生产库运行在PC服务器,服务器内存为 128G,32颗逻辑CPU,采用CDB-PDB模式,查看当前内存占用情况如下:

[root@xxx-xxx-xxx-DB-oracle-db1 ~]# free -g
              total        used        free      shared  buff/cache   available
Mem:            125         102           6           4          16          15
Swap:            63           0          63

显示可用内存已剩余很少。
查看SGA和PGA,显示SGA(80G) + PGA(20G),两者总共占用100G,留给操作系统的只有28G。
目前节点一部署含xstream的databus挖掘进程,该应用目前存在bug,但Oracle目前未有很好的解决办法,即该进程经常会导致内存不足,出现0403告警。

通过trc的报错信息,登录MOS及谷歌进行了查询,提示因 min_free_kbytes 内核不足导致出现了操作系统端的“无可用缓冲区空间”的情况。
其中MOS上有篇文章 “ORA-27301: OS failure message: No buffer space available” occurs on OPC RAC (Doc ID 2397062.1),跟进该文章的描述,说出现该问题是一个bug,已在18.2.1修复。但目前出现问题的数据库是Oracle 19c,似乎有些出入。

三、解决方式

因为该问题是第一次出现,目前该报错暂未导致数据库hang或重启,计划对其先进行观察。根据谷歌和MOS上的一些建议是需要调整MTU或者min_free_kbytes参数。目前因为生产应用再连接,暂时计划不重启服务器。

根据 https://dbalifeeasy.com/tag/the-network-is-required-to-support-udp-protocol-sends-of-this-size/ 这篇文章对该问题的处理建议,有两种方式进行处理:
一是收缩SGA的大小,将一部分SGA内存分配给操作系统。
另一种处理方法是通过更改环回接口的MTU和修改内核参数min_free_kbytes解决,建议将vm.min_free_kbytes的值增加到服务器总物理内存的0.4%。

对min_free_kbytes的解释,官方解释如下:

min_free_kbytes:

This is used to force the Linux VM to keep a minimum number
of kilobytes free.  The VM uses this number to compute a
watermark[WMARK_MIN] value for each lowmem zone in the system.
Each lowmem zone gets a number of reserved free pages based
proportionally on its size.

Some minimal amount of memory is needed to satisfy PF_MEMALLOC
allocations; if you set this to lower than 1024KB, your system will
become subtly broken, and prone to deadlock under high loads.

Setting this too high will OOM your machine instantly.

翻译过来即为:
min_free_kbytes参数用于强制Linux虚拟内存保持一定数量的可用空闲内存。虚拟内存使用此数字为系统中的每个低内存区域计算水印[WMARK_MIN]值。每个低内存区域根据其大小比例获得一定数量的保留空闲页。

系统需要一定量的内存来满足PF_MEMALLOC分配;如果将此设置为低于1024KB,您的系统将变得微妙地不稳定,并且在高负载下容易发生死锁。

设置得太高将立即导致系统OOM(Out of Memory)错误。

min_free_kbytes 代表系统所保留空闲内存的最低限,用于计算影响内存回收的三个参数 watermark[min/low/high]

当前这套Oracle 19C生产库设置的vm.min_free_kbytes大小为 524288
当前系统/etc/sysconfig/network-scripts/ifcfg-lo文件未设置MTU值。

==注意:生产调整参数需谨慎,再谨慎,如果有测试环境,最好在测试环境进行验证测试。

  • 25
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

尚雷5580

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值