解决oracle错误ORA-27300: OS system dependent operation:fork failed with status: 11的一个过程

首先在alert log里面发现下面的错误
提示fork新进程失败,提示是资源临时不足

2023-06-18T23:04:43.231088-04:00
Errors in file /u01/log/main0612/diag/rdbms/cdb2/cdb21/trace/cdb21_psp0_305069.trc:
ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
ORA-27302: failure occurred at: skgpspawn5

查看文件/u01/log/main0612/diag/rdbms/cdb2/cdb21/trace/cdb21_psp0_305069.trc

从第一段来看,剩余内存并不小,还有一百多G的内存,所以应该不是内存不足的问题
看到max user processes limit为65536,貌似是有点小,所以问题的解决方向从max process limit开始

*** MODULE NAME:(sys$abs_timeout) 2023-06-16T17:34:43.855385-04:00
*** ACTION NAME:(fork new processes) 2023-06-16T17:34:43.855431-04:00

*** 2023-06-16T17:34:43.855356-04:00 (CDB$ROOT(1))
Process startup failed, error stack:
ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
ORA-27302: failure occurred at: skgpspawn5
at 0x7ffccfd8c4d0 placed ksb.c@4373
OS - DIAGNOSTICS

loadavg : 221.40 165.09 131.34
Memory (Avail / Total) = 138727.59M / 776630.59M
Swap (Avail / Total) = 307200.00M / 307200.00M
Commit (AS / Limit) = 142356.42M / 506075.29M
Max user processes limits(s / h) = 65536 / 65536

#######################################################


*** MODULE NAME:(sys$abs_timeout) 2023-06-18T23:02:58.183872-04:00
*** ACTION NAME:(fork new processes) 2023-06-18T23:02:58.183910-04:00

*** 2023-06-18T23:02:58.183853-04:00 (CDB$ROOT(1))
Process startup failed, error stack:
ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
ORA-27302: failure occurred at: skgpspawn5
at 0x7ffccfd8c4d0 placed ksb.c@4373
OS - DIAGNOSTICS

loadavg : 41.86 34.83 29.18
Memory (Avail / Total) = 6769.54M / 776630.59M
Swap (Avail / Total) = 307100.50M / 307200.00M
Commit (AS / Limit) = 119815.82M / 506075.29M
Max user processes limits(s / h) = 65536 / 65536

+++++++++++++++++++++++++++++++++++++++++++++++++++++
ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 3088297
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1000000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

从ulimit的输出来看,max user processes是unlimited,和oracle的trace 文件里面显示65536并不匹配
从/etc/security/limits.conf里面的设置来看
oracle soft nproc unlimited
oracle hard nproc unlimited
也是unlimited

ps -ef | grep pmon_cdb23 | grep -v grep
oracle 149076 1 0 Jun18 ? 00:03:11 ora_pmon_cdb23
我们直接看pmon进程的limits设置
cat /proc/149076/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 33554432 unlimited bytes
Max core file size unlimited unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 65536 65536 processes
Max open files 65536 65536 files
Max locked memory unlimited unlimited bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 3088297 3088297 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us

可以看出max processes 和max open files这两个limit都设置的有点小
我们用oracle用户直接登录这个机器
ssh oracle@x73
oracle@x73’s password:
X11 forwarding request failed on channel 0
Last login: Sun Jun 18 20:14:58 2023 from 10.91.68.10
[Mon Jun 19 01:39:42][354713][oracle@nshqae01adm03:~][0]$ echo $$
354713
然后去查看这个进程的limits设置,可以看到max processes和max open files这两项设置是正常的

/proc/354713][0]# cat /proc/354713/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 10485760 unlimited bytes
Max core file size unlimited unlimited bytes
Max resident set unlimited unlimited bytes
Max processes unlimited unlimited processes
Max open files 1048576 1048576 files
Max locked memory unlimited unlimited bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 3088297 3088297 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us

在解决问题的过程中,google了好多资料,又的建议编辑这个文件
/etc/security/limits.d/90-nproc.conf
在里面这只对max processes limit的设置
但是发现貌似是不起作用,pmon的limits还是不对,没有发生变化

有的帖子提示在~oracle/.bashrc里面设置
ulimit -u unlimited
ulimit -n 1000000
好像也不起作用,并且期间莫名其妙的无法在登录或者su到oracle用户
注释掉这两行后就可以了,莫名其妙,没有找到原因

我在man limits.conf里面看到下面的这句话

       hard
           for enforcing hard resource limits. These limits are set by the superuser and enforced by the Kernel. The user cannot raise his requirement of system resources above
           such values.

       soft
           for enforcing soft resource limits. These limits are ones that the user can move up or down within the permitted range by any pre-existing hard limits. The values
           specified with this token can be thought of as default values, for normal system usage.

       -
           for enforcing both soft and hard resource limits together.

           Note, if you specify a type of '-' but neglect to supply the item and value fields then the module will never enforce any limits on the specified user/group etc. .

那我就想试试看看如果是这样设置为怎么样
在/etc/security/limits.conf里面设置
oracle -
这样是不是就去除了所有对oracle的limits设置了呢?

首先我创建了一个用户
/usr/sbin/useradd -u 94088 -s /bin/bash -g oinstall -G dba -d /u01/testuser1 testuser1
echo testuser1 | /usr/bin/passwd testuser- --stdin

然后再/etc/security/limits.conf里面添加了下面这样
testuser1 -

然后使用testuser1去登录,发现这个链接的limits设置并非没有限制,而是非常的小,比如可以锁住的内存只有65536bytes,这是很可怕的,这太少了,另外能够open的file的limit也太小了,只有262144,也是足够小
所以就是说对于一个用户不能使用testuser1 -这种不指定限制的方式

cat /proc/151755/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 3088297 3088297 processes
Max open files 1024 262144 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 3088297 3088297 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us

后来在google的时候,还发现如果是systemd管理的service的话,max open files还可以在service的配置文件中配置
/usr/lib/systemd/system/rsyslog.service
LimitNOFILE=16384

但是oracle db不是一个systemd管理的service,所以这个可能适用于mysql的经验并不适用于oracle db

18440251

最后搜索得知,在GI的一个配置文件里是可以配置oracle db的max open files和max processes参数的
$CRS_HOME/crs/install/s_crsconfig__env.txt

#########################################################################
#This file can be used to set values for the NLS_LANG and TZ environment
#variables and to set resource limits for Oracle Clusterware and
#Database processes.
#1. The NLS_LANG environment variable determines the language and
#   characterset used for messages. For example, a new value can be
#   configured by setting NLS_LANG=JAPANESE_JAPAN.UTF8
#2. The Time zone setting can be changed by setting the TZ entry to
#   the appropriate time zone name. For example, TZ=America/New_York
#3. Resource limits for stack size, open files and number of processes
#   can be specified by modifying the appropriate entries.
#
#Do not modify this file except as documented above or under the
#direction of Oracle Support Services.
#########################################################################
TZ=America/New_York
NLS_LANG=AMERICAN_AMERICA.AL32UTF8
CRS_LIMIT_STACK=2048
CRS_LIMIT_OPENFILE=65536
CRS_LIMIT_NPROC=65536
TNS_ADMIN=

对于这个参数CRS_LIMIT_OPENFILE对应ulimit里面的max open file,我尝试了注释掉这样,注释掉后,会导致GI起不来
./crsctl stat res -init -t
CRS-4639: Could not contact Oracle High Availability Services
CRS-4000: Command Status failed, or completed with errors.

将这个参数设置为一个比1048576大的一个数字或者将这个参数设置为unlimited,则max open file都会变成262144

将这个参数设置为1048576,是最大值,为2的20次方

对于CRS_LIMIT_NPROC,这个参数对应ulimit的max processes limit,这个参数可以设置为一个比较大的数字,则这个比较大的数字会生效,或者也可以设置为unlimited,unlimited也会生效

在测试的过程中,我还发现,将fs.file-max设置为一个较大的数字,为942 million,
cat sysctl.conf | grep -i file-max
fs.file-max = 942080000
运行sysctl -p使之生效后,使用sysctl -a命令查看也是942080000,但是当重启node之后,发现这个值就会变成13631488
sysctl -a | grep -i file-max
fs.file-max = 13631488

通过查看/etc/sysctl.d目录下的文件,发现里面有一些文件是软链接到/opt/oracle.cellos/tmpl/目录下的一些文件
尝试修改下面这个文件,发现修改这个文件/opt/oracle.cellos/tmpl/sysctl.d/99-zzz-exadata-sysctl-20-compute.conf会在node reboot之后也保持这个设定的值

[Mon Jun 19 16:12:06][366174][root@nshqae01adm01:/etc/sysctl.d][0]# pwd
/etc/sysctl.d
[Mon Jun 19 16:12:08][366174][root@nshqae01adm01:/etc/sysctl.d][0]# ls -al
total 40
drwxr-xr-x 2 root root 4096 Jun 3 23:20 .
drwxr-xr-x 127 root root 12288 Jun 19 16:08 …
-rw-r–r-- 1 root root 370 Jul 19 2022 90-rds.conf
lrwxrwxrwx 1 root root 14 May 16 20:09 99-sysctl.conf -> …/sysctl.conf
lrwxrwxrwx 1 root root 70 Jun 3 23:20 99-zzz-exadata-sysctl-10-generic.conf -> /opt/oracle.cellos/tmpl/sysctl.d/99-zzz-exadata-sysctl-10-generic.conf
lrwxrwxrwx 1 root root 70 Jun 3 23:20 99-zzz-exadata-sysctl-20-compute.conf -> /opt/oracle.cellos/tmpl/sysctl.d/99-zzz-exadata-sysctl-20-compute.conf
lrwxrwxrwx 1 root root 65 Jun 3 23:20 99-zzz-exadata-sysctl-30-ib.conf -> /opt/oracle.cellos/tmpl/sysctl.d/99-zzz-exadata-sysctl-30-ib.conf
lrwxrwxrwx 1 root root 74 Jun 3 23:20 99-zzz-exadata-sysctl-40-ib-2sockets.conf -> /opt/oracle.cellos/tmpl/sysctl.d/99-zzz-exadata-sysctl-40-ib-2sockets.conf
lrwxrwxrwx 1 root root 66 Jun 3 23:20 99-zzz-exadata-sysctl-50-ol8.conf -> /opt/oracle.cellos/tmpl/sysctl.d/99-zzz-exadata-sysctl-50-ol8.conf

查看file-max的最准确的值,是通过查看文件最准确
cat /proc/sys/fs/file-max
13631488

#####################################################
查看出现问题当时的系统资源占用情况
从文件/opt/oracle.ExaWatcher/archive/Top.ExaWatcher/2023_06_18_23_01_15_TopExaWatcher_nshqae01adm01.us.oracle.com.dat.bz2来看,当时的进程只有9096,不算多,还算是比较正常,不像是因为进程数超过了ulimit的配置

zzz <06/18/2023 23:04:41> subcount: 33
top - 23:04:42 up 3 days, 1:45, 3 users, load average: 29.64, 32.64, 29.00
Threads: 9096 total, 25 running, 9049 sleeping, 0 stopped, 22 zombie
%Cpu(s): 13.3 us, 5.5 sy, 0.0 ni, 80.0 id, 0.5 wa, 0.2 hi, 0.5 si, 0.0 st
MiB Mem : 776630.6 total, 6621.8 free, 469683.6 used, 300325.2 buff/cache
MiB Swap: 307200.0 total, 307100.5 free, 99.5 used. 280847.9 avail Mem

从/var/log/messages查看当时,也没有看到有什么错误

考虑是不是由于系统的文件描述符 file descriptor使用完了呢

cat /proc/sys/fs/file-nr
48224 0 942000001
//第一列 48224 :为已分配的FD数量

 //第二列 0          :为已分配但尚未使用的FD数量

 //第三列942000001:为系统可用的最大FD数量

因为在系统出问题的时候,文件描述符的数量为13631488

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值