现场完成该故障处理后,总结分析过程如下:
一、数据库无法连接的持续告警
------2020-05-13/17:37:52-------
umoni/....@192.168....:1521/pmark
ERROR:
ORA-12537: TNS:connection closed
SP2-0640: Not connected
二、监听日志大量报错.
13-MAY-2020 18:07:01 * (CONNECT_DATA=(FAILOVER_MODE=(TYPE=session)(METHOD=basic)(RETRIES=4)(DELAY=1))(INSTANCE_NAME=pmark1)(SERVICE_NAME=pmark)(CID=(PROGRAM=expora@...)(HOST=...)(USER=...))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.153.....)(PORT=51390)) * establish * pmark * 12518
TNS-12518: TNS:listener could not hand off client connection
TNS-12547: TNS:lost contact
TNS-12560: TNS:protocol adapter error
TNS-00517: Lost contact
Linux Error: 32: Broken pipe
13-MAY-2020 18:07:01 * (CONNECT_DATA=(FAILOVER_MODE=(TYPE=session)(METHOD=basic)(RETRIES=4)(DELAY=1))(INSTANCE_NAME=pmark1)(SERVICE_NAME=pmark)(CID=(PROGRAM=expora@....)(HOST=....)(USER=.....))) * (ADDRESS=(PROTOCOL=tcp)(HOST.....)(PORT=51391)) * establish * pmark * 12518
TNS-12518: TNS:listener could not hand off client connection
TNS-12547: TNS:lost contact
TNS-12560: TNS:protocol adapter error
TNS-00517: Lost contact
Linux Error: 32: Broken pipe
三、监听状态检查正常
LSNRCTL> status
Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
...
...
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168......)(PORT=1521)))
Services Summary...
Service "pmark" has 1 instance(s).
Instance "pmark1", status READY, has 1 handler(s) for this service...
The command completed successfully
四、数据库状态检查正常,alert无异常报错
[oracle@h5l2001rs ~]$ ora event
\n=============Thu Nov 12 10:29:06 CST 2015===================\n
INST_ID EVENT# EVENT COUNT(*)
---------- ---------- ---------------------------------------------------------------- ----------
1 147 db file sequential read 4
2 157 gc cr multi block request 1
[oracle@....~]$
May13 17:39:09 2020
Thread 1 advanced to log sequence 1971 (LGWR switch)
Current log# 7 seq# 1971 mem# 0: /oradata/redo1/pmark/redo7_01.log
Current log# 7 seq# 1971 mem# 1: /oradata/redo2/pmark/redo7_02.log
.......
五、长连接数量检查同样没有异常,连接数同样没有达到上限
SQL> select inst_id,count(1) from gv$process group by inst_id;
INST_ID COUNT(1)
---------- ----------
1 1119
2 129
4 50
3 49
SQL> select inst_id,name,value from gv$parameter where name = 'processes';
INST_ID NAME VALUE
-------------------- -------------------- --------------------
2 processes 12000
1 processes 12000
4 processes 12000
3 processes 12000
RESOURCE_NAME CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION LIMIT_VALUE
------------------------------ -------------------- -------------------- -------------------- --------------------
processes 140 144 12000 12000
sessions 152 157 18048 18048
enqueue_locks 187 193 209624 209624
enqueue_resources 95 119 84428 UNLIMITED
ges_procs 138 141 10430 10430
ges_ress 74667 74667 387307 UNLIMITED
ges_locks 47091 49439 561722 UNLIMITED
ges_cache_ress 4192 4192 0 UNLIMITED
ges_reg_msgs 226 1317 12852 UNLIMITED
ges_big_msgs 58 655 1888 UNLIMITED
ges_rsv_msgs 0 0 1000 1000
gcs_resources 1531712 1832552 UNLIMITED UNLIMITED
gcs_shadows 1224857 1229614 UNLIMITED UNLIMITED
六、系统nproc 参数核查发现 proc soft limit 2047,但hard limit足够大正常不会出现问题
[oracle@h5l2001rs trace]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1030799
max locked memory (kbytes, -l) 112167992
max memory size (kbytes, -m) unlimited
open files (-n) 2047
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 2047
cpu time (seconds, -t) unlimited
max user processes (-u) 2047
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
[oracle@h5l2001rs trace]$ tail -50 /etc/security/limits.conf
....
grid soft nproc 2047
grid hard nproc 16384
grid soft nofile 2047
grid hard nofile 65536
grid soft stack 2047
grid hard stack 32768
oracle soft nproc 2047
oracle hard nproc 16384
oracle soft nofile 2047
oracle hard nofile 65536
oracle soft stack 2047
oracle hard stack 32768
七、尝试调整soft limit参数值后重启监听,短连接仍然报错,初步判断与监听、proc limit无关。
尝试使用strace在本地使用tns连接观察
strace -to /tmp/strace1.out sqlplus umoni/....@192168..../pmark <------这里需要注意使用网络tns模拟程序客户端连接
write(1, "ERROR:\n", 7) = 7
write(1, "ORA-27140: attach to post/wait f"..., 47) = 47
write(1, "ORA-27300: OS system dependent o"..., 76) = 76
write(1, "ORA-27301: OS failure message: O"..., 55) = 55
write(1, "ORA-27302: failure occurred at: "..., 43) = 43
write(1, "ORA-27303: additional informatio"..., 79) = 79
write(1, "= 501 (oinstall)\n", 17) = 17
write(1, "\n", 1) = 1
ORA-27140: attach to post/wait facility failed
ORA-27300: OS system dependent operation:invalid_egid failed with status: 1
ORA-27301: OS failure message: Operation not permitted <--------------一般为权限问题
ORA-27302: failure occurred at: skgpwinit6
ORA-27303: additional information: startup egid = 503 (asmadmin), current egid
八、 核查发现故障节点丢失S权限
node1:
[oracle@......1 bin]$ ls -l oracle
-rwxr-x--x 1 oracle oinstall 239914191 Aug 19 21:44 oracle <----------权限异常,丢失S权限
[oracle@......1 bin]$
node2:
[oracle@.....2 bin]$ ls -l oracle
-rwsr-s--x 1 oracle oinstall 239914191 Aug 19 21:44 oracle
[oracle@....2 bin]$
九、修正权限后,恢复正常
chmod 6751 oracle
注:
1.这里由于是rac环境,监听一般启动在grid用户下,当grid用户fork的进程调用oracle 相关lib文件时,需要以文件所有者权限调用,这里对应的便是S权限,具体S权限可以查询相关linux文档
2.参考MOS《TNS-12518 Can Not Connect To RAC 11gR2 Database (文档 ID 1088754.1)》