项目场景:
用户应用厂商发现数据库服务器CPU的top使用率在95%~100%,经过相关查询发现是由ora_scm0进程所占用
问题描述:
提示:如下代码所示,cpu占用率超出正常范围达到了100%;但是我在查询时候发现top和sar查询出来的结果不一样,有懂的小伙伴可以科普一下。
[root@rac2 ~]# top
top - 20:22:01 up 667 days, 23:36, 1 user, load average: 2.25, 2.39, 2.61
Tasks: 987 total, 2 running, 985 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3.1 us, 0.7 sy, 0.0 ni, 94.0 id, 2.3 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13097886+total, 4762016 free, 83879424 used, 42337424 buff/cache
KiB Swap: 13421772+total, 13290905+free, 1308672 used. 41708892 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
189061 oracle 20 0 40.4g 33260 26560 R 100.0 0.0 374940:59 ora_scm0_orcl12
58346 root 20 0 162812 2928 1540 R 15.0 0.0 0:00.06 top
217733 oracle 20 0 40.4g 41640 34344 S 5.0 0.0 8:07.40 oracle_217733_o
224310 oracle 20 0 40.4g 34384 26744 S 5.0 0.0 0:59.73 oracle_224310_o
1 root 20 0 191548 3884 2308 S 0.0 0.0 576:22.21 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:26.53 kthreadd
4 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
6 root 20 0 0 0 0 S 0.0 0.0 140:58.68 ksoftirqd/0
7 root rt 0 0 0 0 S 0.0 0.0 1:45.28 migration/0
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
9 root 20 0 0 0 0 S 0.0 0.0 3199:42 rcu_sched
10 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 lru-add-drain
11 root rt 0 0 0 0 S 0.0 0.0 2:06.79 watchdog/0
12 root rt 0 0 0 0 S 0.0 0.0 2:14.56 watchdog/1
13 root rt 0 0 0 0 S 0.0 0.0 1:53.11 migration/1
14 root 20 0 0 0 0 S 0.0 0.0 109:25.44 ksoftirqd/1
137989 root 20 0 899332 32748 22400 S 0.7 0.0 7618:28 orarootagent.bi
[root@rac2 ~]# sar 1 10
Linux 3.10.0-1062.el7.x86_64 (rac2) 2021骞?2?3妤 _x86_64_ (40 CPU)
20娑22?7绉 CPU %user %nice %system %iowait %steal %idle
20娑22?8绉 all 3.71 0.00 0.75 2.43 0.00 93.11
20娑22?9绉 all 4.02 0.00 0.35 2.49 0.00 93.15
20娑22?0绉 all 3.26 0.00 0.28 2.48 0.00 93.98
20娑22?1绉 all 3.56 0.00 0.43 2.43 0.00 93.59
20娑22?2绉 all 3.26 0.00 0.25 2.58 0.00 93.91
20娑22?3绉 all 4.09 0.00 0.33 2.36 0.00 93.23
20娑22?4绉 all 3.86 0.00 0.48 2.58 0.00 93.08
20娑22?5绉 all 4.16 0.00 0.33 2.41 0.00 93.11
20娑22?6绉 all 3.73 0.00 0.38 2.15 0.00 93.74
20娑22?7绉 all 3.23 0.00 0.33 2.21 0.00 94.23
骞冲娑榇: all 3.69 0.00 0.39 2.41 0.00 93.51
原因分析:
alert日志:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
经过分析,该报错是 SCM0进程收集GES、GCS的相关的统计信息时间超长后抛出异常,根据MOS文档ID 2373451.1说明,SCM0收集的信息在12.2版本无价值,
因此本次报错可忽略或者禁用DLM收集统计信息参数。
DLM描述说明:
分布式锁管理器(distributed lock management DLM),简单说对于RAC环境,所有数据的修改,都需要事先以节点为单位,去DLM申请节点对块的修改权限,DLM对块的资源进行多节点修改进行协调。
SCM0进程即DLM数据的采集以及管理进程
SCM0简短描述
收集和管理与全局排队服务(GES)和全局缓存服务(GCS)相关的统计信息
解决方案:
此次方案定为:修改_dlm_stats_collect参数,进行单节点停机,恢复后再次停机第二个节点。
SQL> alter system set "_dlm_stats_collect"=0 scope=spfile sid='*';
System altered.
SQL>
SQL>exit
[root@rac2 ~]$ ps -ef|grep scm
oracle 507879 1 0 May22 ? 00:02:28 ora_scm0_ANBOB1
[root@rac2 ~]# kill -9 507879
[root@rac2 ~]# su - oracle
涓涓娆$诲斤煎 12?23 20:26:13 CST 2021pts/0 涓
^[[?1;2c[oracle@rac2 ~]$
[oracle@rac2 ~]$
[oracle@rac2 ~]$ sqlplus / as sysdba
SQL*Plus: Release 12.2.0.1.0 Production on Thu Dec 23 20:46:59 2021
Copyright (c) 1982, 2016, Oracle. All rights reserved.
Connected to:
Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production
SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.
Total System Global Area 4.2950E+10 bytes
Fixed Size 29865680 bytes
Variable Size 9261025584 bytes
Database Buffers 3.3554E+10 bytes
Redo Buffers 104349696 bytes
ORA-01105: mount is incompatible with mounts by other instances
ORA-01174: DB_FILES is 2000 buts needs to be 200 to be compatible
#到这里发现报错,db_files问题,遂判断节点一、二的db_files参数不一致。修改db_files参数需要两节点同时停机,才可以保证参数一致重启,因本次没有俩节点全部重启权限,遂将db_files参数修改一致后,再次重启
SQL> show parameter db_files
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
db_files integer 2000
SQL> alter system set db_files=200 scope=spfile sid='*';
System altered.
SQL>
SQL>
SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.
Total System Global Area 4.2950E+10 bytes
Fixed Size 29865680 bytes
Variable Size 9261025584 bytes
Database Buffers 3.3554E+10 bytes
Redo Buffers 104349696 bytes
Database mounted.
Database opened.
SQL>
#节点一按照同样方法进行重启,先kill掉scm0进程,再进行shutdown immediate