今天上午,广州一Oracle数据库 11g rac集群的节点1出现告警提示:Suspending MMON slave action kewrmrfsa_ for 82800 seconds
出现的现象是,普通用户登录系统挂起:
[oracle@gzbmi01 ~]$ sqlplus test/test
SQL*Plus: Release 11.2.0.4.0 Production on Fri May 19 11:18:33 2017
Copyright (c) 1982, 2013, Oracle. All rights reserved.
从操作系统看,系统的cpu和内存资源都很空闲,但是oracle有比较多的远程访问连接挂起
[oracle@gzbmi01 ~]$ps -ef|grep LOCAL=NO|wc -l
137
[oracle@gzbmi01 ~]$
节点1使用sys用户能正常登录,但是不能发起任何查询,节点1发起的查询全部挂起:
[oracle@gzbmi01 ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Fri May 19 11:18:33 2017
Copyright (c) 1982, 2013, Oracle. All rights reserved.
SQL> select event,count(*) from v$session where wait_class<>'Idle' group by event;
但是rac的另外一个节点则是正常的,从节点2查询数据库的等待事件:
SQL> select event,count(*) from v$session where wait_class<>'Idle' group by event;
EVENT COUNT(*)
---------------------------------------------------------------- ----------
enq: HW - contention 1
SQL*Net message to client 1
SQL>select sid,username,sql_id,module,machine,program from v$session where event='enq: HW - contention' and sql_id is not null
SID USERNAME SQL_ID MODULE MACHINE PROGRAM
---------- ------------------------------ ------------- ---------------------------------------------------------------- ---------------------------------------------------------------- ------------------------------------------------
430 BMICNEW 4vs91dcv7u1p6 plsqldev.exe Q3RF184E plsqldev.exe
SQL> set long 100000
SQL> select sql_fulltext from v$sqlarea where sql_id='4vs91dcv7u1p6';
SQL_FULLTEXT
--------------------------------------------------------------------------------
insert into sys.aud$( sessionid,entryid,statement,ntimestamp#, userid,userhost,t
erminal,action#,returncode, obj$creator,obj$name,auth$privileges,auth$grantee, n
ew$owner,new$name,ses$actions,ses$tid,logoff$pread, logoff$lwrite,logoff$dead,co
mment$text,spare1,spare2, priv$used,clientid,sessioncpu,proxy$sid,user$guid, in
stance#,process#,xid,scn,auditid, sqlbind,sqltext,obj$edition,dbid) values(:1,
:2,:3,SYS_EXTRACT_UTC(SYSTIMESTAMP), :4,:5,:6,:7,:8, :9,:10,:11,:12,
:13,:14,:15,:16,:17, :18,:19,:20,:21,:22, :23,:24,:25,:26,:27, :28,
:29,:30,:31,:32, :33,:34,:35,:36)
SQL> select sid,spid, osuser, s.program from v$session s,v$process p where
s.paddr=p.addr and sid=&sid; 2
Enter value for sid: 430
old 2: s.paddr=p.addr and sid=&sid
new 2: s.paddr=p.addr and sid=430
SID SPID OSUSER PROGRAM
---------- ------------------------ ------------------------------ ------------------------------------------------
430 33139 Administrator plsqldev.exe
SQL>
查询到的sql是数据库自身的基表进行的Insert,于是尝试查杀该会话对应的系统进程:kill -9 33139
但是,查杀会话进程后问题依然存在。
有人说可能跟归档路径有关系,经查看数据库的归档路径是+ARCH,在节点一发起:alter system switch logfile;
能正常执行并且能在归档的asm磁盘组arch中查到该文件:
Fri May 19 10:43:11 2017
Thread 1 advanced to log sequence 21922 (LGWR switch)
Current log# 7 seq# 21922 mem# 0: +DATA/gzbmi/onlinelog/group_7.349.918726525
经过与地市技术人员沟通决定,重新启动节点1上的数据库实例:
[oracle@gzbmi01 ~]$ ps -ef|grep ora_
oracle 36354 1 0 Jan13 ? 01:19:43 ora_pmon_gzbmi1
oracle 36356 1 0 Jan13 ? 00:41:35 ora_psp0_gzbmi1
oracle 36358 1 0 Jan13 ? 1-05:44:24 ora_vktm_gzbmi1
oracle 36362 1 0 Jan13 ? 00:08:05 ora_gen0_gzbmi1
oracle 36364 1 0 Jan13 ? 02:44:09 ora_diag_gzbmi1
oracle 36366 1 0 Jan13 ? 00:33:55 ora_dbrm_gzbmi1
oracle 36368 1 0 Jan13 ? 00:17:22 ora_ping_gzbmi1
oracle 36370 1 0 Jan13 ? 00:06:58 ora_acms_gzbmi1
oracle 36372 1 0 Jan13 ? 12:43:37 ora_dia0_gzbmi1
oracle 36374 1 0 Jan13 ? 16:38:02 ora_lmon_gzbmi1
oracle 36376 1 0 Jan13 ? 05:14:26 ora_lmd0_gzbmi1
oracle 36378 1 0 Jan13 ? 13:07:21 ora_lms0_gzbmi1
oracle 36382 1 0 Jan13 ? 13:04:36 ora_lms1_gzbmi1
oracle 36386 1 0 Jan13 ? 13:06:21 ora_lms2_gzbmi1
oracle 36390 1 0 Jan13 ? 00:06:41 ora_rms0_gzbmi1
oracle 36392 1 0 Jan13 ? 00:17:44 ora_lmhb_gzbmi1
oracle 36394 1 0 Jan13 ? 00:57:45 ora_mman_gzbmi1
oracle 36396 1 0 Jan13 ? 01:37:48 ora_dbw0_gzbmi1
oracle 36398 1 0 Jan13 ? 01:39:29 ora_dbw1_gzbmi1
oracle 36400 1 0 Jan13 ? 01:39:10 ora_dbw2_gzbmi1
oracle 36402 1 0 Jan13 ? 01:38:37 ora_dbw3_gzbmi1
oracle 36404 1 0 Jan13 ? 03:27:51 ora_lgwr_gzbmi1
oracle 36406 1 0 Jan13 ? 04:29:21 ora_ckpt_gzbmi1
oracle 36408 1 0 Jan13 ? 01:26:17 ora_smon_gzbmi1
oracle 36410 1 0 Jan13 ? 00:04:03 ora_reco_gzbmi1
oracle 36412 1 0 Jan13 ? 00:09:02 ora_rbal_gzbmi1
oracle 36414 1 0 Jan13 ? 00:08:44 ora_asmb_gzbmi1
oracle 36422 1 0 Jan13 ? 00:03:50 ora_d000_gzbmi1
oracle 36424 1 0 Jan13 ? 00:03:48 ora_s000_gzbmi1
oracle 36426 1 0 Jan13 ? 00:12:38 ora_mark_gzbmi1
oracle 36432 1 0 Jan13 ? 06:48:47 ora_lck0_gzbmi1
oracle 36434 1 0 Jan13 ? 00:15:49 ora_rsmn_gzbmi1
oracle 36537 1 0 Jan13 ? 00:25:46 ora_arc0_gzbmi1
oracle 36539 1 0 Jan13 ? 00:29:15 ora_arc1_gzbmi1
oracle 36541 1 0 Jan13 ? 00:07:45 ora_arc2_gzbmi1
oracle 36543 1 0 Jan13 ? 00:25:44 ora_arc3_gzbmi1
oracle 36546 1 0 Jan13 ? 00:07:06 ora_gtx0_gzbmi1
oracle 36548 1 0 Jan13 ? 00:45:36 ora_rcbg_gzbmi1
oracle 41230 1 0 10:01 ? 00:00:03 ora_p000_gzbmi1
oracle 51435 47466 0 11:06 pts/2 00:00:00 grep ora_
[oracle@gzbmi01 ~]$ ps -ef|grep asm_
grid 22587 1 0 Jan13 ? 00:56:55 asm_pmon_+ASM1
grid 22589 1 0 Jan13 ? 00:40:03 asm_psp0_+ASM1
grid 22591 1 0 Jan13 ? 1-05:47:40 asm_vktm_+ASM1
grid 22595 1 0 Jan13 ? 00:07:35 asm_gen0_+ASM1
grid 22597 1 0 Jan13 ? 02:31:03 asm_diag_+ASM1
grid 22599 1 0 Jan13 ? 00:21:24 asm_ping_+ASM1
grid 22601 1 0 Jan13 ? 19:48:03 asm_dia0_+ASM1
grid 22603 1 0 Jan13 ? 06:51:21 asm_lmon_+ASM1
grid 22605 1 0 Jan13 ? 04:58:18 asm_lmd0_+ASM1
grid 22607 1 0 Jan13 ? 08:30:03 asm_lms0_+ASM1
grid 22611 1 0 Jan13 ? 00:15:48 asm_lmhb_+ASM1
grid 22613 1 0 Jan13 ? 00:08:13 asm_mman_+ASM1
grid 22615 1 0 Jan13 ? 00:09:25 asm_dbw0_+ASM1
grid 22617 1 0 Jan13 ? 00:09:44 asm_lgwr_+ASM1
grid 22619 1 0 Jan13 ? 00:23:57 asm_ckpt_+ASM1
grid 22621 1 0 Jan13 ? 00:06:58 asm_smon_+ASM1
grid 22623 1 0 Jan13 ? 01:34:49 asm_rbal_+ASM1
grid 22625 1 0 Jan13 ? 01:08:11 asm_gmon_+ASM1
grid 22627 1 0 Jan13 ? 00:23:08 asm_mmon_+ASM1
grid 22629 1 0 Jan13 ? 03:20:59 asm_mmnl_+ASM1
grid 22631 1 0 Jan13 ? 00:28:17 asm_lck0_+ASM1
grid 22715 1 0 Jan13 ? 00:06:40 asm_asmb_+ASM1
oracle 51438 47466 0 11:06 pts/2 00:00:00 grep asm_
[grid@gzbmi01 ~]$srvctl stop instance -d gzbmi -n gzbmi01
oracle用户的告警日志提示:
Fri May 19 11:01:46 2017
Shutting down instance (immediate)
Stopping background process SMCO
Shutting down instance: further logons disabled
Stopping background process QMNC
Stopping background process MMNL
License high water mark = 212
Stopping Job queue slave processes, flags = 7
Job queue slave processes stopped
Fri May 19 11:02:40 2017
Stopping background process CJQ0
Fri May 19 11:06:49 2017
Active call for process 41230 user 'oracle' program 'oracle@gzbmi01 (P000)'
SHUTDOWN: waiting for active calls to complete.
Fri May 19 11:07:34 2017
All dispatchers and shared servers shutdown
ALTER DATABASE CLOSE NORMAL /* db agent *//* {1:61387:21743} */
Fri May 19 11:09:59 2017
Reconfiguration started (old inc 4, new inc 6)
List of instances:
1 2 (myinst: 1)
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Fri May 19 11:09:59 2017
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri May 19 11:09:59 2017
LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri May 19 11:09:59 2017
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Fri May 19 11:11:48 2017
License high water mark = 212
USER (ospid: 52209): terminating the instance
Instance terminated by USER, pid = 52209
[oracle@gzbmi01 ~]$ ps -ef|grep ora_
[oracle@gzbmi01 ~]$
重新启动节点1上的实例:
[grid@gzbmi01 ~]$srvctl start instance -d gzbmi -n gzbmi01
oracle用户告警日志提示:
Fri May 19 11:14:12 2017
Adjusting the default value of parameter parallel_max_servers
from 1280 to 470 due to the value of parameter processes (500)
Starting ORACLE instance (normal)
......
Starting background process QMNC
Fri May 19 11:14:53 2017
QMNC started with pid=46, OS id=52912
Completed: ALTER DATABASE OPEN /* db agent *//* {1:61387:21775} */
Fri May 19 11:14:56 2017
Starting background process CJQ0
Fri May 19 11:14:56 2017
CJQ0 started with pid=51, OS id=52941
节点1上的实例重新启动后恢复正常:
[oracle@gzbmi01 ~]$ sqlplus bmiCnew/bmiCnew
SQL*Plus: Release 11.2.0.4.0 Production on Fri May 19 11:18:33 2017
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
SQL>
应用程序启动后功能测试恢复正常,具体的节点1实例hang的原因不太清楚,地市技术人员说前期做过数据抽取,
但是从数据库告警日志和集群日志没有找到资源紧张引发的告警信息。