某数据库 实例1出现监听不可用。原因:月初数据库压力大,一个高并发的查询sql(每秒达到216,表大小1.44G 数据量:3873863 没有用到索引 走全表扫描),导致整个数据库系统负载冲高,数据库一个节点的rac后台进程在系统高负载下没有轮询到网卡状态返回信息,重启网卡,同时重启监听,由于负载太高,系统尝试重启失败,部分登录失败,手工重启监听后解决;
现象:
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER)))
TNS-12541: TNS:no listener
TNS-12560: TNS:protocol adapter error
TNS-00511: No listener
Linux Error: 2: No such file or directory
15:12 dba反应个账数据库登录不上,dba登录主机查看实例1监听已经关闭
通知产品dba 查看并启动监听
问题分析:
开发dba侧:
建立临时表保存会话信息:
create table tmp_gt_0301 as SELECT * FROM dba_hist_active_sess_history
where to_char(sample_time, 'yyyymmdd hh24:mi') >= '20160301 14:45'
and to_char(sample_time, 'yyyymmdd hh24:mi') <= '20160301 14:59'
查找问题发生的时间点:
SELECT count(*), to_char(sample_time, 'yyyymmdd hh24:mi:ss') FROM tmp_gt_0301 group by to_char(sample_time, 'yyyymmdd hh24:mi:ss')
查看等待事件分类:
SELECT event, count(*),to_char(sample_time, 'yyyymmdd hh24:mi:ss')
FROM tmp_gt_0301 t group by t.event ,to_char(sample_time, 'yyyymmdd hh24:mi:ss') ;
buffer busy waits 1235 20160301 14:54:13buffer busy waits 1078 20160301 14:54:24
buffer busy waits 1064 20160301 14:54:03
buffer busy waits 901 20160301 14:54:34
gc buffer busy acquire 889 20160301 14:53:52
gc buffer busy acquire 883 20160301 14:53:42
buffer busy waits 854 20160301 14:53:52
buffer busy waits 854 20160301 14:53:42
buffer busy waits 851 20160301 14:53:10
gc buffer busy acquire 793 20160301 14:53:10
gc buffer busy acquire 659 20160301 14:53:31
gc buffer busy acquire 650 20160301 14:53:21
根据高峰期等待事件查看是否有阻塞:
select blocking_session,count(*) from tmp_gt_0301 where event='buffer busy waits'
and to_char(sample_time, 'yyyymmdd hh24:mi:ss') ='20160301 14:54:13'
group by blocking_session ;
2257 103
6077 102
4753 102
4576 101
267 98
根据阻塞的源头 查找阻塞的原因:
select blocking_session,count(*) from tmp_gt_0301 where to_char(sample_time, 'yyyymmdd hh24:mi:ss') ='20160301 14:50:55' group by blocking_session ;
1 461
82 7747 115
29 8077 23
26 4829 22
40 8240 16
61 3431 16
SELECT * FROM tmp_gt_0301 where to_char(sample_time, 'yyyymmdd hh24:mi:ss') ='20160301 14:50:55' and session_id='7747'
g9a9wp3mau6sy gc cr block 2-way
SELECT sql_id,event,count(*) FROM tmp_gt_0301 where to_char(sample_time, 'yyyymmdd hh24:mi:ss') ='20160301 14:50:55' and blocking_session='7747' group by sql_id,event;
1 g9a9wp3maxxxx enq: SQ - contention 115
select S_T_XXX_LOG.nextval from dual
SELECT * FROM dba_sequences where sequence_name='S_T_xxxxx_LOG';------cache:500
排查了这几个阻塞的session 发现都在做insert和select操作 没有长时间的固定的阻塞源头
select S_T_XXXXXV_LOG.nextval from dua
INSERT INTO T_XXXKET_XXXXX_DETAIL
insert into T_XXX_XXXXX_INFO
INSERT INTO T_MXXXXT_XXXXX_XXXXSS
都属于正常业务 sequence的cache 500 也可以
SELECT sql_id, to_char(sample_time, 'yyyymmdd hh24:mi:ss'),count(*) FROM tmp_gt_0301 where sql_opname='SELECT' group by sql_id, to_char(sample_time, 'yyyymmdd hh24:mi:ss') order by 3 desc ;
1 4b88vs5099ddk 20160301 14:55:06 216
2 4b88vs5099ddk 20160301 14:55:37 215
3 4b88vs5099ddk 20160301 14:55:16 213
4 4b88vs5099ddk 20160301 14:56:49 211
5 4b88vs5099ddk 20160301 14:54:24 198
6 4b88vs5099ddk 20160301 14:54:34 197
SELECT count(*),to_char(sample_time, 'yyyymmdd hh24:mi:ss') FROM tmp_gt_0301 where sql_id='4b88vs5099ddk'
group by to_char(sample_time, 'yyyymmdd hh24:mi:ss');
216 20160301 14:55:06
213 20160301 14:55:16
该sql信息:
4b88vs5099ddk:
select accept_trans_seq_no from t_log_bill_writeoff M_C_W where terminal_seq_no = :1
Plan hash value: 2759573899
------------------------------------------------------------------------------------------
| Id | Operation | Name | E-Rows |E-Bytes| Cost (%CPU)| E-Time |
------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 2 (100)| |
| 1 | TABLE ACCESS FULL| T_LOG_BILL_WRITEOFF | 1 | 54 | 2 (0)| 00:00:01 |
该sql在问题发生时间段 一个小时内2:00-3:00
实例1:执行8257次 平均每次执行4s
实例2:执行9413次 平均每次执行3s
平时该sql 单实例每小时执行不足100次
该表大小1.44G 数据量:3873863 termxxxxl_seq_no因此考虑在该列上建立索引
列terXxxxx_seq_no 选择性非常好,建完索引后 单条sql执行由原来的平均4s降到平均3ms 性能提高1000倍以上。
主机系统侧:
实例1:
![](http://img.blog.itpub.net/blog/attachment/201603/7/30109892_1457334169Z4ns.jpg?x-oss-process=style/bb)
集群侧:
Listener_log:
实例1:
No longer listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER)))
No longer listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.xxx.xx.xxx)(PORT=1521)))
Listener completed notification to CRS on stop crs关闭监听
01-MAR-2016 14:54:21 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=xxxx)(USER=grid))(COMMAND=stop)(ARGUMENTS=64)(SERVICE=LISTENER)(VERSION=186647296)) * stop * 0
Tue Mar 01 15:18:57 2016
System parameter file is /u01/xxxxxx/network/admin/listener.ora
Log messages written to /u01/xxxxxx/tnslsnr/spayrxxxx/listener/alert/log.xml
Trace information written to /u01/apxxxxx/listener/trace/ora_19333160_1.trc
Trace level is currently 0
Crsd_log:
实例1:
2016-03-01 14:54:23.191: [UiServer][11567] {1:31649:5735} Done for ctx=111ad0670
2016-03-01 14:54:23.434: [ A**][10025] {0:2:9900} Received the reply to the message: RESOURCE_STOP[ora.LISTENER.lsnr spayrdb3 1] ID 4099:586286 from the agent /u01/app/11id/in/gent_grid
2016-03-01 14:54:23.436: [ A**][10025] {0:2:9900} A** Proxy Server sending the reply to PE for message:RESOURCE_STOP[ora.LISTENER.lsnr spayrdb3 1] ID 4099:1733429
2016-03-01 14:54:23.902: [ A**][10025] {0:2:9900} Received the reply to the message: RESOURCE_STOP[ora.LISTENER.lsnr spayr 1] ID 4099:586286 from the agent /u01/app2.0/grid/bn/ont_grid
2016-03-01 14:54:23.904: [ A**][10025] {0:2:9900} A** Proxy Server sending the last reply to PE for message:RESOURCE_STOP[ora.LISTENER.lsnr spayrdb3 1] ID 4099:1733429 关闭监听
实例2:
2016-03-01 14:54:23.920: [ CRSPE][11310] {0:2:9900} Received reply to action [Stop] message ID: 1733429
2016-03-01 14:54:23.921: [ CRSPE][11310] {0:2:9900} RI [ora.LISTENER.lsnr spayrdb3 1] new internal state: [STABLE] old value: [STOPPING]
2016-03-01 14:54:23.921: [ CRSPE][11310] {0:2:9900} RI [ora.LISTENER.lsnr spayrdb3 1] new external state [OFFLINE] old value: [ONLINE] label = []
2016-03-01 14:54:23.924: [ CRSPE][11310] {0:2:9900} CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'spa' succeeded crs从'spa' 发出关闭监听
Alert_log:
实例1:
***********************************************************************
Fatal NI connect error 12537, connecting to:
(LOCAL=NO)
Tue Mar 01 14:54:20 2016
Tue Mar 01 14:54:22 2016
***********************************************************************
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 – Production
Tue Mar 01 17:02:10 2016
***********************************************************************
Fatal NI connect error 12537, connecting to:
(LOCAL=NO)
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Time: 01-MAR-2016 17:02:10
Tracing not turned on.
Tns error struct:
ns main err code: 12537
处理问题总结:
月初数据库系统压力本来就很大,再加一个高并发的异常sql,平时该sql很少执行,突然高并发执行 再加系统比较繁忙,导致整个数据库系统负载冲高。因此月初月末需要加强对系统的监控。
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/30109892/viewspace-2050765/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/30109892/viewspace-2050765/