故障现象
---------- -----------------------------------
3 row cache lock 4364
3 cursor: pin S wait on X 92
3 latch: row cache objects 51
1 SQL*Net message to client 21
3 log file sync 20
3 direct path write temp 3
1 direct path write temp 2
4 gc cr request 2
1 gc cr request 1
1 gcs log flush sync 1
1 db file scattered read 1
1 db file scattered read 1
1 db file sequential read 1
1 log file parallel write 1
1 Disk file operations I/O 1
1 gc cr multi block request 1
1 latch: cache buffers chains 1
2 gc cr request 1
3 gc cr request 1
3 gc buffer busy acquire 1
3 db file sequential read 1
3 log file parallel write 1
3 latch: cache buffers chains 1
row cache lock非常高,还伴随着cursor: pin S wait on X,latch: row cache objects。业务已经跑不动,数据库hang住
select INSTANCE_NUMBER,p1,count(*) cnt from dba_hist_active_sess_history where event='row cache lock'
and SAMPLE_TIME>=to_date('2018-08-31 10:00:00','yyyy-mm-dd hh24:mi:ss') and SAMPLE_TIME <=to_date('2018-08-31 10:40:00','yyyy-mm-dd hh24:mi:ss')
group by INSTANCE_NUMBER,p1 order by cnt;
INSTANCE_NUMBER P1 CNT
--------------- ---------- ----------
1 8 1
1 7 24133
2 10 37774
3 10 39140
4 10 43463
ash中查找到row cache lock非常高,p1为cache id,然后我尝试通过cache id找dc
SQL> select INST_ID, CACHE#,TYPE,GETS,PARAMETER from gv$rowcache where CACHE# in (7,8,10) order by gets;
INST_ID CACHE# TYPE GETS PARAMETER
---------- ---------- ----------- ---------- ---------
1 7 SUBORDINATE 0 dc_users
4 7 SUBORDINATE 0 dc_users
4 7 SUBORDINATE 0 dc_users
3 7 SUBORDINATE 0 dc_users
3 7 SUBORDINATE 0 dc_users
2 7 SUBORDINATE 0 dc_users
2 7 SUBORDINATE 0 dc_users
1 7 SUBORDINATE 0 dc_users
1 8 SUBORDINATE 6222456 dc_object_grants
4 8 SUBORDINATE 6326296 dc_object_grants
3 8 SUBORDINATE 6525903 dc_object_grants
2 8 SUBORDINATE 6664793 dc_object_grants
4 7 SUBORDINATE 88523395 dc_users
2 7 SUBORDINATE 96842687 dc_users
4 8 PARENT 142839628 dc_objects
4 10 PARENT 334033324 dc_users
2 8 PARENT 344092524 dc_objects
3 10 PARENT 689860688 dc_users
2 10 PARENT 1000412160 dc_users
3 7 SUBORDINATE 1728570957 dc_users
1 8 PARENT 1807576364 dc_objects
1 10 PARENT 1889806171 dc_users
3 8 PARENT 2169044239 dc_objects
1 7 SUBORDINATE 3754684107 dc_users
dc有dc_users,dc_objects,dc_object_grants。其实这样搜索是不准确的,不应该用in,而应该单个的去查找
--可忽略cache8,实际问题在dc_users
SQL> select type,parameter from v$rowcache where cache#=8;
TYPE PARAMETER
----------- --------------------------------
PARENT dc_objects
SUBORDINATE dc_object_grants
SQL> select type,parameter from v$rowcache where cache#=7;
TYPE PARAMETER
----------- --------------------------------
SUBORDINATE dc_users
SUBORDINATE dc_users
SUBORDINATE dc_users
SQL> select type,parameter from v$rowcache where cache#=10;
TYPE PARAMETER
----------- --------------------------------
PARENT dc_users
问题出现在dc_users
ash分析:
SQL> select INSTANCE_NUMBER,USER_ID,count(*) cnt from dba_hist_active_sess_history
2 where SAMPLE_TIME>=to_date('2018-08-31 10:00:00','yyyy-mm-dd hh24:mi:ss') and SAMPLE_TIME <=to_date('2018-08-31 10:30:00','yyyy-mm-dd hh24:mi:ss')
3 group by INSTANCE_NUMBER,USER_ID order by 1,cnt;
INSTANCE_NUMBER USER_ID CNT
--------------- ---------- ----------
1 5 1
1 210 1
1 99 1
1 94 1
1 98 17
1 104 18
1 258 75
1 110 147
1 95 167
1 107 588
1 97 1078
1 0 24356
2 98 1
2 189 1
2 99 1
2 84 16
2 258 34
2 107 127
2 0 52562
3 97 4
3 104 10
3 258 88
3 110 138
3 107 572
3 96 760
3 98 14149
3 0 37858
4 97 1
4 209 3
4 258 16
4 104 86
4 107 90
4 0 50424
33 rows selected.
SQL> select user_id,username from dba_users where user_id =0;
USER_ID USERNAME
---------- ------------------------------
0 SYS
阻塞最多的是用户是sys用户。
使用mark html on把ash记录为html格式,去寻找源头
spool ash0831_opname.html
set mark html on
set pagesize 20000
select INSTANCE_NUMBER,SAMPLE_TIME,event,sql_opname,sql_id,count(*) cnt from dba_hist_active_sess_history
where SAMPLE_TIME>=to_date('2018-08-31 10:00:00','yyyy-mm-dd hh24:mi:ss') and SAMPLE_TIME <=to_date('2018-08-31 10:30:00','yyyy-mm-dd hh24:mi:ss')
group by INSTANCE_NUMBER,SAMPLE_TIME,event,sql_opname,sql_id order by 2,1;
spool off
第一次出现row cache lock在2节点10:18:04,会话执行了授权操作,sql_id= bcc9fs22hu2fb。后面2节点有大量的grant object
再次缩短时间,查找更详细的会话信息(由于会话量很大,html如果字段查太多会非常大)
select INSTANCE_NUMBER,SAMPLE_TIME,session_id,BLOCKING_SESSION,current_obj#,user_id,event,sql_id,P1 from dba_hist_active_sess_history
where SAMPLE_TIME>=to_date('2018-08-31 10:17:00','yyyy-mm-dd hh24:mi:ss') and SAMPLE_TIME <=to_date('2018-08-31 10:20:00','yyyy-mm-dd hh24:mi:ss')
and INSTANCE_NUMBER=2 and event='row cache lock';
这里有几个关键信息 :
user_id 84:
blocking session 1877:
user_id为104的只是一个普通会话,后面的会话有cursor:pin S wait on X的等待,但是时间发生在10:18:04之后,所有blocking session参考价值不高。
current_obj#:
sql_id bcc9fs22hu2fb:
hist sql没有抓取到
p1 10:
最后还是分析回来了,定位在dc_users上,umon用户执行过grant语句。很遗憾没有找到sql text。
在查找问题过程中,用户承认了用umon用户执行过grant语句
grant select any dictionary to uxxx1;
故障总结:
在业务高峰期时,用户信息、用户权限信息等都加载在rowcache dc_users中,用户就算频繁登陆和执行SQL也可以直接读取内存中的数据进行权限验证,但是如果一旦已经登陆到数据库的用户的权限发生变化,rowcache缓存的dc_uses也会发生变化,数据库在处理dc_users数据和该用户权限验证时就会产生等待, 这个等待就是基于dc_users的row cache lock等待。
所以在业务高峰期一定不要对已登录的用户授权。