墨墨导读:本文来自墨天轮读者“Anbob”供稿,分享利用12c渐进式DASH分析"ON CPU"的过程。
墨天轮主页:https://www.modb.pro/u/3895
背景
6日下午17:20接到告警CPU使用率高, 环境 Oracle 12cR2 3-nodes RAC on Linux。登录系统查看是实例1活动会话接近200。这是一个<90 cpu的主机。Session status 全是on cpu。当然从我们的监控上很直观定位2条select SQL 占用了50% 左右的db time。首先可能怀疑是SQL 执行量增长或执行计划改变等效率变差了, 这是一个1类系统,需要快速恢复应用,客户要求先kill部分会话, 查看了sql执行计划并未改变, 可是在KILL 部分会话后还在负载逐渐增长,没有多长时间,17:29 主机ping不通了,并且没有重启。
应用还好有配置TAF自动failover到了其它节点,并且运行良好,就当时的负载能把主机压死确实不多见,难道X86就如此脆弱? 下一步如何分析?我们的AWR 周期是30分钟,AWR中关于SQL stat的信息还没收集。当然事后会说为什么不收systemdump , 为什么不收hanganalyze, 为什么不手动收AWR, 为什么不把v$ash 物化下来, 可往往故障就是这么猝不及防,如何根据当前的信息进行分析?
如果此时有套监控采集信息或许能够提供部分信息, 如分析sysstat中sql executes是不是增长了?奇怪这该时间段我们的监控sysstat数据也没有采集到。
当时KILL 部分会话后收集
-- Session Snapper v4.14 BETA - by Tanel Poder ( http://blog.tanelpoder.com ) - Enjoy the Most Advanced Oracle Troubleshooting Script on the Planet! :)
----------------------------------------------------------------------------------------------------
Active% | INST | SQL_ID | SQL_CHILD | EVENT | WAIT_CLASS
----------------------------------------------------------------------------------------------------
3500% | 1 | 56twj7s93jaz4 | 0 | ON CPU | ON CPU
1700% | 1 | dp8fnkzqdt3km | 0 | ON CPU | ON CPU
1300% | 1 | 05qszn0ufs4ff | 0 | ON CPU | ON CPU
1200% | 1 | 1b9zsamawq6mh | 0 | ON CPU | ON CPU
1100% | 1 | a7kcm21nbngvx | 0 | ON CPU | ON CPU
1100% | 1 | 2v7njuhdw0pgm | 1 | ON CPU | ON CPU
1000% | 1 | 13nf2mwh3xmsh | 1 | ON CPU | ON CPU
900% | 1 | | 0 | ON CPU | ON CPU
700% | 1 | 60t0pum7f1pbm | 1 | ON CPU | ON CPU
600% | 1 | 60t0pum7f1pbm | 2 | ON CPU | ON CPU
-- End of ASH snap 1, end=2021-01-06 17:24:43, seconds=9, samples_taken=1
可以确认当时全时ON CPU, 从上图看负载是从16:58左右开始, 分析这类问题DASH是常用的分析工具,之前总结过ASH相关的BLOG《Know more about Oracle ASH》,前不久同事还问我ASH是何时从memory刷到disk中的, 1,默认AWR的快照间隔;2,如果ASH buffer 已满时MMNL进程都会负责这事。如果17:00 到 17:30的AWR 没有形成,那17:00前那2分钟的DASH(dba_hist_active_sess_history)可以做为分析的入口。在统计top SQL 历史快照时发现实例1其实16:30 – 17:00 AWR snapshot也没形成。问题时间段跨越的2个AWR snapshot都没生成,那DASH 真没有数据了吗?11G前这情况是没有DASH了,从12c开始不是的。
12c ASH 从memroy刷到disk形为改变
检查DASH数据是否存在?
SQL> select to_char(sample_time,'yyyymmdd hh24:mi'),count(*) --not * 10
from dba_hist_active_sess_history
where sample_time >to_date('2021-01-06 17','yyyy-mm-dd hh24') and sample_time <to_date('2021-01-06 18','yyyy-mm-dd hh24')
and instance_number=1
group by to_char(sample_time,'yyyymmdd hh24:mi') order by 1;
TO_CHAR(SAMPLE COUNT(*)
-------------- ----------
20210106 17:00 503
20210106 17:01 273
20210106 17:02 572
20210106 17:03 601
20210106 17:04 607
20210106 17:05 293
20210106 17:06 571
20210106 17:07 559
20210106 17:08 553
20210106 17:09 286
20210106 17:10 510
20210106 17:11 458
20210106 17:12 443
20210106 17:13 434
20210106 17:14 455
20210106 17:15 710
20210106 17:16 457
20210106 17:17 433
20210106 17:18 481
20210106 17:19 503
20210106 17:20 259
20210106 17:21 527
20210106 17:22 530
20210106 17:23 529
20210106 17:24 209
25 rows selected.
Note:
实例1 crash(17:28)之前的ASH数据基本上都在,这也正是oracle ASH即使在系统负载很高时也可以很好的工作。检查了db alert log没有提示ASH buffer 不足的问题, 那ASH是什么刷新频率刷到DISK上的呢?
SQL> select max(sample_time),sysdate from dba_hist_active_sess_history where instance_number=1;
MAX(SAMPLE_TIME) SYSDATE
--------------------------------------------------------------------------- -----------------
14-JAN-21 10.54.32.234 PM 20210114 22:58:23
SQL> select max(sample_time),sysdate from dba_hist_active_sess_history where instance_number=1;
MAX(SAMPLE_TIME) SYSDATE
--------------------------------------------------------------------------- -----------------
14-JAN-21 10.54.32.234 PM 20210114 22:59:52
SQL> r
1* select max(sample_time),sysdate from dba_hist_active_sess_history where instance_number=1
MAX(SAMPLE_TIME) SYSDATE
--------------------------------------------------------------------------- -----------------
14-JAN-21 10.59.54.686 PM 20210114 23:00:56
SQL> r
1* select max(sample_time),sysdate from dba_hist_active_sess_history where instance_number=1
MAX(SAMPLE_TIME) SYSDATE
--------------------------------------------------------------------------- -----------------
14-JAN-21 11.04.26.622 PM 20210114 23:07:58
Note:
非AWR snapshot flush时间,DASH也在逐渐的更新最新数据,基本上可以判断,当前的ASH是大概5分钟更新到DASH(dba_hist_active_sess_history)中, 那继续查找一下相关的ASH参数。
SQL> select
2 n.indx
3 , to_char(n.indx, 'XXXX') i_hex
4 , n.ksppinm pd_name
5 , c.ksppstvl pd_value
6 , n.ksppdesc pd_descr
7 from sys.x$ksppi n, sys.x$ksppcv c
8 where n.indx=c.indx
9 and
10 lower(n.ksppinm) || ' ' || lower(n.ksppdesc) like lower('%\_ash%')
11 escape '\' ;
INDX I_HEX NAME VALUE DESCRIPTION
---------- ----- ------------------------------------ ----------- ----------------------------------------------------------------------
4546 11C2 _ash_sampling_interval 1000 Time interval between two successive Active Session samples in
millisecs
4547 11C3 _ash_size 1048618 To set the size of the in-memory Active Session History buffers
4548 11C4 _ash_enable TRUE To enable or disable Active Session sampling and flushing
4549 11C5 _ash_disk_write_enable TRUE To enable or disable Active Session History flushing
4550 11C6 _ash_disk_filter_ratio 10 Ratio of the number of in-memory samples to the number of samples
actually written to disk
4551 11C7 _ash_eflush_trigger 66 The percentage above which if the in-memory ASH is full the emergency
flusher will be triggered
4552 11C8 _ash_sample_all FALSE To enable or disable sampling every connected session including ones
waiting for idle waits
4553 11C9 _ash_dummy_test_param 0 Oracle internal dummy ASH parameter used ONLY for testing!
4554 11CA _ash_min_mmnl_dump 90 Minimum Time interval passed to consider MMNL Dump