客户反映数据库时快时慢,数据库版本11.2.0.4.0,操作系统RHEL5U5,查看数据库的活动会话,发现基本100%的SQL都在同一条SQL,而且大多都是latch: row cache objects等待。
01 | SQL> select inst_id,username,sql_id,event from gv$session where username is not null and status='ACTIVE'; |
03 | INST_ID USERNAME SQL_ID EVENT |
04 | ---------- ------------------------------ ------------- --------------------------------- |
05 | 1 SYS 8s2qkvk056ugr PX Deq: Execution Msg |
06 | 1 SYS 8s2qkvk056ugr PX Deq: Execute Reply |
07 | 1 PRESSO 9rwzwamtgv6m6 gc cr request |
08 | 2 PRESSO dfftdnm7cu76f latch: row cache objects |
09 | 2 PRESSO dfftdnm7cu76f resmgr:cpu quantum |
10 | 2 PRESSO dfftdnm7cu76f latch: row cache objects |
11 | 2 PRESSO dfftdnm7cu76f latch: row cache objects |
12 | 2 PRESSO dfftdnm7cu76f latch: row cache objects |
13 | 2 PRESSO dfftdnm7cu76f latch: row cache objects |
14 | 2 PRESSO dfftdnm7cu76f latch: row cache objects |
15 | 2 PRESSO 4z9c5071vvaz5 resmgr:cpu quantum |
16 | 2 PRESSO dfftdnm7cu76f resmgr:cpu quantum |
17 | 2 PRESSO dfftdnm7cu76f latch: row cache objects |
18 | 2 SYS 8s2qkvk056ugr PX Deq: Execution Msg |
19 | 2 PRESSO dfftdnm7cu76f resmgr:cpu quantum |
20 | 2 PRESSO dfftdnm7cu76f latch: row cache objects |
21 | 2 PRESSO dfftdnm7cu76f latch: row cache objects |
22 | 2 PRESSO dfftdnm7cu76f latch: cache buffers chains |
23 | 2 PRESSO dfftdnm7cu76f resmgr:cpu quantum |
24 | 2 PRESSO dfftdnm7cu76f resmgr:cpu quantum |
25 | 2 PRESSO dfftdnm7cu76f enq: RC - Result Cache: Contention |
latch: row cache objects等待事件通常是在修改数据字典相关的等待,通常由于硬解析过高导致,查看这条SQL的SQL语句。
01 | SQL> select sql_text from v$sqltext where sql_id='dfftdnm7cu76f' order by piece |
04 | --------------------------------------------------------------------------------------------------------- |
06 | from (select t.apply_id, |
10 | (select count(bd.BAD_RECORD_ID) as n |
11 | from biz_badrecoed_info bd wherebd.BAD_RECORD_ID = 4 and bd.PUNISH_ORG = o.org_id) || '-' || |
12 | (select count(bd.BAD_RECORD_ID) as nll |
13 | from biz_badrecoed_info bd |
14 | where bd.BAD_RECORD_ID = 1 |
15 | and bd.ID_CARD = bb.card_code) || '-' || |
16 | (select count(bd.BAD_RECORD_ID) as nlllfrom biz_badrecoed_info bd where bd.BAD_RECORD_ID = 2 and bd.ID_CARD = bb.card_code) || '-' || |
17 | (select count(bd.BAD_RECORD_ID) as nllll |
18 | from biz_badrecoed_info bd |
19 | where bd.BAD_RECORD_ID = 3 |
20 | and bd.ID_CARD = bb.card_code) bdn, |
23 | trim(ps.cn_name) as psname, |
25 | lead(t.apply_id, 1) over(order by t.create_time, t.apply_id desc) as nextID, |
26 | lag(t.apply_id, 1) over(order by t.create_time, t.apply_id desc) as prevID, |
27 | t.create_time as create_time |
28 | from biz_presscard_application t, |
30 | biz_reporter_base_info bb, |
32 | biz_pressstation_info ps |
33 | where t.org_id = o.org_id |
34 | and bcode.code_sort_id = '1' |
35 | and t.reporter_uuid = bb.reporter_uuid |
36 | and to_char(t.apply_status) = bcode.code_id |
37 | andt.sat_id = ps.sat_id(+) |
38 | and t.apply_status = '2' |
39 | and t.next_orgid = '1' |
40 | ORDER BY T.create_time, t.apply_id DESC) |
可见这条SQL没有使用绑定变量,修改cursor_sharing参数,在数据库层打开绑定变量。
01 | SQL> show parameter cursor |
04 | ------------------------------------ ----------- ------------------------------ |
05 | cursor_bind_capture_destination string memory+disk |
06 | cursor_sharing string SIMILAR |
07 | cursor_space_for_time boolean FALSE |
08 | open_cursors integer 3000 |
09 | session_cached_cursors integer 50 |
10 | SQL> alter system set cursor_sharing=SIMILAR sid='*'; |
再次查询这个等待已经很少,但出现了大量的latch: cache buffers chains等待事件。
01 | SQL> select inst_id,username,sql_id,event from gv$session where username is not null and status='ACTIVE' and username !='SYS'; |
03 | INST_ID USERNAME SQL_ID EVENT |
04 | ---------- ------------------------------ ------------- ------------------------------------ |
05 | 1 PRESSO 1fsdcuajuxncg enq: RC - Result Cache: Contention |
06 | 2 PRESSO dfftdnm7cu76f latch: cache buffers chains |
07 | 2 PRESSO dfftdnm7cu76f latch: cache buffers chains |
08 | 2 PRESSO dfftdnm7cu76f resmgr:cpu quantum |
09 | 2 PRESSO dfftdnm7cu76f latch: cache buffers chains |
10 | 2 PRESSO dfftdnm7cu76f latch: cache buffers chains |
11 | 2 PRESSO dfftdnm7cu76f resmgr:cpu quantum |
12 | 2 PRESSO dfftdnm7cu76f latch: row cache objects |
13 | 2 PRESSO dfftdnm7cu76f latch: cache buffers chains |
14 | 2 PRESSO dfftdnm7cu76f latch: row cache objects |
15 | 2 PRESSO dfftdnm7cu76f latch: cache buffers chains |
16 | 2 PRESSO dfftdnm7cu76f latch: cache buffers chains |
17 | 2 PRESSO dfftdnm7cu76f latch: row cache objects |
18 | 2 PRESSO dfftdnm7cu76f latch: cache buffers chains |
19 | 2 PRESSO dfftdnm7cu76f latch: cache buffers chains |
20 | 2 PRESSO dfftdnm7cu76f latch: row cache objects |
21 | 2 PRESSO dfftdnm7cu76f latch: cache buffers chains |
22 | 2 PRESSO dfftdnm7cu76f resmgr:cpu quantum |
23 | 2 PRESSO dfftdnm7cu76f latch: row cache objects |
24 | 2 PRESSO dfftdnm7cu76f latch: row cache objects |
25 | 2 PRESSO dfftdnm7cu76f enq: RC - Result Cache: Contention |
这套应用代码写的相当那啥,SQL还不能动,只能从数据库角度去解决问题。对这套系统来说,热点块一直都是个问题,为了解决这个问题,数据库的block_size已经修改为4K。
1 | SQL> show parameter block |
4 | ------------------------------------ ----------- ------------------------------ |
5 | db_block_buffers integer 0 |
6 | db_block_checking string FALSE |
7 | db_block_checksum string TYPICAL |
8 | db_block_size integer 4096 |
9 | db_file_multiblock_read_count integer 136 |
现在热点块依旧很严重,那么就只能通过调整pctfree来减少热点块了,先查出热点块严重的表。
02 | 2 FROM (SELECT O.OWNER, O.OBJECT_NAME, O.OBJECT_TYPE, SUM(TCH) TOUCHTIME |
03 | 3 FROM X$BH B, DBA_OBJECTS O |
04 | 4 WHERE B.OBJ = O.DATA_OBJECT_ID |
06 | 6 GROUP BY O.OWNER, O.OBJECT_NAME, O.OBJECT_TYPE |
07 | 7 ORDER BY SUM(TCH) DESC) |
11 | OWNER OBJECT_NAME OBJECT_TYPE TOUCHTIME |
12 | ------------------------------ ---------------------------------------- ------------------- ---------- |
13 | PRESSO BIZ_REPORTER_BASE_INFO TABLE 3691280 |
14 | PRESSO BIZ_REPORTER_EDU_INFO TABLE 3547004 |
15 | PRESSO BIZ_REPORTER_ORG_INFO TABLE 2273524 |
16 | PRESSO BIZ_PRESSCARD_LOGOUT TABLE 2099499 |
17 | PRESSO BIZ_REPORTER_INFO TABLE 1619598 |
18 | PRESSO BIZ_PRESSCARD_APPLICATION TABLE 1191751 |
19 | PRESSO BIZ_SYSTEM_MESSAGE_INFO TABLE 730829 |
20 | PRESSO BIZ_REPORTER_EXTEND_INFO TABLE 610540 |
21 | SYS WRH$_SQLSTAT TABLE PARTITION 193465 |
22 | PRESSO IDX_REPORTER_UUID INDEX 190901 |
调整热点块较高的表的pctfree。
01 | SQL> ALTER TABLE PRESSO.BIZ_REPORTER_BASE_INFO PCTFREE 30; |
05 | SQL> ALTER TABLE PRESSO.BIZ_REPORTER_EDU_INFO PCTFREE 30; |
09 | SQL> ALTER TABLE PRESSO.BIZ_REPORTER_ORG_INFO PCTFREE 30; |
这条SQL同样伴随着resmgr:cpu quantum等待事件,这显然是和资源管理器相关的等待事件,告警日志也可以看到相关的信息。
1 | Setting Resource Manager plan SCHEDULER[0x32DB]:DEFAULT_MAINTENANCE_PLAN via scheduler window |
2 | Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter |
3 | Begin automatic SQL Tuning Advisor run for special tuning task "SYS_AUTO_SQL_TUNING_TASK" |
4 | End automatic SQL Tuning Advisor run for special tuning task "SYS_AUTO_SQL_TUNING_TASK" |
当前服务器CPU使用率在50%左右,对PC服务器来讲,CPU超过50%通常是很危险的,就当前系统而言,AUTO_SQL_TUNING和资源管理器是不想看到的,禁用这些。
01 | SQL> alter system set resource_manager_plan='' sid='*'; |
05 | SQL> execute dbms_scheduler.set_attribute('WEEKNIGHT_WINDOW','RESOURCE_PLAN',''); |
07 | PL/SQL procedure successfully completed. |
09 | SQL> execute dbms_scheduler.set_attribute('WEEKEND_WINDOW','RESOURCE_PLAN',''); |
11 | PL/SQL procedure successfully completed. |
13 | SQL> execute dbms_scheduler.set_attribute('MONDAY_WINDOW','RESOURCE_PLAN',''); |
15 | PL/SQL procedure successfully completed. |
17 | SQL> execute dbms_scheduler.set_attribute('TUESDAY_WINDOW','RESOURCE_PLAN',''); |
19 | PL/SQL procedure successfully completed. |
21 | SQL> execute dbms_scheduler.set_attribute('THURSDAY_WINDOW','RESOURCE_PLAN',''); |
23 | PL/SQL procedure successfully completed. |
25 | SQL> execute dbms_scheduler.set_attribute('FRIDAY_WINDOW','RESOURCE_PLAN',''); |
27 | PL/SQL procedure successfully completed. |
29 | SQL> execute dbms_scheduler.set_attribute('SATURDAY_WINDOW','RESOURCE_PLAN',''); |
31 | PL/SQL procedure successfully completed. |
33 | SQL> execute dbms_scheduler.set_attribute('SUNDAY_WINDOW','RESOURCE_PLAN',''); |
35 | PL/SQL procedure successfully completed. |
38 | 2 DBMS_AUTO_TASK_ADMIN.DISABLE( |
39 | 3 client_name => 'auto space advisor', |
41 | 5 window_name => NULL); |
45 | PL/SQL procedure successfully completed. |
过断时间再次查看,发现这条SQL的等待事件又出现了enq: RC – Result Cache: Contention。这受隐含参数_result_cache_timeout的影响,在11.2.0.4.0版本默认是10秒,在10g版本是60秒。
04 | 3 y.ksppstdf isdefault |
09 | 8 x.inst_id = userenv('Instance') and |
10 | 9 y.inst_id = userenv('Instance') and |
11 | 10 x.indx = y.indx and |
12 | 11 x.ksppinm like '%result_cache%' |
14 | 13* translate(x.ksppinm, ' _', ' ') |
18 | -------------------------------------------------------------------------------- ---------- --------- |
19 | _client_result_cache_bypass FALSE TRUE |
20 | client_result_cache_lag 3000 TRUE |
21 | client_result_cache_size 0 TRUE |
22 | _optimizer_ads_use_result_cache TRUE TRUE |
23 | _result_cache_auto_dml_monitoring_duration 15 TRUE |
24 | _result_cache_auto_dml_monitoring_slots 4 TRUE |
25 | _result_cache_auto_dml_threshold 16 TRUE |
26 | _result_cache_auto_dml_trend_threshold 20 TRUE |
27 | _result_cache_auto_execution_threshold 1 TRUE |
28 | _result_cache_auto_size_threshold 100 TRUE |
29 | _result_cache_auto_time_distance 300 TRUE |
30 | _result_cache_auto_time_threshold 1000 TRUE |
31 | _result_cache_block_size 1024 TRUE |
32 | _result_cache_copy_block_count 1 TRUE |
33 | _result_cache_deterministic_plsql FALSE TRUE |
34 | _result_cache_global TRUE TRUE |
35 | result_cache_max_result 100 TRUE |
36 | result_cache_max_size 2147483648 TRUE |
37 | result_cache_mode FORCE TRUE |
38 | result_cache_remote_expiration 0 TRUE |
39 | _result_cache_timeout 10 TRUE |
降低_result_cache_timeout参数的值可以减少enq: RC – Result Cache: Contention等待事件。
经过查询,发现这些SQL运行时间已经超过2小时,在这期间应用程序重启过,这些进程消耗了大量的资源,并且有释放。经开发人员确认,超过15分钟以上的连接都是不正常的,要回收掉,通过修改profile,设定会话连接时间15分钟。
01 | SQL> alter profile "DEFAULT" limit CONNECT_TIME 15; |
05 | SQL> select * from dba_profiles; |
07 | PROFILE RESOURCE_NAME RESOURCE LIMIT |
08 | ------------------------------ -------------------------------- -------- ---------------------------------------- |
09 | DEFAULT COMPOSITE_LIMIT KERNEL UNLIMITED |
10 | DEFAULT SESSIONS_PER_USER KERNEL UNLIMITED |
11 | DEFAULT CPU_PER_SESSION KERNEL UNLIMITED |
12 | DEFAULT CPU_PER_CALL KERNEL UNLIMITED |
13 | DEFAULT LOGICAL_READS_PER_SESSION KERNEL UNLIMITED |
14 | DEFAULT LOGICAL_READS_PER_CALL KERNEL UNLIMITED |
15 | DEFAULT IDLE_TIME KERNEL UNLIMITED |
16 | DEFAULT CONNECT_TIME KERNEL 15 |
17 | DEFAULT PRIVATE_SGA KERNEL UNLIMITED |
18 | DEFAULT FAILED_LOGIN_ATTEMPTS PASSWORD UNLIMITED |
19 | DEFAULT PASSWORD_LIFE_TIME PASSWORD UNLIMITED |
20 | DEFAULT PASSWORD_REUSE_TIME PASSWORD UNLIMITED |
21 | DEFAULT PASSWORD_REUSE_MAX PASSWORD UNLIMITED |
22 | DEFAULT PASSWORD_VERIFY_FUNCTION PASSWORD NULL |
23 | DEFAULT PASSWORD_LOCK_TIME PASSWORD UNLIMITED |
24 | DEFAULT PASSWORD_GRACE_TIME PASSWORD UNLIMITED |
杀掉这些进程后,问题解决,再次运行这些SQL,很快就有结果返回了。这个故障导致数据库响应时快时慢,原因是连到节点2的会话,由于服务器资源被这些没有回收的进程消耗和占用,速度就相对较慢,而节点1资源充足,分配到节点1的操作就相对较快。