Database Performance: How to reduce "library cache: mutex X" waits for scalability in the Oracle database
As the demand for database workload throughput increases along with server CPU core counts there can be an increasing challenge of maintaining scalability even with the most optimized workloads. In fact in some scenarios following best practices can result in contention as more sessions running on more CPUs contend for access to the same SQL or data. In this blog post I will look at an extremely high throughput transactional Oracle database workload and the options available for improving concurrency and throughput without modifying the application.
The workload I will use is the HammerDB Oracle OLTP workload on a 4-socket Intel Xeon server - importantly the server CPU utilisation is over 95% and the throughput is already very high so we need to be aware that contention may not "true" ie if the CPU is already fully utilised then contention may be as a result of a process waiting for CPU time as much as genuine contention on a particular object. Nevertheless we are in the business of maximising as much as possible from our IT investments and it is interesting to see that even with full CPU utilisation there are options to reduce contention and improve performance still further. This HammerDB workload is a good example of an optimized workload as it shares cursors, reduces parsing and makes use of bind variables. How it does this is when a workload is run, for each stored procedure a cursor is opened and as shown here in Oratcl the statement that calls the stored procedure is parsed once only:
set curn_sl [oraopen $lda ]
set sql_sl "BEGIN slev(:st_w_id,:st_d_id,:threshold); END;"
oraparse $curn_sl $sql_sl
return $curn_sl
When the stored procedure is called bind variables are used
orabind $curn_sl :st_w_id $w_id :st_d_id $stock_level_d_id :THRESHOLD $threshold
oraexec $curn_sl
using bind variables means that the same cursor can be shared between sessions, reducing the impact on the amount of memory needed for the shared pool and keeping hard parsing activity to a minimum: Running the workload and measuring the output with an AWR report shows the following top 10 foreground events:
Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Total Wait Wait % DB
Event Waits Time (sec) Avg(ms) time Wait Class
------------------------------ ------------ ---------- ------- ------ ----------
DB CPU 34.2K 84.5
library cache: mutex X 1,443,752 2284 2 5.6 Concurrenc
cursor: pin S 1,060,588 2004.2 2 5.0 Concurrenc
latch: In memory undo latch 1,036,240 401 0 1.0 Concurrenc
latch: enqueue hash chains 103,783 331.7 3 .8 Other
enq: TX - row lock contention 171,164 266.9 2 .7 Applicatio
db file sequential read 249,612 151.8 1 .4 User I/O
log file sync 35,981 117.3 3 .3 Commit
db file scattered read 155,862 105.7 1 .3 User I/O
buffer busy waits 270,967 90 0 .2 Concurrenc
Performance is good and CPU utilisation is maximised, however the top wait events of library cache: mutex X and cursor: pin S are related to the good practices of cursor sharing. Each of these events covers multiple mutexes and drilling down into the AWR report shows more detailed information on each.
Wait
Mutex Type Location Sleeps Time (ms)
--------------------- -------------------------------- ------------ ------------
Cursor Pin kksfbc [KKSCHLFSP2] 634,827 933,897
Library Cache kglpin1 4 610,798 747,221
Library Cache kglpndl1 95 599,973 753,431
Library Cache kglpnal1 90 503,140 155,556
Cursor Pin kksLockDelete [KKSCHLPIN6] 454,516 701,271
Library Cache kglhbh1 63 2 1
Here the top mutexes are related to child cursor lookup activity (kksfbc) and library cache pinning and unpinning activity (kglpin/kglpndl) - for further information on this there is an excellent presentation on Library Cache internals here. Mutexes are extremely lightweight and use atomic instructions implemented at the CPU level with the LOCK CMPXCHG instruction, however a session sleeping waiting to acquire a mutex will do so utilising CPU time and therefore in this case the optimization has meant the high CPU and session count running the same stored procedures shows evidence of contention.
The tried and tested workaround is to modify the application to to artificially duplicate the SQL by changing the text. One such way to do this is by introducing 'dummy hints' that are not stripped out by the parser but are sufficient for Oracle to recognise that the statements are different and thereby reduces contention by having different sessions use different 'versions' of the same statements. This is not ideal however by modifying the application for a particular hardware platform and in some cases it is simply not an option to modify the application, instead, dbms_shared_pool.markhot can be used. Instead of duplicating the SQL ourselves dbms_shared_pool.markhot serves this purpose introducing multiple versions of the same packages, procedures and cursors into the library cache for different sessions to use - to do this all we have to do is mark the objects in question hot. We already know which packages and procedures are being used for this workload so can mark them hot at a SQL prompt as the SYS user as follows:
exec dbms_shared_pool.markhot(schema=>'SYS',objname=>'DBMS_RANDOM',NAMESPACE=>1);
exec dbms_shared_pool.markhot(schema=>'SYS',objname=>'DBMS_RANDOM',NAMESPACE=>2);
exec dbms_shared_pool.markhot(schema=>'SYS',objname=>'DBMS_OUTPUT',NAMESPACE=>1);
exec dbms_shared_pool.markhot(schema=>'SYS',objname=>'DBMS_OUTPUT',NAMESPACE=>2);
exec dbms_shared_pool.markhot(schema=>'TPCC',objname=>'NEWORD',NAMESPACE=>1);
exec dbms_shared_pool.markhot(schema=>'TPCC',objname=>'PAYMENT',NAMESPACE=>1);
exec dbms_shared_pool.markhot(schema=>'TPCC',objname=>'DELIVERY',NAMESPACE=>1);
exec dbms_shared_pool.markhot(schema=>'TPCC',objname=>'OSTAT',NAMESPACE=>1);
exec dbms_shared_pool.markhot(schema=>'TPCC',objname=>'SLEV',NAMESPACE=>1);
Re-running the workload shows a significant impact on the library cache: mutex X event and despite the high CPU utilisation already throughput has also improved.
Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Total Wait Wait % DB
Event Waits Time (sec) Avg(ms) time Wait Class
------------------------------ ------------ ---------- ------- ------ ----------
DB CPU 34.5K 88.6
cursor: pin S 1,071,740 2093.3 2 5.4 Concurrenc
enq: TX - row lock contention 373,472 548.9 1 1.4 Applicatio
latch: In memory undo latch 956,992 415.6 0 1.1 Concurrenc
latch: enqueue hash chains 103,312 355.7 3 .9 Other
db file sequential read 211,498 139.8 1 .4 User I/O
db file scattered read 168,610 123.5 1 .3 User I/O
log file sync 36,418 118.7 3 .3 Commit
buffer busy waits 257,958 95.9 0 .2 Concurrenc
library cache: mutex X 23,288 24.3 1 .1 Concurrenc
The detailed mutex information also shows the reduction in library cache mutex activity.
Wait
Mutex Type Location Sleeps Time (ms)
--------------------- -------------------------------- ------------ ------------
Cursor Pin kksfbc [KKSCHLFSP2] 644,771 973,310
Cursor Pin kksLockDelete [KKSCHLPIN6] 455,400 725,202
Library Cache kglpin1 4 10,879 11,115
Library Cache kglpndl1 95 8,330 10,836
Library Cache kglpnal1 90 6,959 1,857
Library Cache kglhbh1 63 1 0
Using dbms_shared_pool.markhot can be also be used for cursors and the most active ones can be seen within the view X$KGLOB - there is a SQL statement from Andrey Nikolaev that shows the hottest objects in the library cache here so we can modify that to find the top 4 cursors for the TPCC user and mark them hot as follows:
SET ECHO OFF
SET SERVEROUTPUT ON
DECLARE
BEGIN
FOR hash_id in (
select KGLNAHSV from (
select
case when (kglhdadr = kglhdpar) then 'Parent' else 'Child' ||kglobt09 end,
kglhdadr ADDRESS,substr(kglnaobj,1,20),
parsing_schema_name, kglnahsh, KGLNAHSV, kglobtyd, kglobt23, kglobt24, kglhdexc, kglhdnsp
from x$kglob, v$sql
where x$kglob.kglnahsh = v$sql.hash_value
and v$sql.parsing_schema_name='TPCC'
and kglhdexc > 1
and kglobt24 > 1
order by kglobt24 desc) where rownum < 5
) LOOP
DBMS_OUTPUT.PUT_LINE('hash=>'|| hash_id.KGLNAHSV ||',namespace=>0');
SYS.DBMS_SHARED_POOL.MARKHOT(hash=>hash_id.KGLNAHSV,namespace=>0);
END LOOP;
END;
/
An example output is as follows - it is important to note that the hash value required is the full 16 byte hash value of KGLNAHSV in X$KGLOB although on some versions of Oracle no error is given if using the incorrect hash value however it has no effect so you should ensure that you are using the correct hash value as shown in the example:
SQL> @markhot4sql
hash=>b45763c79b8f4f0e8439f6a3190f1b8f,namespace=>0
hash=>554b1093c9cd87db8f6f55016b20198a,namespace=>0
hash=>15e9d1f93391ce5edb7be748a2ab9f80,namespace=>0
hash=>6508e1443bf2a2948165d5ad91803582,namespace=>0
In this case cursor: pin S events have been reduced, however at the expense of increasing library cache mutex events and lowering throughput.
Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Total Wait Wait % DB
Event Waits Time (sec) Avg(ms) time Wait Class
------------------------------ ------------ ---------- ------- ------ ----------
DB CPU 33.4K 79.7
library cache: mutex X 849,072 5296.8 6 12.6 Concurrenc
enq: TX - row lock contention 277,268 745.3 3 1.8 Applicatio
cursor: pin S 222,045 455.4 2 1.1 Concurrenc
...
The detail shows however that it is a different area of library cache mutex x contention and from referring to the presentation noted previously kglhd is related to the handle structure and therefore creation of a new object.
Mutex Type Location Sleeps Time (ms)
--------------------- -------------------------------- ------------ ------------
Library Cache kglhdgn1 62 1,032,273 2,763,927
Cursor Pin kksfbc [KKSCHLFSP2] 202,786 338,137
Library Cache kglhdgn2 106 186,681 496,834
Library Cache kglpin1 4 127,693 190,311
Library Cache kglpndl1 95 126,592 200,987
Library Cache kglpnal1 90 83,421 19,982
Cursor Pin kksLockDelete [KKSCHLPIN6] 56,208 100,403,
As a consequence with the system already at full CPU utilisation further optimization has proved counter
productive. In fact testing showed that even when marking packages and procedures hot only and limiting the number of copies created with the following parameter resulted in the optimal overall system performance.
SQL> show parameter kgl
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
_kgl_hot_object_copies integer 4
In summary what we have seen is that if experiencing contention on library cache: mutex X or cursor: pin S events then dbms_shared_pool.markhot can be a highly effective tool. Whether looking to achieve a few extra percent on an extreme high performance workload or to achieve high scalability on systems with very high core counts Oracle on Intel makes this possible all without without needing to modify the application.