GC Block Lost Wait Event
from: http://www.dba-oracle.com/t_rac_tuning_gc_block_lost_wait_event.htm
No network is perfect. Data transmitted from point A to point B may occasionally get lost. The same is true for global cache transfers along the Cluster Interconnect. Global cache block transfers can get lost. If a requested block is not received by the instance in 0.5 seconds, the block is considered to be lost. When most block transfers complete in milliseconds, too many lost global cache block transfers can hamper application performance because the block needs to be re-sent, thus wasting time for the second transfer to complete.
Lost global cache block transfers can be seen in two different areas. Wait events named gc cr block lost and gc current block lost will be raised when a consistent read block transfer is lost, or when a current block transfer is lost, and the session must wait for the block to be resent. The other area is for the Oracle statistics namedgc blocks lost as can be seen on the system or session level. Examples of these two metrics are seen below.
< gc_blocks_lost.sql
select
inst_id,
event,
total_waits,
time_waited
from
gv$system_event
where
event in ('gc current block lost',
'gc cr block lost')
order by
event,
inst_id;
INST_ID EVENT TOTAL_WAITS TIME_WAITED
---------- ------------------------------ ----------- -----------
1 gc cr block lost 50 3029
2 gc cr block lost 75 4516
1 gc current block lost 26 1467
2 gc current block lost 36 2060
select
sn.inst_id,
sn.name,
ss.value
from
gv$statname sn,
gv$sysstat ss
where
sn.inst_id = ss.inst_id
and
sn.statistic# = ss.statistic#
and
sn.name = 'gc blocks lost'
order by
sn.inst_id;
INST_ID NAME VALUE
---------- -------------------- ----------
1 gc blocks lost 90
2 gc blocks lost 164
The output above shows the metrics on a per-instance basis. One can certainly summarize the values across all instances if desired.
The presence of blocks lost in wait events or a system statistic is not sufficient to cause us great concern. Just like any network, there may be an occasional hiccup that would lead to lost block transfers and would appear in the gv$sysstat view. As with any wait event, the wait event metric by itself is essentially meaningless as there is no context from the output above. Is the wait event a ?Top 5? wait event? Where the wait events generated over a 1-hour time period or 1 month? Since we do not know the answers to these questions, we cannot determine if the metrics are indicating a problem or not. More information is needed. An AWR report from a 1-hour snapshot of time can be more indicative that a real problem exists.
Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Avg
wait % DB
Event Waits Time(s) (ms) time Wait Class
-------------------------- ------------ ----------- ------ ------ ----------
DB CPU 6,975 32.1
db file sequential read 3,831,277 5,809 2 26.8 User I/O
gc current block lost 3,819 942 247 4.3 Cluster
db file parallel read 145,588 854 6 3.9 User I/O
gc cr multi block request 535,685 498 1 2.3 Cluster
Above, the gc current block lost wait event is in the Top 5 list. The listing above now provides context to the wait event in question. This event contributes the second longest total wait time for the instance during the one-hour time period. However, if the wait event were totally eliminated, only 4.3% of the total processing time would be recovered. From a performance tuning perspective, where the end goal is often to reduce processing time, it would be better to focus on the db file sequential read wait event that is contributing 26.8% of the total database time or determining if the CPU utilization can be decreased as that is contributing to 32.1% of the total time. That being said, it is never a good sign when any global cache blocks being lost are a top wait event.
The most common reason for lost global cache blocks is a faulty private network, i.e. one that is dropping packets. If global cache lost blocks are seen as a problem, then work with the network administrator to ensure the switch is valid, cables are secure and seated properly, firmware levels are up to date, and that other network configuration issues are not a problem. The network administrator should be able to use network tools like netstat and anything else in their arsenal to check for dropped packets on the private network.
[root@host01 ~]# netstat ?su
IcmpMsg:
InType0: 91
InType3: 723
InType8: 23
OutType0: 23
OutType3: 928
OutType8: 103
Udp:
664034038 packets received
983 packets to unknown port received.
20080 packet receive errors
654621700 packets sent
UdpLite:
IpExt:
InMcastPkts: 18041
OutMcastPkts: 8745
InBcastPkts: 102377
OutBcastPkts: 119
InOctets: 4678332299675
OutOctets: 2652878623355
InMcastOctets: 1401313
OutMcastOctets: 636504
InBcastOctets: 19312376
OutBcastOctets: 49090
The netstat utility is reporting UDP packet receive errors, indicating global cache lost block transfers for this node of the cluster. In addition to verifying the hardware is correct, the network administrator should investigate the following:
Private network is truly private
Oversaturated bandwidth due to too much traffic on the network
Quality of Service (QoS) settings that may be downgrading performance
Incorrect Jumbo Frames configuration
Multiple hops between the nodes and the private network switch
Mismatched MTU settings between devices
Mismatch in duplex mode settings between devices
Incorrect bonding/teaming configuration
If everything on the network side checks out, then look to sizing the UDP settings to have larger socket sizes as discussed in the previous section of this chapter. Global cache lost blocks are not always a network issue. After the network has been verified and UDP socket sizes are correct, look to see if CPU resources are in short supply.