Garbage-First Garbage Collection

David Detlefs,Christine Flood,Steve Heller,Tony Printezis

ABSTRACT

Garbage-First is a server-style garbage collector,targeted for multi-processors with large memories,that meets a soft real-time goal with high probability,while achieving high throughout.Whole-heap operations,such as global marking,are performed concurrently with mutation,to prevent interruptions proportional to heap or live-data size.Concurrent marking both provides collection “completeness” and identifies rgions ripe for reclamation via compacting evacnation.This evacuation is performed in parallel on multi-processors,to increase throughput.

Categories and Subject Descriptors:

D.3.4 [Programming Languages]:Processors-Memory management(garbage collection)
General Terms: Languages,Management,Measurement,Performance
Keywords: concurrent garbage collection,garbage collection,garbage-first garbage collection,parallel garbage collection,soft real-time garbage collection

1.INTRODUCTION

The java Programming language is widely used in large server applications.Thest applications are characterized by large amounts of live heap data and considerable throud-level parallelism,and are often run on high-end multiprocessors.Throughput is clearly important for such applications,but they often also have mderately stringent(though soft)real-ti,e constraints,e.g. in telecommunications,callprocessong applications(several of which are not implemented in the Java language),delays of more than a freation of a second in setting up calls are likely to annoy customers.
The Java language specification mandates some form of garbage collection to reclaim unused storage. Traditional “Stop-world” collector implementations will affect an application"s responsiveness,so some form of concurrent and/or incremental collectoris necessary.In such collectors,lower pause times generally come at a cost in throughput.Therefore,we allow users to specify a soft real-time goal,stating their desire that collection consume no more than x ms of any y ms time slice.By making this goal explicit,the collector can try to keep collection pauses as small and infrequent as necessary for the application,but not so low as to decrease throughput or increase footprint unnecessarily.This paper describes the Garbage-First collection algorithm,which attempts to satisfy such a soft real-time goal while mantaining high throughput for programs with large heaps and high allocation rates,running on large multi-processor machines.
The Garbage-First collector achieves these goals via several techniques.The heap is partitioned into a set of equalsized heap regions,much like the train cars of the mature-Object Space collector of Hudson and Moss.However,whereas the remembered sets of the Mature-Object Space collector are unidirectional,recording pointers from older regions to younger but not vice verse,Garbage-First remembered sets record pointers from all regions(with some exceptions,described in sections 2.4 and 4.6).Recording all references allows an arbitrary set of heap regions to be chosed for collection.A consuccent thread processes log records created by soecial metutor write barriers to keep remembered sets up-to-date,allowing shorter collections.
Garbage-First uses a snapshot-at-the-beginning(henceforth SATB)concurrent marking algorithm.This provides periodic analysis of global reachability,providing completeness,the property that all garbage is enentually identified.The concurrent marker also counts the amount of live data in each heap region.This information informs the choice of which regions are collected:regions that have little live data and more garbage yield more efficient collection,hence the name “Garbage First”.The SATB marking algorithm also hava very small pause times.
Garbage-First employs a novel mechanism to attempt to achieve the real-time goal.Recent hard real-time collectors have satisfied real-time constraints by making collection inerruptible at the granularity of copying individual objects,at some time and space overhead.In constrast,Garbage-First copies objects at the coarser granularity of heap regions .The collector has a reasonably accurate model of the cost of collecting a particular heap region,as a function of quicklyu-measured properties of the region.Thus the collector can choose a set of regions that can be collected within a given pause time limit(with high probability).Further,collection is delayed if necessary(and possible) to avoid violationg the real-time goal.Our belief si that abandoning hard real-time guarantees for this softer best-effort style may yield better throughput and space usage,an appropriate tradeoff for many applications.

2.DATA STRUCTURES/MECHANISMS

In this section,we describe the data structures and mechanisms used by the Garbage-First collector.

2.1 Heap Layout/Heap Regions/Allocation

The Garbage-First heap is divided into equal-sized heap regions,each a contiguous range of virtual memory.Allocation in a heap region consists of incrementing a boundary,top,between allocated and unallocated space.One region is the current allocation region from which storage is being allocated.Since we are mainly concerned with multi-processors,mutator theads allocate only thread-local allocation buffers,or TLABs,directly in this heap region,using a compare-and-swap,or CAS,operation.They then allocate objects privately within those buffers,to minimize allocation contention.When the current allocation region is filled,a new allocation region is chosen.Empty regions are organized into a linked list to make region allocation a constant time operation.
Larger objects may be allocated directly in the current allocation region ,outside of TLABs.Objects whose size exceeds 3/4 of the heap region size,however,are termed humongous.humongous objects are allocated indedicated(contiguous sequences of)heap regions;these regions contain only the humongous object.

2.2 Remembered Set Maintenance

Each region has an associated remembered set,which indicates all locations that might contain pointers to (live)objects with in the region.Maintaining these remembreed sets requires that mutator threads inform the collector when they make pointer modifications that might create inter-region pointers.This notification uses a card table:evary 512-byte card in the heap maps to a one-byte entry in the card table.Each thread has an associated remembered set log,a current buffer or sequence of modified cards.In addition,there is a global set of filled RS buffers.
The remembered sets themselves are sets(represented by hash tables) of cards.Actually ,bacause of parallelism,each region has an associated array of several such hash tables,one per parallel GC thread,to allow these threads to update remembered sets without interference.The logical contents of the remembered set is the union of the sets repersented by each of the component hash tables.
The remembered set write barrier is performed after the pointer write.If the code performs the pointer write x.f=y,the registers rX and rY contain the object pointer values x and y respectively,then the pseudo-code for the barrier is

1 rTmp := rX XOR rY
2 rTmp := rTmp >> LogOfHeapRegionSize
3 //Below is a conditional move instr
4 rTmp := (rY == NULL) then 0 else rTmp
5 if(rTmp == 0) goto filtered
6 call rs_enqueue(rX)
7 filtered

This barrier uses a filtering technique mentioned briefly by Stefanovie et al in [32].If the write creates a pointer from an object ro another object in the same heap region,a case we expect to be common,then it need not be recorded in a remembered set.The exclusive-or and shifts of lines 1 and 2 means that rTmp is zero after the second line if x and y are in the same heap region.Line 4 adds filtering of stores of null pointers.If the store pases there filtering checks,then it creates an out-of-region pointer.The rs-equeue routing reads the card table entry for the object head rX.If that entry is already dirty,nothing is done.This reduces work for multiple stores to the same card,a common case because of initializing writes.If the card table entry is not dirty,then it is dirtied,and a pointer to the card is enqueued on the thr4ead’s remembered set log.if this equeue fills the thread’s current log buffer(which holds 256 elements by default),then that buffer is put in the global set of filled buffers,and a new empty buffer is allocated.
The concurrent remembered set thread waits(on a condition variable) for the size of the filled RS buffer set to reach a configurable initiating threshold(the default is 5 buffers)。The remembered set thread processes the filled buffer as a queue,until the length of the queue decrease to 1/4 of the initiating threshold.For each buffer,it processes each card table pointer entry.Some cards are hot:they contain locations that are written to frequently,To avoid processing hot cards repeatedly,we try to identify the hottest cards,and defer their processing until the next evacuation pause(see section 2.3 for a description of evacuation pauses).We accomplish this with a second card table that records the number of times of the card has been dirtied since the last evacuation pause(during which this table,like the card table proper,is cleared).Whenwe process a card we increment its count in this table.If the count exceeds a hotness threshold (default 4),then the card is added to circular buffer called the hot queue(of default size 1K).This queue is processed like a log buffer at the start of each evacuation pause,so it is empty at the end.If the circular buffers is full,then a card is evicted from the other end and processed.
Thus,the concurrent remembered set thread processes a card if it has not yet reached the hotness threshold,or if it is evicted from the hot queue.To process a card,the thread first resets the corresponding card table entry to the clean value , so that any concurrent modifications to objects on the card will redirty and re-equeue the card.It then examines the pointer fields of all the objects whose modification may have dirtied the card,looking for pointers outside the containing heap region.If such a pointer is found,the card is inserted into the remembered set of the referenced region.
We use only a single concurrent remembered set thread,to introduce parallelism then idle processors exist.However ,if this thread is not sufficient to service the rate of mutation,the filled RS buffer set will grow too large.We limit the size of this set;mutator threads attempting to add further buffers perform the remembered set processing themselves.

2.3 Evacuation Pauses

At appropriate points(described in section 3.4),we stop the mutator threads and perform an evacuation pause.Here we choose a collection set of regions,and evacuate the regions by copying all their live objects to other locations in the heap,thus freeing the collection set regions.Evacuation pauses exist to allow compaction:object movement must appear atomic to mutators.This atomicity is costly to achieve in truly concurrent systems,so we move objects during incremental stop-world pauses instead.
If a multithreaded program is running on a multiprocessor machine,using a sequential garbage collector can create a performance bottleneck.We therefore strive to parallelize the operations of an evacuation pause as much as posibble.
The first step of an evacuation pause is to sequentially choose the collection set (section 3 details the mechanisms and heuristics of this choice).Next,the main parallel phrase of the evacuation pause starts.GC threads compete to claim tasks such as scanning pending log buffers to update remembered sets,scanning remembered sets and other root groups for live objects , and evacuating the live objects.There is no explicit synchronization among tasks other than ensuring that each task is performed by only one thread.
The evacuation algorithm is similar to the one reported by Flood et al.To achieve fast parallel allocation we use GCLIBs,i.e. thread-local GC allocation buffers(similar to mutator TLABs).Threads allocate an object copy in their GCLAB and compete to install a forwarding pointer in the old image.The winner is responsible for copying the object and scanning its contents.A teahnique based on work-stealing provoides load balancing.

在这里插入图片描述
Figure 1 illustrates the operation of an evacuation pause.Step A shows the remembered set of a collection set region R1 being used to find ponters into the collection set.As will be discussed in section 2.6,pointers from objects identified as garbage via concurrent marking(object a in the figure are not followed.

2.4 Generational Garbage-First

Generational garbage collection has several advantages,which a collection strategy ignores at ites peril.Newly allocated objects are usually nore likely to become garbage than older objects,and newly allocated objects are alse more likely to be the target of ponter modifications,if only bacause of initialization.We can take advantage of both of thest properties in Garbage-First in a flexible way.We can heuristically designate a region as young when it is chosen as a mutator allocation region.This commits the region to be a member fo the next collection set.In return for this loss of heuristic flexibility,we gain an important benifit:remembered set processing is not required to consider modifications in young regions.Reachable young objects will be scanned after they are evacuated as a normal part of the next evacuation pause.
Note that a collection set can contain a mix of young and non-young regions.Other than thespecial treatment for remembered sets described above,both kinds of regions are treated uniformly.
Garbage-First runs in two modes:generational and pure garbage-first.Generational mode is the default,and is used for all performance results in this paper.
There are two futher “submodes” of generational mode:evacuation pauses can be fully or partially young.A fully-young pause adds all(and only)the allocated young regions to the collection set.A partially-young pause chooses all the allocated young regions,and may add further non-young regions,as pause times allow(see section 3.2.1).

2.5 Concurrent Marking

Concurrent marking is an important component of the system.It provides collector completeness without imposing any order on region choice for collection set(as,for example,the Train algorithm of Hudson and Moss does)Further,it provides the live data information that allows regions to be collected in “garbage-first” order.This secion descibes our concurrent marking algorithm.
We use a form of snapshot-at-the-beginning concurrent marking.In this style,marking is guaranteed to identify garbage objects that exist at the start of marking,by marking a logical “snapshot” of the object graph existing at that point.Objects allocated during marking are necessarily considered live.But while such objects must be considered marked,they need not be traced:they are not part of the object graph that exists ant the start of marking.This greatly decreases concurrent marking consts,especially in a system like Garbage-First that has no physically separate young generation treated spcially by marking.

2.5.1 Marking Data Structures

We maintain two marking bitmaps,labeled previous and next.The previous marking bitmap is the last bitmap in which marking has been completed.The next marking bitmap may be under construction.The two physical bitmaps swap logical roles as marking is completed.Each bitmap contains one bit for each address that can be the start of an object.With the default 8-byte object alignment,this means 1 bitmap bit for evary 64 heap bits.We use a mark stack to hold(some of)the gray(marked but not yet recurisively scanned)objects.

2.5.2 Initial Marking Pause/Concurrent Marking

The first phase of a marking cycle clears the next marking bitmap.This is performed concurrently.Next the initial marking pause stops all mutator threads,and marks all objects directly reachable from the roots(in the generational mode,initial marking is in fact piggy-backed on a fully-young evacuation pause).Each heap region contains two top at mark start (TAMS) variables,one for the previous marking and one for the next.We will refer to these as the previous and next TAMS variables.These variables are used to identify objects allocated during a marking phase.These objects above a TAMS value are considered implicitly marked with respect to the marking to which the TAMS variable corresponds,but allocation is not slowed down by marking bitmap updates.The initial marking pause iterates over all the regions in the heap,copying the current value of top in each region to the next TAMS of that region.Steps A and D of figure 2 illustrate this.Steps B and E of this figure show that objects allocated during concurrent marking are above the next TAMS value,and are thus considered live(The bitmaps physically cover the entire heap,but are shown only for the portion of regions for which they are relevant)
Now mutator threads are restarted,and the concurrent phase of marking begins.This phase is vare similar to the concurrent marking phase of [29]: a “finger” pointer iterates over the marked bits.Objects higher than the finger are implicitly gray;gray objects below the finger are represented with a mark stack.
在这里插入图片描述

2.5.3 Concurrent Marking Write Barrier

The mutator may be updating the pointer graph as the collector is tracing it.This mutation may remove a pointer in the “snapshot” object graph,violating the guarantee on which STAB marking is based.Therefore,STAB marking requires mutator threads to record the value of pointer firlds before they are overwritten.Below we show pseudocode for the marking write barrier for a write of the value in rY to offset FieldOffset in an object whose address is in rX.Its operation is explained below.

1| rTmp := load(rThread+MarkingInProgressOffset)
2| if(!rTmp) goto filtered
3| rTmp := load(rX + FieldOffset)
4| if(rTmp == null) goto filtered
5| call satb_enqueue(rTmp)
6| filtered:

The actual pointer shore [rX,FieldOffset] := rY would follow.The first two lines of the barrier skip the remainder if marking is not in progress;for many programs,this filters out a large majority of the dynamically executed barriers.Lines 3 and 4 load the value in the object field,and check whether it is null,It is only necessary to log non-null values.In many programs the majority of pointer writes and initializing writes to previously-null fields,so this further filtering is quite effective.
The satb_equeue operation adds the pointer value to the thread’s current marking buffer.As with remembered set buffers,if the enqueue fills the buffer,it then adds it to the global set of completed marking buffers.The concurrent marking thread checks the size of this set at regular intervals,interrupting its heap traversal to progress filled buffers.

2.5.4 Final Marking Pause

A marking phase is complete when concurrent marking has traversed all the marked objects and completely drained the mark stack,and when all logged updates have been processed.The former condition is easy to detect;the later is harder,since mutator threads “own” log buffers until they fill them.The purepose of the stop-world final marking pause is to reach this termination condition reliably,while all mutator threads are stopped.It is very simple:any unprocessed completed log buffers are precessed as above,and partiallly completed pre-thread buffers are processed in the same way.This process is done in parallel,to guard against programs with many mutator threads with partially filled marking log buffers causing long pause times or parallel scaling issues.

2.5.5 Live Data Counting and Cleanup

Concurrent marking alse counts the amount of marked data in each heap region.Originally,this was done as part of the marking process.However,evacuation pauses that move objects that are live must also update the per-region live data count.When evacuation pauses are performed in parallel,and several threads are evacuating objects to the same region,updating this count consistently can be a source of parallel contention.While a variety of techniques could have ameliorated this scaling preblem,updating the count represented a significant portion of evacuation pause cost even with a single thread.Therefore,we opted to perform all live data counting concurrently.When final marking is complete,the GC thread re-examines each region,counting the bytes of marked data below the TAMS value associated with the marking.This is something like a sweeping phrase,but note that we find live objects by examining the marking bitmap,rather than by traversing dead objects.
As will be discussed in section 2.6,evacuation pauses occurring during marking may increase the next TAMS value of some heap regions.So a final stop-world cleanup pause is necessary to reliably finish this counting process.This cleanup phase also completes marking in several other ways.It is here that the next and previous bitmaps swap roles:the newly completed bitmap becomes the previous bitmap,and the old one is available for use in the next marking.In addition,since the marking is complete,the value in the next TAMS field of each region is copied into the previous TAMS field,as shown in steps C and F of figure 2.Liveness queries rely on the previous marking bitmap and prevous TAMS,so the newly-completed marking information will now be used to determine object liveness.In figure 2,light gray indicates objects known to be dead.Steps D and E show how the results of a completed marking may be used while a new marking is in progress.
Finally,the cleanup phase sorts the heap regions by expected GC efficiency.This metric divides the marking’s estimate of garbage reclaimable by collecting a region by the cost of collecting it.This cost is estimated based on a number of factors,including the estimated cost of evacuating the live data and the cost of traversing the region’s remembered set(Section 3.2.1 discusses our techniques for estimating heap region GC cost)The result of this sorting is an initial ranking of regions by desirablity for inclusion into collection sets.As discussed in section 3.3,the costestimate can change over time,so this estimate is only initial.
Regions containing no live data whatsoever are immediately reclaimed in this phase.For some programs,this method can reclaim a significant fraction of total garbage.

2.6

In this section we discuss the two major interactions between evacuation pauses and concurrent marking.
First,an evacuation pause never evacuates an object that was proven dead in the last completed marking pass.Since the object is dead,it obbiously is not referenced from the roots,but is might be referenced from other dead objects.References within the collection set are followed only if the referring object is found to be live.References from outside the collection set are identified by the remembered sets.objects identified by the remembered sets are ignored if they have been shown to be dead.
Second,when we evacuate an object during an evacuation pause,we need to ensure that it is marked correctly,if necessary,with respect to both the previous and next markings.It turns out that this is quite subtle and tricky.Unfortunately,due to space restrictions,we cannot give here all the details of this interaction.
We allow evacution pauses to occur when the marking thread’s stack is non-empty:if we did not,then marking could delay a desired evacuation pause by an arbitrary amount.The marking stack entries may refer to objects in the collection set.Since these objects are marked in the current marking,they are clearly live with respect to the previous marking,and may be evacuated by the evacuation pause.To ensure that marking stack entries are updated properly ,we treat the marking stack as a source of roots.

2.7

A popular object is one that is referenced from many locations,This section describes special handling for popular objects that achieves two goals:smaller remembered sets and a more efficient remembered set barrier.
We reserve a small prefix of the heap regions to contain popular objects.We attempt to identify popular objects quickly,and isolate them in this prefix,whose regions are never chosen for collections ets.
When we update region remembered sets concurrently,regions whose remembered set sizes have reached a given threshold are scheduled for processing in a popularity pause;such growth is often caused by popular objects.The popularity pause first constructs an approximate reference count for each object,then evacuates objects whose count reach an individual object popularity threshold to the regions in the popular prefix:non-popular objects are evacuated to the normal portion of the heap.If no individual popular objects are found,no evacuation is performed,but the pre-region threshold is doubled,to prevent a loop of such pauses.
There are two benefits to this treatment of popular objects.Since we do not relocate popular objects once they have been segregated,we do not have to maintain remembered sets for popular object regions.We show in section 4.6 that popular object handling eliminates a mjority of remembered set entries for one of hour benchmarks,We also save remembered set processing overhead by filtering out pointers to popular objects early on.We modify the stop of the remembered set weite barrier described in 2.2 that filtered out null pointers to instead do :

if(rY<PopObjBoundary) goto filtered

This test filters both popular object and also null pointers (using zero to represent null).Section 4.6 also measures the effectiveness of this filtering.
While popular object handling can be very beneficial,it is optional,and disabled in the performance measurements described in section 4,except for the portion of section 4.6 that explicaily investigates popularity,As discussed that section,popular objects effectively decrease remembered set seizes for some applications,but not for all;this mechanism may be superseded in the future.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值