Controlling GC pauses with the GarbageFirst Collector

9 篇文章 0 订阅

原文链接http://blog.mgm-tp.com/2014/04/controlling-gc-pauses-with-g1-collector/

In the previous post I have shown that the GarbageFirst (G1) collector in Java 7 (and also 8ea) does a reasonable job but cannot reach the GC throughput of the “classic” collectors as soon as old generation collections come about. This article focuses on G1’s ability to control the duration of GC pauses. To this end, I refined my benchmark from the previous tests and also ran it with a huge heap size of 50 GB for which G1 was designed. I learnt that G1’s control of GC pauses is not only costly but, unfortunately, also weaker than expected.


在之前的post中,我已经向大家展示了:java7中的G1垃圾回收器做了所有合理的工作,但是在进行老年代的收集时,却不能达到经典的垃圾回收器的吞吐量。这篇文章专注于G1控制GC暂停时间的能力。为了达到这个目的,我改良了之前的测试,并且堆的大小也是50G。我了解到,G1控制GC暂停间隔的能力不但昂贵,而且不幸的是比预期的能力要弱。

The MixedRandomList benchmark presented in the previous post differed in at least two respects from a real-world high-traffic application: its execution resulted in a CPU consumption of 100% and in a much higher GC rate than observed even in high-traffic applications. Therefore I refined it with the following changes:

  • Add some latency by calling Thread.sleep() to bring CPU usage to a high but healthy level of about 50%
  • Introduce some CPU consumption outside object creation and garbage collection to bring the memory allocation and GC rate to a level of about 500 MB/s which is high but can be found in real high-traffic installations of reasonable applications


之前的测试中的MixedRnadomList与真实世界中的高吞吐量(高负载)的应用之间至少存在两点不同:它的执行,导致了CPU利用率达到了100,并且GC的速率比真实的高吞吐量应用还要高。因此我做了如下的改进:

●   通过调用Thread.sleep()添加了影响因素,使得CPU的利用率达到一定高度,但是保持在一个相对良好的状态50%。

●   除了创建对象,和垃圾回收,还添加了其他的因素,使得内存的内存分配和GC速率达到一定的高度500MB/s,这个速率在真实世界中的某些应用中可以遇到。

I named this slightly changed benchmark MixedRandomListRealWorld (source) and ran it in all the tests presented below, usually with parameters 100 (size of created objects) and 100000000 (number of live objects) to fill about half of the available heap with live objects.

While I had previously run my benchmarks on a notebook with the recommended minimum of 6GB heap space, I also moved to a server with much more RAM and used a Java heap size comparable to that found in large productive installations.

Unless explicitly mentioned otherwise, all the tests presented in this post have been executed on a server with an Intel Xeon E5-2620 chip (2,00 GHz, 6 Cores, 15MB L3 cache) and 64 GB of PC3-12800R RAM running Java 7u45 on the CentOS Linux operating system. The total java heap size was 50 GB: -Xms50g -Xmx50g.

我将这个改良版的测试命名为MixedRandomListRealWorld。并且通过设置参数为100(创建的对象的大小)和100000000(存活对象的个数)进行测试,通过这个参数将堆的一半填满。

之前我也曾尝试过在我的笔记本上运行这个测试,使用的最小堆大小为6GB,我也曾尝试过将这个实验放置在拥有更多内存的服务器上运行。

除非明确的说明,否则所有的测试都是运行在服务器上,该服务器的环境如下:

1)Intel Xeon E5-2620 chip (2.00Ghz, 6Cores, 15MB L3 cache)(注意这里的6core,之后会提到超线程技术将6core -> 12core)

2)64GB of PC3-12800R RAM

3)Java 7u45

4) CentOS linux Operating system.

5)heap size 50GB : -Xms50g -Xm50g

The pause time problem with large heaps(大型堆中的暂停问题)

Stop-the-world collectors may be robust and efficient but with a heap of that size GC pauses are in most cases prohibitive for any interactive interaction. As an example, see in the next figure how the default collector (which can also be started explicitly using -XX:+UseParallelGC) operates with parameters

-Xms50g -Xmx50g -XX:NewSize=2000m -XX:MaxNewSize=2000m


Stop-the-world 垃圾回收器可能是健壮的并且有效的,但是在如此大的堆的情况下,GC暂停在影响交互的情形中占据了相当大的比例。比如下面的例

子,在下面的图中可以看出默认的垃圾回收器(我们也可以使用参数 -XX:+UseParallelGC)在在使用上面提到的参数(100和100000000)时是如何工作的。

-Xms50g -Xmx50g -XX:NewSize=2000m -XX:MaxNewSize=2000m



      Figure 1. Heap usage, GC pauses und CPU usage for the ParallelGC collector running the benchmark with a total heap of 50 GB and a new generation of about 2 GB. There are GC pauses of almost 30s!

      图1: 在拥有50GB堆和2GB的新生代的笔记本上运行这个测试,使用的垃圾回收算法是ParallelGC。该图显示了在上述测试环境下   堆的利用率, GC暂停和CPU的利用率。GC暂停时间将近30s!

It is obvious that from a throughput and efficiency standpoint this collector does a good job because it only uses about 5% of elapsed time to clean up roughly 450 MB of garbage per second which is quite a lot even in the league of high throughput web applications. New generation pause times average 170 milliseconds which is good and old generation pauses are infrequent enough.

There is only one nuisance here: the old generation pause duration of about 30 seconds. Note that the benchmark does not have a very complicated reference graph and does also not include nasty things like reference processing and similar complications. A more complicated and more “real-world” application could easily push the old generation pause duration beyond the 1-minute-barrier. That’s why such stop-the-world collectors are not an option for any interactive application with a heap of that size. Using a low-pause collector becomes mandatory for such cases.

显而易见的是,从吞吐量和效率来看,这个垃圾回收器是十分有效的。因为它只用了5%的时间来清理垃圾,并且吞吐量达到了450MB/S,即使是在高吞吐量的应用中,这个吞吐量也是相当高的。新生代的平均暂停时间是170ms,也是不错的。并且老年代的回收也不是很频繁。

但是麻烦的是:老年代的GC暂停时间是30s。需要注意的是,这个测试中没有非常复杂的对象引用关系图。而且也不包括那些让人不知所措的东西,比如reference processing和similar complications(不知如何翻译,暂时使用英文)。在更复杂并且更具有真实性的应用中,老年代的暂停时间可以轻易地超过一分钟的界限。这就是为什么在那些拥有巨大堆的交互式应用中,stop-world-collectors并不是首选。在这种情形下,使用短暂停的垃圾回收器是十分必要的。


The CMS collector’s performance on a large heap(超大堆情形下CMS垃圾回收器的性能)

The CMS collector started with the settings

-XX:NewSize=2000m -XX:MaxNewSize=2000m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=80

shows the following results when it runs the same load as above:

CMS的配置如下

-XX:NewSize=2000m -XX:MaxNewSize=2000m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=80

下面显示了在相同配置下CMS的结果



        Figure 2. The CMS collector running the same benchmark as in figure 1 delivers amazingly short pauses which never exceed 140 ms.
       图2:令人惊奇的是,GC暂停从未超过140ms。

The CMS collector delivers excellent results, indeed. Throughput is only slightly lower than with the parallel collector, new generation pauses are shorter (on average 110 compared to 170 ms) and there aren’t any pauses longer than 140 ms, neither for new generation nor for old generation collections.

Keep in mind, however, that the benchmark is favorable for the CMS collector because, by construction, it works with objects of a single size and thus avoids fragmentation and any overhead for free-list searching during new generation pauses as much as any other complications. In real applications, CMS pauses, in particular ‘remark’ pauses, can take a second even with much smaller heaps.

CMS带来了不错的效果。吞吐量仅仅比之前的并行垃圾回收器低了一点,新生代的暂停更短(平均110ms,而之前是170ms),并且在老年代和新生代都没有暂停的时间是超过140ms的。

但是,需要记住的是,在这个测试对于CMS是有利的。因为,它工作的对象在结构上都是single size的,避免了碎片的产生。并且在新生代,这个测试在free-list的搜索上的花费和其它应用相同。在真实的软件应用中,CMS的暂停,尤其是在'remark'阶段,即使是在更小的堆上,也会花费1s。

Garbage First (G1)

Even if we know that the benchmark is favorable for the CMS collector, it remains interesting and instructive to learn how the G1 collector copes with a load that is as simple and well-defined as this one on a heap of that size.

As a first test, I applied G1 to the same load with only the following parameters:

-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=80
By default, this also implies a pause time target of 200 milliseconds:

-XX:MaxGCPauseMillis=200

The following figure shows the results that G1 delivered:

即使我们知道这个测试对于CMS垃圾回收器是有利的,但是它对于让我们了解G1垃圾回收器仍然是有益处的。

首先,设置G1的参数如下所示,其他参数不变。

-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=80

默认情况下,最大的暂停时间是200ms:

-XX:MaxGCPauseMillis=200


下表显示了G1的结果



     Figure 3. The G1 collector running the same load as in figure 1 and 2 shows           many new generation pauses of about 200 ms (=target) duration and                      fewer 'mixed' pauses which largely exceed the target.

    图3:显示了新生代的大部分暂停是在200ms左右,极少数的暂停较大地超过了这个            目标值。

This graphic and the statistics on the right show that the vast majority of new generation pauses meet the max pause target rather well and with little variation. The old generation pauses (which are in fact “mixed” pauses in the case of G1 because they are used to clean both new and old generation regions) can be spotted as vertical grey lines in regular intervals which miss that target and reach about 800 milliseconds after a rather extended warm-up phase. It can also be seen that the CPU usage with G1 is higher than with the other collectors: close to 60 instead of 50 percent.

We will see later that the setting -XX:InitiatingHeapOccupancyPercent=80
is not a good choice and rather increases the duration of mixed GC pauses. However, the tendency of mixed pauses to exceed the max pause target is a general problem as will be shown in the next section.

图表以及右侧的统计显示,绝大多数的新生代暂停都较好的维持在了预定的目标范围以内(200ms),并且只有少数的浮动。老年代的暂停(实际上是混合暂停,因为在G1中同时清理了老年代和新生代两个区域)可以用有规律间隔的灰色的竖线来表示,可以看到基本上都是超过了目标范围,并且都达到了将近800ms。同时也可以看到G1的CPU使用率相对于其他的垃圾回收器是较高的,达到了60%,而其他的是50%。

我们之后会发现 —XX:InitiatingHeapOccupancyPercent=80并不是一个好的选择,并且增加了混合GC暂停。但是混合GC暂停超过最大范围是一个普遍的问题,在接下来的章节中也会出现。

注:-XX:InitiatingHeapOccupancyPercent=45    设置触发标记周期的 Java 堆占用率阈值。默认占用率是整个 Java 堆的 45%。(摘自官网)

Let’s try to tune G1(让我们来调优G1)

The results for G1 in figure 3 and for the CMS in figure 2 suggest that there should be a knob to tune G1’s performance. G1 tuning options and best practices from Oracle can be found in “Getting Started with the G1 Garbage Collector” and“Garbage First Garbage Collector Tuning”. I followed the best practice advice not to touch generation sizing parameters as for example NewRatio andSurvivorRatio and let G1 adjust them to best fulfill the max pause target but measured the impact of other parameters on the achieved pause durations.

图表3显示的G1测试结果和图表2中显示的CMS的测试结果都显示想要进行G1的性能调优,必须有一定的常识。G1调优的建议和最佳实践可以再Oracle的官网上找到,两篇文章分别是“Getting Started with the G1 Garbage Collector”“Garbage First Garbage Collector Tuning”。我遵从最佳实践,没有设置generation sizing Parameter,比如NewRatio和SurvivorRatio,并且让G1自动调节这些参数用于满足最大暂停时间,仅仅测试其他影响暂停时间的参数。

Pause duration target: -XX:MaxGCPauseMillis

MaxGCPauseMillis is the prime tuning parameter for the G1 collector and is at the heart of its promise to control GC pause durations.

Have a look at the following figure to see how well G1 is able to deliver the configured target when run with a minimal configuration of

-Xms50g -Xmx50g -XX:+UseG1GC -XX:MaxGCPauseMillis=x

MaxGCPauseMillis是G1中主要的调优参数,并且是控制GC暂停时间的核心。

让我们来看看面对设置的MaxGCPauseMillis,G1是如何应对的。

-Xms50g -Xmx50g -XX:+UseG1GC -XX:MaxGCPauseMillis=x



      Figure 4. Achieved pause duration vs. target for new gen (red), mixed (blue)                      and other (green) pauses.

The figure shows very nicely that pause durations follow the configured target to some limited extent:

  • Young generation pauses (red lines) obey the target down to a limiting value (which is about 170 ms in this case) in such a way that the 0.9 quantile is more or less on target or lower.
  • With a target value below 170 ms most young generation pauses still take 170 ms with very little variation (flat red curves are close together) except for very low target values below 40 ms where few pauses get longer again (“max young pause” line).
  • “mixed” pauses (blue lines) follow the pause time target much less because considerable numbers of pauses (see “0.9 quantile mixed pause” line) do not fall below about 500 ms and the average only gets down to about 350 ms.
  • There are other pauses (green lines) like “remark” pauses whose durations look completely independent of the configured target but are short enough anyway. Actually they are of similar nature as the CMS pauses and also in the same range as shown by the “CMS max pause” line which I took from the test shown in figure 2.

图标很好的显示了暂停时间是如何根据配置的目标而变化的:

●  新生代(红色的线)遵守了设置的限定值(这种情况下MaxGCPauseMillis大约是170ms),90%的情况稍多或者稍少于目标值。

●  通过设置目标值在170ms以下,大部分的新生代暂停仍然是170ms左右,并且浮动很小(红色的曲线非常靠近)。在低于40ms时,少数暂停的时间会再次变长(通过max young pause这条线可以看出)。

●  "mixed"暂停(蓝色的线)多数不太遵守设定的目标值。因为相当数量的暂停(可以通过‘0.9 quantile mixed pause’)都不低于500ms,并且平均也不低于50ms。

●  有一些暂停(绿色的线),比如'remark'暂停,它们似乎与设置的目标值完全独立,但是这些暂停却都十分短暂。实际上它们和CMS暂停有一定的相似,并且和图表2中显示的'CMS max Pause'在大致相同的区间(翻译的不一定正确)。

Note that the range where most GC pauses follow the pause time target depends on application, hardware and heap size. In our case, this range is 170-600 ms which by accident includes the default target value of 200 ms. With the same benchmark program but on different hardware (Oracle T3/SPARC, 16 cores x 8 hardware threads per core=128 processors at 1650 MHz) and with a smaller heap size (-Xms10g -Xmx10g) I observed the same big picture as described above. However, the range of effective control was much lower: 40-150 ms.

需要指明的是,根据应用,硬件和堆大小的不同,都会有一个不同的区间。只有设置的目标值在这个区间的时候,大多数的GC暂停才能遵守这个目标值。在我们的例子中,这个区间是170-600ms,并且十分偶然的是默认的目标值200ms也在我们的这个区间内。在不同的硬件(Oracle T3/SPARC, 16 cores x 8 hardwarethreads per core=128 processors at 1650 MHz),我也做了相同的试验,并且设置的堆大小是(-Xms10g -Xmx10g)。我观察到了和上图相类似的大图,不同的是这个区间更加低,是40-150ms。

Concurrent operation threshold: -XX:InitiatingHeapOccupancyPercent

Very much like the CMS collector G1 starts concurrent processing depending on how much of the heap is filled. The parameter which controls the threshold is -XX:InitiatingHeapOccupancyPercent and the default value is 45, which means that concurrent processing starts when 45% of the available heap is filled. In contrast to that, the CMS collector uses a much higher default of 80 for the corresponding threshold parameter -XX:InitiatingOccupancyFraction, which is often a good choice and sometimes needs to be lowered only a little bit to reduce the risk of concurrent mode failures. Altogether, for people with CMS experience G1’s default of 45 looks very low and it is tempting to push the threshold towards 80 in the hope that this could reduce G1’s CPU consumption and increase its throughput.

We have seen in figure 4 that mixed pauses are the most critical ones and often exceed the max pause target. The following figure shows how mixed pauses behave as a function of the max pause target for different values of -XX:InitiatingHeapOccupancyPercent (IHOP):

和CMS垃圾回收器相类似,G1也是根据堆的利用率的多少来决定是否启动并发处理。这个控制的参数是 -XX:InitiatingHeapOccupancyPercent,默认值是45。该值的意思是当堆中45%已经被利用了,就启动并发处理。相比之下,CMS的默认值更加高,是80%,对应的参数是 -XX:InitiatingOccupancyFraction。并且在CMS中,最好的选择是将该值设置为比默认值稍微小一点的值,以减小并发模式启动失败的风险。总而言之,对于那些有CMS经验的人,G1的默认值45似乎看上去很低了。他们试图将该值提升到80,以减少G1的CPU占用率,和增加G1的吞吐量。

我们在图标4中可以看到,混合暂停是最具决定性的,并且通常都会超过最大设定值。下面的图标显示了混合暂停是如何根据不同 -XX:InitiatingHeapOccupancyPercent而工作的。

      Figure 5. Influence of the InitiatingHeapOccupancyPercent (IHOP) parameter                       on mixed pauses, (orange curves in this plot show the same values                         as 3 of the blue ones in figure 4).

It is clearly visible that values of -XX:InitiatingHeapOccupancyPercent=45(orange curves) and 60 (light blue curves) deliver almost the same results over the full range from -XX:MaxGCPauseMillis=10 to 5000.
Values 80 and 90 each add about 100 millis to average pause duration and the 0.9 quantiles. It can also be seen at -XX:MaxGCPauseMillis=5000 that a value of IHOP=90 leads to a concurrent mode failure and thus a Full GC stop which pushes average and 0.9 quantile very high. Because, for both CMS and G1, we always want to avoid the risk of concurrent mode failures and prefer shorter pauses we conclude that it makes sense to keep the value for parameter -XX:InitiatingHeapOccupancyPercent below 80 and G1’s default is not bad.


很清晰的可以看到,在-XX:InitiatingHeapOccupancyPercent等于45(橘黄的线)和60(深蓝色的线)的时候,产生的图像和-XX:MaxGCPauseMillis=10到5000产生的图像相同。

当值等于80和90的时候,平均暂停和0.9 quantiles都提升了100ms。而且可以看到,当-XX:MaxGCPauseMillis为5000时,如果IHOP=90会导致并发模式失败。平均暂停和0.9 quantile都很高。因此,无论对于CMS还是G1,我们都想避免并发模式错误和获得较短的暂停,我们推断设置-XX:InitiatingHeapOccupancyPercent低于80和G1的默认值都是不错的选择。

Is the cost of G1’s pause control so high?(G1暂停控制的花费很高?)

How costly is it for G1 to control GC pause duration? How does the GC throughput depend on the pause time target? Is there a penalty if the pause time target is set to a lower value than G1 can deliver, i.e. below the 170 ms limit observed in figure 4? I hoped to answer these questions with the following figure:

G1控制GC暂停的消耗是多少?GC的吞吐量和设置的暂停目标有什么关系?如果设置的目标值比G1能够处理的还要低,会有什么负面效果(比如图表4中提到的170ms)?我希望通过下面的图表能够解释:


     Figure 6. GC throughput as a function of the MaxGCPauseMillis parameter for                      different values of InitiatingHeapOccupancyPercent (IHOP) is probably                    to large extent a measuring artefact!

This looks like a highly interesting result: Increasing the pause time target from 170 to 600 ms increases throughput by up to 35% while it is flat at lower and higher target values.

可以看到一个有趣的现象:当暂停目标时间在170到600ms时,吞吐量增长了35%(35%是最大增幅)。但是暂停目标设置的较低或者较高的时候,却是很平滑的。

Therefore, let’s first have a look at how this is measured and what the accuracy is. Look at some lines from the GC log from a measurement which corresponds to the blue line at a target value of 600 ms (-XX:MaxGCPauseMillis=600):

因此,我们首先来看看这个是如何度量的,精确度如何。

276.141: [GC pause (young) 44G->31G(50G), 0.4517380 secs]
288.056: [GC pause (young) 43G->32G(50G), 0.5088510 secs]
301.181: [GC pause (young) (initial-mark) 44G->32G(50G), 0.4453570 secs]

Note that above a heap size of 10 GB G1’s accuracy for logging heap sizes is 1 GB (no decimal digits). This means in the log lines above we have a rounding error for the difference (=cleaned heap) of about 1/12=8%. That is a lot but still much less than the effect we see in figure 6. But let’s also look at some log lines from a test with MaxGCPauseMillis=150:

当堆大小超过10G时,日记记录的堆精确度是1GB(没有小数位)。这意味着在上面的日志中我们的四舍五入误差是8%。这个误差虽然很大,但是相比于图表6中-XX:MaxGCPauseMillis和-XX:InitiatingHeapOccupancyPercent对吞吐量的影响,这个误差还是小的多(这句话是个人的理解,不是十分明确)。让我们来看看当MaxGCPauseMillis为150时的日志行。

258.827: [GC pause (mixed) 35G->32G(50G), 0.2036500 secs]
262.001: [GC pause (young) 34G->32G(50G), 0.1774750 secs]
265.279: [GC pause (young) (initial-mark) 34G->32G(50G), 0.1744120 secs]

We see that now each GC pause cleans much less, only about 2 GB of heap (because G1 has less time in each GC run and executes smaller chunks of work at a time) but log accuracy is still 1GB which gives us a statistical error of about 1/2 = 50%. But especially the lower value (after GC) does not vary a lot, which means that the 32G could be constantly at roughly the same value between 31.5 and 32.49, which results in a systematic error of up to 0.5 or 25% of what we measure. This means that most of the change in throughput which is shown in the figure above could be accounted for by a systematic measuring error when G1 reduces the amount of cleaned heap to meet the pause time target.

我们可以看到目前每次GC暂停清理的堆大小更少,仅仅只有2GB(因为G1在每次执行GC的时候所拥有的时间都更少,每次只能更少的工作),但是日志精度仍然是1GB,导致了统计误差达到了50%。但是每次GC后堆大小变化不大,这意味着32G常常是是31.5和32.49之间的某个常量,这个结果是根据0.5的误差或者我们测量的25%(这个25%是什么意思?)得出的。这意味着当G1减少堆清理数量以满足设定的暂停目标时(堆清理的少了,每次的时间也就少了,所以可以满足设定的暂停目标),上表中显示的吞吐量的变化的大多数情况都可以使用这个统计误差来解释。

The plot is mostly useless due to the large systematic error, except as a hint to measure this again and for highlighting how insufficient G1’s logging accuracy is. This is a serious flaw in G1’s GC log and needs to be fixed. Until this has been done, such measurements of throughput for large heaps can only be done with an object creation counter in the benchmark’s code which, unfortunately, I haven’t used in those tests.

Tuning the number of threads: -XX:ParallelGCThreads(并行线程)

G1 as much as the “classic” collectors share this parameter which sets the number of threads used in parallel for processing during GC pauses. The following figure shows how pause durations for G1 and CMS depend on that parameter:

G1和大多数经典垃圾回收器一样为用户提供了一个参数用于设置并行处理GC的线程数。下面的图表指出了G1和CMS的暂停时间是如何根据这个参数不同而变化的:


     Figure 7. GC pause duration as a function of -XX:ParallelGCThreads for G1                        (light blue curves) and CMS (purple curves); G1 benefits more from                        many threads on multi-core machines.

The first thing to learn from this plot is that here the default (5/6 of the number of processors = 10 threads for G1 on the given hardware) is unnecessarily high because there is little to gain from using more than 6, the number of CPU cores. We have seen earlier in part 2 of this blog series that cores count for GC operations while hyper threads (which double the number of processors to 12) add little benefit. The JVM, however, knows only the number of processors.

首先可以从上面的数据中分析出来的是默认的线程数(处理器数的5/6,在目前的测试硬件上是 12 * 5 / 6 = 10)有些过高因为当线程数超过6时,没有什么太大的提升。早在这个博客系列的第二篇文章中,我们已经知道当处理器采用超线程技术(将CPU核数提升至12)的时候,GC统计的CPU的核数是没有太多益处的。因为JVM仅仅知道处理器的数目(6)。

Second, the new generation pauses (dotted lines) for both collectors show almost identical dependence on the number of threads. Both benefit most from adding a second thread and on a quickly diminishing scale from adding more.

第二,对于几个被测试的垃圾回收器,它们的青年代的暂停对于线程数的依赖基本上一致(两条曲线十分吻合)。从单线程到两个线程时,性能的提升是最明显的。当添加更多线程时,性能提升迅速衰减,逐渐趋于平稳。

The big difference, once again, is in the old generation/mixed collections. With a single GC thread G1’s mixed collections (solid light blue line) are excessively long and it takes at least 5 parallel GC threads to bring them down to the 500 ms for which 2 threads are enough in all other cases. As we have seen several times before, mixed collections are the troublesome task for the G1 collector which it tackles with plenty of CPU power (and at the expense of throughput). Starting from figure 7, we can speculate that G1 could use still more CPU cores (as are for example available on a SPARC T3 or T4 chip or in a multi-chip system) to further reduce GC pause duration while CMS would benefit less. It is not shown in figure 7 that GC throughput already reaches a maximum at 3 parallel GC threads.

最大的不同体现在老年代和混合垃圾回收时。当只有一个线程时,G1(高亮的蓝色实线)的GC暂停是非常高的,并且至少需要5个线程才能将GC暂停减少到500ms,而在其他的情况下仅仅需要两个线程(这里的其他是指其他线条对应的pause)。就像我们之前多次看到的,混合垃圾回收时G1中最让人懊恼(让人不知所措)的阶段。因为它耗费了大量的CPU时间(吞吐量)。从图表7开始,我们可以推断G1可以利用更多的CPU数以减少GC暂停间隔,相比之下即使CPU核数增多,CMS也不会有太大的提升。在图表7中没有显示的是,3个并行线程时GC吞吐量已经达到最大值。

Number of concurrent threads: -XX:ConcGCThreads(并发线程)

This parameter defines the number of GC threads working concurrently with mutator (= application) threads, and it depends on the number of parallel threads: the number of parallel GC threads is an upper bound for it and also determines its default value, which is approximately 1/4 of the number of parallel threads for G1.

To avoid concurrent mode failures (which lead to excessively long Full GC pauses) I found it necessary for G1 to use at least 2 concurrent GC threads whereas with CMS it was always safe to run the same benchmark with a single concurrent thread. Setting the number of concurrent threads to higher values or not setting it at all made very little difference. I conclude that setting this parameter is pushing things very far and in most cases is probably not worth the effort as the default behavior is very good as long as there are enough parallel GC threads.

这个参数定义了和mutator线程(应用线程)一起工作的线程数(也就是和mutator线程并发执行),它决定于并行的线程数:并行的GC线程大于它,并且也决定于默认的线程数,大约是并行线程数的1/4。

为了避免并发模式错误(导致相当长的Full GC Pause)我发现至少拥有两个并发线程是十分必要的,然而CMS运行同样的测试时只有一个线程也是安全的。将并发的线程数设置的很高或者干脆不设置是没有什么太大区别的。(这句话不知道翻译的准不准?大致根据理解翻一下)我得出结论:只要有足够的并行GC线程,往往设置这个参数反而得不偿失,不如采用默认的设置。

并发和并行的区别:

并行是指在同一时刻,有多条指令在多个处理器上同时执行。真正意义上的并行处理。

并发是指在同一时刻,只能有一条指令执行,但多个进程指令被快速轮换执行,使得在宏观上具有多个进程同时执行的效果。 


G1 region size: -XX:G1HeapRegionSize

This is the parameter at the heart of G1’s heap design. G1 uses a number of regions which are managed individually. One might expect that the region size has a certain impact on G1’s results, but I could not find any effect when I varied the region size from its minimum to maximum value:

这个参数是对G1堆设计的核心之一。G1独立的管理一定数量的区域。一个可能的预期是这个区域大小参数在一定程度上影响这个G1的运行结果,但是我把区域大小从最小设置到了最大值,尽了一切努力都没有找出是如何影响的。


     Figure 8. G1 pause durations and throughput as a function of the region size.
All the indicators monitored do not systematically depend on the region size parameter. Remember, however, that all the objects created in the benchmark are of equal size and very small compared to even the smallest possible region size. Once an application handles much larger objects, e.g. PDF documents in huge byte arrays in or close to the MB range, we should see a different picture.

观测到的数据似乎与G1的region size没什么太大的关系。但是需要记住的是,我们测试里所创建的对象相对于最小的region size都是很小的。一旦一个应用处理了更大的对象,比如处理PDF文档时的字节数组(数量级是MB),我们可以看到不同的图表。

G1 has few parameters to tune it

There are more parameters described in “Getting Started with the G1 Garbage Collector” and “Garbage First Garbage Collector Tuning”, but as I understand them they are meant to make G1’s operation more robust and to reduce a specific risk of failure rather than to tune G1’s performance. A case in point is -XX:G1ReservePercent which in my experiments did not influence either throughput or GC pause duration.

From my measurements I conclude that -XX:MaxGCPauseMillis is by far the most important tuning parameter. The default value, 200 ms, is not bad. For most applications this is responsive enough and feasible at the same time. But it is important to keep in mind that there are cases (depending mainly on application, hardware and heap size) where a desired and reasonable pause time target simply cannot be reached and I have not found a JVM screw to fix that once the application is in production.

For all the other parameters G1 uses rather good default values or even adjusts them dynamically to meet the pause time target, and it usually is not necessary and not beneficial to configure them explicitly.

“Getting Started with the G1 Garbage Collector” 和“Garbage First Garbage Collector Tuning”这两篇文章中描述了更多的参数,但是我觉得它们更像是使得G1更健壮和减少G1失败的危险,而不是优化G1的性能。一个例子就是-XX:G1ReserverPercent,在我的试验中它没有对吞吐量或者GC暂停起到任何影响。

在我的试验中,我得出结论:-XX:MaxGCPauseMillis是目前为止最重要的优化参数。它的默认值,200Ms,不太坏。在大多出的应用中,这个值都是够用和可行的。但是需要尤其记住的是在某些情况下(受应用,硬件和堆大小影响),期望的和合理的目标暂停不是那么容易得到的。并且我也没有在JVM中找到一个手段来修复这种情况(是不是指动态的修改这个参数??),一旦应用投入生产。

对于其它参数,G1更多的是采用默认值或者使它们动态的满足GC暂停目标,通常明确的设置它们都是没必要和没什么好处的。

Coexistence of several JVMs on a single OS instance


In a common system architecture several JVMs, for example clustered application server instances, run on a common server without explicit allocation of CPU resources. In old-fashioned installations they often run on a common operating system instance. But even in virtualized environments with one JVM per OS instance mechanisms to allocate CPU resources are not always available or used. Therefore, I also examined how both garbage collectors, CMS and G1, operate under such circumstances where several JVMs freely compete for CPU power. For this I used the following test setup:

A JVM with 25 GB of total heap size executes a 30 minutes test run using roughly 50% of CPU available power. 15 minutes after this first JVM a second JVM with exactly the same parameters and program is started. As both now compete for the available CPU power and other system resources, we expect a clear change after half the execution time.

The following figure shows the CMS collector started with the parameters

-Xms25g -Xmx25g -XX:NewSize=2000m -XX:MaxNewSize=2000m -XX:CMSInitiatingOccupancyFraction=80

doing GC in the first of the two JVMs:

在通用的系统架构中多个JVM(例如多服务器实例的集群应用)跑在一个通用服务器上,并且没有明确的分配CPU资源。在传统的安装程序中,它们通常跑在一个通用的操作系统实例上。即使是在每个操作系统实例只有一个虚拟机的虚拟环境下,分配CPU资源也是不常用的。因此,我测试了多个垃圾回收器CMS和G1是如何在多个JVM自由竞争CPU资源的环境下运行的。为了验证这个,我将测试设置如下:

运行一个测试,该程序的JVM拥有25G大小的堆,同时运行时间是30分钟,耗费了50%的CPU资源。第一个JVM运行15分钟后,启动另一个JVM,该JVM和之前的那个JVM拥有相同的参数。现在两个JVM同时竞争可用的CPU资源和其他的操作系统资源,我们希望看到一个明显的变化在15分钟后。

下面的图表显示了两个JVM(都是使用的CMS)同时工作时,第一个JVM是如何工作的。

CMS参数:-Xms25g -Xmx25g -XX:NewSize=2000m -XX:MaxNewSize=2000m -XX:CMSInitiatingOccupancyFraction=80




     Figure 9. Operation of the CMS collector without (left half) and with (right                         half) competition from a second JVM of exactly the same                                      configuration.

The competition for CPU power affects the CMS collector in three ways:

  • It increases the average duration of GC pauses by a factor of about 1.7 (from 83 to 145 ms)
  • It moderately increases the standard deviation of GC pause duration (from 3 to 13 ms)
  • It reduce the throughput (slope of the blue heap usage curve) by a factor of 1.9 such that the sum for both JVMs is only about 5% higher than for a single one (745 MB/s) because at close to 50% CPU usage all cores are almost in full use and hyperthreads have little benefit.

Altogether, the CMS collector gracefully responds to the sudden onset of CPU competition and in a way as I would have expected.

The following figure shows how the G1 started with -Xms25g -Xmx25g and its default max pause target of 200ms copes with exactly the same situation:

这种CPU的竞争以如下三种方式影响着CMS垃圾回收器的工作:

●  使得平均GC暂停时间增加到了原来的1.7倍(83ms到145ms)。

●  适度地(稍微地)增加了GC暂停时间的误差(3ms到13ms)。

●  单个JVM吞吐量减少了到了原来的 10 / 19,因此两个JVM的吞吐量加起来的总和比原来单个JVM(745MB/S)多了5%。因为50%的CPU利用率就意味着

注:10 /19 + 10 / 19 = 20 / 19 , 20 / 19 - 1 = 1 / 19, 1 / 19 ≈ 5%

所有的CPU都处于忙碌状态,超线程技术没有起到应有的作用。

总的来说,CMS很好的应对了突发的CPU竞争,并且这种方式也是我们所预期的。

下面的图表中显示了G1是如何应对上面的测试的(参数-Xms25g -Xmx25g,并且最大暂停目标采用默认的200ms)



     Figure 10. Operation of the G1 collector without (left half) and with (right                       half) competition from a second JVM with exactly the same                                   configuration.

G1 shows a very different reaction to the sudden change:

  • the average duration of new gen GC pauses is only slightly changed (from 180 to 208 ms)
  • the GC pauses vary much more after the change and the standard deviation grows from 22 to 69 milliseconds, the max value from 264 to 700 milliseconds
  • measured throughput is reduced by almost a factor of 2.7 (from 707 to 266 MB/s) such that the combined throughput of both instances is 25% lower than that of a single JVM but this detail result suffers from exactly the same kind of rounding error as described above and is therefore unreliable (unlike the corresponding result for the CMS collector that has no issue with rounding)

In contrast to the CMS collector, the G1 does not work along the same rules when the change happens but it tries to maintain the pause time target. To this end, it decreases the new generation from about 7 to about 2 GB as can be seen from the blue colored line in the figure above. G1 is quite successful in reaching the target pause time on average. But there are more outliers because one JVM finds it very difficult to predict how much work it can do in a certain time interval when the second JVM takes away CPU power in an unpredictable way by switching from normal processing to stop-the-world or concurrent GC.
We have learned that G1’s pause control mechanism is rather sensitive to CPU competition with a second JVM but this topic and its many aspects need more investigation and, maybe, a separate blog post.

面对这种突发的变化,G1和CMS相比采用了完全不同的应对措施

● 平均的新生代GC暂停只发生了轻微的变化(180ms到208ms)

● GC暂停的上下浮动区间变化很大(最大值和最小值之间的差距)。标准差从22ms增长到了69ms,最大值从264ms增长到了700ms。

● 可度量的吞吐量减少到了原来的 10 / 27(707MB/S到266MB/S),两个JVM实例的联合吞吐量比之前单个JVM的吞吐量低了25%。但是这些测试结果可

能是因为之前提到的四舍五入误差导致的。因此不具有太大的可靠性。

注:1 - (10 / 27 + 10 / 27) = 7 / 27 ≈ 25%

和CMS不同,面对这种突发的变动,G1采取了完全不同的应对方式,但是它试图维持在目标暂停时间。为了这个目的,他将新生代的大小从7GB降到了2GB(如蓝色线所示)。G1在保持平均暂停时间在目标暂停时间附近方面做得很长共,但是有许多异常值(过高的峰值和过低的谷值)产生了,这是因为JVM发现当第二个JVM突然将CPU资源占有(切换正常处理到stop-the-world或者并发GC)时,它(第一个JVM)无法预测在某个时间段到底需要做多少工作。

我们已经了解到G1对于CPU竞争是敏感的,当第二个JVM曾参与到竞争中。但是这个主题和它的许多方面都需要进一步的调查,或许可以另开一篇博客。

Summary and Conclusion

The results presented in this post show that G1’s control of GC pauses with the -XX:MaxGCPauseMillis parameter is effective only in a rather narrow range of target values and that it has no effect (fortunately also no negative effect) when the target is set too low for the given setup. Control is also incomplete because a significant share of (mixed) GC pauses is hardly affected by the target value. Given the high cost of G1’s GC scheme, which I have already pointed out in the previous post, I find these results disappointing.

When I started working with G1 I had a feeling I was missing a decisive parameter which would get things going. After searching and testing for a while I have come to the conclusion that there probably is none. Sometimes, parameters (like the number of parallel threads) can be changed to use less resources but I found no change that gave me significantly better results than what I got out-of-the-box using only a minimal configuration like:

经过全文的讨论得出结论,G1使用-XX:MaxGCPauseMillis控制GC暂停时,只有当这个参数在一定的范围内才有效,而过低或者过高的值都是无效的(幸运的是没有其他负面的效果)。控制不是完全的,因为混合GC暂停很难被设定的目标GC暂停所影响。加上上篇文章中得出的结论:G1的GC花费很高,都使得我很沮丧。

当我开始使用G1的时候,我有一种感觉:缺少了一个能够决定G1工作的关键参数。在不停的寻找和测试的后,我渐渐得出结论:可能没有这样一个参数。有时候,可以更改某些参数(比如并行的线程数)来减少资源的小号,但是我发现有时改变了参数后的性能和我使用默认配置时的性能没有太大差别。

-Xms50g -Xmx50g -XX:MaxGCPauseMillis=500

My recommendation: start from here and set the heap size and pause time target to the values you really need for your application on your target platform. If that works out as expected you are done. If it doesn’t you probably can’t do much about it because G1’s default settings and auto-adjustments are good.

High hopes that the MaxGCPauseMillis parameter could be used to reduce GC pauses of real applications at will to any value in the 1000 to 10 ms range (be it at the expense of very high CPU consumption) look futile when this cannot be achieved with a rather simple benchmark as I have used for my tests.

Looking at figure 2 which shows that the CMS collector achieves clearly better results on that same benchmark leads me to question G1’s design: Why does it use regions to manage all generations of objects? Why does it have a commonMaxGCPauseMillis target for all kinds of GC pauses? Would it not be better to keep young objects in exactly the same kind of heap layout as the classic collectors do (1 Eden space + 2 survivor spaces) and to use many regions only for the old generation? The many new generation GC pauses should then in most cases stay clearly below the MaxGCPauseMillis target. And for the few old generation pauses a relaxed target of something like 1s would still be fine and hopefully cheaper to achieve.

我的意见:从现在开始,在你的平台上,除非你的应用的确需要设置heap size和pause time target,否则就采用默认值吧。如果它没有按照你的期望工作,如果你不知道做什么事情,那么采用默认或者让G1自动调节会是一个好的选择。

我们对于通过在10ms到1000ms之间调节MaxGCPauseMillis能够很好地减少GC暂停抱有很大期望,但是以我们目前的简单测试还无法进行验证。

Outlook

As I have pointed out before, the benchmark used for these tests is very favorable for the CMS collector. With real applications, the CMS can also produce GC pauses in the range of several seconds. In fact, I work with such an application that from time to time experiences remark and new gen pauses of more than 1 second although it only has a 2 GB heap space. It would be interesting, to create a synthetic benchmark which, first, is able to reproduce that result and, second, one on which G1 can beat the CMS collector. That could be a nice challenge whereas it is simple to provide a benchmark where the CMS outperforms the G1 collector. 



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值