java warmup,Java Warmup

I gave a talk at JavaOne last week on flame graphs ( slides here ), and in Q&A was asked about Java warmup. I wanted to show some flame graphs that illustrated it, but didn’t have them on hand. Here they are.

These are 30 second flame graphs from a Java microservice warming up and beginning to process load. The following 30 flame graphs (3 rows of 10) have been shrunk too small to read the frames, but at this level you can see how processing changes in the first 10 minutes then settles down. Click to zoom (a little):

df8e8636c401faf1c82184305dce4cc5.png

The hues are: yellow is C++, green is Java methods, orange is kernel code, and red is other user-level code.

Now I’ll zoom in to the first two:

Time = 0 – 30s

This is the first 30 seconds of Java after launch.

On the left, the Java heap is being pre-allocated by the JVM code (C++, yellow). The os::pretouch_memory() function is triggering page faults (kernel code, orange). Click Universe::initialize_heap to get a breakdown of where time (CPU cycles) are spent. Neat, huh?

Because you’re probably wondering, here’s clear_page_c_c() from arch/x86/lib/clear_page_64.S:

ENTRY(clear_page_c_e)

movl $4096,%ecx

xorl %eax,%eax

rep stosb

ret

ENDPROC(clear_page_c_e)

It uses REP and XORL instructions to wipe a page (4096 bytes) of memory. And that’s a performance optimization added to Linux by Intel (Linux gets such micro-optimizations frequently). Now zoom back out using “Reset Zoom” (top left) or click a lower frame.

The tower on the right has many “Interpreter” frames in red (user-mode): these are Java methods that likely haven’t hit the CompileThreshold tunable yet (by default: 10000). If these methods keep being called, they’ll switch from being interpreted (executed by the JVM’s “Interpreter” function) to being complied and executed natively. At that point, the compiled methods will be running from their own addresses which can be profiled and translated: in the Java flame graph, they’ll be colored green.

The tower in the middle around C2Complier::compile_method() shows the C2 complier working on compiling Java just-in-time.

Time = 30 – 60s

We’re done initializing the heap (that tower is now gone), but we’re still compiling: about 55% of the CPU cycles are in the C2 complier, and lots of Interpreter frame are still present. Click CompileBroker::compiler_thread_loop to zoom in to the compiler. I can’t help but want to dig into these functions further, to look for optimizations in the compiler itself. Zooming back out…

There’s a new tower, on the left, containing ParNewGenTask::work. Yes, GC has kicked in by the 30-60s profile. (Not all frames there are colored yellow, but they should be, it’s just a bug in the flame graph software.)

See the orange frames in the top right? That’s the kernel deleting files, called from java/io/UnixFileSystem:::delete0(). It’s called so much that its CPU footprint is showing up in the profile. How interesting. So what’s going on? The answer should be down the stack, but it’s still Interpreter frames. I could do another warmup run and use my tracing tools ( perf-tools or bcc ). But right now I haven’t done that so I don’t know — a mystery. I’d guess log rotation.

The frames with “[perf-79379.map]”: these used to say “unknown”, but a recent flame graph change defaults to the library name if the function symbol couldn’t be found. In this case, it’s not a library, but a map file which should have Java methods. So why weren’t these symbols in it? The way I’m currently doing Java symbol translation is to take snapshots at the end of the profile, and it’s possible those symbols were recompiled and moved by that point.

See the “Interpreter” tower on the far left? Notice how they don’t begin with start_thread(), java_start(), etc? This is likely a truncated stack, due to the default setting of Linux perf to only capture the top 127 frames. Truncated stacks break flame graphs: it doesn’t know where to merge them. This will be tunable in Linux 4.8 (kernel.perf_event_max_stack). I’d bed this tower really belongs on as part of the tallest Interpreter tower on the right — if you can imagine moving the frames over there and making it wider to fit.

Time = 90+s

Going back to the montage, you can see different phases of warmup come and go, and different patterns of Java methods in green for each of them:

d5f9abf47b12e8b8be71c29dd4313239.png

I was hoping to document and include more of the warmup, as each stage can be clearly seen and inspected (and there are more interesting things to describe, including a lock contention issue in CFLS_LAB::alloc() that showed up), but this post would get very long. I’ll skip ahead to the warmed-up point:

Time = 600 – 630s

954b715770f9bb1c86464360ba7ca123.png

At the 10 minute mark, the microservice is running its production workload. Over time, the compiler tower shrinks further and we spend more time running Java code. (This image is a PNG, not a SVG.)

The small green frames on the left are more evidence of truncated stacks.

Generation

To generate these flame graphs, I ran the following the same moment I began load:

for i in `seq 100 130`; do perf record -F 99 -a -g -- sleep 30; ./jmaps

perf script > out.perf_warmup_$i.txt; echo $i done; done

The jmaps program is an unsupported helper script for automating a symbol dump using Java perf-map-agent.

Then I post processed with the FlameGraph software:

for i in `seq 100 130`; do ./FlameGraph/stackcollapse-perf.pl --kernel

warmup_$i.svg; echo $i done; done

This is using the older way of profiling with Linux perf: via a perf.data file and “perf script”. A newer way has been in development that uses BPF for efficient in-kernel summaries, but that’s a topic for another post…

注意:本文来自Brendan Gregg's Blog。本站无法对本文内容的真实性、完整性、及时性、原创性提供任何保证,请您自行验证核实并承担相关的风险与后果!

CoLaBug.com遵循[CC BY-SA 4.0]分享并保持客观立场,本站不承担此类作品侵权行为的直接责任及连带责任。您有版权、意见、投诉等问题,请通过[eMail]联系我们处理,如需商业授权请联系原作者/原网站。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值