Handling oprofile sample buffer overflows

What to do when sample buffer overflows are reported

OProfile is a system-wide profiler for Linux, capable of profiling all running code at low overhead.

As of OProfile release 0.9.6, when you run 'opcontrol -shutdown' after a profiling run, oprofile will check for lost samples and display a warning message if any samples have been dropped. The message recommends reviewing the oprofiled.log for more information on lost samples and to try to eliminate or minimize them.

But how does one do that? And how does one interpret the information about lost samples in the log?

Oprofiled.log layout

Let's start out with the interpretation of the log information. The oprofiled.log is located in:

/samples

where is /var/lib/oprofile by default

 
           # cat /var/lib/oprofile/samples/oprofiled.log 
          

At the end of a profiling run, a section titled "OProfile Statistics" is written to this log. The beginning (summary) part of this section includes several lines giving information about lost samples or events. For example:

 
           oprofiled started Tue Jun 22 17:00:46 2010
kernel pointer size: 8

Tue Jun 22 17:01:21 2010

-- OProfile Statistics --
Nr. sample dumps: 4
Nr. non-backtrace samples: 4279
Nr. kernel samples: 3834
Nr. lost samples (no kernel/user): 0
Nr. lost kernel samples: 0
Nr. incomplete code structs: 0
Nr. samples lost due to sample file open failure: 0
Nr. samples lost due to no permanent mapping: 1
Nr. event lost due to buffer overflow: 0
Nr. samples lost due to no mapping: 11
Nr. backtraces skipped due to no file mapping: 0
Nr. samples lost due to no mm: 42 
          

Following that, is a "Statistics for cpu : " section for each processor, with more information on lost samples. The oprofile kernel driver uses a two-level sample buffering scheme: one buffer per cpu, plus one main "event" buffer which periodically collects all the individual cpu buffer samples. The userspace oprofile daemon reads the samples from this main event buffer.

 
                       
            
             
             
---- Statistics for cpu : 2
Nr. samples lost cpu buffer overflow: 0
Nr. samples received: 0
Nr. backtrace aborted: 0
Nr. samples lost invalid pc: 0

---- Statistics for cpu : 1
Nr. samples lost cpu buffer overflow: 0
Nr. samples received: 0
Nr. backtrace aborted: 0
Nr. samples lost invalid pc: 0

---- Statistics for cpu : 0
Nr. samples lost cpu buffer overflow: 0
Nr. samples received: 67
Nr. backtrace aborted: 0
Nr. samples lost invalid pc: 0
oprofiled stopped Tue Jun 22 17:01:21 2010
            
             
          

Getting back to the summary part, the "OProfile Statistics".

The "Nr. non-backtrace samples" line shows the number of samples actually received.

Recently discovered (Dec 2011) Bug Alert: The "Nr. non-backtrace samples" statistic up through OProfile 0.9.7 is not accurate. This value is less than the actual number of samples received.

Looking at the "lost samples' lines, you'll almost always see some samples lost due to "no mapping" or "no mm". This can happen when samples are recorded by the oprofile kernel module for an application, but then that app ends before those samples can be processed by the userspace oprofile daemon. There's not much the user can do about these. Proportionally, the number of such lost samples is usually very low, so these can generally be safely ignored.

On the other hand, the user often can (and should) take action to eliminate or minimize the number of "events lost due to buffer overflows", especially if that number is high (say, anything over 1%) compared to the total "Nr. non-backtrace samples". The same is true for "Nr. samples lost cpu buffer overflow" in the per-cpu stats sections. OK, so what action(s) should be taken?

Reducing/Eliminating Events lost due to buffer overflow

Note: The "events lost due to buffer overflow" statistic is mis-named; it should be named "samples lost due to buffer overflow". The following text uses the phrasing that the tool provides, but keep this in mind.

Generally speaking, if you're seeing "events lost due to buffer overflow" in the summary part of the "OProfile Statistics" section of the oprofiled.log, the easiest thing to try is lowering your sampling interval (i.e., using a higher 'count' value for the event(s) you're profiling). But since OProfile is a statistical profiler, you need to keep in mind that the fewer the samples over a given period of time, the less accurate your profile data will be. It's difficult to give a rule of thumb on what constitutes a statistically valid profile. But as a rough example, consider that gprof typically collects 100 samples per second. This is fine for the low-hanging fruit, but may not be fine-grained enough for detailed performance investigations. So, for time-based or cycles-based profiles, you should try to do at least an order of magnitude better than gprof.

Here is an example of how to change the 'count' value for the event PM_CYC_GRP1, the count value is the number that follows the colon:

 
           $ opcontrol --event PM_RUN_CYC_GRP1:25000 
          

You can check the new count value by issuing:

 
           $ opcontrol --status
Daemon not running
Event 0: PM_RUN_CYC_GRP1:25000:0:1:1
Separate options: library
vmlinux file: /boot/vmlinux-2.6.32.12-0.7-ppc64
Image filter: none
Call-graph depth: 0 
          

oprofile provides the ability to tailor the sizes of the key working buffers allocated.

 
           $ opcontrol 2>&1 | grep buffer
   --buffer-size=num             kernel buffer size in sample units
   --buffer-watershed            kernel buffer watershed in sample units (2.6 only=
   --cpu-buffer-size=num         per-cpu buffer size in units (2.6 only)
   --note-table-size             kernel notes buffer size in notes units (2.4 only) 
          

If, after lowering your sampling interval, you're still getting too many overflows, try adjusting the buffer size along with the buffer-watershed (which should be around 1/4 the value of buffer size). Once you have eliminated the main buffer overflows, then go after CPU buffer overflows (if any) by adjusting the cpu-buffer-size. The default size for the main event buffer is 131072, and the default size for the per-cpu buffers is 8192.

See the 'opcontrol' man page for details on setting these parameters. The man page uses "num samples" to indicate that the value you enter will initialize a buffer whose size is big enough to hold "x" samples. The samples are unsigned long's, so they would be 64-bits or 32-bits, depending on the bitness of the kernel (since these buffers are allocated by the kernel).

Here is a quick example:

 
           # Set the event (main) buffer size
$ opcontrol --buffer-size=1000000

# Set the buffer watershed
$ opcontrol --buffer-watershed=256000

# Set the cpu buffer size
$ opcontrol --cpu-buffer-size=16000 
          

Run opcontrol --status again to verify the "non-default" settings.

 
           $ opcontrol --status
Daemon not running
Event 0: PM_RUN_CYC_GRP1:1000000:0:1:1
Separate options: library
vmlinux file: /boot/vmlinux-2.6.32.12-0.7-ppc64
Image filter: none
Call-graph depth: 0
Buffer size: 1000000
CPU buffer watershed: 250000
CPU buffer size: 16000 
          

To restore to the defaults, simply set each of the buffer values to zero (0).

 
           $ opcontrol --buffer-size=0
$ opcontrol --buffer-watershed=0
$ opcontrol --cpu-buffer-size=0 
          

Testing this on a POWER7 system

We did some testing on a POWER7 @3Ghz, with 32 logical processors - CPUs (i.e., 8 cores, with the default of SMT=4). The table below shows the results. You can see that, with a sampling interval of one sample every 25,000 PM_RUN_CYC events, we found no buffer size settings that prevented lost samples. But by adjusting sampling interval and buffer parameters, we were able to collect profile data with very few (or no) lost samples.

Notes

With a sampling interval of one sample for every 50,000 run cycles on a 3 GHz system, the total possible number of samples per second is 60,000 x num_cpus (in this case 32 CPUs), assuming all CPUs are 100% busy. Using that assumption, the nominal number of samples for this testing scenario is 1,920,000 per second. That's 3Ghz / 50,000 samples = 60,000 samples per sec.

Without oprofile running, an example runtime for a single test application is 23 seconds; 53 seconds runtime with app running on all CPUs (we'll call this "nominal runtime"). Thus, 32 apps should register 1,920,000 x 53 = 101,760,000 total samples over the collective runtime. This is a theoretical value, because when oprofile is running, the profiling overhead causes an increased runtime for the apps.

The data in the table show that, for this type of system, oprofile is indeed capable of processing 1,920,000 million samples per second with appropriate profiling parameters.

Sampling = 25,000 events

All profiling parameters in this case resulted in too many lost samples

Sampling interval PM_RUN_CYC	--buffer-size	--buffer-watershed	--cpu-buffer-size	Events lost from buffer overflow	Average events lost per-cpu buffer	oprofile overhead (ratio of nominal)	Summary
25,000	2M	.5M	16K	96 million	3.5 million	1.33	Not good
25,000	4M	1M	16K	45 million	3.5 million	1.33	Not good
25,000	8M	2M	16K	10 million	3.5 million	1.34	Not good
25,000	16M	4M	16K	0	3.5 million	1.33	Not good

Sampling = 50,000 events

The last two did the best here. The first lost nearly all the samples.

Sampling interval PM_RUN_CYC	--buffer-size	--buffer-watershed	--cpu-buffer-size	Events lost from buffer overflow	Average events lost per-cpu buffer	oprofile overhead (ratio of nominal)	Summary
50,000	default	default	default	81.5 million	a few on some several thousand on others	1.21	Not good
50,000	1M	.25M	16K	800,000	4	1.22	Not good
50,000	2M	.5M	16K	0	3	1.21	Good
50,000	4M	1M	16K	0	3	1.23	Good

In this example, the main buffer size didn't make a difference between 2MB and 4MB, so for this workload the 2MB example would have been fine. For other workloads, you may see more events lost.

Sampling = 100,000 events

This set of profiling parameters uses even less resources and should provide as statistically valid a profile as when using a sampling interval of 50,000. Note that the oprofile overhead is also substantially less.

Sampling interval PM_RUN_CYC	--buffer-size	--buffer-watershed	--cpu-buffer-size	Events lost from buffer overflow	Average events lost per-cpu buffer	oprofile overhead (ratio of nominal)	Summary
100,000	1M	.25M	16K	0	1.5	1.07	Very good