Performance Analysis Methodology
A performance analysis methodology is a procedure that you can follow to analyze system or application performance. These generally provide a starting point and then guidance to root cause, or causes. Different methodologies are suited for solving different classes of issues, and you may try more than one before accomplishing your goal.
These methodologies can help you solve issues quickly, and solve a wider range of issues. Analysis without a methodology can become a fishing expedition, where tools and metrics are examined ad hoc, until the issue is found – if it is at all.
Key Methodologies:
- The USE Method: for finding resource bottlenecks
- The TSA Method: for analyzing application time
- Off-CPU Analysis: for analyzing any type of thread wait latency
- Active Benchmarking: for accurate and successful benchmarking
This is my main page of performance analysis methodologies, and contains links and summaries of different methods.
Summaries
I first summarized various methodologies for my USENIX LISA2012 talk "Performance Analysis Methodology" (PDF, slideshare, youtube,USENIX), then later documented them in my Systems Performance book. The following is my most up to date summary list, with methodologies enumerated.
These begin with anti-methods, which are included for comparison, and not to follow. You can print these all out as a cheetsheet/reminder.
Blame-Someone-Else Anti-Method
- Find a system or environment component you are not responsible for
- Hypothesize that the issue is with that component
- Redirect the issue to the responsible team
- When proven wrong, go to 1
Streetlight Anti-Method
- Pick observability tools that are:
- familiar
- found on the Internet
- found at random
- Run tools
- Look for obvious issues
Drunk Man Anti-Method
- Change things at random until the problem goes away
Random Change Anti-Method
- Measure a performance baseline
- Pick a random attribute to change (eg, a tunable)
- Change it in one direction
- Measure performance
- Change it in the other direction
- Measure performance
- Were the step 4 or 6 results better than the baseline? If so, keep the change; of not, revert
- Goto step 1
Passive Benchmarking Anti-Method
- Pick a benchmark tool
- Run it with a variety of options
- Make a slide deck of the results
- Hand the slides to management
Ad Hoc Checklist Method
- ..N. Run A, if B, do C
Problem Statement Method
- What makes you think there is a performance problem?
- Has this system ever performed well?
- What has changed recently? (Software? Hardware? Load?)
- Can the performance degradation be expressed in terms of latency or run time?
- Does the problem affect other people or applications (or is it just you)?
- What is the environment? What software and hardware is used? Versions? Configuration?
Scientific Method
- Question
- Hypothesis
- Prediction
- Test
- Analysis
Workload Characterization Method
- Who is causing the load? PID, UID, IP addr, ...
- Why is the load called? code path
- What is the load? IOPS, tput, type
- How is the load changing over time?
Drill-Down Analysis Method
- Start at highest level
- Examine next-level details
- Pick most interesting breakdown
- If problem unsolved, go to 2
By-Layer Method
Measure latency from:
- Dynamic languages
- Executable
- Libraries
- Syscalls
- Kernel: FS, network
- Device drivers
Latency Analysis Method
- Measure operation time (latency)
- Divide into logical synchronous components
- Continue division until latency origin is identified
- Quantify: estimate speedup if problem fixed
Tools Method
- List available performance tools (optionally add more)
- For each tool, list its useful metrics
- For each metric, list possible interpretation
- Run selected tools and interpret selected metrics.
USE Method
For every resource, check:
- Utilization
- Saturation
- Errors
Stack Profile Method
- Profile thread stack traces, on- and off-CPU
- Coalesce
- Study stacks bottom-up
Off-CPU Analysis
- Profile scheduler per-thread off-CPU time with stack traces
- Coalesce times with like stacks
- Study stacks from largest to shortest time
TSA Method
- For each thread of interest, measure time in operating system thread states. Eg:
- Executing
- Runnable
- Swapping
- Sleeping
- Lock
- Idle
- Investigate states from most to least frequent, using appropriate tools
Active Benchmarking Method
- Configure the benchmark to run for a long duration
- While running, analyze performance using other tools, and determine limiting factors