http://blogs.msdn.com/b/debuggingtoolbox/archive/2011/10/03/top-things-to-consider-when-troubleshooting-complex-application-issues.aspx
1- For reactive incidents: “Bring the engineer onsite because it is going to be easier to isolate the problem.”
This is the most common misconception I’ve heard. Let me explain: most complex problems require deep debugging sessions. Collecting the necessary information is the easy part and can be done remotely or by the customer. However, several hours or days may be necessary for debugging the dump files. Being onsite can actually slow down the process since we can be without access to our private symbols and collaborative access to colleagues with specific technology knowledge.
Many times, a great value of being onsite is to act as ears and eyes for a remote engineer or to get a better understanding of complex problems that we can’t understand very well by e-mail or phone.
2- “We need a Code Review because our application has performance problems.”
Sometimes I receive requests for Code Review but what the customer needs, in reality, is Problem Isolation. So what is the difference and how do I know what I need?
The goal of a Code Review is to review the source code and point the parts of the code that are not following the Best Practices, or that represents security holes or yet parts of the code that can be optimized for speed.
The goal of Problem Isolation is to isolate problem(s) causing specific application symptoms. For example, crashes, hangs, memory leaks and performance bottlenecks.
Let me explain: imagine a scenario where an ASP.NET application is suffering from poor performance. If I code review the application I may find methods that can be optimized for speed. However, if the application has slow performance, because there is a bottleneck on the database side or network, the performance gain from the code review is not going to solve the problem. Worst case, it may not even be noticeable.
The Code Review is great when you want to make sure your application doesn’t have potential problems that could be avoided by implementing Best Practices or if you think the application can be further optimized to gain more speed. However, you’ll only be able to measure the speed gain if you have a baseline from when the application was not experiencing issues. Importantly, usually the performance gain is not as significant as removing the bottlenecks.
3- “So after fixing this problem the performance/memory issue is going to be normalized, right?”
The reality is that there can be different problems causing the same symptom(s) like slow performance, hangs or memory issues.
What does this mean? It means that after solving the most significant and visible problem(s) we need to monitor the application because other minor problems could be causing the same symptom. In addition, after fixing the main bottleneck(s), these other minor problems should become visible and easier to isolate. Identifying and resolving application issues are an iterative process. In case you would like to know more about it read this old post here.
4- “We’re using .NET so I don’t need to worry about memory management.”
If you have a pure .NET application, I would be inclined to agree. However, most commercial applications have some kind of interaction with the Native World, like C DLLs, COM objects or API calls.
The CLR is great to manage memory from pure .NET applications. If your application is interacting with native code it is the developer’s responsibility to make sure that resources are released/closed.
5- What information do I need to collect? How much information do I need to collect?
There is a fine line between not enough information and too much information. What is most important for us is to get the right information. One dump file from your problematic application collected when experiencing the symptom is very valuable. Five dump files from the problematic application collected when it was running fine will most likely not be very helpful.
If your application is crashing you want to collect a dump file when the application crashes. If you collect a dump file any other time it won’t have information from the exception. If you force a huge Kernel Dump file to be collected you will end up with a huge dump file from all your machine’s processes but, again, that dump file won’t have information about the exception crashing your application.
6- “We need an Architecture Review because our application has performance problems.”
This is similar to item 2 above. An Architecture Review is not the best approach to solve immediate problems. Additionally, an Architecture Review may not even be the right approach to solve most application problems because usually these application problems are too granular. This means the customer’s application designed correctly from an Architectural point of view but problems not related to the way the architecture was designed.
Let me provide a few examples. Imagine that you haven’t installed an important update for the .NET Framework which is impacting your application. Or that your SharePoint application is not releasing the internal SharePoint objects it’s using. In these examples, an Architecture Review is not going to uncover these problems.
7- Sometimes finding the right place to start is the hardest part.
Imagine this scenario:
“We need an IIS Engineer because my W3WP.EXE is consuming too much memory. It may be an IIS bug.” How will users, administrators and developers experience the issue?
- End user: I think the browser has a problem, the application is slow.
- IIS Administrator: I think the problem is the ASP.NET application.
- Developers: The ASP.NET application is running fine; the problem is probably on the database side.
- DBA: The SQL Server is running fine; I think the bottleneck is network related.
- Network Administrator: The network doesn’t have problems.
Our goal as Developer PFEs is to help our customers to isolate the problem across disparate technologies and work to provide cross group collaboration between different teams while onsite or remotely.
8- What skillset do I need to help me debug an application?
If your application requires debugging you don’t need an engineer who knows how to administer or install the product. What you need is an engineer that knows the internals of the applications and how to debug them. The good news is that this knowledge is not application dependent.
A Microsoft Engineer can debug your application even if he/she has never seen your application before. The same applies to our own products.
If at some point we isolate the problem to one of our products then we involve an engineer from the Product Team because he/she has intimate knowledge of the problems and bugs from the product he/she supports.
9- “I just ran !clrstack and most threads running for a long time are trying to retrieve data from the database. The bottleneck is probably on the database side.”
Let me tell you something: I used to say to our new engineers or those who want to learn more about .NET Debugging, if you want to excel at .NET Debugging you must learn Native code debugging, which implies some knowledge of C/C++ programming too.
Don’t believe me? Ask your favorite bloggers that blog about .NET Debugging if they only know .NET debugging.
With that being said !clrstack is the favorite command from people learning .NET Debugging. It’s cool; you can see the managed side of the call stack which usually is higher level than the native side. However, sometimes you still need to see the native side to really understand what the thread is doing otherwise, if focusing just on the managed side, you may come up with the wrong conclusion.
Bottom line is: If you want to improve your .NET Debugging skills learn more about Native debugging.
Here is a list of books about .NET Debugging, User Mode Debugging and Kernel Debugging.
10- “My two servers are identical but the issue happens just on server XYZ.”
When troubleshooting scenarios like that never assume the servers are identical. Instead, gather the data to prove it.
A great place to start is to run the MPSReport/SPSReport tool. This tool will collect all information from each server and compare them. In at least one occasion in which the servers really were identical, the underlying issue was one of the servers was being accessed by the application, so it was being overloaded.
11- “From the Event Log I can see the exception that crashed my application and the call stack is pointing to Windows. I think this is a Windows bug.”
This is related to item 7 above and a common misconception. Sometimes calls stacks from 2nd chance exceptions (exceptions not handled by your application, thus crashing the app) have DLLs from Windows as the top frames. This is normal and it doesn’t mean that Windows is causing the crash.
Example:
ChildEBP RetAddr 0013bcd0 7c90de7a ntdll!KiFastSystemCall+0x2 0013bdd0 7c81cdfe kernel32!_ExitProcess+0x62 0013bde4 79f944b0 kernel32!ExitProcess+0x14 0013c00c 79f2c09a mscorwks!SafeExitProcess+0x11b 0013c018 79eff585 mscorwks!DisableRuntime+0xd1 0013c0a8 79011628 mscorwks!CorExitProcess+0x242 0013c0b8 77c39d3c mscoree!CorExitProcess+0x46 0013c0c4 77c39e78 msvcrt!__crtExitProcess+0x29 0013c0d4 77c39e90 msvcrt!_cinit+0xee 0013c0e8 0e68d21e msvcrt!exit+0x12 0013c580 0e256834 testappl!FuTestInterface::init+0x34 <<< This is where you should start the investigation. 0013c5a4 0e1d8c01 testapp!WBNARiskReportInterface::getResults+0x442a
Therefore don’t assume that ntdll or kernel32 caused the problem. The APIs from these Operating System dlls are being called as a consequence of the exception likely caused by the application. Try to identify the latest application method call as your initial investigation point. In our example above this is testapp!FuTestInterface::init. Analyze it and, if necessary, analyze the previous frame and so on.
12- “We collected dump files from that C++ application which is crashing. We think it is a heap corruption, so the call stack should indicate the culprit, right?”
Heap corruption is not frequent as it used to be because .NET applications are more and more common. However, back in the days of COM objects and C DLLs, heap corruption was a typical problem.
In order to get call stacks from the method which actually corrupted the heap is to enable Page Heap, restart the application so it can use the new Heap Manager settings and collect a dump file. With this approach you can easily isolate the heap corruption problem.
The Page Heap can be enabled using different tools like PageHeap.exe, GFlags.exe, Application Verifier and others. Some Page Heap settings, like Full Page Heap, create a read only page after each memory allocation, so whenever your application tries to overwrite the buffer it hits the read only page and you get an Access Violation.
Here is a didactic explanation about using GFlags to isolate Heap Corruption problems.
Note: Windows Vista/Windows Server 2008 and Windows 7 have more mechanisms to easily detect heap corruptions and minimize them.
Note 2: In some situations even PageHeap won’t crash with the culprit on the call stack but those cases are rare. (Thanks to Mario Hewardt for the reminder)