In-Depth

A .NET Developer's Life, or How To Do Production Debugging on the Fly

A case study in swift .NET application debugging using a variety of free tools that can help keep a client happy.

Over the years I've been involved in troubleshooting some nasty production issues in a number of projects. One of the main reasons these issues have been quite tricky to troubleshoot was because the issues often weren't easily reproducible in-house. Such types of bugs usually escape the QA testing lifecycle and are discovered only after deployment to a production environment. At that stage in a product lifecycle, these bugs are revenue impacting, which means that an investigating engineer will be under extra pressure to resolve any problem swiftly. Depending on the type of problem, I use various production-grade tools and techniques to troubleshoot these issues and determine the root cause of the problem. In this article, I walk through one of these scenarios.

Naturally, the approaches I describe here aren't the only possible solutions. There can certainly be other methodologies to solve these problems. The approach described here is what I used in a certain scenario based on my knowledge and real-world experience, and I'll welcome reader comments and suggestions for other solutions. In order to protect customer privacy and any potential confidentiality conflicts, I've obscured some sections in the figures included here. Rest assured, the flow of troubleshooting steps remains unchanged.

What Is Production Debugging?
Production debugging is all about solving customer-facing issues that aren't easily reproducible. Take, for example, a common problem of a fast-food restaurant kiosk that goes offline with a blue screen of death. That restaurant loses its ability to accept orders from customers, of course, but it also can disrupt workflows and bring chaos to other parts of the business operations. If the problem can be traced to a hardware issue, hardware can be quickly replaced. But in the case of software issues, replacing hardware will be of no help. Software vendors have to fix the issue, and that, in turn, requires being able to reproduce the scenario first.


Reproducing such issues in a lab environment is the key to finding the root cause: If the issue cannot be reproduced, that makes such problems really hard to solve. One methodology for solving this type of problem is based on the Dump File approach. (If you are unfamiliar with the Dump file-based approach, a good starting point for understanding it is this article I wrote for CODE magazine.). At a high level, a dump file-based approach requires capturing a snapshot (aka dump file) of the problematic process first, and then analyzing it to find the root cause.

There are two main goals of production debugging:

  1. Find actionable information: Most dump file analysis tools can provide useful information, but that information may not lead to specific actions to fix the issue. As an example, a dump file from a typical production application almost always shows a large number of byte arrays and string data type allocations in memory. But this information may not give anything specific that you can take as an action item to be fixed in the code or in the environment.
  2. Find that actionable information fast: Because production issues can negatively impact customer revenues and SLAs, the longer the issue persists, the worse effect it can have on their KPIs. Therefore, it's critical to find that actionable information as quickly as possible. The more time it takes to find and fix the issue, the more pressure is added on everyone in the troubleshooting team. Coming up with a resolution fast sometimes requires going beyond status quo methodologies.

Imagine the following scenario: A customer opened an urgent critical support ticket about Internet portal services. The problem was that the portal was working extremely slowly and hindering users from performing day-to-day tasks. The customer raised the concern that this was not only impacting the company revenues, but putting the company at the brink of missing monthly SLA targets.

We use IIS for hosting these services. Our support technicians looked into the issue and found that IIS memory usage was significantly higher than normal. This application's typical memory consumption was between 400MB and 600MB, but in this case it was consuming almost 1.8GB. Support tech looked through the application and system logs but didn't find any obvious reasons to explain the high memory consumption. At that point, they escalated the issue to a higher tier and that's where I came into the picture. I recommended collecting a couple of dump files for affected w3wp processes. After receiving the dump files, I started deciphering the reasons behind the high memory consumption of this process. Let's go through the tools.

PerfView
In my first attempt at solving the issue, I first used PerfView. Microsoft's powerful and free performance analysis tool can be used to analyze memory- and CPU-related problems. If you haven't used this before, Microsoft .NET Performance Architect Vance Morrison has published a series of PerfView video tutorials in the Channel 9 and Defrag Tools sites. PerfView is well suited for use in a production environment, as it comes as a single executable file that can easily be XCopy'd to the target environment. For memory analysis, PerfView can be used in two different modes:

  1. By capturing a heap snapshot against a target process
  2. By taking a snapshot from a dump file

I opted for the second option (see Figure 1) as we had already collected the process dump files from the production environment.

[Click on image for larger view.] Figure 1. Taking a Heap Snapshot from a Dump File in PerfView

Figure 2 shows PerfView after the heap snapshot was generated from the dump file. It provides a list of objects and the memory consumed by instances of those objects. The Exc% and Exc Count columns provide Exclusive costs associated with a particular type, whereas the Inc% and Inc Count columns provide costs associated with a particular type, as well as all its children recursively. This view shows the most memory-expensive objects at the top of the grid. In case of high memory issue, focus should be on the objects that are most costly. This means that the System.Data.DataRow objects are contributing about 51.1 percent toward memory cost. The other items in the grid have their role in memory consumption, but those aren't as close as System.Data.DataRow objects.

[Click on image for larger view.] Figure 2. PerfView Snapshot from the w3wp Process Dump File

One of the cool features of PerfView is that clicking through the nodes shown in the grid takes you through the objects that are referencing the objects in question. This means that clicking through the nodes will eventually take you to a parent object that's causing child objects to remain in memory. So I double-clicked on the System.Data.DataRow node and navigated to the Referred-From tab in Figure 3.

[Click on image for larger view.] Figure 3. PerfView Referred From Tab

This tab shows that System.Data.DataRow objects are referred by the following objects:

  • System.Data.RecordManager
  • Dictionary<int, System.Data.DataRow>
  • System.Data.RBTree+TreePage<System.Data.DataRow>

Because RecordManager is on top of the list in that grid, that means it holds references to most memory-consuming DataRow objects. In order to find out more about this chain of reference, I just needed to click through the node. However, when I did that, PerfView popped up a dialog indicating that it had encountered some internal error due to an Out of memory condition (Figure 4). I clicked on the other two nodes listed in Referred-From tab for DataRow object but encountered the same fate.

[Click on image for larger view.] Figure 4. PerfView Encountering Its Own OutOfMemory Error

Even though it's an extremely useful tool that I've used many times successfully to troubleshoot production issues, PerfView was just not ready to help me this time. It provided me some useful information in terms of objects that were consuming the most memory, but it just wasn't enough to take a specific action to fix the root cause of the problem.

DebugDiag
Next, I turned to DebugDiag, another free Microsoft tool. DebugDiag provides a rich set of rules that can be used to collect and analyze dump files. It has two components, one for dump file collection and another one for analyzing that dump file. Here, I'll focus on how I used the DebugDiag Analysis tool with an already collected dump file. Figure 5 shows the DebugDiag Analysis component, which comes with built-in rules for dump file analysis. You can add one or more dump files for analysis and then select rules that are appropriate for specific issues under investigation.

[Click on image for larger view.] Figure 5. DebugDiag Analysis Application with Analysis Rules

DebugDiag Analysis outputs to a report in the form of an HTML file containing analysis results. Figure 6 shows the key parts of the analysis from the dump file when running the DotNetMemoryAnalysis rule against that dump file. The top banner shows a summary of errors/warning/information. The warning section describing the number of objects that are ready for finalization. The information section shows that GC heap size is about 1.2GB.

[Click on image for larger view.] Figure 6. DebugDiag Analysis Report -- Error/Warning/Information/Notification

Scrolling down in the report is more information about the breakdown of overall memory across number of heaps (see Figure 7). The next section lists 40 of the most memory-consuming objects. Consistent to what can be seen in PerfView, System.Data.DataRow is at the top of the list.

[Click on image for larger view.] Figure 7. DebugDiag Analysis Report -- .NET GC Heap Info and 40 Most Memory-Consuming Object Types

With the help of PerfView and DebugDiag, I determined that it was the System.Data.DataRow objects that were responsible for most memory consumption within the IIS process. All of us who have experience with data-centric applications know that DataRow objects are populated as a result of queries performed against the database. This, of course, is helpful information that I could use to build an initial theory about the problem, but there wasn't much in these reports to give me anything concrete to fix it.

At this point my thought was that some database queries are probably retrieving lots of data. The problem is that in this application, there are hundreds (if not thousands) of queries performed against the database. Without knowing more about the actual queries, it's really hard to predict what part of the code is responsible for the application's high memory consumption. Running SQL Profiler against this application was a theoretical possibility (to find all running queries), but that alone could be a mammoth task that could take a long time. We needed something fast to take this customer out of its misery. The customer was literally willing to abandon using our product unless something was done immediately. With every minute ticking, pressure was building up, so we looked to other tools to get better answers.

Windbg
I then turned to another free Microsoft tool, Windbg, a native debugger that's part of the Debugging Tools for Windows. Given the firepower and flexibility that Windbg offers, it's a favorite. Because Windbg is a native debugger, it doesn't understand managed code without the help of Extension DLLs. These DLLs give the ability to glean into internal data structures of managed heap. Son of Strike (SOS) is one commonly used extension that comes as a part of the .NET Framework. Figure 8 shows Windbg after loading the collected dump file. I then ran the command to load SOS.


comments powered by Disqus

Featured

Subscribe on YouTube