设置 核心内存转储后无效_从内存转储的第1部分(共3部分)调查无效的程序异常

设置 核心内存转储后无效

Datadog automated instrumentation for .NET works by rewriting the IL of interesting methods to emit traces that are then sent to the back-end. This is a complex piece of logic, written using the profiler API, and ridden with corner-cases. And as always with complex code, bugs are bound to happen, and those can be very difficult to diagnose.

用于.NET的Datadog自动化工具通过重写有趣方法的IL来发出跟踪,然后将其发送到后端来工作。 这是一个复杂的逻辑,是使用事件探查器API编写 ,并带有各种极端情况。 与复杂的代码一样,错误一定会发生,并且这些错误可能很难诊断。

As it turns out, we had customer reports of applications throwing InvalidProgramException when using our instrumentation. This exception is thrown when the JIT encounters invalid IL code, most likely emitted by our profiler. The symptoms were always the same: upon starting, the application had a random probability of ending up in a state where the exception would be thrown every time a method in particular was called. When that happened, restarting the application fixed the issue. The affected method would change from one time to the other. The issue was bad enough that the customers felt the need to report it, and rare enough that it couldn’t be reproduced at will. Yikes.

事实证明,在使用我们的工具时,我们收到了有关应用程序InvalidProgramException客户报告。 当JIT遇到无效的IL代码(最有可能由我们的探查器发出)时,将引发此异常。 症状始终是相同的:在启动时,应用程序随机出现的可能性是,每次调用特定方法都将引发异常。 发生这种情况时,重新启动应用程序可以解决此问题。 受影响的方法将从一次更改为另一次。 该问题非常严重,以至于客户感到有必要举报该问题,并且极为罕见,无法随意复制。 kes

Since we couldn’t reproduce the issue ourselves, I decided to ask for a memory dump, and received one a few weeks later (random issues are random). This was my first time debugging this kind of problem, and it proved to be quite dodgy, so I figured out it would make a nice subject for an article.

由于我们自己无法重现该问题,因此我决定要求进行内存转储,并在几周后收到一个(随机问题是随机的)。 这是我第一次调试此类问题,事实证明它相当狡猾,因此我发现这将是一篇不错的文章主题。

The article ended up being a lot longer than I thought, so I divided it into 3 different parts.

这篇文章最终比我想象的要长得多,所以我将其分为3个不同的部分。

  • Part 1: Preliminary exploration

    第1部分:初步探索
  • Part 2: Finding the generated IL

    第2部分:查找生成的IL
  • Part 3: Identifying the error and fixing the bug

    第3部分:识别错误并修复错误

初步探索 (Preliminary exploration)

The memory dump had been captured on a Linux instance. I opened it with dotnet-dump, running on WSL2. The first step was to find what method was throwing the exception. Usually, this kind of memory dump is captured when the first-chance exception is thrown, and that exception is visible in the last column when using the clrthreads command. But I couldn’t find any:

内存转储已在Linux实例上捕获。 我使用在WSL2上运行的dotnet-dump打开了它。 第一步是查找引发异常的方法。 通常,在引发第一个机会异常时会捕获这种内存转储,并且在使用clrthreads命令时该异常在最后一列中可见。 但是我找不到任何东西:

I then decided to have a look at the notes sent along with the dump (yeah, I know, I should have started there), and understood why I couldn’t see the exception: the customer confirmed that the issue was occurring and just captured the memory dump at a random point in time. Can’t blame them: I don’t even know how to capture a memory dump on first-chance exceptions on Linux, and it doesn’t seem to be supported by procdump. If somebody from the .NET diagnostics team reads me…

然后,我决定查看随转储一起发送的注释(是的,我知道,我应该从那里开始的),并理解为什么我看不到异常:客户确认问题正在发生并已被捕获内存转储在随机的时间点。 不能怪他们:我什至不知道如何在Linux上的第一次机会异常中捕获内存转储,并且procdump似乎不支持它 。 如果.NET诊断小组的人读过我的话……

That’s OK though. If no garbage collection happened since the exception was thrown, it should still be hanging somewhere around on the heap. To find out, I used the dumpheap -stat command:

没关系。 如果自引发异常以来未发生垃圾回收,则它仍应挂在堆上。 为了找出dumpheap -stat ,我使用了dumpheap -stat命令:

Three of them, great. I used the dumpheap -mt command to get their address, with the value from the “MT” column. Then I used the printexception (pe) command to get the stacktrace associated to the exception:

他们三个,太好了。 我使用dumpheap -mt命令来获取其地址,并使用“ MT”列中的值。 然后,我使用了printexception ( pe )命令来获取与异常关联的stacktrace:

Note: Since this is the output from an actual memory dump sent by a customer, I redacted the namespaces containing business code and replaced by “Customer”

注意:由于这是客户发送的实际内存转储的输出,因此我删除了包含业务代码的名称空间,并替换为“ Customer”

We see that the exception was thrown from Customer.DataAccess.Generic.AsyncDataAccessBase`2+<>c__DisplayClass1_0.<CreateConnectionAsync>b__0. The <>c__ indicates a closure, so this was probably a lambda function declared in the CreateConnectionAsync method.

我们看到该异常是从Customer.DataAccess.Generic.AsyncDataAccessBase`2+<>c__DisplayClass1_0.<CreateConnectionAsync>b__0<>c__表示关闭,因此这可能是在CreateConnectionAsync方法中声明的lambda函数。

Since I didn’t have the source code, I used theip2md command to convert the instruction pointer (second column of the stacktrace) to a MethodDescriptor, then fed it to the dumpil command to print the raw IL:

由于我没有源代码,因此我使用ip2md命令将指令指针(stacktrace的第二列)转换为MethodDescriptor,然后将其提供给dumpil命令以打印原始IL:

I’m not going to show the full IL for obvious privacy reasons, but one thing struck me: the code was not calling any method that we instrument so there was no reason the IL would have been rewritten (and it turns out there was no traces of rewriting).

由于明显的隐私原因,我不会显示完整的IL,但有一件事情令我震惊:代码未调用我们要检测的任何方法,因此没有理由将IL重写(事实证明没有任何理由)重写的痕迹)。

Digging deeper, I got a hint at what was happening:

深入研究,我得到了一个提示:

Translated into C#, this is pretty much equivalent to:

翻译成C#,这几乎等同于:

Basically, the method was there to log a few things (that I didn’t include in the snippet) then rethrow the exception. But since they didn’t wrap the exception before rethrowing it (and didn’t use ExceptionDispatchInfo), it overwrote the original callstack!

基本上,该方法用于记录一些内容(我未包含在代码段中),然后重新引发异常。 但是因为他们没有在抛出异常之前包装异常(并且没有使用ExceptionDispatchInfo ),所以它覆盖了原始的调用栈!

The caller code would have been something like:

调用者代码应该是这样的:

The usage of ContinueWith is confirmed by the presence of ContinuationResultTaskFromTask`1 in the callstack.

调用堆栈中是否存在ContinuationResultTaskFromTask`1来确认ContinueWith的使用。

That was really bad for me, because it meant that the original exception could have been thrown anywhere in whatever methods were called by SomethingAsync.

这对我来说真的很糟糕,因为这意味着原始的异常可能已经被SomethingAsync调用的任何方法抛出了任何地方。

By looking at the raw IL of the caller method, I figured out that SomethingAsync was System.Data.Common.DbConnection::OpenAsync. Since the application used PostgreSQL with the Npgsql library, and since our tracer automatically instruments that library, it made sense that the rewritten method would be somewhere in there. Normally, I could have checked our logs to quickly find what Npgsql method was rewritten, but the customer didn’t retrieve them before killing the instance, and waiting for the problem to happen again could have taken weeks (random issue is random). So I decided to bite the bullet and start the painstaking process of cross-checking the source code of Npgsql with the state of the objects in the memory dump to find the exact place where the execution stopped and the exception was originally thrown.

通过查看调用方方法的原始IL,我发现SomethingAsyncSystem.Data.Common.DbConnection::OpenAsync 。 由于该应用程序将PostgreSQL与Npgsql库一起使用 ,并且由于我们的跟踪程序会自动对该库进行检测,因此重写的方法将位于其中。 通常,我可以检查一下日志以快速找到重写了什么Npgsql方法,但是客户在杀死实例之前没有检索到它们,而等待问题再次发生可能要花几周的时间(随机问题是随机的)。 因此,我决定硬着头皮,开始艰苦的过程,将Npgsql的源代码与内存转储中的对象状态进行交叉检查,以找到执行停止和最初引发异常的确切位置。

For instance, at some point the method PostgresDatabaseInfoFactory.Load is called:

例如,在某个时候调用方法PostgresDatabaseInfoFactory.Load

There were instances of PostgresDatabaseInfo on the heap, so I knew this method ran properly. From there, the LoadBackendTypes method is called, and the result is assigned to the _types field:

堆上有PostgresDatabaseInfo实例,因此我知道此方法可以正确运行。 从那里,调用LoadBackendTypes方法,并将结果分配给_types字段:

But when inspecting the instances of PostgresDatabaseInfo (using the dumpobj or do command with the address returned by the dumpheap -mt command), we can see that the _types field has no value:

但是,当检查PostgresDatabaseInfo的实例时(使用dumpobjdo命令以及dumpheap -mt命令返回的地址),我们可以看到_types字段没有值:

Therefore, the execution stopped somewhere in the LoadBackendTypes method. That method has calls to NpgsqlCommand.ExecuteReader and NpgsqlCommand.ExecuteReaderAsync, which are two method that our tracer instruments, and therefore is a candidate for rewriting.

因此,执行在LoadBackendTypes方法中的某处停止。 该方法具有对NpgsqlCommand.ExecuteReaderNpgsqlCommand.ExecuteReaderAsync调用,这是我们的跟踪器使用的两个方法,因此是重写的候选方法。

Great! At this point I just had to dump the generated IL, find the error, and call it a day. Right? Well it got more complicated than I anticipated, as we’ll see in the next article.

大! 在这一点上,我只需要转储生成的IL,找到错误,然后将其命名为day。 对? 好吧,它变得比我预期的要复杂,我们将在下一篇文章中看到。

翻译自: https://medium.com/@kevingosse/investigating-an-invalidprogramexception-from-a-memory-dump-part-1-of-3-bce634460cc3

设置 核心内存转储后无效

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值