HttpException "Request timed out"的发生原因及调试方法

问题简介

ASP.NET程序中我们经常遇到错误消息为"Request Timed Out"HttpException,浏览器中的出错信息如下图所示,

同时应用程序日志中也会出现一条警告消息如下



如何重现这种异常?

线程在处理逻辑的过程中是如何被打断而抛出超时异常?

如何调试这种异常?

 

这篇文章我们从CLR源代码级别来分析一下该问题发生的原因以及相应的调试方法。

 

重现问题

首先我们先尝试重现这个问题,在ASP.NET程序web.config中有一个executionTimeout的设置,默认值为110秒,也就是说一个请求的执行时间超过110秒就会timeout从而产生HttpException "Request Timed Out"。这个设置可以通过IIS Configuration Editor或者web.config进行更改。

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <appSettings />
    <connectionStrings />
    <system.web>
        <compilation>
        </compilation>        
        <authentication mode="Windows" />
        <httpRuntime executionTimeout="10" />        
    </system.web>

</configuration>

为了快速重现这个问题我们将executionTimeout默认值改为10秒。然后建立一个让请求超时的asp.net页面如下。

<%@ Page Language="C#" AutoEventWireup="true" Inherits="System.Web.UI.Page" %>
<script runat="server">
    protected void Page_Load(object sender, EventArgs e)
    {
        System.Threading.Thread.Sleep(120000);
    }

</script>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
    <title></title>
</head>
<body>
    <form id="form1" runat="server">
    <div>
    
    </div>
    </form>
</body>
</html>

将页面部署在IIS上,然后请求页面,等待10秒到25秒左右问题重现。

 

产生原理

这里有一个值得注意的地方,为什么我们页面逻辑在执行Sleep却抛出了HttpException,执行线程被谁打断了,是怎么打断的?

研究这个问题我们首先要用windbg来察看一下这个异常发生的调用栈信息。

通过windbg attachw3wp进程,然后输入命令'sxe clr'捕获clrfirst chance exception

CLR exception - code e0434f4d (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
KERNELBASE!RaiseException+0x68:
000007fb`bbb789cc 488b8c24c0000000 mov     rcx,qword ptr [rsp+0C0h] ss:000000da`6bf0da20=38a1bff3b7640000

0:023> KL
Child-SP          RetAddr           Call Site
000000da`6bf0d960 000007fb`a0da98ef KERNELBASE!RaiseException+0x68
000000da`6bf0da40 000007fb`a0d03e07 mscorwks!RaiseTheExceptionInternalOnly+0x2ff
000000da`6bf0db30 000007fb`a0cc58cf mscorwks!Thread::HandleThreadAbort+0x19b
000000da`6bf0dbd0 000007fb`a0cbe815 mscorwks!Thread::HandleThreadInterrupt+0x7b
000000da`6bf0dc30 000007fb`a1280769 mscorwks!Thread::UserSleep+0x89
000000da`6bf0dc90 000007fb`415d09ee mscorwks!ThreadNative::Sleep+0xf9

000000da`6bf0de40 000007fb`41630f39 App_Web_kng1ridv!ASP.sleep120_aspx.Page_Load(System.Object, System.EventArgs)+0xe

000000da`6bf0fae0 000007fb`beb53501 KERNEL32!BaseThreadInitThunk+0x1a
000000da`6bf0fb10 00000000`00000000 ntdll!RtlUserThreadStart+0x21


栈上面的first chanceexceptionThreadAbortException

0:023> !dso
OS Thread Id: 0x12e0 (23)
RSP/REG          Object           Name
000000da6bf0d968 000000da0aac3ea0 System.Threading.ThreadAbortException
000000da6bf0da30 000000da0aac3ea0 System.Threading.ThreadAbortException
000000da6bf0da48 000000da0aac3ea0 System.Threading.ThreadAbortException
000000da6bf0db30 000000da0aac3ea0 System.Threading.ThreadAbortException
000000da6bf0dc98 000000da0aa93c90 System.Web.HttpContext
000000da6bf0dca0 000000da0aac3498 System.ComponentModel.EventHandlerList
000000da6bf0dd70 000000da0aac3498 System.ComponentModel.EventHandlerList
000000da6bf0dd90 000000da0aa93c90 System.Web.HttpContext

从这个调用栈上我们可以看出来这个异常来自于Thread::UserSleep -> Thread::HandleThreadInterrupt ->Thread::HandleThreadAbort

 

由于这个调用栈是CLR非托管代码,我们需要一份.NETFrameworkCLR的源代码作为参考,可以到微软下载一份Share Source CLI。找到\sscli20_20060311\sscli20\clr\src\vm\threads.cpp这个源文件,这个类封装的一系列CLR线程相关的属性和方法。

 

因为我们调用了Sleep方法,首先看一下Sleep函数的实现。原来线程调用Sleep方法之后进入系统内核,内核将其置于等待状态,一旦CLRSleepEx返回WAIT_IO_COMPLETION并且通过m_State & TS_Interrupted两个标志位来判断是interrupt的话,系统将开始执行HandleThreadInterrupt方法,进而在HandleThreadInterrupt方法中抛出ThreadAbortException

// Implementation of Thread.Sleep().
void Thread::UserSleep(INT32 time)
{
    CONTRACTL {
        THROWS;
        GC_TRIGGERS;
    }
    CONTRACTL_END;

    INCONTRACT(_ASSERTE(!GetThread()->GCNoTrigger()));

    DWORD   res;

    GCX_PREEMP();

    // A word about ordering for Interrupt.  If someone tries to interrupt a thread
    // that's in the interruptible state, we queue an APC.  But if they try to interrupt
    // a thread that's not in the interruptible state, we just record that fact.  So
    // we have to set TS_Interruptible before we test to see whether someone wants to
    // interrupt us or else we have a race condition that causes us to skip the APC.
    FastInterlockOr((ULONG *) &m_State, TS_Interruptible);

    // If someone has interrupted us, we should not enter the wait.
    if (IsUserInterrupted())
    {
        HandleThreadInterrupt(FALSE);
    }

    ThreadStateHolder tsh(TRUE, TS_Interruptible | TS_Interrupted);

    FastInterlockAnd((ULONG *) &m_State, ~TS_Interrupted);

retry:

    res = ClrSleepEx (time, TRUE);

    if (res == WAIT_IO_COMPLETION)
    {
        // We could be woken by some spurious APC or an EE APC queued to
        // interrupt us. In the latter case the TS_Interrupted bit will be set
        // in the thread state bits. Otherwise we just go back to sleep again.
        if ((m_State & TS_Interrupted))
        {
            HandleThreadInterrupt(FALSE);
        }

        // Don't bother with accurate accounting here.  Just ensure we make progress.
        // Note that this is not us doing the APC.
        if (time == 0)
        {
            res = WAIT_TIMEOUT;
        }
        else
        {
            if (time != (INT32) INFINITE)
            {
                time--;
            }

            goto retry;
        }
    }
    _ASSERTE(res == WAIT_TIMEOUT || res == WAIT_OBJECT_0);
}

接下来一个问题是到底是什么使得ClrSleepEx方法返回了呢?根据注释我们可以看到APC很可能是打断等待线程的原因。接下来我们继续通过debug验证一下。根据msdn用户模式可以调用kernel32!QueueUserAPC方法发送一个APC到指定线程。我们在这个方法上设置一个断点然后重现这个问题。结果抓到了这个调用栈。

0:022> kL
Child-SP          RetAddr           Call Site
000000da`6bbfe808 000007fb`a10ce75b KERNEL32!QueueUserAPCStub
000000da`6bbfe810 000007fb`a0e37e88 mscorwks!Thread::Alert+0xdb
000000da`6bbfe880 000007fb`a103d0d4 mscorwks!Thread::UserInterrupt+0x30
000000da`6bbfe8b0 000007fb`a11f5699 mscorwks!Thread::UserAbort+0x3984d0
000000da`6bbfe9d0 000007fb`997c4519 mscorwks!ThreadNative::Abort+0x169
000000da`6bbfebd0 000007fb`988f835f mscorlib_ni!System.Threading.Thread.Abort(System.Object)+0x39
000000da`6bbfec20 000007fb`97e94547 System_Web_ni!System.Web.RequestTimeoutManager+RequestTimeoutEntry.TimeoutIfNeeded(System.DateTime)+0xa63d4f
000000da`6bbfec60 000007fb`99772bdb System_Web_ni!System.Web.RequestTimeoutManager.CancelTimedOutRequests(System.DateTime)+0x167

000000da`6bbff8a0 00000000`00000000 ntdll!RtlUserThreadStart+0x21

看一下栈上的信息,究竟是哪个线程正在被abort

0:022> !dso
OS Thread Id: 0x14e4 (22)
RSP/REG          Object           Name
000000da6bbfe9c0 000000da0aa84400 System.Threading.Thread
000000da6bbfea28 000000d9cab1efe8 System.Web.HttpApplication+CancelModuleException
000000da6bbfea38 000000da0aa84400 System.Threading.Thread
000000da6bbfea88 000000d9caa4c810 System.String

0:022> !do 000000da0aa84400
Name: System.Threading.Thread
MethodTable: 000007fb998b86c0
EEClass: 000007fb994bea18
Size: 104(0x68) bytes
 (C:\windows\assembly\GAC_64\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
000007fb998f01b0  4000690        8 ....Contexts.Context  0 instance 000000d9caa285c0 m_Context
000007fb998b2bd0  4000691       10 ....ExecutionContext  0 instance 000000da0aace570 m_ExecutionContext
000007fb998b7c20  4000692       18        System.String  0 instance 0000000000000000 m_Name
000007fb998b8438  4000693       20      System.Delegate  0 instance 0000000000000000 m_Delegate
000007fb998aefe0  4000694       28    System.Object[][]  0 instance 0000000000000000 m_ThreadStaticsBuckets
000007fb998bed40  4000695       30       System.Int32[]  0 instance 0000000000000000 m_ThreadStaticsBits
000007fb998c05a0  4000696       38 ...ation.CultureInfo  0 instance 0000000000000000 m_CurrentCulture
000007fb998c05a0  4000697       40 ...ation.CultureInfo  0 instance 0000000000000000 m_CurrentUICulture
000007fb998b7510  4000698       48        System.Object  0 instance 0000000000000000 m_ThreadStartArg
000007fb998ba628  4000699       50        System.IntPtr  1 instance       da6b789d60 DONT_USE_InternalThread
000007fb998bee90  400069a       58         System.Int32  1 instance                2 m_Priority
000007fb998bee90  400069b       5c         System.Int32  1 instance                7 m_ManagedThreadId
000007fb998ac8d8  400069c      2d8 ...LocalDataStoreMgr  0   shared           static s_LocalDataStoreMgr
                                 >> Domain:Value  000000d9c90512d0:0000000000000000 000000d9ca176610:0000000000000000 <<
000007fb998b7510  400069d      2e0        System.Object  0   shared           static s_SyncObject
                                 >> Domain:Value  000000d9c90512d0:000000d9caa23bf0 000000d9ca176610:000000da0aa2d410 <<

0:022> !threads
ThreadCount: 10
UnstartedThread: 0
BackgroundThread: 9
PendingThread: 0
DeadThread: 1
Hosted Runtime: no
                                              PreEmptive                                                Lock
       ID OSID        ThreadOBJ     State   GC     GC Alloc Context                  Domain           Count APT Exception
   3    1 150c 000000d9ca125a90      8220 Enabled  0000000000000000:0000000000000000 000000d9c90512d0     0 Ukn
  19    2 23c8 000000d9ca12f510      b220 Enabled  000000da0aa8a238:000000da0aa8c200 000000d9c90512d0     0 MTA (Finalizer)
  20    3 1a98 000000d9ca174ab0    80a220 Enabled  0000000000000000:0000000000000000 000000d9c90512d0     0 MTA (Threadpool Completion Port)
  21    4 23f4 000000d9ca175b60      1220 Enabled  0000000000000000:0000000000000000 000000d9c90512d0     0 Ukn
  22    5 14e4 000000da6b779c50   180b220 Disabled 000000d9cab1f000:000000d9cab20b18 000000d9ca176610     1 MTA (Threadpool Worker)
XXXX    6    0 000000da6b77d9d0     10820 Enabled  0000000000000000:0000000000000000 000000d9c90512d0     0 Ukn
  23    7 12e0 000000da6b789d60   3a0b221 Enabled  000000da0aafd860:000000da0aafe2a8 000000d9ca176610     1 MTA (Threadpool Worker)
  15    8 2180 000000da6b78a330   880a220 Enabled  0000000000000000:0000000000000000 000000d9c90512d0     0 MTA (Threadpool Completion Port)
  24    9 13a4 000000da6b7903a0       220 Enabled  000000d9caae2e48:000000d9caae4aa8 000000d9c90512d0     0 Ukn
  25    a  e5c 000000d9ca107180       220 Enabled  000000d9cab20c68:000000d9cab22b18 000000d9c90512d0     0 Ukn


结果我们看到是23号线程是被abort的目标。23号线程正在做什么?查看23号线程调用栈正好是我们的Sleep方法线程。

0:023> kL
Child-SP          RetAddr           Call Site
000000da`6bf0db08 000007fb`bbb611f2 ntdll!NtDelayExecution+0xa
000000da`6bf0db10 000007fb`a10c369d KERNELBASE!SleepEx+0xaa
000000da`6bf0dbb0 000007fb`a0cbe7fd mscorwks!EESleepEx+0x2d
000000da`6bf0dc30 000007fb`a1280769 mscorwks!Thread::UserSleep+0x71
000000da`6bf0dc90 000007fb`415d09ee mscorwks!ThreadNative::Sleep+0xf9
000000da`6bf0de40 000007fb`41630f39 App_Web_kng1ridv!ASP.sleep120_aspx.Page_Load(System.Object, System.EventArgs)+0xe

000000da`6bf0f750 000007fb`a0d1970a mscorwks!UnManagedPerAppDomainTPCount::DispatchWorkItem+0x157
000000da`6bf0f7f0 000007fb`a0dcf3a4 mscorwks!ThreadpoolMgr::WorkerThreadStart+0x1ba
000000da`6bf0f890 000007fb`bc53167e mscorwks!Thread::intermediateThreadProc+0x78
000000da`6bf0fae0 000007fb`beb53501 KERNEL32!BaseThreadInitThunk+0x1a
000000da`6bf0fb10 00000000`00000000 ntdll!RtlUserThreadStart+0x21

到此我们可以证明我们的想法了,原来ASP.NET启动了一个RequestTimeoutManager来负责查看超时线程,然后通过发送APC的方式打断超时线程使之抛出ThreadAbortException结束等待,最后ThreadAbortException被包装成HttpException抛出,记录在日志以及显示在客户端。

 

调试方法

现实过程中Timeout不一定是出在线程Sleep上,也不是所有的ThreadAbortException都代表Timeout的问题发生。我们还是要总结一下类似问题的调试方法。

 

第一,观察timeout的时间,是否是100秒左右,与配置中的数值是否相符(注意这里比一定是一秒不差,因为查看RequestTimeoutManager逻辑可以发现他是每隔15秒查看一下超时线程,所有存在15秒左右的误差。

第二,如果不是超时时间设置过小的原因,那就要通过抓dump的方式来查看具体的超时请求超时在什么操作上。因为我们最终得到的ThreadAbortException中没有调用栈信息。

 

如何抓到对的dump是个值得研究的问题,我们可以把断点设在以下调用栈上的某个点上,注意断点越接近栈顶,这个方法可能被调用的次数就越多,抓到没有用的dump的可能性越大。

0:022> kL
Child-SP          RetAddr           Call Site
000000da`6bbfe808 000007fb`a10ce75b KERNEL32!QueueUserAPCStub
000000da`6bbfe810 000007fb`a0e37e88 mscorwks!Thread::Alert+0xdb
000000da`6bbfe880 000007fb`a103d0d4 mscorwks!Thread::UserInterrupt+0x30
000000da`6bbfe8b0 000007fb`a11f5699 mscorwks!Thread::UserAbort+0x3984d0
000000da`6bbfe9d0 000007fb`997c4519 mscorwks!ThreadNative::Abort+0x169
000000da`6bbfebd0 000007fb`988f835f mscorlib_ni!System.Threading.Thread.Abort(System.Object)+0x39
000000da`6bbfec20 000007fb`97e94547 System_Web_ni!System.Web.RequestTimeoutManager+RequestTimeoutEntry.TimeoutIfNeeded(System.DateTime)+0xa63d4f
000000da`6bbfec60 000007fb`99772bdb System_Web_ni!System.Web.RequestTimeoutManager.CancelTimedOutRequests(System.DateTime)+0x167

000000da`6bbff8a0 00000000`00000000 ntdll!RtlUserThreadStart+0x21

通过ILSpy查看CancelTimedOutRequestsTimeoutIfNeeded源代码,可以看到进入这个方法并不一定代表有线程会被abort,所以我们要从下面一层抓。最终结论是如果我们想要研究TimeoutThreadAbortException,我们可以mscorlib_ni!System.Threading.Thread.Abort上面设置断点抓dump,然后查看堆栈上面的线程,最终找到正在被abort的线程以及相应执行的调用栈信息。

// System.Web.RequestTimeoutManager
private void CancelTimedOutRequests(DateTime now)
{
    if (Interlocked.CompareExchange(ref this._inProgressLock, 1, 0) != 0)
    {
        return;
    }
    ArrayList arrayList = new ArrayList(this._requestCount);
    for (int i = 0; i < this._lists.Length; i++)
    {
        DoubleLinkList obj;
        Monitor.Enter(obj = this._lists[i]);
        try
        {
            DoubleLinkListEnumerator enumerator = this._lists[i].GetEnumerator();
            while (enumerator.MoveNext())
            {
                arrayList.Add(enumerator.GetDoubleLink());
            }
        }
        finally
        {
            Monitor.Exit(obj);
        }
    }
    int count = arrayList.Count;
    for (int j = 0; j < count; j++)
    {
        ((RequestTimeoutManager.RequestTimeoutEntry)arrayList[j]).TimeoutIfNeeded(now);
    }
    Interlocked.Exchange(ref this._inProgressLock, 0);
}

// System.Web.RequestTimeoutManager.RequestTimeoutEntry
internal void TimeoutIfNeeded(DateTime now)
{
    Thread thread = this._context.MustTimeout(now);
    if (thread != null)
    {
        this.RemoveFromList();
        thread.Abort(new HttpApplication.CancelModuleException(true));
    }
}



展开阅读全文

没有更多推荐了,返回首页