Recently, I have seen a bunch of questions asking: "IIS6 runs fine for X amount of time [where X varies from days to weeks] and then all of a sudden, it stops accepting all connections. If I restart/reboot the server, then it is fine again... until X amount of time later, when I have to repeat the same procedure."

最近,我经常一堆问题:“IIS6 运行了一段时间都挺好的,然后突然就不接受连接了。重启服务器之后,就好了。如此反复。”


Sometimes, the user even notices that HTTP Error logs in %SYSTEMROOT%\System32\LogFiles\HTTPERR\*.log mention Connections_Refused for all those requests, but what the heck does that mean?

有时,我们检查%SYSTEMROOT%\System32\LogFiles\HTTPERR\*.log的日志会发现很多Connections_Refused,这是啥JB玩意?


Ok, ignoring the jokers in the back that are snickering "hmm, this is Microsoft software; what do you expect?" , this issue is actually very diagnosable and should be treatable, as I will shortly show...

好吧,丢掉那些“这就是微软的程序,你还能怎样”之类的玩笑。这个问题其实是很可以诊断并修复的。


Question:

Every four or five days the web site hoste on my machine will stop responding to HTTP requests on port 80.


IIS doesn't crash so I cannot get an IIS State log, just this:


问题:每隔4、5天,web服务器就停止接受HTTP请求了。IIS没有挂掉,所以没有IIS状态日志。只有HTTPErr.log如下所示:


Excerpt from the HTTPErr.Log:


(... working fine up until this ...)
2005-09-02 18:45:12 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:46:07 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:46:27 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:46:42 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:47:02 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:48:22 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:48:57 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:49:02 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:49:42 - - - - - - - - - 7_Connections_Refused -
2005-09-02 18:50:02 - - - - - - - - - 3_Connections_Refused -
2005-09-02 18:50:37 - - - - - - - - - 6_Connections_Refused -
2005-09-02 18:50:42 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:51:02 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:51:47 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:52:02 - - - - - - - - - 3_Connections_Refused -
2005-09-02 18:52:27 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:53:17 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:53:22 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:54:02 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:54:22 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:54:32 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:55:22 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:56:02 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:56:42 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:58:17 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:58:42 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:58:57 - - - - - - - - - 1_Connections_Refused -


I have to reboot the machine to get it to work again.


Does anyone have any ideas?

我需要重启机器才能让它重新工作。有啥想法吗?


Answer:

Connections_Refused is actually a very diagnosable failure pattern, and I am going to show you how to diagnose it and what is going on.

回复:Connections_Refused 是可以诊断滴。下面演示下诊断过程。


What does Connections_Refused Mean?

According to KB 820729, Connections_Refused means that the "Kernel Non Paged Pool memory has dropped below 20MB and HTTP.SYS has stopped receiving new connections". What all this means in English is this:


什么是Connections_Refused?根据KB 820729,Connections_Refused的意思就是说:“未分页核心内存低于20MB时,HTTP.SYS停止接受新的连接。”翻译成人类语言就是下面这个意思:


  • On Windows NT systems, there are many types of memory with different properties. Kernel Non Paged Pool (NPP) memory is one such type, and the important thing to know is that on x86 machines, the size of this memory is fixed and has a maximum size of 128MB. In other words, you may have an x86 machine with 4GB of RAM, but if you only use 128MB and that 128MB all comes from NPP memory, your system is still "out of memory" regardless of how much is actually in use.
    [1/4/2006 - Correction - maximum size on x86 is "around 250MB" and 128MB with /3GB. For Windows Vista, it will be bounded by physical RAM and not arbitrary limit.]

    在windows NT系统中,有很多种属性的内存。未分页核心内存(NPP)就是其中一个。有一点需要注意的:在X86机器上,NPP的最大值限定在128M。换句话说,就算你有4GB的X86机器,你也只有128M的NPP.你的这128M用光了,你的系统就内存溢出了。


  • HTTP.SYS, being a kernel-mode driver, uses NPP memory for every single connection that is active, and we made a conscious decision to have it stop accepting connections when NPP memory falls below a threshold, arbitrarily set at 20MB. This is reasonable because lots of bad things can occur if NPP memory is depleted.

    HTTP.SYS 是一个内核驱动,利用NPP来为来接收连接。我们决定如果NPP小于20M,HTTP.SYS就停止接受连接。这是有道理的,因为如果NPP内存泄漏了,那很多坏事就来了。


    For example: HTTP.SYS is likely the most active codebase in a web server, and if NPP memory gets depleted and HTTP.SYS needs NPP memory to accept a connection, the machine will blue screen. Now, regardless of the driver leaking NPP memory, HTTP.SYS looks like the culprit because it is most active. In any case, all of this is not good, so HTTP.SYS proactively refuses to get near the edge and plays it safe.

例如:HTTP.SYS在web服务器上应该是最活跃的codebase,如果NPP内存耗尽了,而HTTP.SYS需要NPP内存去接受连接,机器就会蓝屏。如果不考虑驱动泄漏NPP内存的话,HTTP.SYS应该是罪魁祸首因为它最活跃。无论如何,现在HTTP.SYS已经高瞻远瞩的提前停止了接受新请求,保持系统安全。


Hopefully, it is now clear what Connections_Refused means. HTTP.SYS is basically telling you "someone is using up a lot of NPP memory, and for protective reasons, I am going to stop accepting requests". We need to figure out what driver is using up all the NPP memory and address it, and the Connections_Refused should naturally go away because NPP memory will not be under pressure.


现在我们知道Connections_Refused的含义了。基本上HTTP.SYS是在说:“有人用了很多NPP内存,为安全起见,我就只好委屈求全停止接受请求了。”现在我们需要厘清是啥驱动用光了NPP内存,找出他,检举他,搞臭他,让Connections_Refused滚蛋,这样NPP内存就不会有压力了。


How to Diagnose this Issue

One of the nice properties of NPP memory is that a unique tag is associated with every piece of this memory, so we can always track down who is using what and how much of NPP memory. The user-mode tool to do this is called POOLMON.EXE (if you have a Kernel Debugger attached, you can use the !poolused command to get this info within the Kernel Debugger). Here are some additional bits of related information:


如何诊断:NPP内存有一个很好的特点是他的每一片内存都连接了一个tag,所以我们可以根据这个来追踪到谁用了多少NPP内存。我们用一个叫poolMon.exe的工具。如果你有Kernel Debugger,你可以用!poolused命令来得到获得相关信息。下面是一些额外的信息:



I am now going to step through the POOLMON output of a user who had Connections_Refused and comment on what is going on. Here is a sample snippet of running POOLMON.EXE -b :


现在我要开动了!运行poolmon.exe -b。结果如下:


 Memory: 1048016K Avail:  438396K  PageFlts:   219   InRam Krnl: 3604K P:42372K
 Commit: 538908K Limit:11245092K Peak: 609576K          Pool N:109056K P:43740K
 System pool information
 Tag  Type     Allocs            Frees            Diff   Bytes       Per Alloc
ISil Nonp  1408366 ( 56)  1327427 ( 58)  80939 36678632 ( -872)  453I100 Nonp  11048877 ( 217) 10967968 ( 219)  80909 14886928 ( -368)  183
PcNw Nonp     292427 (   0)     88967 (   0)   203460 12203296 (     0)    59
 SavE Nonp    1596066 (   0)   1595659 (   0)      407 11717016 (     0) 28788
 IAM  Nonp    1946164 (  49)   1904725 (  50)    41439 8519576 (  -208)    205
 tdLL Nonp      74748 (   0)     69830 (   4)     4918 2672832 ( -1776)    543
 LSwi Nonp          1 (   0)         0 (   0)        1 2576384 (     0) 2576384
 R200 Nonp         22 (   0)         1 (   0)       21 2297816 (     0) 109419
 MmCm Nonp        302 (   0)        48 (   0)      254 2236960 (     0)   8806
 TCPt Nonp     190486 (   4)    190462 (   4)       24 1392264 (     0)  58011
ULHP Nonp    3127 (  0)    41 (  0)   3086 1047216 (   0)  339
 PTrk Nonp     999750 (   0)    996561 (   0)     3189  956824 (     0)    300
 File Nonp    9628160 (  65)   9623252 (  65)     4908  750688 (     0)    152
 rg81 Nonp       6910 (   0)      5294 (   0)     1616  603864 (     0)    373
 Pool Nonp          6 (   0)         3 (   0)        3  602112 (     0) 200704
 Devi Nonp        680 (   0)       262 (   0)      418  558552 (     0)   1336
 Thre Nonp     417221 (   6)    416457 (   6)      764  476736 (     0)    624
 LSwr Nonp        128 (   0)         0 (   0)      128  416768 (     0)   3256
 Mm   Nonp         26 (   0)         2 (   0)       24  379880 (     0)  15828
 AfdC Nonp      21989 (   0)     19872 (   1)     2117  338720 (  -160)    160
....


Some observations:

观察到的结果:

  • Notice that this system has 1GB RAM but still only 128MB NPP memory (109MB has been used, hence tripping the 20MB marker of HTTP.SYS).

    我们看到系统有1G的内存,128M的NPP内存,其中109MB已经被用了,剩下的不到20MB,所以HTTP.SYS停止了工作。

  • The NPP memory tags of HTTP.SYS all start with "UL" (for the trivia-inclined: as with all software projects, HTTP.SYS has gone through some evolution of its own. It used to be called something else - Universal Listener - hence its tag names all started with "UL").

    HTTP.SYS在NPP内存中的tag是以UL开头的。UL貌似是全局监听器(Universal Listener)的头字母。


  • Notice that HTTP.SYS is not even in the top-10 NPP memory usage on the system, and its largest set of allocations is barely using 1MB total. As it turns out on this system, TOTAL NPP memory usage by HTTP.SYS was under 2MB, so clearly, it is not the aggressive user of NPP memory that is now causing issues.

    注意,HTTP.SYS甚至不在NNP内存使用的前10位。它最大的分配额只占了1Mb。也即是说,这个系统里面HTTP.SYS的全部使用空间小于2MB。所以很清楚,是其他原因导致了npp内存过低。


  • Meanwhile, focus your attention on the top four users of NPP memory, with tags "ISil", "I100", "PcNw", and "SavE". I looked up drivers for those tags, and I only find that "PcNw" comes from Microsoft and is associated with WDM audio. The others are all 3rd party drivers.

    同时,我们注意到用NPP内存最狠的,是几个叫“ISil”,"I100","PcNw","SavE"的货。我检查了这几个对应的驱动,只有“PcNw”来自微软,对应WDM声驱,其他的几个来自于第三方驱动。


  • Umm, this is a server, so is audio really that important? I checked on my system, and it was at 275KB, so clearly this user is/has done some heavy-duty audio stuff on this server. Thus, the memory may be legitimately used. You know, you might want to reconsider the effects of playing your MP3s on your Web Server's ability to accept connections...




真是蛋疼,一个服务器要声音干啥?我检查了我的系统,这个在我的系统上面是275kb,所以显然有人在服务器上弄了很耗资源的声音相关的东东,于是内存就被华丽丽的使用了。你可能需要考虑到你的web服务器上播放MP3的功能。。。


  • Meanwhile, Isil and I100 have suspiciously similar active allocations and ISIL is the stock ticker for a semiconductor company, so my guess is that I100 is related to a Fast Ethernet card... and since web servers totally rely on the network driver, having a memory-hog for a network driver cannot be a good thing and should be fixed. My guess for SavE is probably anti-virus.

    同时,ISil和I100有比较类似活跃配额,Isil一个半导体公司的股市行情收录器,所以我觉得I100可能跟以太网卡有关。。web服务器是基于网络驱动的,有一个网络驱动的内存吃货可不是什么好事。我猜savE可能是杀软。


  • Now, I may be totally wrong with those arbitrary tag names, but one thing is for certain - IIS6 and HTTP.SYS is running just fine on this system. There are some other memory-hungry drivers running on the system, and if you get fixes for them such that your system will stay running longer, IIS6 will likely keep running as well

我可能弄错了这个标签,不过有一件事是清楚的:IIS6和HTTP.SYS在系统上运行正常。是其他驱动占用了大量的内存。如果你把这些修复好了,你的系统就会活得久一些,IIS6也会保持运行。

Note: Don't shoot the messenger (HTTP.SYS)! Always strive to figure out the root of your problem and address that. The rest of the problems should naturally fall away.

注:别老是把枪口对着HTTP.SYS啊亲!找出问题的根源,其他问题就迎刃而解了。

//David

来自:

http://blogs.msdn.com/b/david.wang/archive/2005/09/21/howto-diagnose-iis6-failing-to-accept-connections-due-to-connections-refused.aspx