Connectivity troubleshooting in SQL Server 2008 with the Connectivity Ring Buffer

http://blogs.msdn.com/b/sql_protocols/archive/2008/05/20/connectivity-troubleshooting-in-sql-server-2008-with-the-connectivity-ring-buffer.aspx

SQL Server 2008 contains a new feature designed to help troubleshoot particularly tricky connectivity problems. This new feature, the Connectivity Ring Buffer , captures a record for each server-initiated connection closure, including each kill of a session or Login Failed event. To facilitate efficient troubleshooting, the ring buffer attempts to provide information to correlate between the client failure and the server’s closing action. Up to 1k records in the ring buffer are persisted for as long as the server is online, and after 1000 records, the buffer wraps around and begins replacing the oldest records. The Connectivity Ring Buffer’s records are exposed by a DMV which can be queried:

SELECT CAST ( record AS XML ) FROM sys . dm_os_ring_buffers
WHERE ring_buffer_type = 'RING_BUFFER_CONNECTIVITY'

Additional ring buffer tyep: from envykok
RING_BUFFER_RESOURCE_MONITOR - Resource Monitor activity like was physical memory pressure signaled or not.
RING_BUFFER_SCHEDULER_MONITOR  - What is the state of the logical schedulers, the health record type is very helpful.
RING_BUFFER_MEMORY_BROKER  - Actions the internal memory broker is taking to balance the memory between caches.
RING_BUFFER_SECURITY_ERROR  - Errors that occur during security operations.  Ex: login failed may have more details about the OS error code and such
RING_BUFFER_SCHEDULER - Actual scheduler activity such as context switching.  You can reconstruct the execution order from these entries.
RING_BUFFER_EXCEPTION - Any exceptions encountered in the server.  SQL uses throw internally for errors so you can see SQL errors as well.
RING_BUFFER_CONNECTIVITY - Core connectivity information - useful in tracking down connection failure information

This will select all records as XML types; in Management Studio, you can click the records to get a more readable version. If you want to use SQL queries on the XML records to locate your particular problem, you can also use SQL Server’s XML support to turn this into a temp table and query against the records.

A basic ring buffer entry: killed SPID

An easy way to cause a server-initiated connection closure is to open two connections to a SQL Server, find the SPID of one of the connections, and then kill that spid from the other connection:

C:/>osql -E
1> SELECT @@spid
2> go
 ------
     51
(1 row affected)

C:/>osql -E
1> kill 51
2> go
1>

If you try this and then query the ring buffer, you will get a result which will look like this:

< Record id = " 2 " type = " RING_BUFFER_CONNECTIVITY " time = " 110448275 " >

  < ConnectivityTraceRecord >

    < RecordType > ConnectionClose </ RecordType >

    < RecordSource > Tds </ RecordSource >

    < Spid > 55 </ Spid >

    < SniConnectionId > B7882F3C-3BA9-45A7-8D23-3C5C05F9BDF9 </ SniConnectionId >

    < SniProvider > 4 </ SniProvider >

    < RemoteHost > &lt; local machine &gt; </ RemoteHost >

    < RemotePort > 0 </ RemotePort >

    < LocalHost />

    < LocalPort > 0 </ LocalPort >

    < RecordTime > 5/6/2008 22:47:35.880 </ RecordTime >

    < TdsBuffersInformation >

      < TdsInputBufferError > 0 </ TdsInputBufferError >

      < TdsOutputBufferError > 0 </ TdsOutputBufferError >

      < TdsInputBufferBytes > 60 </ TdsInputBufferBytes >

    </ TdsBuffersInformation >

    < TdsDisconnectFlags >

      < PhysicalConnectionIsKilled > 0 </ PhysicalConnectionIsKilled >

      < DisconnectDueToReadError > 0 </ DisconnectDueToReadError >

      < NetworkErrorFoundInInputStream > 0 </ NetworkErrorFoundInInputStream >

      < ErrorFoundBeforeLogin > 0 </ ErrorFoundBeforeLogin >

      < SessionIsKilled > 1 </ SessionIsKilled >

      < NormalDisconnect > 0 </ NormalDisconnect >

      < NormalLogout > 0 </ NormalLogout >

    </ TdsDisconnectFlags >

  </ ConnectivityTraceRecord >

  < Stack >

    < frame id = " 0 " > 0X01CA0B00 </ frame >

    < frame id = " 1 " > 0X01CA0DB1 </ frame >

    < frame id = " 2 " > 0X01DF6162 </ frame >

    < frame id = " 3 " > 0X02E53C98 </ frame >

    < frame id = " 4 " > 0X02E54845 </ frame >

    < frame id = " 5 " > 0X02E57BE9 </ frame >

    < frame id = " 6 " > 0X02E38F57 </ frame >

    < frame id = " 7 " > 0X02E3B2C0 </ frame >

    < frame id = " 8 " > 0X02E3C832 </ frame >

    < frame id = " 9 " > 0X02E3D55E </ frame >

    < frame id = " 10 " > 0X781329BB </ frame >

    < frame id = " 11 " > 0X78132A47 </ frame >

  </ Stack >

</ Record >

Different record types contain some different information. The three record types exposed by the Connectivity Ring Buffer are: ConnectionClose, Error, and LoginTimers. This record type is a ConnectionClose since it is not a connection close during Login, either caused by timeout or some other login failure scenario :
    < RecordType > ConnectionClose </ RecordType >

We can see that it was the connection with SPID 55 that was closed:
    < Spid > 55 </ Spid >
We can see that the connection was local (the text in this tag is an XML-encoded string reading “<local machine>”, which indicates a local, shared memory connection):
    < RemoteHost > &lt; local machine &gt; </ RemoteHost >
More useful correlation information is available when the TCP protocol is used for connecting to SQL Server – then the local IP address and port, and remote IP address and port will be included in the record, allowing you to uniquely identify the client machine and application to which the record applies. Additionally, the ring buffer contains the timestamp and whenever possible the SPID of the client process to make the correlation seamless (since, as you probably know, SPIDs may be reused over time by different connections).

 

We can also see exactly how many bytes were in the TDS packet sent by the client , and whether there was any error found in the TDS:
      < TdsInputBufferError > 0 </ TdsInputBufferError >
      <
TdsOutputBufferError > 0 </ TdsOutputBufferError >
      <
TdsInputBufferBytes > 60 </ TdsInputBufferBytes >
The most relevant, easy-to-analyze piece of information for a ConnectionClose record is in the TdsDisconnectFlags , where there are a number of values telling the state of the closed connection. Here we see that no error was found, but that it was also not a normal disconnect or a normal logout for the session. What we do see is a flag indicating that the session was killed:
      < SessionIsKilled > 1 </ SessionIsKilled >

A more interesting example: DC connectivity troubles

Tracking killed SPIDs is fine, but the goal is to help out on tricky problems, before resorting to network traces. Here’s an example of a Login Timers record from the Connectivity Ring Buffer, for a scenario that has up until now been very hard to pinpoint without the costly process of reproducing the problem and taking and analyzing network captures:

< Record id = " 3 " type = " RING_BUFFER_CONNECTIVITY " time = " 112254962 " >

  < ConnectivityTraceRecord >

    < RecordType > LoginTimers </ RecordType >

    < Spid > 0 </ Spid >

    < SniConnectionId > B401B045-3C82-4AAC-A459-DB0520925431 </ SniConnectionId >

    < SniConsumerError > 17830 </ SniConsumerError >

    < SniProvider > 4 </ SniProvider >

    < State > 102 </ State >

    < RemoteHost > &lt; local machine &gt; </ RemoteHost >

    < RemotePort > 0 </ RemotePort >

    < LocalHost />

    < LocalPort > 0 </ LocalPort >

    < RecordTime > 5/6/2008 23:17:42.556 </ RecordTime >

    < TdsBuffersInformation >

      < TdsInputBufferError > 0 </ TdsInputBufferError >

      < TdsOutputBufferError > 232 </ TdsOutputBufferError >

      < TdsInputBufferBytes > 198 </ TdsInputBufferBytes >

    </ TdsBuffersInformation >

    < LoginTimers >

      < TotalLoginTimeInMilliseconds > 21837 </ TotalLoginTimeInMilliseconds >

      < LoginTaskEnqueuedInMilliseconds > 0 </ LoginTaskEnqueuedInMilliseconds >

      < NetworkWritesInMilliseconds > 2 </ NetworkWritesInMilliseconds >

      < NetworkReadsInMilliseconds > 77 </ NetworkReadsInMilliseconds >

      < SslProcessingInMilliseconds > 3 </ SslProcessingInMilliseconds >

      < SspiProcessingInMilliseconds > 21756 </ SspiProcessingInMilliseconds >

      < LoginTriggerAndResourceGovernorProcessingInMilliseconds > 0 </ LoginTriggerAndResourceGovernorProcessingInMilliseconds >

    </ LoginTimers >

  </ ConnectivityTraceRecord >

  < Stack >

    < frame id = " 0 " > 0X01CA0B00 </ frame >

   

    < frame id = " 15 " > 0X02E3C832 </ frame >

  </ Stack >

</ Record >

For this situation, on the client we see:
[SQL Server Native Client 10.0]Shared Memory Provider: Timeout error [258].
[SQL Server Native Client 10.0]Login timeout expired
[SQL Server Native Client 10.0]Unable to complete login process due to delay in login response

Getting the OS error message included doesn’t tell us anything:
C:/>net helpmsg 258
The wait operation timed out.

And in the server errorlogs, there is nothing at all. The record from the ring buffer, however, is very interesting. LoginTimers records show the overall processing time:
      < TotalLoginTimeInMilliseconds > 21837 </ TotalLoginTimeInMilliseconds >

This time is then broken into components based on what work was happening at the time (the numbers won’t always add up because of rounding; here they’re off by 1). The interesting component, of course, is the one or ones that took a long time. In this case, the SspiProcessingInMilliseconds looks interesting, at almost 22 seconds:
      < SspiProcessingInMilliseconds > 21756 </ SspiProcessingInMilliseconds >

SSPI , the Security Support Provider Interface , is the interface used by SQL Server for Windows Authentication. When the Windows login used is a domain account, SQL Server uses SSPI to interact with the Domain Controller to authenticate the user.  The indication in this record of a very long time spent in SSPI processing indicates that there are latency issues talking to the Domain Controller, likely caused by some faulty network hardware between the SQL Server and DC, or perhaps software problems with the DC. Without even taking network traces, and without having to reproduce the problem (since the Connectivity Ring Buffer is turned on by default), the problem is narrowed down to SQL Server’s interaction with the Domain Controller.

Trace Flags

The Connectivity Ring Buffer is on by default, and by default traces all server-initiated connection closures. If you experience an error on the client but see no entry in the ring buffer, this means that the server has seen a connection closure (reset) similar to normal connection closure from the client, or an erratic one coming from outside the server; for instance, from a malfunctioning piece of networking hardware. If this is the case, you should focus on potential networking issues. If you see an entry in the ring buffer, then it will contain information about why the server decided to close this connection, and it will hopefully be enough to facilitate further troubleshooting. For example, if you see that a connection is closed due to invalid information in the TDS packet, you can investigate for faulty network cards, routers and switches which might have corrupted the data. As you will see below, by using a trace flag you can enable logging of all connection closures and thus observe the exact client-initiated closure conditions and potential errors.

There are two trace flags to change the behavior of the Connectivity Ring Buffer. 

To completely disable the Connectivity Ring Buffer, globally enable trace flag 7826:

DBCC TRACEON ( 7826 , - 1 )

Client-initiated connection closures are not traced by default because they are a normal usage pattern, rather than always an error case; when a client is done with its session, it disconnects. It is recommended not to trace client-initiated connection closures since they may cause you to miss a real error case when it occurs, either by eventually writing over the error record when the buffer rolls around (which will happen faster if you have many correctly-behaving clients opening and closing connections), or just by hiding the error in a long list of normal connection closures. However, the option to trace them, in addition to the above events, is available by globally enabling trace flag 7827:

DBCC TRACEON ( 7827 , - 1 )

What are all those <frame> tags?

The sys.dm_os_ring_buffers DMV gives access to a variety of internal information, beyond just the Connectivity Ring Buffer, and as part of the infrastructure of this DMV, most of these ring buffers provide the stack trace where the event occurred. The </ Stack > tag indicates the beginning of a stack trace, and each < frame > gives the hexadecimal address of a function. These can be resolved into function names by taking a dump of the sqlservr.exe process, opening the dump in Windbg, and using the lm command on the function address.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值