Understanding SCSI Sense

最新推荐文章于 2023-06-25 22:16:01 发布

Kinges

最新推荐文章于 2023-06-25 22:16:01 发布

阅读量2.4k

点赞数 1

分类专栏： Storage System

Storage System 专栏收录该内容

29 篇文章 1 订阅

订阅专栏

This page is about decoding and interpreting the SCSI sense buffer in order totroubleshoot a disk or storage device problem.

A SCSI sense buffer is the error reporting facility in SCSI. It reports theerror code and possibly also additional information that helps to locate thesource of the problem so the administrator or developer can help resolve the issue.

A SCSI sense has several top-level attributes that one would care about the most:

Sense type, either fixed or descriptor,
What command it relates to, current or previous,
Sense Key,
ASC/ASCQ — Additional Sense Code and Additional Sense Code Qualifier.

The easiest way to decode a sense buffer is to use a tool, I know of two:

sg3_utils provides sg_decode_sense since version 1.31
libscsicmd implements it
a web tool is available to Decode the sense data that is based on libscsicmd

The explanation below would focus a bit on how to decode and also what can be understood from it.

The sense type is important to decode the sense buffer, you need to know if itis a fixed format or a descriptor format. The most common format is the fixedformat and most of the direct decoding instruction below will be about thefixed format the descriptor format is more complex and less frequent but it’sworth being aware of its existence. The details both formats provide are thesame just the decoding mechanics are different.

One important distinction about a sense buffer is wether the sense is about thecommand that failed with the sense or a previous command. It is entirelypossible that the command that returned with an error is not at all at faultand that everything is just fine with it but that a previous command that wasalready acknowledged went bad at the end and the SCSI target has no other wayto tell the user about the problem. In such a case some random other commandwill be failed with a sense buffer that indicates the problem was in a previouscommand.

The first byte to look at is byte 0, and what matters there are the 7 lowerbits, so if the number is at or above 80h (128 decimal) you need to substruct80h to get the actual value. There are only 4 permitted values for these 7 lower bits:

70h — fixed format, current sense
71h — fixed format, previous sense
72h — descriptor format, current sense
73h — descriptor format, previous sense

The next important information is in byte 2 (number 3 if counting from 1), the four lowest bits are the sense key, since the sense buffer is given in hexadecimal numbers this is the second character of the number. The sense key is the key to understand the error code. It tells you the high level issue and it is detailed below with their meanings.

The next part is the ASC and ASCQ these are found in bytes 12 and 13 (13 and 14if counting from 1). These explain in somewhat more detail the specifics of theproblem.

Take a look at the following example:

f1 00 03 02 DD 7E BF 18 00 00 00 00 0C 03 00 00 00 00 00 00 03 0C 03 00 00 0F 83 01 00 08 00 00

The first byte is F1h, we remember to remove the top bit and we get 71h whichmeans “fixed format, previous sense” so we can further decode it according ourinstructions and remember that the IO that failed with this sense is not toblame, it was a previous IO that failed. Next up we find 3h as the secondnibble in the third byte which tells us that this is a medium error. A disktried to read or write and failed. The last part is the ASC and ASCQ which are0Ch/03h and this translates to “WRITE ERROR – RECOMMEND REASSIGNMENT”. Thistells us it was a write that failed and that the disk is suggesting to reassignthe sector. One part that is a bit harder to decode is what is the LBA thatactually failed. The first bit of the first byte that is lighted says that theinformation field has meaning and in the case of a medium error (sense key 3h)the meaning is the first LBA that failed.

You can see full parsing of this sense in the webapp: f1 00 03 02 DD 7E BF 18 00 00 00 00 0C 03 00 00 00 00 00 00 03 0C 03 00 00 0F 83 01 00 08 00 00

The sense keys are listed briefly at the T10 Sense Key page.The ASC/ASCQ are listed at the T10 ASC/ASCQ page.

Common sense keys are:

1h Recovered Error — informational only
2h Not ready — temporary error, need to wait it out
3h Medium Error — may work if retried, disappears after write or reassign
4h Hardware Error — usually permanent failure
5h Illegal Request — mostly a programming error, maybe the device handles an older standard with some bits unsupported
6h Unit Attention — a storage fabric problem, usually a notification and not a problem with the IO itself
7h Data Protect — The device cannot be read/written, needs to be unlocked (physically or logically)
Bh Aborted Task — Fabric problem, command may be retried but possibly a bad cable

Recovered Error

A recovered error is the least problematic in one way since it only says thatthere was a problem that the storage device managed to take care of and is justletting the user know about in case he is curious or would like to delve deeperand find what is going on.

There are two reasons why a recovered error would be returned:

SMART Trip
Medium Errors recovered

If a disk finds that it is about to fail according to its SMART logic (alsoknown as Informational Exceptions in SCSI), it will report it in log page 2Fhbut the only way for it to tell the storage system that now is the time tostart looking at this log page is by taking one random IO (the first one thatcomes up after the SMART issue is detected) and return a correctable error withASC/ASCQ of 5Dh/00h which stands for “FAILURE PREDICTION THRESHOLD EXCEEDED”.

If a sector is having problems and it took a non-trivial amount of work torecover from it a recovered error may be reported with an ASC of 11h, 17h or18h depending on the severity and the type of recovery needed.

If a specific device will or will not return a recovered error sense isdetermined by some parameters in the mode pages. You may want to peruse them tofind how to turn on or off this behavior.

The normal Linux kernel SCSI stack will ignore this sense and continue alongwith only reporting it in dmesg.

Not Ready

Device not ready, wait and retry, device is either going to get good or fail and it should timeout itself to Hardware Error if so.

A Not Ready sense is returned when the device is powering up and not yet readyto really respond to anything serious, such as when an HDD is still spinning upor when an SSD has still not read its metadata tables from the flash.

Under some error conditions this may persist for some while and if it persistsfor more than 30 seconds or so it is likely to be a failure already. In mostcases the device will have a timeout of its own after which it will transitionto replying 4h Hardware Error instead of the Not Ready reply.

A user can only wait a bit more for the device to get ready and fail it out ifit takes too long to exit this state.

Medium Error

A medium error means that you tried to read or write data and the disk failed.It also is taken to mean that the problem is not permanent and doesn’t afflictthe entire disk only some area of it. A disk can reassign the affected to solvethe problem. If the disk is configured to auto-reassign than a write to thatarea will cause the disk to reassign and the problem will be gone, if the diskis not configured to auto-reassign then you need to use the REASSIGN BLOCKScommand to get that same effect.

At some cases a retry to read the data may get the data eventually but itdoesn’t have a high likelyhood and it incurs a great penalty in time sincenormally a medium error is declared after a timeout is reached during the readoperation.

In a SCSI disk there are two bits AWRE and ARRE that control if aauto-reallocation is done by the disk.

Hardware Error

An hardware error is reported when the disk reaches a fatal state and will notrecover from this. The disk can no longer be read or written. Not even a powercycle will help in this case.

Illegal Request

When a disk returns “Illegal Request” it means it failed to parse the commandor the data you gave it. Either the command is invalid or it is unsupported bythe disk. This can happen when the disk supports an older standard or doesn’tadhere to the standard completely.

When doing MODE SELECT and LOG SELECT commands if the parameter you are tryingto change is unsupported for change you will also get an Illegal Request. Youcan get the Changeable Mask for MODE SELECT with MODE SENSE to see if this isthe case.

You will need to reformulate the request or plain avoid it altogether.

Unit Attention

A Unit Attention is the way for the device to tell you that it’s operationalstate or the fabric state has changed. Since SCSI is a client-server protocolthere is no other way for the device to tell you that something changed withoutpiggy-backing on another request which is exactly what happens here. Thecommand that you performed is likely to be just fine but there was some othercondition in the device that requires the user’s attention. The attention needsto be taken care of and the command that was unfortunate enough to be failedfor this can be retried.

Examples for this can be when MODE SELECT is used to set Mode Parameters, whenan initiator is lost and then you get an I_T_L NEXUS LOSS or several other suchcases.

Data Protect

Data Protect is received when the device is working but locked, either aphysical write lock or for Data-at-Rest encryption when the device was not yetunlocked or the band was not yet unlocked.

This only means that unlock needs to happen for the action to be allowed.

Aborted Task

When a communication link fails or a command is aborted you can get this sensekey, it cannot be directly attributed as a failure in the device, it is mostlikely a connectivity issue which will need to be resolved.

If there is a flaky link these errors can come and go from time to time and itwill be hard to communicate with the device. In most cases the command shouldjust be retried several times and a problem flag raised if this continues.

Some communication failures when they are rare are no importance and can beassumed to happen but if the failure is common enough it should be reported tobe fixed by the user. The normal BER for SCSI links is around 10^-15 and soabout 1 error per day at full data-rate of about 6Gbps is perfectly acceptable,above that it really depends on the application and system.

Posted by Baruch Even