环境
- Red Hat Enterprise Linux (RHEL)
- 5
- 6
- 7
- Red Hat's
qla2xxx
driver - QLogic FC HBAs
- Fibre Channel SAN
问题
- We are noticing SAN errors across all our different system (SuSE, Solaris, even Windows).
- Performance on our database servers is degraded and applications are slow responding.
- Some systems are crashing after these errors.
- What do the "Abort command issued" error messages mean?
kernel: qla2xxx 0000:46:00.0: scsi(1:0:105): Abort command issued -- 1 3e7bbc46 2002.
kernel: qla2xxx 0000:46:00.0: scsi(1:0:101): Abort command issued -- 1 3e7c1ec0 2002.
kernel: qla2xxx 0000:46:00.0: scsi(1:0:103): Abort command issued -- 1 3e7d02b8 2002.
kernel: qla2xxx 0000:46:00.0: scsi(1:0:115): Abort command issued -- 1 3e7d37a9 2002.
kernel: qla2xxx 0000:46:00.0: scsi(1:0:109): Abort command issued -- 1 3e7d44cd 2002.
- What is the meaning of
qla2xxx [0000:04:00.0]-801c:1: Abort command issued nexus
message?
kernel: qla2xxx [0000:04:00.0]-801c:1: Abort command issued nexus=1:0:0 -- 1 2002.
kernel: qla2xxx [0000:04:00.0]-801c:1: Abort command issued nexus=1:0:0 -- 1 2002.
kernel: qla2xxx [0000:04:00.0]-801c:1: Abort command issued nexus=1:0:0 -- 1 2002.
决议
- These errors indicate an error condition being returned from the SAN.
- Try to verify if there are any issues present from the FC switch, FC cabling, zoning or Storage array.
- It would also be advised to engage the storage vendor to review the switch logs to verify if there are any error counters, CRC errors in FC switch logs.
根源
- Error message
qla2xxx [0000:04:00.0]-801c:1: Abort command issued nexus=1:0:0 -- 1 2002
is explained below.qla2xxx
is the name of the driver or kernel module.[0000:04:00.0]
is the PCI bus information of the device.801c
is a hexadecimal id which uniquely identifies the part of the code from where the message originated.1
is the host number of the scsi target.Abort command issued nexus=1:0:0
The driver aborted the command that was in progress to thescsi target 1:0:0
.- the last
1
means the driver spent time wait for the device to respond. 2002
means the reset succeeded
- Multiple underlying issues can cause abort messages and a slow SAN.
- Initial areas to investigate include SAN related components, such as the switches or storage targets.
- Command aborts are almost always caused by command timeouts. The first course of action is to abort it to make sure that any references to it are erased. Command timeout could be caused by many different things: SAN congestion, a flaky target, bad hardware somewhere, or an overloaded target that might be dropping commands.
诊断步骤
-
Enable extended logging on the qla2xxx driver
**CAUTION:** Turning on extended error logging under moderate to heavy IO loads can cause lockups! The debug code logs information to `/var/log/messages` about IO being processed. These debug messages cause additional IO, which in turn causes more logging. This can get to the point of essentially locking up the system. It is strongly suggested that the messages file be moved off any QLogic-controlled disks to a local disk or via the network to a remote logging point to avoid this issue.
-
Enable extended logging for the qla2xxx driver to try to capture any additional error messages when the issue occurs
$ chmod u+w /sys/module/qla2xxx/parameters/ql2xextended_error_logging $ echo "1" > /sys/module/qla2xxx/parameters/ql2xextended_error_logging
-
Check for additional error logging in /var/log/messages when the issue occurs:
Mar 14 00:04:51 hostname kernel: qla2xxx_eh_abort(1): aborting sp ffff8102c5614680 from RISC. pid=1048458952. Mar 14 00:04:51 hostname kernel: scsi(1): ABORT status detected 0x5-0x0. Mar 14 00:04:51 hostname kernel: qla2xxx 0000:46:00.0: scsi(1:0:109): Abort command issued -- 1 3e7e36c8 2002.
-
Increase scsi extended event logging to get more information from the SCSI layer. It is possible to enable this without a reboot using sysctl in the following fashion:
$ sysctl -w dev.scsi.logging_level=0x1003
- Note: Don't use other values, especially larger values such as
0xffff
, unless you know exactly what each bit does. Turning on other values can flood the logs with so many messages that the important messages will be overwritten before ever being saved to disk and also cause huge log files to be created.
- Note: Don't use other values, especially larger values such as
-
Please open cases with SAN and Fabric switch vendors involved in the case.
-
With scsi extended
logging_level
andql2xextended_error_logging
set, wait for a few events to occur and upload a fresh/var/log/messages
file from the systems. -
Check how many HBAs and if the errors are balanced over both or only on one of the HBA's:
-
Check HBA PCIID's:
$ lspci | grep -i qlogic 02:00.0 Fibre Channel: QLogic Corp. ISP2432-based 4Gb Fibre Channel to PCI Express HBA (rev 02) 46:00.0 Fibre Channel: QLogic Corp. ISP2432-based 4Gb Fibre Channel to PCI Express HBA (rev 02)
-
the
02:00.0
and46:00.0
at the beginning of the output are the PCI address values for these cards.
-
-
Check the number of errors on each of the cards, based on the pci addresses found via above) from /var/log/messages:
$ grep 02:00.0 var/log/messages | grep qla2xxx | wc -l <-------- These are sample values of PCI addresses for QLogic HBA 4 $ grep 46:00.0 var/log/messages | grep qla2xxx | wc -l <-------- These are sample values of PCI addresses for QLogic HBA 86
-
Check if there is something special on the fabric and paths for the device
46:00.0
(Please use the value that correspond to your own environment)