Troubleshooting ServeRAID

最新推荐文章于 2022-08-24 20:30:00 发布

cuisou8847

最新推荐文章于 2022-08-24 20:30:00 发布

阅读量251

点赞数

To begin troubleshooting, check the following top issues. If your issue is listed, select the link, otherwise proceed to step 2.

What are Bad Stripe Table entries?
What are Bad Stripe Table (BST) limitations?
What are the conditions conducive to the appearance of bad stripes?
How to minimize the risk of bad stripes
How to maintain ServeRAID
How does an operating system react when it tries to read or write to a bad stripe?
How can Bad Stripes Table entries be removed from an Array/Logical Drive?
How to mitigate the existence of bad stripes on a logical drive
Frequently asked questions (FAQs)

[@more@]

What are Bad Stripe Table entries?

The Bad Stripe Table (BST) tracks stripes across a logical drive that contains invalid or incomplete data. The table is stored in an area that ServeRAID reserves for configuration information on each physical disk grouped into an array that hosts one or more logical drives. There is a separate table for each logical drive.

The same concepts for RAID logical drives can be brought down to the stripe level. The ServeRAID controller is designed to handle and correct a single stripe unit failure on a read or a write-verify. If two or more stripe units within the same horizontal stripe across the array fail at the same time for any reason, all stripe units within that stripe will become blocked, creating a bad stripe table entry in the hosted logical drives configuration. The error message "Multiple stripe unit failures within a single horizontal stripe" is a clear definition of a bad stripe at its most basic level. See ServeRAID stripes and stripe units for more information on stripes.

The most common cause of a bad stripe table entry is extended operation of a logical drive while in a critical state (one disk within the array is marked defunct). In this situation, one stripe is already unavailable and if the controller encounters another unrecoverable error, a second stripe unit failure will increment the bad stripe count. A bad stripe is essentially a stripe level RAID failure, although instead of taking the entire logical drive off-line, only the data within the stripe becomes unavailable. A bad stripe entry is then made to the Bad Stripe Table and becomes a part of the array and logical drive configuration.

Note: The cause of bad stripe table entries on a logical drive is a symptom of a physical, procedural, or environmental problem within the ServeRAID subsystem. Bad stripe table entries can occur while the ServeRAID controller tries to keep data available under less than optimal circumstances. Correcting the symptom of the bad stripe may not prevent them from reoccurring at a later time if the condition that is conducive to the creation of bad stripes is not corrected first.
What are Bad Stripe Table (BST) limitations?

ServeRAID firmware allows a maximum of 128 entries in the BST for a logical drive before blocking that logical drive. If 128 entries already exist in the BST for a logical drive in Rebuild state and another uncorrectable read error occurs such that the firmware would normally add this stripe to the BST, the rebuild will be halted and the Logical Drive will become blocked. The state shown for the drive that had been rebuilding will continue to show as rebuilding, but no rebuild activity will occur.

What are the conditions conducive to the appearance of bad stripes?

Physical Conditions:

Simultaneous disk drive problems with two or more drives
Hard system hang conditions
Failing components in the SCSI data path to include cables, backplanes, termination, and tray interposers
UPS failures, power fluctuations or unexpected power off situations
System NMI conditions
Poor seating of the SCSI bus components to include the ServeRAID controller, cables, terminators or backplanes, drive or cable connections or hot swap drive tray connections, and repeater options

Procedural issues:

Powering a server off improperly
Operating a server for an extended period of time with unmatched versions of ServeRAID software including BIOS, Firmware, Drivers, or Utilities
Operating a server for an extended period of time while a logical drive is in a critical state
Improper installation of replacement drives (can cause a poor seat against the backplane)
Failure to follow recommended guidelines for a UPS's installation and redundant power connections, designed to prevent exposure to an unexpected power off condition

Environmental issues:

Operating the server or disk drives outside of environmental specifications

How to minimize the risk of bad stripes

Ensure that RETAIN tip H09680, RAID-5 Potential for Data Loss with ServeRAID Under Stress is applied, if applicable. Ensure that the server is always shut down and powered off properly. Ensure that the ServeRAID subsystem is operating with matching software versions, as described in ServeRAID mismatched software levels can result in system problems - Servers and IntelliStation.Ensure that the hard disk drive firmware is current, as provided in IBM Hard Disk Drive Update Program (DOS update package) - IBM IntelliStation and Servers.Ensure that there is an available Hot Spare or Standby Hot Spare installed in the system. Ensure that the disk subsystem is monitored for disk failures by installing appropriate software that provides alert automation Ensure that all hard disks are replaced according to IBM procedures and guidelines. See RETAIN tip H026485, Removing ServeRAID hot-swap hard disk drives improperly may cause damage. Ensure that Hard Disk Drives are operating with IBM supported cache mode settings. See RETAIN tip H033734, Hard disk drive Write-Cache mode defaults to Write-Through mode. Ensure that the server is properly protected from unexpected power off conditions by implementing UPS redundancy according to guidelines. Ensure that the server is operating within environmental specifications.

How to maintain ServeRAID

Synchronize each logical drive on a regular basis. The time period between synchronizations depend on how dynamic the data is. When there are constant changes made to the data, synchronize the data weekly. If the data is static with minimal changes, synchronize the data monthly.

Foreground syncs can be initiated two ways, by using the ServeRAID Manager GUI or using the IPSSEND command. The IPSSEND command can be used in a BAT or CMD file and then automated using almost any scheduling utility.

How does an operating system react when it tries to read or write to a bad stripe?

When the operating system attempts to write to a portion (a stripe) of a logical drive that has an entry in the BST, the write fails and an error code is returned to the operating system. Some operating systems can handle the error by using a write/reassign command that will write the data to another area of the logical drive. The new data will be stored at another location of the drive, but the BST is not changed.

When the operating system attempts to read from a portion (a stripe) of a logical drive that has an entry in the BST, the read fails and an error code is returned to the operating system. There is no operating system recovery, since the data is lost.

Each operating system will react differently when bad stripes entries occur. It is not possible for ServeRAID to identify which files are located within a blocked stripe. This fact may lead to varying operating system behaviors. If the blocked file is a data file, the operating system will likely complain that it cannot find or cannot read the file. If the blocked file is an operating system or application file, the operating system will likely fail the application or may crash the system, depending on how important the file is.

How can Bad Stripes Table entries be removed from an Array or Logical Drive?

By design, bad stripe table entries only increment upward from zero. There are no tools or commands that can remove an entry from the table. The actual table is a part of the ServeRAID configuration information stored in the reserved area of each physical disk associated to the array with the affected logical drive.

There are only two suggested methods for clearing the reserved areas of the drives and both are destructive to the data stored on the physical disks. Backing up the good data on the drives is recommended before any changes to the configuration are made.

The first method is to remove or delete the existing array configuration from the physical disks associated to the array with the affected logical drive, then create an identical new configuration, which will overwrite any previous existing configuration data. The BST will be rewritten and will start withzero entries.

The second method has one additional step. After the existing configuration is removed from the physical disks, do a low level format on each physical disk using the IPSSEND Format command, and then create an identical new configuration. This provides an additional benefit of verifying an error free drive.

Any other methods have a high probability of recreating the same bad stripe table entry, or exposing the operating system to invalid data that may result in other unexpected problems.

How to mitigate the existence of bad stripes on a logical drive

Every situation is different and these circumstances greatly affect how to proceed in resolving the appearance of bad stripes on a logical drive. The first step is to identify the most likely cause for the appearance of the bad stripe, which could include: a physical problem, a procedural problem, and environmental issues, then take corrective actions.

After the condition has been corrected, assess the damage done to the data.
What data is corrupt or missing?
Was this data critical or non-critical?

In a Windows Environment:

Use the CHKDSK command and the COPY filename NUL /B commands to test the data in question. You can also examine logs from a recent backup to see what files may have failed to backup properly (when the problem has existed for a while). A file sitting on a bad stripe will usually fail to backup to tape. Most backup software logs the filenames for files that cannot be backed up.
CHKDSK will assess the over-all health of the logical drive. If CHKDSK errors out, the corruption is likely extensive.
COPY filename NUL /B will force a binary read of the file and should result in a "1 File copied" message or an error. The file is not actually copied, as the output is NUL and the /B forces a file length binary read. If the file is sitting on a bad stripe or is otherwise damaged, the command will error out. The filename can use wildcards like *.doc. Exclusive access to the files is required to run this command.

Neither of these commands can entirely assess the scope of corruption, as the copy command will only determine if the file is valid, not if the data stored in the file is valid. You are likely to need to use data integrity tools native to the applications accessing the files on the system to fully assess the scope of data loss.

Based on all the evidence on hand, including the total number of bad stripes and the confidence level in the problem determination steps taken, and corrective action plan, make a determination on how to proceed with the recovery. One recovery option could be to restore the files lost from a recent backup, when minor problems are determined. A second recovery option could be to remove the array and recreate the array and logical drive, then restore from backup when corruption is catastrophic. A third could be a more moderate approach by restoring the lost data from backup with a planned outage to rebuild the logical drive and data later. This can get the data back on-line to users during production times.

A system can operate normally with bad stripes on the logical drive; however, it is very important to monitor the system to ensure the corrective action actually fixed the condition.

Frequently asked questions (FAQs)

Q: Will the existence of a bad stripe cause a Rebuild to fail?
A: No, a rebuild will complete normally, except as noted above under Bad Stripe Table (BST) Limitations. However, if the rebuild does fail, it is likely that the condition conducive to the appearance of bad stripes was not corrected or the corrective actions were not completely successful.

Q: Does the existence of one or more bad stripes cause additional bad stripes?
A: No. Bad stripes are symptoms of another problem most commonly SCSI bus related, for example, cables, backplanes, termination, trays, improper seating of components, and so on. The controllers contribute to new bad stripes very rarely. RETAIN tip H09680, RAID-5 Potential for Data Loss with ServeRAID Under Stress, describes all known issues.

Q: If the condition conducive to the appearance of bad stripes is eliminated, will the system operate properly from then on?
A: Yes, however there may be some residual effects under the OS. In a Windows environment, Event ID 26, 50, or 51's can occur if some missing data is not accounted for by the operating system, for example, a temporary file created to track progress of another process goes missing. The software may continue to look for the missing file resulting in Event ID 26, 50 or 51's. In Windows, very small files are often saved in the Master File Table (MFT) and if a bad stripe crosses the MFT, Event ID 26, 50 or 51's may also occur. These events occur infrequently, but sometimes they fill the System Event log. Running a CHKDSK /F should correct these problems. Continued problems with new Lost Delayed Writes (Event ID's 26 and 50) are an indication that the corrective actions were not fully successful. Check the Bad Stripe Table entry count regularly until you are sure it doesn't increment. Event ID 51's can still occur after a CHKDSK /F successfully completes and corrects file system integrity.

Q: Is there a way to clear the bad stripe entries without removing and rebuilding the array/logical drives?
A: There are no tools or commands that can remove an entry from the bad stripe table.