|
As more companies make the move to LAN based client server models, IBM servers and their
associated drive subsystems are becoming larger and storing more mission critical information.
Because of this, the availability of these systems is more and more important. Protecting the data
stored is vital. This manual focuses on the actions necessary to properly maintain a RAID disk
array and how to recover from the most common types of failures in RAID disk arrays.
IBM provides management software, NetFinity Manager, to monitor the status of the hardware
and provide alerts when conditions are not optimal. IBM provides this software and upgrades at
no additional charge for all customers that have purchased an IBM server that ships with
ServerGuide so that customers can obtain all of the information necessary to confidently protect
their data. Installation of NetFinity Manager, or of similar tools, to monitor and track the health
of the disk subsystem is critical to the protection of the data stored. Without these tools, the
failures listed below, and other system warnings, such as Predictive Failure Analysis or SMART
alerts, cannot be communicated to the operator so that preventative action can be taken.
There are three types of drive failures that can typically occur in a RAID-5 or RAID-1 subsystem
that may threaten the protection of this data:
'Catastrophic' Drive Failures
When the data on a drive is completely inaccessible due to mechanical or electrical problems, we
define this as a catastrophic or complete drive failure. In these cases, all data stored on the drive,
including the FCC data written on the drive to protect information, is inaccessible. This is where
RAID-1 and RAID-5 level arrays provide the most common protection. A RAID-5 or RAID-1
array stores redundant, or 'parity' information within the array of drives. This parity
information can be used to recreate the data from the lost drive. The information will be
recalculated 'on the fly' in response to user requests and can also be used to rebuild the lost
drive's data either immediately to a hot spare drive or when the failed drive has been replaced.
RAID-1 and RAID-5 arrays protect from the loss of a single drive within the array. Failure of
more than one drive will require restoring information from a backup device.
Problem: The RAID-5 technology can not reconstruct the data correctly unless the RAID-5
parity throughout the drives is correct. The RAID-1 logical drive does not reconstruct using
parity inft)rmation. Therefore, RAID-1 logical drives are not affected.
Prevention: For the IBM ServeRAID and ServeRAID II Adapters, Synchronization is required
before installing an operating system or storing any customer data on a RAID-5 array to ensure
the parity correctly reflects the data. The RAID-5 arrays write data Out to drives in stripe units.
The size of the stripe unit can be configured to 8KB, 16KB, 32KB, or 64KB. Synchronization
reads all the data bits in each stripe unit, calculates the parity for that data, compares the
calculated parity with the existing parity for all stripe units in the array, and updates the existing
parity for all stripe units that are inconsistent. Once the logical drive has been synchronized, the
RAID-5 parity will remain synchronized until it is redefined.
Grown Sector Media Frrors
Sector media errors only affect a small area of the surface of the drive and do not constitute a
catastrophic drive failure. These errors are typically identified when the corresponding data is
requested by an application program. Often, the drive itself can repair these errors by
recalculating lost data from Error Correction Code (ECC) information stored within each data
sector on the drive. The drive then remaps this damaged sector to an unused area of the drive to
prevent data loss.
Problem: Media Sector Errors may not be detected in seldom used files or in non-data areas of
the disk. These errors will only be identified and corrected if a read or write request is made to
data that is stored within that location.
Prevention: Data Scrubbing forces all sectors in the logical drive to be accessed so that Media
Sector Errors are detected by the drive. Once detected, the drive's error recovery procedures will
be invoked to repair these errors by recalculating the lost data from the FCC information
described above. If the ECC information is not sufficient to recalculate the lost data, the
information may still be recovered if the drive is part ofa RAID-5 or RAID-1 array. RAID-5
and RAID-1 arrays can provide their own redundant information (similar to the FCC data written
on the drive itself) which is stored on other drives in the array. The RAID adapter can recalculate
the lost data and remap the bad sector. An easy process used to accomplish Data Scrubbing is
synchronization. Data Scrubbing can be performed in the background while allowing
concurrent user disk activity on RAID-5 and RAID-1 logical drives. With the IBM
ServeRAlDIl Adapter, Data Scrubbing is performed by the Firmware of the adapter as a
background process. With all other IBM RAID Adapters, an easy tool used to accomplish Data
Scrubbing is Synchronization. Netfinity Manager 5.0 will allow you to automatically schedule
the synchronization from either the server or the remote manager. Netfinity Manager 5.0 can be
obtained at no additional charge by customers that have ServerGuide which ships with every
IBM server. If the customer has another type of scheduler such as the AT scheduler built into
Windows NT or RFXXWARF by Simware Corporation, then the IBM ServeRAID and
ServeRAlD II adapter command line utilities may be used to allow the customer to schedule Data
Scrubbing without Netfinity Manager installed. Refer to the Data Scrubbing Utilities Available
via Array Synchronization section for the adapter and operating system compatibility matrix for
these Data Scrubbing utilities.
Combination Failures
Problem: When a catastrophic drive failure occurs while there are still undetected and therefore
uncorrected sector media errors on the remaining drives in the array, the array will not be able to
rebuild all the data. Just as if two drives had failed, the array will be missing BOTH the
information stored on the lost drive and the information from the sector where there is a media
error. This constitutes a double failure and files will need to be restored from backup media.
Prevention: IBM recommends Data Scrubbing all RAID-5 and RAID-l logical drives weekly to
minimize the risk of having any undetected sector media errors on the remaining drives of the
array when a drive failure occurs.
Please see the LEGAL - Trademark notice.
Feel free - send a for any BUG on this page found - Thank you.