|
IBM-AUSTRIA - PC-HW-Support 30 Aug 1999 |
RAID-5 ADAPTER LEVEL PROTECTION
RAID-5 ADAPTER LEVEL PROTECTION
At the adapter level, RAID-5 has become an industry standard method to provide increased
availability for servers. RAID-5 and RAID-1 implementations allow servers to continue
operation even if there is a 'catastrophic' failure of a hard drive.
Normal Operations
During normal operations in a RAID 5 environment, redundant information is calculated and
written out to the drives as shown above. In a 'n' disk environment, 'n-1' disks of data are
provided with 1 disk of space dedicated to redundant 'check sum' or 'parity' information. As
pictured above, three 2GB drives will provide 4GB of data space and 2GB of redundancy.
Notice that the redundant data is actually spread out over all the disks for performance reasons.
Catastrophic Disk Failure
If a drive that is a member of a RAID-5 array fails, the remaining members of the array can use
their redundant information to recalculate the lost data - either to respond to user requests for
data or to rebuild the data stored on the lost drive when it is replaced with a new one.
In the case pictured here, information in Record 1 from Drive I will be combined with the check
sum information on Drive 3 to recreate information that is not available from Drive 2. As long as
the array controller can access the remaining 'n-1' drives, the rebuild will be successful.
Naturally, if a second disk failure were to suddenly occur, the array and its data would be lost.
RAID 5 can only protect against the loss of a single drive.
Grown Sector Media Errors
Let's assume there is a read request for a file. As the drive attempts to read this data, it
determines there is a bad sector within Record 1 of Disk 1, as pictured below. If the media error
is minor, the information is corrected or remapped by the drive using the drive ECC information,
all of which is transparent to the RAID array. However, what if the disk can not recreate the
information from the ECC information on the drive? Is data still lost, as it was before without
RAID support ? In this case, IBM RAID adapters provide the additional capability to recognize
the fault and re-create the data from redundant information stored on other drives. For example,
Record 1 in the diagram will be corrected from data stored in Record 2 on Drive 2 and check sum
information on Drive 3. The RAID adapter then requests that Record 1 be rewritten, the drive
will remap the bad sector elsewhere on the drive and Record 1 will have good data.
In this case, RAID 5 has increased the availability of the information by re-creating data that
otherwise would have been lost. However, the initial assumption is that this process has been
initiated by accessing this data on the drive. If this data is not accessed, this error will not be
detected. This can become a real problem if a catastrophic failure occurs before the data is
corrected.
Combination Failures
Consider the example shown above. Here, we have an undetected sector media error within
Record 1 of Disk 1. This error is undetected because it happens to have occurred within an
archived section of the users database that is seldom accessed. Before this error is recognized
and corrected, we sustain a 'catastrophic' failure of Drive 2. So far, no data problems are
noticed. User requests for information other than Record 1 can still be serviced with RAID
protection and data recalculation. However, when drive 2 is replaced and a rebuild is initiated,
the RAID controller will attempt to recalculate Record 2 from the failed Drive 2 by combining
Record 1 with the check sum data on Drive 3. At this point, the media sector error is discovered.
If the ermr is minor, the disk can re-create the missing information from its FCC data (as before)
and potentially remap the bad sector. However, if the error is too severe, the disk will not be able
to recover the data. The rebuild process can not complete successfully because it does not have a
complete Record I to combine with the check sum data to rebuild the lost data on drive 2. In this
case, the Rebuild will skip that stripe and continue rebuilding the rest of the logical drive. Once
the rebuild has completed, a 'rebuild failed' message is displayed. The IBM ServeRAlD and the
IBM ServeRAlD II adapters will bring the rebuilt drive online and take the array out of 'critical'
mode. In order to protect data integrity, it will also block access to the damaged stripes of the
array. Data files covered by these damaged stripes will still report data errors and need to be
restored from a previous backup. This prevents the necessity of a full restore due to a 'rebuild
failed' message that is caused by one or two bad stripes. In the case of non-ServeRAlD
Adapters, the customer will need to use the RAID configuration previously saved to a diskette, in
order to bring the array back online.
Back to
More INFORMATION / HELP is available at the IBM-HelpCenter
Please see the LEGAL - Trademark notice.
Feel free - send a for any BUG on this page found - Thank you.