The Mythical Hot-Spare - Tape/Disk/Optical Storage
Hot Spare is a term given to a device that can be added to computer storage systems while the system is running without being required to shut the system down or interrupting service. In this article we'll be discussing hard drives and RAID storage and more importantly the belief that Hot Spares keep your data safe.
A brief primer on RAID storage is needed to make sure the concept of a Hot Spare is understood and why RAID storage may not be as safe as you may have once thought!
RAID originally was defined as "Redundant Array of Independent Drives" and is more commonly known as "Redundant Array of Inexpensive Devices". In either case RAID is a combination of software algorithms and hardware devices allowing companies to typically join multiple hard disk drives in order to gain capacity, performance, and safety. Selecting the different RAID levels, which are defined for some typical cases (in Table 1), does this.
I'm sure many of you have already been through this dizzying matrix of choices before and have had to settle on one of these levels to manage your company's data. According to Salomon Smith Barney and Dataquest, 70% of the total RAID storage market is running on RAID 5, this is not surprising since this is the most cost efficient, largest capacity, reasonably safe RAID Level available today.
Now the question, "where does this Hot Spare thing fit into all of this?" Hot Spares are combined with RAID systems to increase overall system reliability. This is done by adding one or more hard drives to an already existing RAID system. But the drive is never utilized until one of the existing RAID drives fails within the system. Of course if we only have to purchase one more hard drive and we get double the safety that's a great insurance policy right? And if by purchasing a couple of drives this safety margin goes up even more ... that's great, right?
Statistically, all companies that store their data on RAID 5 systems agree with this idea and it turns out that 75% of the RAID 5 storage systems running today have one or more Hot Spares running and providing this insurance ...
The Myth: Hot Spares do not provide instant insurance! If a hard drive fails and the Hot Spare comes into action there is a rebuild time. And with today's drives this rebuild time represents a significant opportunity for disaster!
This has never been a problem in the past; why now? What happened is that hard drives in the past were much smaller and didn't take very long to copy the "safety data" back to a new drive "the Hot Spare". But over the last six years, drives have doubled in capacity every year while their relative performance to capacity has remained roughly the same. This means that each time the drive doubled in capacity the time it takes to update an entire drive almost doubled with it.
In Table 2, the hard drive capacity versus performance is shown along with the average time it would take to rebuild the larger drives with new data. The performance data was retrieved from an excellent source at www.storagereview.com
What Table 2 shows is that the rebuild time for a Hot Spare in 1995 was between two and eight hours, which was by no means perfect but a company's data was only at risk for up to one day. Now with today's drives, the same company's data would be at risk for up to 13 days, just short of two full weeks. In addition, the total amount of data at risk has also doubled every year, now that the RAID 5 array may actually contain every piece of data the company owns.
Imagine all of your company's data on 14 hard drives, 12 for actual storage and one for parity and the Mythical Hot Spare respectively. Using today's drives, that represents approximately two terabytes of capacity. This seems like a great system; most of the storage industry says this is the way to go, and because you bought that Hot Spare you have that extra safe insurance, right? Well not quite, if one morning at 10:00 a.m. you lost one hard drive under RAID 5, your data would still be intact and your employees would still be able to use the storage system.
However the storage system is now running in degraded mode meaning if you lose any other drive before your Hot Spare rebuilds you will lose the entire two terabytes of data. And worse yet the system will be running in degraded mode for up to thirteen days depending on how much new data and system use you need during the rebuild time.
Hot Spares do not protect against more then one drive failing at the same time or within a short period of each other, nor do they protect against someone accidentally removing the wrong drive when they really meant to remove the already dead drive.
Table 1
RAID LEVEL Basic Total Drive (1) Relative (2)
Description Drives Redundancy Performance
RAID 0 Striping 18 0 8
RAID 1 Mirroring 2 1 1.5
RAID 5 Parity 18 1 16
RAID 10 RAID 1 + 0 18 1 * 13.5
RAID 15 RAID 1 + 5 18 3 * 11.5
RAID LEVEL Drives
Available
RAID 0 18
RAID 1 1
RAID 5 17
RAID 10 9
RAID 15 8
(1) Drive Redundancy is the maximum number of random drive failures
before catastrophic data loss, Mirroring combinations can lose more
as long as they are not a mirrored set.
(2) Relative Performance is the average read/white contribution
of all drives minus read/write/verify penalties.
Table 2
Average Rebuild Rebuild
Year Drive Speed Time (1) No Time (2)
Introduced Capacity (MB/s) Overhead With Overhead
1996 1 GB's 6.7 MB/s 2.11 Hrs. .36 Days
1997 4 GB's 8.7 MB/s 6.51 Hrs. 1.12 Days
1998 9 GB's 12 MB/s 10.63 Hrs. 1.82 Days
1999 18 GB's 22 MB/s 11.59 Hrs. 1.99 Days
2000 36 GB's 30 MB/s 17 Hrs. 2.92 Days
2001 73 GB's 44 MB/s 23.50 Hrs. 4.03 Days
2002 180 GB's 33 MB/s 77.27 Hrs. 13.26 Days
(1) Write Time No overhead assumes the RAID controller is doing nothing
else but rebuilding the data to the Hot Spare.
(2) Write Time w/ overhead assumes the RAID controller is handling
moderate to heavy user traffic while rebuilding.
Kris Land is the CTO at Land-5 Corp. (San Diego, CA).
0 Comments:
Post a Comment
<< Home