Insuring The Reliability Of Fibre Channel RAID Storage - Industry Trend or Event
A major benefit of Storage Area Networks is fast "any to any" server or client access to RAID storage. In a mission-critical environment, this places emphasis on ensuring high availability of not only the data access paths, but also the RAID storage system itself.
Fortunately, standardized Fibre Channel layers define media and interface characteristics, as well as specifying highly reliable transmission protocols with low bit-error rates. SAN fabrics have evolved to include redundancies among switches and access paths, providing failover insurance against hardware problems.
From a hardware perspective, RAID systems typically include such high-availability features as redundancies, hot-swappability, and thermal management to dissipate heat build-up. Fibre Channel RAID systems with dual-loop architectures even provide protection against internal disk channel failures. Alarm systems and remote management capabilities further contribute to the reliability of today's RAID storage systems.
The storage industry has embraced traditional RAID levels (1, 3, 5) and variations thereof (0+1, 1+5, 6, etc.) as means of protecting critical information against the likelihood of disk drive failures. Typically, however, this protection is limited to a single drive failure (RAID 3 or 5). At most, protection against three concurrent inoperable drives is achieved, but at the cost of expensive mirroring. Even exotic arrays of this nature have limitations on the conditions under which drive failures can be sustained.
LAND-5 has developed patented algorithms that allow a disk RAID array consisting of "N" drives to sustain operations even in the event of "M" drive failures, where 1[less than]=M[less than]N. Called "eRAID," this breakthrough technology can be implemented with far fewer disk drives than mirroring while also yielding higher performance and enhanced reliability.
With the growth of mission-critical information requiring twenty-four hour access, the reliability of storage systems is paramount. Downtime is extremely costly. Customers, vendors, employees, and prospects can no longer conduct essential business or critical operations. There is a "lost opportunity" cost to storage failures, as well, in terms of business lost to competitors. Well-documented studies place the cost of downtime in the tens of thousands (or even millions) of dollars per hour.
Consider the recent problems with eBay, a major online auction Website with 2 million customers that suffered extended equipment crashes. The company, which saw its stock value slide by almost 20 percent, lost significant revenue over the three-day period--eBay warned that the latest 22-hour outage would knock between $3 million to $5 million off Q2 sales. However, the greater damage could be to eBay's reputation, especially if it continues to be plagued by outages. In a recent survey of consumers, Jupiter Communications found that 46 percent of online consumers leave a preferred site if they experience technical or performance problems.
The need for large amounts of reliable online storage is fueling demand for fault-tolerant technology. According to International Data Corporation, the market for disk storage systems last year grew by 12 percent, topping $27 billion. More telling than that figure, however, is the growth in capacity being shipped, which grew 103 percent in 1998. Much of this explosive growth can be attributed to the space-eating demands of endeavors such as year 2000 testing, installation of data-heavy enterprise resource planning applications, and the deployment of widespread Internet access.
The rising tide of Storage Area Networks (SAN) is fueled by the prospect of providing "any to any" high-performance access by networked servers and clients to critical information on a continuous basis. RAID storage is the underlying foundation of SAN technology, necessary to insure that mission-critical data is available when needed. Access to online storage on a 24x7 basis is essential to most SAN configurations. Thus, the reliability of Fibre Channel RAID storage is, in a sense, the Achilles heel of a SAN fabric.
In examining SAN storage, attention is quickly focused on three elements that are essential to reliability:
* The error checking scheme inherent in the transmission protocol
* The reliability of the RAID storage unit itself
* The ability of the RAID storage system to withstand multiple drive failures
This article discusses each topic in turn. Industry answers exist for the first two subjects, but the storage community is still applying expensive "band-aids" in an attempt to overcome the inevitability of disk drive failures in large storage arrays.
Fibre Channel has five layers: FC-0 through FC-4. The FC-0 layer defines the media and interface characteristics of full-duplex serial links between points. It lets Fibre Channel scale its signaling rates and define conforming cabling and connectors without affecting upper level protocols. As such, the FC-0 layer facilitates high-performance availability to Fibre Channel storage systems.
The FC-l layer defines transmission protocols. It defines how FC-0 signals are patterned to carry data and how port-to-port links are initialized and, if necessary, recovered from error conditions. Within a Fibre Channel network, the transmitter keeps track of the number of binary 0s and 1s. Likewise, the receiver also tracks the running disparity of 0s and 1s to detect any errors. Fibre Channel also uses a control character to synchronize word boundaries. With a specified bit-error rate of less than one bit error in 1012 bits, the FC-1 layer provides low-cost, reliable transmit-and-receive circuits and a transmission protocol that is independent of media, distance, or data rate.
Together, the FC-0 and FC-1 layers provide a solid foundation for reliable, high-performance access to Fibre Channel storage systems configured in a switched fabric network. Along with the other layers, they also present a standard, open architecture for interfacing Fibre Channel storage systems, supporting a competitive atmosphere that benefits the consumer.
RAID SYSTEM RELIABILITY
Most enterprise-level storage is "mission critical" these days. Corporate Intranets are the lifeblood of employees, vendors, and contractors. Presenting an appealing Web site to customers and prospects on a 24x7 basis is essential to competitive survival. Online databases and consumer activities require storage systems that are impervious to normal fatigue or thermal failures.
Reliable storage systems are crucial for SANs. "High availability" is implemented through redundancy of critical components, hot-swappability in the event of component failure, and management of heat build-up.
Heat, or thermal energy, is transferred from one body to another by virtue of a temperature differential. In short, heat flows from a high-temperature area to a lower-temperature area. If there is no means of removing heat, then a steady state condition will eventually be reached wherein the internal temperature of a system enclosure equals that of its hottest element.
In general, there are three methods, or modes, of heat transfer: conduction (transfer of heat through a solid caused by molecular oscillations), convection (transfer of heat from the surface of a solid to the surrounding air), and radiation. LAND-5's PolAIRis, a thermal management system, focuses on removing this heat by using strategically placed conduits and fans to direct an optimized volume of airflow through its system enclosures.
Fast and highly integrated circuits generate large amounts of heat. Although a typical ECL gate dissipates less than 10 milliwatts, 10,000 of these gates integrated onto a chip can bring total power consumption easily up to 20-30W At high temperatures, corrosion mechanisms accelerate and stresses are generated at the material interfaces because of different expansion coefficients. As a result, solder and wire bonds fail. In addition, CMOS switching speed degrades as the temperature increases. To eliminate negative temperature effects, heat must be removed rapidly from semiconductor devices.
In computer equipment, disk drives, processors, ASICs, and power supplies tend to be the hottest components. Disk drives operate at high Revolutions Per Minute (RPM) and quickly begin to generate considerable heat, the leading cause of disk drive failure. High-performance CPUs typically have a dedicated fan and a heat sink to dissipate heat build-up. However, most dedicated I/O subsystems now contain powerful processors, as well (such as Intel's 1960), and these generate considerable heat that must be discharged by the enclosure's thermal management. Likewise, most ASICs become local hot spots within an enclosure, endangering surrounding components unless their heat is rapidly dissipated. Power supplies, critical to continuous system operation, also quickly fail without adequate cooling.
Thus, it is clear that disk drive failures are often related to heat build-up within an enclosure. Generally speaking, disk drive reliability drops sharply as internal enclosure temperature rises above 45[degrees]C (113[degrees]F). A reduction in temperature of five degrees Centigrade can significantly improve disk drive reliability from 15% to 40%, depending on the actual inside cabinet air temperatures.
To address the requirement for extreme system uptime, better RAID storage systems implement sophisticated thermal management using a reverse air cooling process, multiple fans, and a chassis design that creates a "wind tunnel" effect, drawing cool air across heat-generating components. Airflow is controlled to conform to one direction, thereby maximizing the cooling effect of multiple fans. Heat dissipation is further aided by designing the system to reduce heat sources. As a final measure, temperature monitoring, along with visual and audio alarms, is required.
Even with excellent thermal management, hardware failures are inevitable in any storage system. Hence, it is essential to design a high-availability RAID system with redundancies in order to ensure that storage access is not interrupted whenever a failure occurs.
The most common redundancy is dual power supplies. If one power supply fails, the remaining power supply should be sufficient to allow continued system operation for an indefinite period. Added safety is achieved by designing the power system to include automatic load balancing, thereby prolonging the life cycle of each power supply. Having separate power cords allows each power supply to be plugged into a separate circuit, enhancing protection against the failure of an electrical system within the building. Adding an UPS buys time in the event of a complete power outage.
As discussed, aggressive system cooling is essential to continuous operation. Hence, redundant fans are critical. Many RAID systems have unfortunately not learned this lesson and their users suffer accordingly.
Redundant RAID controllers have two benefits. They provide a fail-over capability in case one fails. Moreover, in an "active-active" mode, the controllers can share the workload, thereby enhancing system performance.
Channel failures do occur. A truly mission-critical RAID system compensates for this possibility by having built-in redundancies in the form of A-B loops for each internal channel. If one loop fails, the remaining loop kicks in to ensure continued operations. In the future, RAID controllers will be able to take advantage of dual loop architectures to significantly increase transfer rates through "active-active" operation. For example, the new LAND-5 ICEbox FC 2500 RAID storage system has three disk channels, each supported by independent A-B loop access. Now providing up to an aggregate 300MB/sec transfer rate, the potential exists to double performance to 600MB/sec when controllers that support simultaneous dual-loop data access become available.
Having at least a global hot-spare disk drive is a universal requirement for a mission-critical RAID system. More sophisticated systems also support local hot spares.
Hot Swappability for Critical Components
Mission-critical storage systems demand the ability to perform repairs without interrupting operations. Thus, major system components that are the most likely to fail over time must be "hot swappable." Local personnel must be able to access and swap out a failed disk drive, fan, or power supply with minimal effort. Better systems with redundant RAID controllers also support replacement of a failed controller "on the fly."
RAID System Architecture Considerations
Two backplane architectures are available in commercial RAID systems--active and passive. Both support an A-B loop architecture. A passive backplane allows hot-swappability of controller and channel interface boards. However, its architecture increases the design complexity and cost. Active backplanes allow channel segmentation, a performance boost. They are also less costly to design and build. The downside is that if a channel fails, then the entire backplane must be replaced.
PROTECTION AGAINST MULTIPLE DISK DRIVE FAILURE
RAID storage configurations have proven to be the best hedge against the possibility of a single drive failure within an array. Each RAID level, however, has its pluses and minuses:
* While RAID 0 delivers high performance, it cannot sustain even a single drive failure because there is no parity information or data redundancy.
* Although the most costly, mirroring data on separate drives (RAID 1), means that if one drive fails, critical information can still be accessed from the mirrored drive. Typically, RAID 1 involves replicating all data on two separate "stacks" of disk drives on separate SCSI channels, incurring the cost of twice as many disk drives. There is a performance impact, as well, since data must be written twice, consuming both RAID system and possibly server resources.
* RAID 3 and RAID 5 allow continued (albeit, degraded) operation by reconstructing lost information "on the fly" through parity checksum calculations. Adding a global hot spare provides the ability to perform a background rebuild of lost data.
With the exception of costly RAID 1 (or combinations of RAID 1 with RAID 0 or RAID 5) configurations, there has been no solution for recovering from a multiple drive failure within a RAID storage system. Even the exceptions sustain multiple drive failures only under very limited circumstances. For example, a RAID 1 configuration can obviously lose multiple (or all) drives in one mirrored stack, as long as not more than one disk falls in its mirrored partner. Combining striping and parity within mirrored stacks buys some additional capabilities, but is still subject to these drive-failure limitations.
Why would a system need protection against more than one drive failure at a time? Isn't the reliability of today's disk drives so high that the chances of a multiple drive failure are remote?
Disk drive manufacturers publish Mean Time Between Failure (MTBF) figures as high as 800,000 hours (91 years). Yet, as one examines these claims, disk drive manufacturers readily admit that such claims are unrealistic. In fact, the practical life of a disk drive is five to seven years of continuous use. Information Technology managers can painfully testify that disk drives fail with great frequency. That's why all companies place emphasis on storage backup and there is such a large market for tape systems.
It is clear that the likelihood of a drive failure increases as more drives are added to a disk RAID storage system. For example, a terabyte of RAID 5 storage consisting of fiftyeight 18GB disk drives can expect a drive to fail every 44 days! Moreover, when one drive fails, the statistical odds of a second drive failing increase dramatically and if two drives fail, the odds of a third failure jump again. In short, the more drives configured in a RAID storage system, the greater is its potential for suffering multiple drive failures.
Also, disk drives configured within a RAID storage system can be of different ages, including a mixture of new and older drives. This profile increases the odds of a multiple drive failure.
The consequences of a multiple-drive failure can be devastating. Typically, if more than one drive fails, or a service person accidentally removes the wrong drive when attempting to replace a failed drive, the entire RAID storage system is out of commission. Access to critical information is not possible until the RAID system is re-configured, tested, and a backup copy restored. Transactions and information written since the last backup may be lost forever.
Extensive research and development by LAND-S has resulted in a set of software and hardware algorithms that augments RAID storage by performing automatic, transparent recovery from multiple drive failures without interrupting ongoing operations. Called "eRAID," these patented algorithms allow users to select the degree of disk-loss insurance desired. Continued operations are possible even in the event of N1 drive failures. Moreover, because these algorithms have exceptionally fast computational speeds, storage transfer rate performance actually increases under eRAID while adding virtually unlimited data protection.
eRAID consists of a series of software matrix array formulas. It involves breakthrough algorithms for accomplishing XOR calculations (which are the basis of RAID 5). eRAID dramatically alters the reliability of RAID storage by circumventing previous limitations on the number of permissible drive failures. With eRAID, all but one drive can fail (assuming sufficient capacity) and users will still have access to critical information.
HOW DOES ERAID DIFFER FROM TRADITIONAL RAID?
Today, the ultimate protection for critical information is accomplished through RAID 1 (mirroring), overlaying RAID 5 (striping with parity), and then adding a global hot spare. For example, if user data consumes four disk drives, then reliability is improved by replicating this data on a second "stack" of four drives. Within each stack, however, losing just one drive would make the whole database useless. To further enhance reliability, each mirrored stack can be configured as an individual RAID 5 system. Since implementing parity requires an additional drive, user data and parity information are now striped across five drives within each stack. This provides protection against the loss of a single drive within each stack. So, from an original database that required just four drives, this RAID configuration has grown to include:
* Four drives for the original data
* Four drives for the mirrored data
* One parity-drive (equivalent) for each stack (Two total)
* One global hot spare (standby drive on which data can be rebuilt if a drive fails)
This architecture now requires a total of eleven disk drives (Fig 1). Thus, seven drives have been added to protect data on the four (original) drives. This configuration can recover from a failed drive in either stack. Even if all the drives in one stack failed, the remaining drives in the surviving stack would still provide access to critical data. However, in this case, only one drive failure in the remaining stack could be tolerated. Overall, if multiple drive failures occur within each stack, access to the database is lost. Barring a total stack failure, its maximum protection is against the failure of three drives, but in a limited fashion (maximum of two failures in any one stack).
Looking at the same example using eRAID to achieve equal protection against multiple drive failure (Fig 2), protection against three-drive failure is achieved at less cost and overhead:
* Requires only eight disk drives compared toll for traditional RAID
* Requires less administrative overhead
Hence, if these disk drives cost $1,000 each, the eRAD solution saves $3,000 while providing better insurance, since any three random drives can fail and the system will continue to properly function. Many databases rely strictly upon RAID 5 with striping and parity for protection against drive failure because RAID 1 solutions are so costly. However, RAID 5 supports continued operation only in the event of a single inoperable drive at any one moment. Losing two or more drives under RAID 5 brings operations quickly to a halt. For the cost of adding just one more drive, eRAID mitigates the risk of data loss by providing the means to sustain up to two drive failures.
LAND-5 eRAID, however, can support continuous operation even in the event several drives fail. Thus far, LAND-5 has successfully tested recovery when 50 percent of the disk drives fail. With eRAID, network administrators can manually assign the level of desired drive-failure protection. In short, eRAID allows the user the flexibility of selecting the level of drive-failure protection to fit specific needs.
The tangible cost of eRAID is that an additional parity drive equivalent is consumed for each incremental protection level. For instance, if a user desires to protect a 100-drive storage system against the possibility of two concurrent drive failures, then the equivalent of two disk drive capacities will be allocated for eRAID parity-related data. Thus, while users can still read from 100 drives, they can write to only 98 drives, reducing usable storage capacity by two percent. Hence, protection from (say) five concurrent drive failures reduces data storage capacity by only five percent. As any Information Technology Manager will testify, this is a small price to pay for dramatically enhanced storage reliability.
Aside from protection against multiple drive failures, some significant benefits of eRAID are:
* eRAID supports continued operations even in the event of a total SCSI channel failure, whereas this would be catastrophic under traditional RAID 3 or 5.
* In a traditional RAID 1 (or 0+1, 5+1, RAID 6, etc.) storage configuration, with (say) data mirrored on two independent SCSI channels, all data could be lost in one channel and operation would continue. However, if more than one drive failure concurrently occurs in both mirrored channels, then the entire storage system becomes inoperable. With eRAID, on the other hand, random multiple drive failures are sustainable.
Kris Land is the chief technical officer of LAND-5 Corporation (San Diego, CA).