Tuesday, October 10, 2006

From: Computer Technology Review Date: June 1, 2003 Written by: "kris land"

The emergence of ATA drives as a serious alternative to enterprise storage holds the promise of significantly reducing storage acquisition costs. This is amplified by the advent of Serial ATA, which brings features like hot-pluggability, CRC for all communications (including data, commands and status), and thin flexible cabling to further decrease the gap between ATA and more expensive "server" class drive. However, in order to fully realize the advantages of the ATA platform for enterprise storage, new software technologies are required to guarantee the reliability and maximize the performance of the platform.

Specifically, RAID technologies currently used with SCSI and Fibre Channel storage implementations are ill-suited for use in the ATA arena. The pervasive use of write-back caching and the high cost of NVRAM-based board solutions negatively impacts the reliability and price advantages of the ATA platform, introducing the possibility of corruption and data loss and negating much of the cost benefit for the enterprise user. Similarly, the clear attractiveness of RAID Level 5 for large capacity storage is all but eliminated because existing methods for implementing low-cost RAID-5 systems have severe limitations in performance or reliability. On the ATA platform, this results in the undesirable flight to RAID-10 for most types of workloads and directly reduces the cost benefit of the platform.

In order for ATA-based storage to achieve its full potential in the enterprise, it is necessary to understand the limitations of today's hardware-assisted RAID solutions as these attempt, imperfectly, to address the unique characteristics of the ATA drive platform. Particular attention is placed on RAID Level 5, which is the most promising RAID type given its natural application to the larger capacity storage applications that will dominate networked ATA adoption.

ATA Characteristics
ATA disk drives emerged in the late 1980s, as desktop computers began their ascent into the mainstream of the IT universe. ATA is an acronym that stands for "AT Attachment," a reference to the IBM PC/AT that served as the de facto reference specification for the desktop since its introduction in the early 1980s. Though synonymous with IDE (Integrated Drive Electronics), the ATA designation is the subject of various ANSI specifications that have evolved the platform over time and is generic to the category.

Since their initial shipments in 1986, ATA drives have grown substantially in volume. Today, ATA drive shipments outnumber SCSI drive shipments by a factor of 6 to 1. And they outnumber Fibre Channel drive shipments by a factor of 10 to 1. Their volume differences are accounted for by the continuing centrality of ATA's role in the highest volume segment of the PC universe, the desktop computer. Because of their substantial volume advantages, they are subject to far more significant price competition than higher end drive platforms, and on average cost between 3 and 4 times less than SCSI or FC drives. The result has been an increased desire by IT end users to employ ATA drives in enterprise data settings as opposed to using them exclusively in desktop PC devices and workstations.

As engineered products, ATA magnetic disk drives harness the same basic technologies found in higher-end drives that employed different interfaces, most common of which are SCSI and Fibre Channel drives. They employ platters, actuators and a variety of micromotors. As such, ATA drives take advantage of the rapid advances in these component technologies that all disk drive manufacturers are continuously exploiting. Ranging from greater volumetric densities to enhancements in seek performance, ATA drives leverage that same basic technologies as SCSI and FC drives.

However, ATA drives do have significant differences from higher end drive platforms, and these differences must be addressed if the ATA platform is to emerge as a enterprise class storage platform. The first major difference is that ATA drives are subject to different sorting criteria than higher end platforms. Quality control is relaxed because of the relative tradeoff in profitability and defect rates. Instead of 1 percent component rejection levels as seen in SCSI drives, ATA drives are typically subject to a less demanding 5 percent rejection rate. The other differences between ATA and SCSI flow from their different end use targets. Because they are intended for desktop computers, ATA drives use different motors that generate less heat and ambient noise than SCSI. They are also slower than their SCSI counterparts from a RPM basis, given similar design goals to minimize desktop heat and noise but also to maintain SCSI performance advantages at similar capacity levels. That is, drive manufacturers frequentl y release similar capacity SCSI and ATA drives with higher RPMs available first in the SCSI device.

To compensate for decreased performance, ATA drive manufacturers have employed a variety of techniques to enhance the ATA platform. The most important of these techniques is called Write Back Caching. Write Back Caching involves the use of small memory chips contained in the drive electronics that buffer data transfers to the ATA disk. By using these memory modules, which are typically deployed in 2MB to 8MB configurations, the ATA drive can signal the completion of writes more quickly than if it had to Wait until that data was completely transferred to the disk media. However, even as write back caching provides a performance boost, it introduces a series of reliability concerns that contribute to the failure of the drive platform to achieve enterprise-class acceptance. These and other obstacles to reliability in the ATA drive platform will be discussed in detail later.

One of the most significant developments in the ATA world has been the evolution of the platform from a parallel bus architecture to a serial one. This evolution was undertaken to accelerate the use of ATA in networked storage environments and it has proven to be a crucial step in raising the awareness of the platform in multi-drive configurations. Technically, the Serial ATA drive is a seven-wire replacement for the physical ribbon of parallel ATA with a variety of benefits for denser storage implementations. The most important of these includes the cabling change (which facilitates better airflow and easier assembly) as well as the addition of capabilities like hot-pluggability and a point-to-point topology that enables full data-path switching. The first Serial ATA specification was completed in 2000 and drives supporting serial ATA begin initial production runs in the second half of 2002. Major research houses like IDC predict that Serial ATA will dominate the ATA platform within three years, rising to a 95-99 percent share of new drive shipments by the mid-2000s. In the area of networked storage, IDC further predicts the possibility of Serial ATA commanding at least 20 percent of entry-level servers by 2004.

Serial ATA Characteristics
Several years ago, the ANSI Parallel ATA specification was amended with the Ultra DMA protocol, which brought advanced CRC algorithms into the ATA world. These have been carried into the Serial product. While this inclusion has addressed low-level data transfer integrity issues, a new series of problems have surfaced that stand to pose the largest obstacle to ATA acceptance in the enterprise storage world. These problems center around the use of RAID technologies that have been largely tailored and refined through their application to multi-drive SCSI and Fibre Channel storage. As ATA begins to enter the multi-drive network storage world, enterprising vendors are attempting to apply legacy RAID strategies to multi-drive ATA installations but are achieving mixed results. Today, all hardware-assisted RAID technologies native to the ATA platform--as well as ascendant software RAID packages--fail to address key performance and reliability concerns that are unique to the ATA market. By failing to address these pro blems, it is unlikely that the ATA platform will break beyond the entry-level category that IDC and others envision for it.

RAID-5
RAID-5 is one of the methods for achieving higher performance and greater resilience to drive component failure that was originally developed by the U.C. Berkeley RAID team in the late 1980s and early 1990s under the auspices of principal investigators David Patterson, Randy Katz and their students. RAID is an acronym that refers to Redundant Array of Inexpensive Disks, and the original RAID project was conceived as a way to exploit the benefits of high-volume magnetic disk drives by using strings of lower cost drives together, in order to achieve the same benefits as more expensive storage configurations popular in the high-end systems of the day. The groundbreaking work of the RAID team and the industry acceptance that shortly followed have made RAID strategies and resultant technologies the ascendant paradigm for dealing with magnetic disk storage today.

RAID-5 specifically is a methodology for achieving redundancy of data on a group of drives without sacrificing half of the available capacity as mirroring (RAID-1) and its variations (i.e., RAID-10) do. RAID-5 achieves this storage efficiency by performing a parity calculation on the data written to disk and storing this parity information on an additional drive. Should a disk drive fail, the data can be recovered by computing the missing data using the parity and data blocks in the remaining drives. RAID-5 is an especially popular methodology for achieving redundancy because it is more economical than RAID-1 insofar as more disk-drive capacity can be rendered usable from a group of active drives. It has been estimated that RAID-5 accounts for 70 percent of all drive volumes shipped into RAID configurations (the actual percentage of RAID-S per discrete RAID configuration is lower, given the popularity of striping and mirroring with OLTP). This would be sensible given that RAID-5 is typically associated with f ile serving and similar workloads, which account for significantly more capacity usage on a global basis than higher intensity OLTP workloads, for which RAID-5 is rarely used.

The attractiveness of RAID-5 to the ATA storage opportunity is even more pronounced. Given the great volumetric density advantages of the ATA platform versus SCSI and Fibre Channel, ATA is ideally suited for larger capacity storage installations. The capacity efficient RAID Level 5 is functionally allied with this focus on maximum capacity per dollar of storage cost. Though some have speculated that the high density advantage of the ATA platform will result in a willingness of end users to employ mirroring given a surplus of raw capacity, the fundamental laws of technology would seem to argue against this. The sharp and continuous rise in the processing power of the Intel chip, for instance, has not been accompanied by an increase in the sales of 4-way or 8-way servers--quite the reverse is true, with one- and two-way processor servers today dominating most application usages on the market. In the storage market, given its long evidenced storage elasticity, greater volumetric densities will be accompanied by a growth in the desire to maximize capacity as well as prevent disruption from drive failure. In this view data protection based on parity strategies, as opposed to redundancy ones, will be maximally appealing--provided that they pose no crippling obstacles in their implementation.
Today, even for expensive solutions on SCSI and Fibre Channel platforms, there are obstacles to the universal ascendance of RAID Level 5 and the foremost among these is speed. For instance, one reason that RAID-5 is rarely used for OLTP application storage is because of its low performance for such workloads. As a tradeoff to its storage efficiency benefits, RAID-5 imposes additional computational as well as I/O burdens on the underlying magnetic disk storage. These additional burdens in many cases result in the general characterization that RAID-5 is slower than other types of RAID. And, in fact, with many commercial RAID controller technology--both hardware and software-- RAID-5 is often the slowest performing configuration, especially when compared to straight striping (RAID-0), mirroring (RAID-1) or striping + mirroring (RAID-10). In some cases--for instance, software RAID from vendors like VERITAS--the difference in performance between RAID-S and RAID-0 is as much as lox.

Conventional RAID-5 Performance Penalties
The reason that RAID-5 imposes performance penalties when compared to other methods of RAID is due to two principal and related requirements. The first is the calculation of the parity itself, which requires computational resources and takes place in real time. This calculation can be accelerated by the use of specialized hardware such as an XOR engine, and most hardware RAID controllers employ this type of component to assist performance. The second performance cost, by far the most extensive, is due to the way that RAID-S typically conducts its writes. This process is called Read-Modify-Write.

During the process of a sequential write, the RAID-5 implementation will attempt to write data in full stripes corresponding to the number of drives in the RAID group. However, at the end of any sequential write process and during any modification of data in place, it is not possible to write a complete stripe and the technique of Read-Modify-Write must be employed. The Read-Modify-Write process is the prototypical RAID-5 process and it is responsible for much of the performance limitations seen in most implementations of RAID-5.

In a typical Read-Modify-Write operation, multiple I/Os must be executed for each logical write request. The first 110 involves reading an existing block or sequence of blocks on the disk. The second I/O involves reading the parity associated with the block(s) that will be modified. The third I/O involves writing the new data blocks, and the fourth 110 involves updating the parity associated with the relevant block(s) corresponding to the new data that is being written. No matter how small the set of drives that comprise the RAID group, the minimum number of I/Os required in a single write operation that involves the standard Read-Modify-Write approach is four, with an even greater number of I/Os associated with multiple data block writes in larger RAID sets. Furthermore, certain approaches to ensuring reliability in RAID-5 implementations (see section below) involve additional 110 activity such as logging atomic parity updates separately which increases the minimum number of Read-Modify-Write I/Os to six or higher. It is desired to update block D2 with D2'. It is also necessary to update the parity P to P'. Two reads are needed to obtain block D2 and P. D2' and P' are then computed. Finally, two writes are performed to write D2' and P' to disks.

Because of the multiple I/Os required in existing RAID-5 implementations, write performance is characteristically poor, often 5X-10X slower than mirroring or striping alternatives. There are hardware limits to the performance that is achievable given the amount of 110 activity that is generated upon each write.

In addition to low write performance, conventional RAID-5 implementations have other performance limitations that are unique to its RAID flavor. Two of the most common are RAID group initialization and RAID group rebuilding. In RAID-5 group initialization, the RAID solution needs to perform a scan of every data sector on each disk in the RAID set and initialize the corresponding parity. This initialization process is time consuming, the magnitude of which is directly related to the size of the RAID set and the capacity of each drive in the group.
RAID-5 rebuilding is a process that must occur after a RAID-5 set experiences a disk failure. When a disk fails in a RAID-5 set, the missing data and parity contained on the failed drive must be regenerated on a replacement drive once the new working drive is inserted into the set or an existing hot spare is activated as the replacement drive target. Similar to initialization, the process of rebuilding requires that each data block on the system is read and the XOR computations are performed in order to obtain the absent data and parity blocks, which are then written onto the new disk. Often, during the process of reading all data from the disk to recompute the missing data and parity, bad sectors may be encountered, and it is no longer possible to rebuild the array. Depending on the size of the RAID group and the capacity of each drive, the rebuilding process is time consuming and may degrade the use of the drives in the RAID-5 set for normal activity. Both the initialization and the rebuild processes are ad ditional performance and reliability penalties of conventional RAID-5 implementations that will occur as a matter of normal operation.

Conventional RAID-5 Reliability Penalties
Based on the dominant approach to implementing RAID-5 at present, there are several discrete reliability problems that arise in common implementations. Many of these reliability concerns are generated by events like power failure, which can often set in motion a cascade of correlated failures. For instance, a power failure not only interrupts active writes, which can invalidate any parity that is in the process of being updated, but can also bum out disks with aging components. As a result, power failures can often cause data loss in many types of RAID implementations by destroying both the parity and data associated with a "parity stripe." Part of this is due to characteristics of the ATA platform itself, such as differences in assembly line quality control processes that have more tolerance for production variability. However a large part of the quality differential is due to ineffective strategies employed by the ATA RAID community using legacy RAID methodologies.

The most salient reliability problem in the ATA RAID arena is the nearly universal use of write back caching in all ATA implementations, even those driven by hardware RAID solutions. Write-back caching is a function that is enabled by the inclusion of small cache memory components within the disk drive electronics. By providing this additional memory, the drive is able to commit to write commands by buffering bursts of data in memory prior to the full completion of writing data onto the disk platter. When the drive signals that a write has been completed, the application moves on to its subsequent operation even if the data in question remains in the drive's write-back cache. Quicker completion of writes leads to faster application performance when disk latency is the primary performance limitation. Because of this, the logic behind making write-back caching a default strategy is straightforward: to increase the performance of the disk platform.

This performance enhancement is understandable given ATA's traditional role as a desktop device with most target implementations limited to one or two drives. Drive manufacturers have sought to differentiate the high-volume ATA offering from the higher margin SCSI and Fibre Channel drive business by limiting rotational speed thresholds on the platform. This gives pressure to optimize for performance gains like those presented by write back caching, and for the most part the industry benchmarks the ATA platform with write back caching enabled. It is possible that this will change in the future, but at the present moment this strategy is so pervasive that drive manufacturers presume write-back caching to be enabled when certifying their ATA products.

Though performance enhancement is helpful, the use of write-back caching in ATA RAID implementations presents at least two severe reliability drawbacks. The first involves the integrity of the data in the write-back cache during a power failure event. When power is suddenly lost in the drive bays, the data located in the cache memories of the drives is also lost. In fact, in addition to data loss, the drive may also have reordered any pending writes in its write-back cache. Because this data has been already committed as a write from the standpoint of the application, this may make it impossible for the application to perform consistent crash recovery. When this type of corruption occurs, it not only causes data loss to specific applications at specific places on the drive but can frequently corrupt file systems and effectively cause the loss of all data on the "damaged" disk.

The reason that this more global type of corruption occurs is due to another problem with using a write-back cache. This second problem involves the sequencing of data that enters and exits the write-back cache. That is, ATA drives are free to reorder any pending writes in its write-back cache. This allows the write-back cache to obtain additional performance improvements. Instead of issuing sector commitments and then initiating rotational seeks for each sector in the exact sequence that commits were made, the drive places data on sectors that it encounters as platters rotate through an increasing or decreasing sector path. This reduces seek times and speeds up cache throughput. However, if a power or component failure occurs during a write process, the identity of sectors that make it to disk will not correspond to the sequence in which they were written. This causes corruption as applications are unable to recover from drive failures because they have no way of resolving the order in which data made it to the disk media versus which data was lost in cache. Even if individual drives did not reorder writes, there is no convenient way of preventing the reordering of writes that are striped across multiple drives that use write-back caching, since any individual drive is unaware of the writes being serviced by another drive.

These write-back cache problems are a common cause of data corruption. In fact, the weakness of the write-back cache is even a relatively well understood problem, and in higher end drive platforms RAID devices and sophisticated storage administrators will default to a policy of prohibiting the use of the SCSI write back cache. However, in the ATA RAID arena, the write-back cache is usually enabled by default, and performance measurement is conducted with the caching enabled, which is misleading given that the reliability implicit in RAID is compromised by the use of write-back caching.

Deactivation of write-back caching prevents the most severe of the ATA RAID corruption problems. The tradeoff for RAID-5, however, involves even lower performance. As discussed in the previous section, the legacy methodologies for RAID-5 impose a significant performance limitation on this type of RAID, one that is partially addressed by vendors through the default use of write-back caching. Unfortunately, deactivating write-back caching usually has a dire effect on performance.

And yet, there is a further dilemma. Since ATA vendors are not currently certifying the recovery of drives that deactivate write-back caching, it is possible that drives operating without this function will have greater failure rates. So, while vendors do achieve the goal of preventing an obvious source of data corruption, they run the risk of increasing drive failure.
The other showstopper problem posed by disk failure in ATA RAID-5 solutions is the parity recalculation problem. If the system crashes during the middle of a write process, the parity calculation that applied to the active data write may be inconsistent. As a result, when the system is powered back on, it is necessary to regenerate this parity and write it to disk. Since the system will not be able to determine where the last active write was in progress, one solution is to recalculate all of the parity on the RAID-5 group. This recalculation process takes time and every sector of each participating RAID group must be scanned. Based on various leading system implementations currently available, the parity recalculation process can take between 45 minutes for a standard RAID-5 group of five or six drives to several hours for larger sets.
Currently, the parity recalculation problem is a significant drawback of software RAID-5 solutions. There is no easy way to avoid this penalty when using the traditional read-modify-write approach to RAID-5. Some RAID-5 solutions in the ATA universe do avoid this limitation, however, through the use of "pointers" that records the positions of the in-place updates. These pointers are stored either on another disk or within a small NVRAM component. This technique is called "dirty region logging." If the pointer is stored on. another disk, it generates an additional I/O step that will further degrade performance. Nonetheless, it will deliver a performance benefit by avoiding the need to recalculate all parity upon power failure; however, it does not eliminate the associated reliability problem since, in the event of a crash, some parity will still be left in an inconsistent state until recovery can be performed. If dirty region logging is combined with write-back-caching, the original reliability problem caused by a power failure or power spike event will result in inconsistent or corrupt data. Another solution is to log the data and parity to a separate portion of the disks before responding to the write request; the logged data and parity are then copied to the actual RAID stripe. In the event of a failure, the data and parity can be copied back to the RAID stripe. This approach, while much more reliable than dirty region logging, imposes additional disk latency and makes RAID-5 writes significantly slower.
A complete, high-performance way around these parity update problems in RAID-5 is to use significant quantities of NVRAM with reliable battery backup. Unfortunately, the use of NVRAM will tend to degrade RAID-5 performance for streaming where throughput rather than latency is important. NVRAM is often employed in higher-end SCSI and Fibre Channel RAID controllers because it improves performance for many applications and confers reliability benefits in the face of power failure. Nevertheless, it is undesirable for the ATA world to move to this type of solution. One of the most important aspects of the ATA storage opportunity involves its cost savings over alternative drive platforms. Given this, vendors do not have the luxury to equip ATA RAID solutions with a lot of expensive hardware components. Moreover, there is some expectation within the ATA community that the widespread adoption of serial ATA will result in an increase of drive counts within standard rack-mount servers. In many of these scenarios, the r eal estate required for additional board-level components will not be readily available on motherboards or easily addressable through the use of expansion boards. This means that the ATA world will continue to have relatively few options available for addressing reliability concerns associated with RAID-5 implementations simply by applying more hardware.

Conclusion
The advent of Serial ATA drive technology holds the promise for radically altering the economics of networked storage. However, the ATA drive platform is largely unsuitable for enterprise class storage because of severe reliability problems in RAID solutions addressing the ATA universe. These reliability problems are exacerbated in the case of RAID Level 5, which amplifies susceptibility to drive failures and imposes crippling performance limitations. While RAID Level S has great popularity and should top demand for the overwhelming bulk of drive shipments that address mass storage, it fails to confer these advantages in the ATA world where expensive NVRAM-based hardware is economically infeasible and performance limitations make it impractical. As a result, data protection must be achieved through mirroring rather than parity, which is wasteful for many applications and reduces the cost savings advantage of the ATA platform.

A new methodology to conduct RAID-5 is required if its promise in an era of low-cost drive platforms is to be realized. Such a methodology would provide enterprise-class reliability without NVRAM and would deliver near-wire-speed write performance within existing ATA rotational speed frameworks. If this type of solution were available, RAID Level 5 ATA-based storage would achieve rapid and ready acceptance throughout the enterprise-class universe.
Boon Storage Technologies, Inc. has a breakthrough RAID-5 technology called SR5 that overcomes the limitation of existing ATA RAID-5 solution. SR5 truly makes ATA drives enterprise quality; it delivers the ultimate cost benefit to ATA drives while delivering high reliability and high performance to ATA RAID-5.
www.sr5tech.com

Serial ATA Characteristics

Narrower Cabling
Supports Lower Power Requirements
Lower Pin Counts
10-year Roadmap
Higher Performance
Improved Connectivity (No Master-Slave)
Longer Cabling
PC Economies of Scale

RAID-5 Performance Limitations

Multiple I/Os in Read-Modify-Write
Parity Calculation Overhead
Fixed Stripe Size
RAID Group Initialization
RAID Group Rebuilding

Table 3

ATA RAID-5 Reliability Penalties

Platform requires write-back-caching
Data loss on power failure
Data out-of-sequence failure
Parity recalculation failure
File system corruption
NVRAM is not an economic answer
Single drive failure problem

RELATED ARTICLE:
RAID (n) Implements MAID Technology.
Nobody talks about RAID storage anymore, yet virtually every computer system with two or more hard drives is running some form of RAID. The basic technology has not changed in over 10 years, but many versions of RAID have emerged, each with a myriad of options. What does this mean for the end user? As drive capacities double and systems grow in complexity, it is increasingly difficult for users to balance cost with usability, capacity, performance, and reliability.

RAID (n) offers a solution. This new, patented technology out-performs RAID-5 and RAID-10 systems by providing cost-efficient high performance across all drives while allowing any number of random drive failures with no data loss. In addition to enhanced performance and fail-safe reliability, RAID (n) offers superior usability with the option to dynamically adjust usable capacity and/or insurance against data loss while the system is running.

Roughly 80% of the RAID-5 arrays used in today's data centers have one or more "Hot Spares." These drives run all day while providing no direct benefit to the drive's associated array. The result is a potentially dangerous false sense of security. If a second drive fails, or if the wrong drive is removed accidentally while the array is rebuilding, all of the data contained on the system is lost. RAID (n) technology allows the user to set the insurance to 1 + (Hot Spare/s) without the need to purchase additional drives. As a result, the RAID (n) array can tolerate a higher total number of random simultaneous drive failures without data loss. As a further benefit, "Hot Spares" used in conjunction with RAID (n) add directly to the overall system performance.

Users of RAID-10 (or RAID 1+0) systems, where performance and safety are of foremost concern, know that these systems have one serious drawback: the cost of inverse capacity. Any system, such as this one, that contains mirroring will require twice the number of hard drives to affect the same capacity of RAID-S minus one drive. So, as an example, a 2-Terabyte (TB) RAID-5 system using 181GB drives would cost $10,128, assuming each drive was $844.00-today's price for this size drive. That same 2TB system using RAID-10 would cost $18,568, but it would not provide any write performance benefits with the extra ten drives needed under RAID-10.

With RAID (n), a 2TB array with one drive redundancy would cost the same as the RAID-5 system; however, if the user wanted the two-drive failure benefit of RAID-10, the cost would only increase by $844.00 or a total of $10,972 instead of the $18,568. Concurrently, three-drive redundancy would only cost $11,816. If, on the other hand, the RAID-10 system for $18,568 was already in place, the RAID (n) system could be implemented with three drives of insurance while providing 3.4TB of usable capacity.

For companies looking for ways to decrease costs and increase usability, capacity, performance, and reliability, RAID (n) provides a cost-efficient solution. RAID (n) offers a means to improve the overall capacity of current RAID arrays while at the same time enhancing the overall performance and reliability of the total system.

Kris Land is chief technology officer at InoStor Corp. (Poway, Calif.)
www.inostor.com

Kris Land, Recent patents listed: 20050182992 - Method and apparatus for raid conversion

A general RAID conversion method is described for converting between different RAID configurations. The method includes reading a unit of user data from the source devices according to the source RAID algorithm, writing the user data together with redundant data (if any) to the target devices according to the target RAID algorithm, and from time to time releasing portions of the source devices containing data that has been converted. The conversion may be used to expand or contract the array, to increase or decrease usable capacity, and to increase or decrease the device-loss insurance level. Conversion may be performed on line (dynamically) or off line. The flexibility of the method allows the implementation of manual and/or rule-based RAID reconfiguration that automatically adjusts system parameters based on user request and/or a set of rules and conditions respectively. It may also be used to perform self-healing after one or more devices in the array have failed. [0001] This application is related to U.S. Pat. No. 6,557,123, issued Apr. 29, 2003 and U.S. patent application Ser. No. 10/371,628, filed Feb. 20, 2003, both of which are incorporated by reference herein in their entirety. BACKGROUND OF THE INVENTION [0002] 1. Field of the Invention [0003] This invention relates to RAID (Redundant Array of Inexpensive (or Independent) Disks (or Devices)) systems, and in particular, to method and apparatus for converting between different species of RAID's and rule-based RAID reconfiguration. [0004] 2. Description of the Related Art [0005] RAID is a data storage system that provides a certain level of redundancy so that a certain number of disks (devices) of the disk (device) array may be lost without any loss of user data stored thereon. Various species of RAID systems are known, including RAID0, RAID1, RAID3 and RAID5 (known as standard RAID), and RAID2, RAID4 and RAID6 (known as non-standard RAID). Methods and apparatus that provide conversion or migration between different conventional RAID species have been described. For example, U.S. Pat. No. 6,275,898 describes converting from RAID5 to RAID1 (a contraction, or reduction of the usable capacity of the system, referred to as "promotion" in that patent) and converting from RAID1 to RAID5 (an expansion, or increase of the usable capacity of the system, referred to as "demotion" in that patent). The conversion must be done off line, i.e. the system cannot take user request while performing the conversion. In the context of this patent "RAID1" includes the compound RAID, which we call "RAID10". U.S. Pat. No. 6,154,853 describes a special case of an "even" conversion (where the usable capacity in the system is unchanged), by converting an n-disk RAID5 to a 2(n-1) disk RAID10 and back. U.S. Pat. No. 5,524,204 and U.S. Pat. No. 5,615,352 describe a method for expanding a RAID5 to a bigger RAID5 with a larger number of disks. The conversion may be accomplished without interrupting service, i.e. while the system is online. These two patents do not describe an array contraction. SUMMARY OF THE INVENTION [0006] Accordingly, the present invention is directed to a method and apparatus for RAID conversion that substantially obviates one or more of the problems due to limitations and disadvantages of the related art. [0007] An object of the present invention is to provide a flexible approach to RAID conversion and reconfiguration. [0008] Additional features and advantages of the invention will be set forth in the descriptions that follow and in part will be apparent from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings. [0009] To achieve these and other advantages and in accordance with the purpose of the present invention, as embodied and broadly described, the present invention provides a method for RAID conversion in a redundant array of inexpensive devices (RAID) comprising a controller and a plurality of storage devices for storing user data, the controller storing a plurality of RAID algorithms to be implemented for writing data to and reading data from the storage devices, the method includes storing in the controller one or more rules for selecting a desired one of the plurality of RAID algorithms based on one or more conditions of the array; detecting the one or more conditions of the array; selecting the desired RAID algorithm based on the detected conditions and the stored rules; and when the desired RAID algorithm is different from the RAID algorithm currently implemented in the array, automatically converting the array from the currently implemented RAID algorithm to the desired RAID algorithm. [0010] In another aspect, the present invention provides a RAID system configured to carry out the above method steps. In yet another aspect, the invention provides a computer software product for implementing the above method steps in a RAID system. [0011] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed. BRIEF DESCRIPTION OF THE DRAWINGS [0012] FIGS. 1(a) and 1(b) are schematic diagrams showing a RAID system before and after an RAID conversion. [0013] FIG. 2 is a flow chart illustrating a method for RAID conversion. [0014] FIGS. 3(a) and 3(b) are a flow chart illustrating a method for off-line replication. [0015] FIGS. 4(a) and 4(b) are a flow chart illustrating a method for on-line conversion. [0016] FIG. 5 is a flow chart illustrating a rule-based RAID conversion method.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS [0017] A new species of RAID, hereinafter referred to as "RAIDn", is described in commonly assigned U.S. Pat. No. 6,557,123, entitled "Data redundancy methods and apparatus", issued Apr. 29, 2003. U.S. Pat. No. 6,557,123 describes a data storage apparatus having a plurality of n disks, where data comprising a plurality of data groupings are stored respectively across the plurality of n disks. Each one of the n data groupings comprises a data portion and a data redundancy portion. Advantageously, the n data portions are recoverable from any and all combinations of n-m data grouping(s) on n-m disk(s) when the other m data grouping(s) are unavailable, where 1.ltoreq.m[0018] As used in the present application, "RAIDn" is a RAID system according to the principles described in U.S. Pat. No. 6,557,123 and/or U.S. patent application Ser. No. 10/371,628, i.e., a RAID system where the level of redundancy is selectable or adjustable. "Conventional RAID", on the other hand, is used in the present application to refer to conventionally known RAID species such as RAID0, RAID1, RAID3, RAID5, RAID6, RAID2 and RAID4, and/or compound RAID's where any of the above RAID types are combined. "RAID" is used to generally refer to any RAID systems, including conventional RAID and RAIDn systems.
[0019] Although the term disk is used in the present application, the method and apparatus are not limited to disks, but the RAID may comprise any type of suitable devices for data storage, including but not limited to magnetic disks, magnetic tapes, optical discs, memory, any block devices, servers, NAS (network attached servers) systems, JBOD's (Just a Bunch of Disks), clustered servers, etc. This application uses the term "disk", "drive" and "device" interchangeably, unless otherwise specified, without affecting the scope of the description. At least the term "device" should be understood to encompass all suitable storage devices including but not limited to those listed above.
[0020] Embodiments of the present invention provide RAID conversion methods and apparatus for converting (or migrating) between a conventional RAID and a RAIDn system, and/or converting between two RAIDn systems. Other aspects of the invention include applications of rule-based RAID conversion where both RAID systems may be either a RAIDn or a conventional RAID. For convenience, the RAID system before a RAID convention is referred to as the source RAID and the RAID system after the conversion is referred to as the target RAID.
[0021] According to embodiments of the present invention, the RAID conversion may be an expansion where the number of disks in the array increases, or a contraction where the number of disks in the array decreases. The conversion may either increase or decrease usable capacity, which is defined as the total capacity of the system usable for storing user data. The conversion may either increase or decrease the number of total disks in the array. The conversion may either increase or decrease device-loss insurance, which is defined as the maximum number of disks that may fail without the loss of user data. The conversion may translate between two different RAID/RAIDn species whose physical characteristics (i.e. Number of devices, device-loss and/or usable capacity) remain the same; this flexibility of the system allows implementation of rule-based RAID reconfiguration that automatically adjusts one or more system parameters based on a prescribed set of rules and conditions. In particular, RAID conversion may be used to perform self-healing after one or more devices in the array failed, in which situation the source array will be the remaining devices of the original RAID (from which all user data can be reconstructed), and the target array will be either a reconfigured RAID on the same remaining devices or an array that includes replacement devices for the failed devices. In addition, the conversion may be performed either in an on line fashion (i.e. dynamically), where the system will accept and process user I/O requests while performing the conversion, or in an off line fashion, where the system will not accept and process user I/O requests while performing the conversion.
[0022] Referring now to FIG. 1(a), a RAID system includes an array of n1 storage devices 14-1, 14-2, . . . 14-i, . . . 14-n1 connected to a controller 12. A controller useful in embodiments of this invention can be either a physical "Hard Ware" device or a virtual "Software loadable module" managing the RAID functions. FIG. 1(b) shows the system after a RAID conversion, where the array now comprises an array of n2 devices 16-1, 16-2, . . . 16-j, . . . 16-n2. The controller 12, which preferably includes a processor or logic circuits, implements a plurality of RAID algorithms, controls the read and write operations of the devices 14-i or 16-j, and carries out the RAID conversion. The controller 12 is also connected to a host device via any suitable interface device (not shown), for receiving read and write requests from the host, and transmitting or receiving user data to or from the host. The invention does not impose any requirement on the physical identity of the source devices 14-i and target devices 16-j. When the source array and the target array share some of the same physical devices, RAID conversion involves reading data from portions of some devices (as source devices) and writing data to unused portions of the same physical devices (as target devices). When the source array and the target array are separate and distinct physical devices, the RAID conversion may be referred to as replication, and involves copying of user data from the source array to the target array which may be configured as a different RAID.
[0023] RAID conversion methods according to embodiments of the present invention generally involves the following steps (FIG. 2): (1) reading a predefined amount of user data from the source devices according to the RAID algorithm implemented in the source RAID; (2) writing the user data together with redundant data (if any) to the target devices according to the RAID algorithm implemented in the target RAID; and (3) releasing portions of the source devices containing data that has been converted and making such portions available for use as target devices. The read step (1) includes, when appropriate, decoding the received data according to the source RAID algorithm to obtain user data. The write step (2) includes, when appropriate, calculating redundancy date from the user data according to the target RAID algorithm. The write step may include a step of verifying the data written onto the target RAID. During conversion, a watermark is maintained for the source array to indicate the conversion progress. This allows the read and write steps to be carried out for a unit of data at a time, so that user I/O requests can be handled during conversion. The read and write steps are repeated until all data is converted. The capacity release step (3) may be carried out from time to time or when necessary, depending on the amount of unused capacity in the physical device.
[0024] In the write step, the data may be optionally written to a scratch area to avoid "write holes". A known problem in RAID systems, "write holes" refer to possible interruptions of multi-step sequences that may cause data integrity problems. For example, during writing of a data stripe across a RAID5 array, data may be lost if a power failure occurs before sufficient data has been written to enable recovery of the entire stripe. Writing updates to a scratch area substantially eliminates the write hole problem.
[0025] FIGS. 3(a) and 3(b) illustrate an off-line replication method, and FIGS. 4(a) and 4(b) illustrate an on-line conversion method. Both methods are specific examples of the more general method described in FIG. 2.
[0026] The conversion method according to embodiments of the present invention is described in more detail below using a specific example. In this example, it is assumed that the number of bytes in any data chuck is a power of 2. (Generally, the data chunks, chunk sizes, chunk boundaries and byte offsets may be of any defined values and the present invention is not limited to the specifics of this example given here.) When chunk sizes are not fixed, it is assumed that a larger size chunk always starts on chunk boundary of any smaller size chunk. In fact, absolute byte offset of chunk start is a multiple of chunk size. It is also assumed that virtual stripes start at a multiple of their size in absolute byte offset.
[0027] Any virtual stripe size is an integer multiple of a chunk size, and therefore any two abstract RAID's (conventional RAID or RAIDn), have a least common multiple which is an exact integer multiple of both their chunk sizes. Watermarks at absolute byte offsets equal to integer multiples of this least common multiple are used as virtual stripe boundaries for both abstract RAID's. These are referred to herein as "shared stripe boundaries". For example, a virtual stripe on a 9:2 RAIDn is 63 chunks, while a virtual stripe on a 9-disk RAID5 is 8 chunks. The least common multiple will be 504 chunks, or about 2 megabytes with 4 Kbyte chunks. Conversion is preferably carried out in units of virtual stripes, as follows.
[0028] First, a subset of possible shared stripe boundaries is defined as "step watermarks". The step watermarks should be spaced so that full conversion between neighboring step watermarks takes a desired amount of time, such as on the order of {fraction (1/10)} second, or less. The controller 12 alternates (e.g. on the order of once a second) between a converting state and a user I/O state. When entering the converting state, the controller flushes all pending user requests to the array, with the cooperation of the upper level driver connected to the controller 12, so that no I/O to this array is issued while the state remains converting. Preferably, the upper level driver either sends a pause, which will not return until, or the driver can queue user requests until, the entire conversion to the next step watermark is completed. The controller then converts the data from the source array to the next step watermark. The new watermark is stored in the controller, the controller flushes watermark data and the controller enters the user I/O state. During user I/O state, normal user I/O takes place to the array with the watermark fixed at its new location. Since the watermark location indicates which portions of the data has been converted and hence exist on the target RAID, and which portions of the data have not yet been converted and hence exist on the source RAID, user read requests can be handled appropriately by reading data from either the source RAID or the target RAID. User write requests are preferably handled by writing data onto the target array with an appropriate watermark indicating the boundary of such data. The above steps are repeated until all the data is converted.
[0029] The above-described method may involve small pauses in data availability to the users, but is relatively easy to implement. Alternatively, if smoother data availability is to be maintained during conversion, a moving RAID0, RAID1, RAID10, or some other RAID section embracing at least two steps may be implemented, preferably on a separate storage device such as a solid-state disk or battery backed memory. By placing an intermediary RAID device and/or cache between new user I/O and the target array during the step watermark I/O operation substantially eliminates all potential user I/O pauses. Additionally this would eliminate "write holes" even if there are pauses.
[0030] The RAID conversion method described above may be applied where the source and target RAID's may be any species of RAID, including conventional RAID's and RAIDn with any desirable n:m parameters. As a result, the RAID conversion method is flexible and general in that it can implement a contraction as well as an expansion, with increased or decreased usable capacity and increased or decreased device-loss insurance. Further, conversion may be carried out either on-line (dynamically) or off-line. This flexibility allows practical applications for reconfiguring RAID systems not offered by conventional conversion methods. One category of such applications is rule-based RAID reconfiguration. Rule-based reconfiguration may be implemented by storing a set of rules in the controller (or in an upper level user application), which causes automatic conversion (reconfiguration) of the RAID system when certain conditions are met (FIG. 5). Some examples of rule-based RAID conversion include:
[0031] Capacity utilization-based rules. Device-loss insurance level may be automatically adjusted, between a minimum and a maximum level set by the user, based on capacity utilization (i.e. amount of total device capacity that is utilized by user data). For example, a 20-drive array may be set to have a maximum insurance level of 5 disks and a minimum insurance level of 2 disks. If the utilization of available capacity of the array is at or below 50%, the RAID is configured as 20:5; if the capacity utilization is between 50% and 60%, the RAID is configured as 20:4; etc. Additionally idle drives can be added to maintain both capacity and insurance by using a predetermined number of idle drives and/or idle drives know as Global spares.
[0032] Performance requirement-based rules. Different species of RAID's have different performance in terms of read and write speeds. For example, RAID0 had the fastest performance for both reads and writes but no safety. The level of device-loss insurance in RAIDn affects write performance to a certain degree and affects read performance to a lesser degree. A rule may be defined to increase or decrease the insurance level based on performance requirements. If, for example, from RAID0 each one disk of insurance increase results in a write penalty of 10%, and if a performance level of 60% of the maximum performance is acceptable, then the device-loss insurance may be set as high as 4. The RAID may be automatically reconfigured when the performance requirement changes.
[0033] Self-healing fixed insurance. Rules may be set up so that the RAID will automatically add devices and/or borrow usable capacity from the array to maintain a certain level of device-loss insurance. For example, if an insurance level of 3 is always to be maintained, and one device in a 9-device array fails, the remaining 8 devices may be reconfigured into an 8:3 RAIDn (assuming total capacity is adequate). Alternatively, if a spare device is available, it may be added to the 8 remaining devices and reconfigured into a 9:3 RAID.
[0034] Self-healing minimal insurance. A RAID system may be supplied by a supplier and set to an initial high level of insurance. As devices fail, self-healing is performed to reconfigure the remaining devices, until a minimal insurance threshold is reached which triggers a maintenance call. This may be especially useful when a preventive maintenance contract is in place as it reduces the number of maintenance calls to the user site, and/or allows maintenance to be performed at a desired time during a window instead of at each device failure.
[0035] Data criticality-based rules. Device-loss insurance level may be automatically adjusted, between a minimum and a maximum level set by the user, based on the importance of the user data. Such rule-based settings will dynamically change from higher insurance (for more important data) to lower insurance (for less important data) and vice versa. Data criticality may be measured or defined by any suitable methods such as the class of user, the use of directories that are designated at higher insurance levels, files marked with higher priorities etc.
[0036] Data recency and repetition-based rules. Device-loss insurance level may be automatically adjusted, between a minimum and a maximum level set by the user, based on recency and repetition (R&R) of the user data. Such rule-based setting will dynamically change from higher insurance (for higher R&R) to lower insurance (for lower R&R) and vice versa. R&R may be measured or defined by any suitable methods such as the number of files R/W over a period of time and/or the number of accesses of one or more files over a period of time.
[0037] Device vulnerability-based rules. Device-loss insurance level may be automatically adjusted, between a minimum and a maximum level set by the user, based on the device type, vulnerability of the type of device, and/or location of the user data (for example. the location of user data may be in remote locations such as mobile offices, home offices, remote offices etc., or a managed data center). Such rule-based settings will dynamically change from higher insurance (for more vulnerable devices) to lower insurance (for less vulnerable devices) and vice versa. [0038] In the above rule-based RAID conversion methods, each of the source and target RAIDs may be a conventional RAID or a RAIDn.
[0039] It will be apparent to those skilled in the art that various modification and variations can be made in the RAID conversion methods and apparatus of the present invention without departing from the spirit or scope of the invention. For examples, although a set of possible rules are described, the invention is not limited to these rules and any suitable rules may be used. Thus, it is intended that the present invention cover modifications and variations that come within the scope of the appended claims and their equivalents.What is claimed is: 1. In a redundant array of inexpensive devices (RAID) comprising a controller and a plurality of storage devices for storing user data, the controller storing a plurality of RAID algorithms to be implemented for writing data to and reading data from the storage devices, a method for RAID conversion comprising: storing in the controller one or more rules for selecting a desired one of the plurality of RAID algorithms based on one or more conditions of the array; detecting the one or more conditions of the array; selecting the desired RAID algorithm based on the detected conditions and the stored rules; and when the desired RAID algorithm is different from the RAID algorithm currently implemented in the array, converting the array from the currently implemented RAID algorithm to the desired RAID algorithm. 2. The method of claim 1, wherein the converting step comprises: (a) reading a unit of user data from the storage devices according to the currently implemented RAID algorithm; (b) defining a watermark indicating the position where the data is read from the current RAID; and (c) writing user data on appropriate storage devices according to the desired RAID algorithm. 3. The method of claim 2, further comprising: alternating between performing steps (a), (b) and (c), and processing user I/O requests. 4. The method of claim 1, wherein the desired RAID has fewer storage devices storing user data than the currently implemented RAID. 5. The method of claim 1, wherein the desired RAID has more storage devices storing user data than the currently implemented RAID. 6. The method of claim 1, wherein the conditions include the current capacity utilization of the array. 7. The method of claim 1, wherein the conditions include a performance requirement. 8. The method of claim 1, wherein the conditions include a change in the number of available storage devices in the array. 9. The method of claim 1, wherein the conditions include a decrease in the number of available storage devices in the array. 10. The method of claim 1, wherein the conditions include an increase in the number of available storage devices in the array. 11. The method of claim 1, wherein the conditions include a measure of data criticality of the user data. 12. The method of claim 1, wherein the conditions include a measure of recency and repetition of the user data. 13. The method of claim 1, wherein the conditions include a measure of vulnerability of the storage devices. 14. The method of claim 1, wherein the converting step is performed on line. 15. The method of claim 1, wherein the converting step is performed off line. 16. The method of claim 1, wherein at least some of the RAID algorithms stored in the controller are characterized by a number of storage devices in the array (n), and a device-loss insurance level (m) such that when up to m devices of the array are unavailable, user data is fully recoverable from the remaining n-m devices, where 1.ltoreq.m17. The method of claim 16, wherein the device-loss insurance level of the desired RAID is greater than the device-loss insurance level of the currently implemented RAID.
18. The method of claim 16, wherein the device-loss insurance level of the desired RAID is less than the device-loss insurance level of the currently implemented RAID.
19. The method of claim 16, wherein the condition is a decrease in the number of available storage devices in the array and the desired RAID after conversion has the same device-loss insurance level as the currently implemented RAID.
20. The method of claim 16, wherein the condition is a decrease in the number of available storage devices in the array and the desired RAID after conversion has a lower device-loss insurance level than the currently implemented RAID.
21. The method of claim 16, wherein the rules define a maximum device-loss insurance level and a minimum device-loss insurance level for a given n value, and one or more conditions based on which a desired device-loss insurance level is determined, the desired device-loss insurance level falling between the maximum and minimum device-loss insurance levels.
22. In a redundant array of inexpensive devices (RAID) comprising a controller and a plurality of storage devices for storing user data, the controller storing a plurality of RAID algorithms to be implemented for writing data to and reading data from the storage devices, wherein at least some of the RAID algorithms are characterized by a number of storage devices in the array (n), and a device-loss insurance level (m) such that when up to m devices of the array are unavailable, user data is fully recoverable from the remaining n-m devices, where 1.ltoreq.m23. The method of claim 22, wherein the writing step includes writing updates to a semi-permanent cache.
24. A redundant array of inexpensive devices (RAID) system comprising: a plurality of n storage devices for storing user data thereon; and a controller connected to the storage devices for controlling writing and reading data to and from the storage devices according to a RAID algorithm, the controller storing a plurality of RAID algorithms to be implemented for writing data to and reading data from the storage devices, the controller further storing one or more rules for selecting a desired one of the plurality of RAID algorithms based on one or more conditions of the array, the controller having stored program instructions or a logic circuit operable to detect the one or more conditions of the array, to select the desired RAID algorithm based on the detected conditions and the stored rules, and when the desired RAID algorithm is different from the RAID algorithm currently implemented in the array, to convert the array from the currently implemented RAID algorithm to the desired RAID algorithm.
25. The system of claim 24, wherein the controller has stored program instructions or a logic circuit operable to convert the array by: (a) reading a unit of user data from the storage devices according to the currently implemented RAID algorithm, (b) defining a watermark indicating the position where the data is read from the current RAID, and (c) writing user data on appropriate storage devices according to the desired RAID algorithm.
26. The system of claim 25, wherein the controller has stored program instructions or a logic circuit operable to alternate between performing steps (a), (b) and (c), and processing user I/O requests.
27. The system of claim 24, wherein the desired RAID has fewer storage devices storing user data than the currently implemented RAID.
28. The system of claim 24, wherein the desired RAID has more storage devices storing user data than the currently implemented RAID.
29. The system of claim 24, wherein the conditions include the current capacity utilization of the array.
30. The system of claim 24, wherein the conditions include a performance requirement.
31. The system of claim 24, wherein the conditions include a change in the number of available storage devices in the array.
32. The system of claim 24, wherein the conditions include a decrease in the number of available storage devices in the array.
33. The system of claim 24, wherein the conditions include an increase in the number of available storage devices in the array.
34. The system of claim 24, wherein the conditions include a measure of data criticality of the user data.
35. The system of claim 24, wherein the conditions include a measure of recency and repetition of the user data.
36. The system of claim 24, wherein the conditions include a measure of vulnerability of the storage devices.
37. The system of claim 24, wherein the converting step is performed on line.
38. The system of claim 24, wherein the converting step is performed off line.
39. The system of claim 24, wherein at least some of the RAID algorithms stored in the controller are characterized by a number of storage devices in the array (n), and a device-loss insurance level (m) such that when up to m devices of the array are unavailable, user data is fully recoverable from the remaining n-m devices, where 1.ltoreq.m40. The system of claim 39, wherein the device-loss insurance level of the desired RAID is greater than the device-loss insurance level of the currently implemented RAID.
41. The system of claim 39, wherein the device-loss insurance level of the desired RAID is less than the device-loss insurance level of the currently implemented RAID.
42. The system of claim 39, wherein the condition is a decrease in the number of available storage devices in the array and the desired RAID after conversion has the same device-loss insurance level as the currently implemented RAID.
43. The system of claim 39, wherein the condition is a decrease in the number of available storage devices in the array and the desired RAID after conversion has a lower device-loss insurance level than the currently implemented RAID.
44. The system of claim 39, wherein the rules define a maximum device-loss insurance level and a minimum device-loss insurance level for a given n value, and one or more conditions based on which a desired device-loss insurance level is determined, the desired device-loss insurance level falling between the maximum and minimum device-loss insurance levels.
45. A computer program product comprising a computer usable medium having a computer readable code embodied therein for controlling a redundant array of inexpensive devices (RAID), the RAID comprising a controller and a plurality of storage devices for storing user data, the controller storing a plurality of RAID algorithms to be implemented for writing data to and reading data from the storage devices, the computer program product comprising: first computer readable program code configured to cause the controller to storing one or more rules for selecting a desired one of the plurality of RAID algorithms based on one or more conditions of the array; second computer readable program code configured to cause the controller to detect the one or more conditions of the array; third computer readable program code configured to cause the controller to select the desired RAID algorithm based on the detected conditions and the stored rules; and fourth computer readable program code configured to cause the controller to, when the desired RAID algorithm is different from the RAID algorithm currently implemented in the array, convert the array from the currently implemented RAID algorithm to the desired RAID algorithm.
46. The computer program product of claim 45, wherein the fourth computer readable program code comprises: fifth computer readable program code configured to cause the controller to read a unit of user data from the storage devices according to the currently implemented RAID algorithm; sixth computer readable program code configured to cause the controller to define a watermark indicating the position where the data is read from the current RAID; and seventh computer readable program code configured to cause the controller to write user data on appropriate storage devices according to the desired RAID algorithm.
47. The computer program product of claim 46, further comprising seventh computer readable program code configured to cause the controller to process user 1/0 requests; and eighth computer readable program code configured to cause the controller to alternate between executing the fifth, sixth and seventh program codes and executing the seventh program code. 48. The computer program product of claim 45, wherein the desired RAID has fewer storage devices storing user data than the currently implemented RAID.
49. The computer program product of claim 45, wherein the desired RAID has more storage devices storing user data than the currently implemented RAID.
50. The computer program product of claim 45, wherein the conditions include the current capacity utilization of the array.
51. The computer program product of claim 45, wherein the conditions include a performance requirement.
52. The computer program product of claim 45, wherein the conditions include a change in the number of available storage devices in the array.
53. The computer program product of claim 45, wherein the conditions include a decrease in the number of available storage devices in the array.
54. The computer program product of claim 45, wherein the conditions include an increase in the number of available storage devices in the array.
55. The computer program product of claim 45, wherein the conditions include a measure of data criticality of the user data.
56. The computer program product of claim 45, wherein the conditions include a measure of recency and repetition of the user data.
57. The computer program product of claim 45, wherein the conditions include a measure of vulnerability of the storage devices.
58. The computer program product of claim 45, wherein the converting step is performed on line.
59. The computer program product of claim 45, wherein the converting step is performed off line.
60. The computer program product of claim 45, wherein at least some of the RAID algorithms stored in the controller are characterized by a number of storage devices in the array (n), and a device-loss insurance level (m) such that when up to m devices of the array are unavailable, user data is fully recoverable from the remaining n-m devices, where 1.ltoreq.m61. The computer program product of claim 60, wherein the device-loss insurance level of the desired RAID is greater than the device-loss insurance level of the currently implemented RAID. 62. The computer program product of claim 60, wherein the device-loss insurance level of the desired RAID is less than the device-loss insurance level of the currently implemented RAID.
63. The computer program product of claim 60, wherein the condition is a decrease in the number of available storage devices in the array and the desired RAID after conversion has the same device-loss insurance level as the currently implemented RAID.
64. The computer program product of claim 60, wherein the condition is a decrease in the number of available storage devices in the array and the desired RAID after conversion has a lower device-loss insurance level than the currently implemented RAID.
65. The computer program product of claim 60, wherein the rules define a maximum device-loss insurance level and a minimum device-loss insurance level for a given n value, and one or more conditions based on which a desired device-loss insurance level is determined, the desired device-loss insurance level falling between the maximum and minimum device-loss insurance levels.

Browse Industry: USPTO Class 714

Kris Land, Recent patents listed: 20050097388 - Data distributor

A data distribution system that includes a plurality of access ports (APs) connected by one or more crossbar switches. Each crossbar switch has a plurality of serial connections, and is dynamically configurable to form connection joins between serial connections. Each AP has one or more serial connections, a processor, memory, and a bus. A first subset of the APs are host and/or peripheral device APs which further include host and/or peripheral device adapters for connecting to hosts and/or peripheral devices. A second subset of the APs are CPU-only APs that are not connected to a host or peripheral device but perform data processing functions. The data distributor system accomplishes efficient data distribution by eliminating a central CPU that essentially processes every byte-of data passing through the system. The data distributor system can be implemented in a storage system such as a RAID system.

BACKGROUND OF THE INVENTION
[0001] 1. Field of the Invention
[0002] This invention relates to data distribution, and in particular, to data distribution in a data storage system or other distributed data handling systems.
[0003] 2. Description of the Related Art
[0004] In peer-to-peer and mass storage development, data throughput has been a limiting factor, especially in applications such as movie downloads, pre and post film editing, virtualization of streaming media and other applications where large amounts of data must be moved on and off storage systems. One cause of this limitation is that in current systems, every byte of data passing through is handled by a central CPU, internal system buses and the associated main memory. In the following description, a RAID (Redundant Array of Inexpensive Disks) is used as an example of a data storage system, but the analysis is applicable to other systems. A general description of RAID and a description of specific species of RAID, referred to as RAIDn here, may be found in U.S. Pat. No. 6,557,123, issued Apr. 29, 2003 and assigned to the assignee of the present application.
[0005] FIG. 5 is an exemplary configuration for a conventional hardware RAID system. One or more hosts 502 and one or more disks 504 are connected to an EPCI bus or buses 508 (or other suitable bus or buses) via host bus adapters (HBAs) 506. The hosts may be CPUs or other computer systems. The EPCI bus 508 is connected to a EPCI bridge 510 (or other suitable bridge). A CPU 514 and a RAM 516 are connected via a front side bus 512 to the EPCI bridge 510. In a RAID system, the CPU 514 and the RAM 516 are typically local to the RAID hardware. In a RAID write operation, data flows from a host 502 to the CPU 514 via the HBA 506, the EPCI bus or buses 508, the EPCI bridge 510, and the front side bus 512; and flows back from the CPU to a disks 504 via the front side bus 512, the EPCI bridge 510, the EPCI bus 508, and the HBA 506. Data flows in the opposite direction in a read operation. All data processing, including RAID encoding and decoding, is handled by the CPU 514. SUMMARY OF THE INVENTION
[0006] The present invention is directed to a data distribution system and method that substantially obviates one or more of the problems due to limitations and disadvantages of the related art.
[0007] An object of the present invention is to provide a data distribution system that is capable of moving large amounts of data among multiple hosts and devices efficiently by using a scheme of destination control and calculation.
[0008] Additional features and advantages of the invention will be set forth in the descriptions that follow and in part will be apparent from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
[0009] To achieve these and other advantages and in accordance with the purpose of the present invention, as embodied and broadly described, the present invention provides a data distribution system, which includes one or more crossbar switches and a plurality of access ports. Each crossbar switch has a plurality of serial connections, and is dynamically configurable to form connection joins between serial connections to direct serial transmissions from one or more incoming serial connections to one or more outgoing serial connections. Each access port has one or more serial connections for connecting to one or more crossbar switches, a processor, memory, and an internal bus. Each of a first subset of the plurality of access ports further includes one or more host adapters and/or peripheral device adapters for connecting to one or more hosts and/or peripheral devices, and each of the first subset of access ports is connected to at least one crossbar switch. Each of a second subset of the plurality of access ports has one or more input serial connections and one or more output serial connections connected to one or more crossbar switches, and is adapted to perform data processing functions.
[0010] Optionally, one of the crossbar switches is a control crossbar switch connected to all of the plurality of access ports for transmitting control signals among the plurality of access ports, and one of the plurality of access ports is an allocator CPU access port which is connected to the control crossbar switch via a serial connection, the allocator CPU access port being operable to control the other access ports to direct data transmissions between the other access ports connected via crossbar switches.
[0011] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed. BRIEF DESCRIPTION OF THE DRAWINGS
[0012] FIG. 1 is a schematic diagram of the basic configuration of a data distribution system according to an embodiment of the present invention.
[0013] FIGS. 2(a)-2(f) schematically illustrate the structure of a data distributor according to embodiments of the present invention.
[0014] FIG. 3 shows an access port.
[0015] FIGS. 4(a) and 4(b) show a data crossbar with connections and join patterns for data write.
[0016] FIG. 5 shows a system configuration for a conventional hardware RAID. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0017] In the following description, a RAID (Redundant Array of Inexpensive Disks) system is used as an example of a data storage system, but the invention may be applied to other systems, such as storage networking, storage pooling, storage virtualization and management, distributed storage, data pathways, data switches and other applications where using multicast and broadcast with this invention art allows for a highly efficient method of moving data.
[0018] FIG. 1 is a schematic diagram illustrating an overview of the basic configuration of a data distribution system according to an embodiment of the present invention. The basic design of the data distribution system includes four sets of parallel components interconnected to one another. Specifically, the system includes a plurality of crossbar switches (XBAR) 102 which connect a plurality of hosts 104, a plurality of peripheral devices 106, and a plurality of processors 108 together. When two or more crossbars 102 are present, each of the crossbars is preferably connected to each of the hosts 104, to each of the data storage devices 106, and to each of the processors 108. The arrows of the connection lines in FIG. 1 indicate the direction of data movement for a data write operation of a storage device (such as RAID); arrows in opposite directions would apply to a data read operation.
[0019] A host 104 is typically a local or remote computer capable of independent action as a master, and may include system threads, higher nested RAID or network components, etc. The plurality of hosts 104 make their demands in parallel and the timing of their demands is an external input to the data distribution system, not controlled by the rest of the system. A queuing mechanism is operated by a processor which may be a specialized one of the processors 108. Such queuing does not involve mass data passing, but only requests passing. A peripheral device 106 is typically a local or remote device capable of receiving or sending data under external control, and may be data storage devices (disks, tapes, network nodes) or any other devices depending on the specific system. The processors 108 may be microprocessors or standard CPUs, or specialized hardware such as wide XOR hardware. They perform required data processing functions for the data distribution system, including RAID encoding and decoding, data encryption and decryption or any related compression and decompression or redundancy algorithms that may relate to mass storage or distributed networks, etc. As described earlier, optionally, one or more processors 108 may be specialized in control functions and control the data flow and the operation of the entire system including other processors. Control of data flow will be described in more detail later with reference to FIGS. 2-4. The arrows in FIG. 1 indicate the basic way data would move for a write operation. Data movement would be different for a read or other operations, as will be describe in more detail with reference to FIGS. 2-4.
[0020] By using the crossbars 102 to connect the other components 104, 106 and 108, each peripheral device 106 may serve any math processor 108 and any host 104, and each data math processor 108 may serve any host 104 and any data storage devices 106. In addition, the multiple processors 108 may share among themselves the tasks required by heavy demand from the hosts. Data may flow directly between the peripheral devices 106 and the hosts 104, or through the processors 108, depending on the need of the data distribution scheme.
[0021] FIG. 2(a) shows a more specific example of a data distributor according to an embodiment of the present invention. As shown in FIG. 2(a), one or more hosts 202 are connected to one or more host access ports (APs) 206 via host connections 204, which may be a SCSI, Fibre Channel, HIPPI, Ethernet, one or more T1's or greater or other suitable types of connections. One or more peripheral devices (such as storage disks, other storage devices or block devices) 208 are connected to one or more peripheral device APs 212 via standard drive buses 210, such as SCSI buses, Fibre Channel, ATA, SATA, Ethernet, HIPPI, Serial SCSI and any other physical transport layer, or other suitable types of connections. The host APs 206, the peripheral device APs 212, and one or more CPU-only APs 220 are connected to crossbar switches (data XBARs) 214. The APs 206, 212 and 220 may optionally be connected to a crossbar switch (control XBAR) 218, which is also connected to an allocator CPU AP 222. All connections between a crossbar and an AP are fast serial connections 216. The host APs 206, peripheral device APs 212 and CPU-only APs 220 are connected to the allocator CPU AP 222 by interrupt lines 224.
[0022] To avoid overcrowding the drawings, only one component of each kind is shown in FIG. 2(a), and the labels "*1", "*2", etc. designate the number of the corresponding component present in the system (when not indicated, the number of components is one). In addition, each illustrated connection line (e.g. the host connections 204, the standard drive buses 210, the serial connections 216 and the interrupt lines 224) represents a group of connections, each connecting one device at one end of the line with one, device at the other end of the line. For example, six (*6) peripheral devices 208 are connected to three (*3) peripheral device APs 212 via six (*6) standard drive buses 210. As another example, two (*2) data crossbars 214 are present, and each data crossbar is connected to each of the host APs 206, each of the peripheral device APs 212 and each of the CPU-only APs 220. Accordingly, six (*6) serial connections 216 are present between three (*3) peripheral device APs 212 and two (*2) data crossbars 214. Of course, the numbers of components shown in FIG. 2(a) are merely illustrative, and other numbers of components may be used in a data distributor system. For example, multiple (two) data crossbars 214 are shown in FIG. 2(a), but the data distributor may be implemented with a single crossbar (so long as the total number of required connection joins do not exceed the maximum for a crossbar). The system shown in FIG. 2(a) is a complex example of a data distributor, and not all components shown here are necessarily present in a data distributor, as will become clear later.
[0023] In the data distributor of FIG. 2(a), both data and control signals are transmitted through the nodes (the host APs 206, peripheral device APs 212, and CPU-only APs 220). Typically, control signals or commands refer to signals that reprogram the access ports or affect their actions through program branching other than by transmission monitoring, transmission volume counting or transmission error detection. Data typically refers to signals, often clocked in large blocks, that are transmitted and received under the control of the programs in the hosts, peripheral devices, and access ports, without changing those programs or affecting them except through transmission monitoring, transmission volume counting and transmission error detection.
[0024] In operation, data flows between nodes as directed by the paths in data crossbars 214. The allocator CPU AP 222, which is the master of the data distributor 200, controls the APs 206, 212 and 220 by transmitting control commands to these APs through the control crossbar 218, and receiving interrupt signals from the APs via the interrupt lines 224. The allocator CPU AP 222, under boot load or program control, transmits commands to other APs, receives interrupt or control signals from other APs as well as from hosts, peripheral devices, or other components (not shown) of the network system such as clocks, and synchronizes and controls the actions of all of these devices. The data crossbar 214 is controlled in-band by any sending AP, which is accomplished by preloading the data stream from the sending AP with crossbar in-path commands. For example, a data stream originating from the Host AP 206 may contain a command header in the data stream being sent to the Data Xbar 214 that instructs the Data Xbar 214 to "multi-cast" the data stream to a plurality of peripheral AP's 212. The Host AP may receive its instructions from the Allocator CPU AP 222. The receiving peripheral AP's 212 may receive instructions from the Allocator CPU AP on what to do with the data received from the data XBAR 214.
[0025] The structure of an access port (AP) is schematically illustrated in FIG. 3. The AP 300 typically include a bus 302, one or more CPU 304, CPU RAM 314 (such as 128 MB of 70 ns DRAM), RAM 306 (such as a 256K to 512 k of fast column RAM), one or more interrupt lines 308, and one or more serial connections 310. The bus 302 may have a typical speed of 533 MB/sec or higher. The serial connections may be fast serial connections matched to the crossbar to which the AP is connected, and capable of communicating data and/or control commands in either direction. The AP 300 optionally contains a ROM (not shown) for an allocator or other coding applications. The AP 300 may also contain an IO adapter 312 capable of connecting to one or more hosts or peripheral devices, via SCSI, ATA, Ethernet, HIPPI, Fibre Channel, or any other useful hardware protocol. The IO adapter provides both physical transmission exchanges and transmission protocol management and translation. Because of the presence of the CPU RAM 314, the AP 300 is capable of running a program, accepting commands via a serial connection 310, sending notification via an interrupt line 308, etc. By using RAM 306, the AP 300 is capable of buffering data over delays caused by task granularity, switch delays, and irregular burdening. Of course, a single RAM or other RAM combination may be used in lieu of the CPU RAM 314 and the RAM 306 shown in FIG. 3 as the processor's internal memory; preferably, such RAM or RAM combination should allow the AP to perform the above described functions at a satisfactory speed. The AP 300, under program control, is capable of creating, altering, combining, deleting or otherwise processing data by programmed computation, in addition to transmitting or receiving data to or from hosts and peripheral devices. The programming on the APs is preferably multitasked to run control code in parallel with data transmissions.
[0026] Depending on the presence or absence of the adapter and, if present, the type of the adapter, an APs may be (1) a peripheral device AP (such as devices 212 in FIG. 2(a)) adapted for connection to one or more peripheral devices, (2) a host AP (such as devices 206 in FIG. 2(a)) adapted for connection to one or more hosts, (3) a CPU-only AP (such as device 220 or 222 shown in FIG. 2(a)) lacking an adapter, or other suitable types of APs. For example, the adapter in a peripheral device AP 212 may be a SCSI HBA, and the adapter in a host AP 206 may be an HIPPI adapter. Alternatively, a single adapter suitable for connection to either hosts or peripheral devices may be used, and the AP may thus be a host/peripheral device AP. A host AP and peripheral device AP may be one-sided at any given time from the adapter and data serial line point of view. "One-sided" refers to certain interface types where the interface requires an initiator and a target that behave differently from each other; "one-sided" does not mean that data flows in one direction. However both the host AP and the peripheral device AP is preferably capable of transmission in both directions over their lifetime. Preferably, the programming for a host AP 206 causes it to look like a slave (such as a SCSI "target") to the host(s) 202 on its adapter, while the programming for a peripheral device AP 212 causes it to look like a master (such as a SCSI "initiator") to the peripheral device(s) 208 on its adapter. A host/peripheral device AP switches between these two rolls. The slave/master distinction is independent of which direction the data flows.
[0027] A CPU-only AP lacks a host or peripheral device adapter, and is typically used for heavy computational tasks such as those imposed by data compression and decompression, encryption and decryption, and RAID encoding and decoding. A CPU-only AP typically requires two serial connections 310, i.e., both input and output serial data connections simultaneously.
[0028] A special case of a CPU-only AP is the Allocator CPU AP (device 222 in FIG. 2(a)). Unlike other APs, which each has an output interrupt line, the Allocator CPU AP has several input interrupt lines. Also unlike other APs, it does not require serial data connections for transmitting data; it requires only serial control connections for transmitting control signals. It is typically supplied with a larger CPU RAM 314 to run the master control program, which may be placed on an onboard ROM, or transmitted in through an optional boot load connection.
[0029] As is clear from the above description, not all components shown in FIG. 3 are required for an AP. The minimum requirement for an AP is an internal bus 302, a CPU 304, a RAM 306 or 314, and a serial connection 310. As will be described later with reference to FIGS. 2(b)-2(e), the interrupt lines 308 (224 in FIG. 2(a)) may be omitted and their function may be performed by a serial connection 310.
[0030] A crossbar switch (XBAR) is a switching device that has N serial connections, and up to N(N-1)/2 possible connection joins each formed between two serial connections. A typical crossbar may have N=32 serial connections. It is understood that "serial connections" here refer to the ports or terminals in the crossbar that are adapted for fast serial connections, which ports or terminals may or may not be actually connected to other system components at any given time. In use, a subset of the N(N-1)/2 possible connection joins may be activated and connected to other system components, so long as the following conditions are satisfied. First, at a minimum, each activated connection join connects one device that transmits data and one device that receives data. Second, no two connection joins share a data receiving device. The access ports connected to the crossbars, under program control, control the crossbar switches by rearranging the serial transmission connections to be point to point (uni-cast), one to many (multi-cast) or one to all (broadcast). Preferably, rearrangement occurs when the previous transmissions through the switch are complete and new transmissions are ready. Thus, the crossbar can be configured dynamically, allowing the crossbar configurations to change whenever necessary as required by the data distribution scheme.
[0031] FIGS. 4(a) and 4(b) illustrate two examples of connection join patterns of a data crossbar in normal host (uni-cast or point to point) and rapid host (Multi-cast) setups, respectively, for data write. The configurations for data read may be suitably derived; for example, in the case of FIG. 4(a), data read may be illustrated by reversing the direction of the arrows. FIG. 4(a) is an example for a RAID5 or RAIDn write, at a time when the parity calculation for a previous stripe is completed, and the parity calculation for the next stripe is just starting. In this exemplary system, each of two data crossbars 404a and 404b is connected to a host AP 402, to each of two CPU-only APs 406a and 406b, and to each of three peripheral device APs 408a, 408b and 408c, via fast serial connections. In particular, the connections between each data crossbar and each CPU-only AP include a pair of fast serial connections as shown in the figure, for example, connection 410a from the first data crossbar 404a to the first CUP-only AP 406a, and connection 410b in the reverse direction. The two data crossbars 404a and 404b have identical configurations, and each may receive every second bit, byte, block or unit of the data, for instance, for increased throughput.
[0032] The dotted lines 412a, 412b and 412c shown within the data crossbars represents connection joins, i.e. the path of data movement between connections. In this particular example, data moves in a direction from left to right for data write (and reversed for data read, not shown). Specifically, at this stage of a RAID5 or RAIDn write, data is moving from the host AP 402 to the CPU-only AP 406a (via path 412a) to start the new parity calculation for the next stripe, as well as to the peripheral device AP 408c (via path 412b) for storage. The parity data calculated by the CPU-only AP 406b for the previous stripe is moving from that AP to another peripheral device AP 408a for storage. In the illustrated example, two CPU-only APs are employed, but other configurations are also possible.
[0033] The crossbar configuration in FIG. 4(b) is am example for a RAID0 write, or a RAID5 or RAIDn write of a non-parity block. This configuration is similar to that of FIG. 4(a), but only one data crossbar 404 is involved in the operation. Data moves from the host 402 to each of the three peripheral device APs 408a, 408b and 408c via paths 412d, 412e and 412f. This configuration is advantageous in a situation where the host AP 402 has approximately the same connection speed as the total throughput of all connected peripheral device APs. Using this configuration, each data packet from the host AP is broadcast to all peripheral device APs, and data received may be selectively used or thrown away by the peripheral devices, without sacrificing system speed.
[0034] In general, the APs, under program control, are capable to accumulate data in their RAMs and buffer the data as appropriate for the efficient interleaving and superimposing of transmissions through crossbar switches.
[0035] One specific application of the data distributors according to embodiments of the present invention is a RAID data storage system, where a plurality of disks are connected to the data crossbar via disk APs. Various RAID configurations include RAID0, RAID 1, RAID10 (also referred to as combined RAID), RAID5, RAIDn (which is ideally tuned for this invention), etc. In a RAID0 configuration, each bit, byte or block of data is written once in one of the disks. In a RAID0 write operation in the conventional system (FIG. 5), the data goes from one host 502 sequentially to all of the disks 504. Each disk 504 receives a number of blocks (including one block) and then the next disk becomes active. In the conventional system (FIG. 5), the EPCI bus 508 is traversed seven times assuming a write to all of six Disks 504. Thus, assuming a bus speed of 266 MB/Sec, the maximum transfer rate would be 38 MB/Sec (266 MB/Sec dividing by seven). Using the data distributor according to an embodiment of the present invention (FIG. 2(a)), data can be broadcast from the host APs to the disk APs simultaneously, which the disk APs selectively use or throws away the received data. Using the data distributor according to another embodiment (FIG. 4(b)), when the Host 402 and Disk 408 busses are identical in speed to the conventional system example above (ULTRA SCSI 320), by using the data distributor design, the maximum transfer rate will be 320 MB/Sec., i.e., limited by the Host bus speed only. Further, by using a subtractive logic approach, the Disk AP's would simply ignore or delete the received data that would not be sent to their respective Disks.
[0036] In a RAID10 configuration (using a six-disk RAID as an example), a RAID0 of three disks is mirrored by an identical RAID0 of three disks. The read of a RAID10 is equivalent to a RAID0 by alternating mirror selection stripe by stripe in the standard way. For RAID10 writes, two writes (to two disks) are performed for every read (from a host). In the conventional system (FIG. 5), each of the two HBA gets its copy from the RAM 516. In the data distributor (FIG. 2), data flows once from the host 202 via the host AP 206 to the data crossbar 214, and two copies are sent by the crossbar to two disks 210 via two disk APs 212.
[0037] In a RAID5 storage system, parts of the data written to the disks are data provided by the user (user data) and parts of the data are redundancy data (parity) calculated from the user data. For example, a six-disk array may be configured so that six blocks of data are written for every five blocks of user data, with one block being parity. Data read for RAID5 is similar to the RAID0 and the RAID10 read in efficiency. Data write for RAID5 involves the steps of fetching five blocks of user data and calculating one block of parity, and storing the parity block, as follows: 1 a>fetch Block0 b>fetch/XOR Block1 c>fetch/XOR Block2 d>fetch/XOR Block3 e>fetch/XOR Block4 f>store Block5
[0038] In the conventional system (FIG. 5), for every five blocks of user data, five blocks of data would move from the host 502 to the RAM 516, five from the RAM to the CPU 514 through the front end bus 512 in the fetching steps, one from the CPU to the RAM through the front end bus in the storing step, and six from the RAM to the disk 504. In the data distributor according to an embodiment of the present invention (FIG. 2(a)), the data crossbar 214 may be used to distribute data efficiently. For example, in the five fetching steps, the user data blocks may be directed simultaneously through the data crossbar to two destinations, the disk APs 212 and the CPU-only APs 220 (fetch). In addition, the block of parity data from the previous calculation may be directed to its parity disk for store at the same time the current first fetching step a> is being performed, by an independent transmission through the data crossbar. It can be seen that the data distributor according to embodiment of the present invention increases the efficiency of data distribution in the RAID systems.
[0039] Referring back to FIG. 2(a), in the data distributor system shown therein, data flow of the entire system is controlled by the Allocator CPU AP 222. The Allocator AP 222 communicates with the nodes (the host APs 206, the peripheral device APs 212 and the CPU-only APs 220) via serial connections 216 (through the control XBAR 218) and the interrupt lines 224. In addition, by setup transmissions to the nodes, the Allocator CPU AP 222 notifies the nodes to perform particular actions according to the desired data distribution scheme. The command transmissions and setup transmissions are typically small, infrequent and quick. The nodes then controls the XBARs by transmitting commands to the XBAR in-band. The data processing is primarily handles by the CPU-only APs 220; the Allocator CPU AP 222 merely directs large amounts of data to individual math processors after notifying them of their task, but does not actually process each byte of data. Thus, this data distribution system reduces the amount of data processing performed by any central processor, and increases the speed and efficiency of data distribution. For example, if data transmission is handled in packets of at least 512 bytes, and control by the CPU is managed by a single 4-byte word, an improvement of more than 100-fold may be achieved over a conventional system in which data processing (e.g. encoding and decoding) is performed by a central CPU.
[0040] FIGS. 2(b)-2(e) illustrate alternative structures of a data distributor according to other embodiments of the present invention. Like components in FIGS. 2(b)-2(e) are designated by like or identical reference symbols as in FIG. 2(a). In the structure of FIG. 2(b), the interrupt lines are eliminated, and the Allocator CPU AP 222 communicates with the nodes 206, 212 and 220 in both directions via serial connections 216 and the control XBAR 218.
[0041] In the structure of FIG. 2(b), the Allocator CPU AP 222 may be eliminated and its functions performed by one of the CPU-only APs 220, and/or the control XBAR 218 may be eliminated and its functions performed by one of the data XBARs 214. FIG. 2(c) illustrates a structure where the Allocator CPU AP 222 is eliminated and its functions performed by one of the CPU-only APs 220. This CPU-only AP 220 is connected to the host APs 206 and the peripheral device APs 212 via serial connections 216 and the control XBAR 218. FIG. 2(d) illustrates a structure where the control XBAR 218 is eliminated and its functions performed by one of the data XBARs 214. The Allocator CPU AP 222 is connected to this data XBAR 214 via a serial connection 216 and communicates with the nodes 206, 212 and 220 via this data XBAR 214. This is referred to as in-path communication. FIG. 2(e) illustrate a structure where the Allocator CPU AP 222 is eliminated and its functions performed by one of the CPU-only APs 220, and the control XBAR 218 is eliminated and its functions performed by one of the data XBARs 214. Comparing FIG. 2(e) with the overview illustration of FIG. 1, it may be seen that components 202 and 206 correspond to component 104, components 212 and 208 correspond to component 106, component 220 corresponds to component 108, and component 214 correspond to component 102.
[0042] In the structures of FIGS. 2(b)-2(e), the APs have structures similar to that shown in FIG. 3 but without the interrupt line(s) 308. Other aspects of the alternative structures of FIGS. 2(b)-2(e) are identical or similar to those described above for the structure of FIG. 2(a).
[0043] It will be apparent to those skilled in the art that various modifications and variations can be made in a data distribution system and method of the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention cover modifications and variations that come within the scope of the appended claims and their equivalents.

What is claimed is:
1. A data distribution system for distributing data among components of a data processing system including hosts and peripheral devices, the system comprising: one or more crossbar switches each having a plurality of serial connections, each crossbar switch being dynamically configurable to form connection joins between serial connections to direct serial transmissions from one or more incoming serial connections to one or more outgoing serial connections; and a plurality of access ports each having one or more serial connections for connecting to one or more crossbar switches, a processor, a memory, and an internal bus, wherein each of a first subset of the plurality of access ports further includes one or more host adapters and/or peripheral device adapters for connecting to one or more hosts and/or peripheral devices, and each is connected to at least one crossbar switch, and wherein each of a second subset of the plurality of access ports includes one or more input serial connections and one or more output serial connections connected to one or more crossbar switches, and is adapted to perform data processing functions.
2. The data distribution system of claim 1, wherein at least one of the plurality of access ports is an allocator CPU access port which is connected to at least one crossbar switch via a serial connection, the allocator CPU access port being operable to control the other access ports to direct data transmissions between the other access ports connected via the crossbar switches.
3. The data distribution system of claim 3, further comprising interrupt lines connected between the allocator CPU access port and the other access ports.
4. The data distribution system of claim 1, wherein one of the crossbar switches is a control crossbar switch connected to all of the plurality of access ports for transmitting control signals among the plurality of access ports.
5. The data distribution system of claim 1, wherein the crossbar switches are dynamically configured in-band by one or more access ports connected thereto.
6. The data distribution system of claim 1, wherein each of the first subset of access ports is operable to transmit or receive data to or from hosts or peripheral devices .
7. The data distribution system of claim 1, wherein at least some of the access ports are operable to buffer data within the access ports and to transmit buffered data in an interleaving or superimposing manner through crossbar switches.
8. The data distribution system of claim 1, wherein a crossbar switch is configured to simultaneously direct an incoming serial transmission from one sending access port to a plurality of receiving access ports, each receiving access port either discards the transmission, or utilizes the transmission for further processing or transmission.
9. The data distribution system of claim 1, wherein the second subset of access ports operate to perform parallel computations and are connected with a plurality of crossbar switches.
10. The data distribution system of claim 1, wherein the second subset of access ports operate individually or in parallel to compute RAID and/or RAIDn parity encoding and decoding.
11. The data distribution system of claim 1, wherein the second subset of access ports operate individually or in parallel to compute data encryption and decryption.
12. The data distribution system of claim 1, wherein any of the first subset of access ports operate in a uni-cast mode, a multicast mode, and/or a broadcast mode.
13. The data distribution system of claim 1, wherein the host adapters and/or peripheral device adapters provides both physical transmission exchange and transmission protocol management and translation.
14. A data distribution system for distributing data among components of a data processing system including hosts and peripheral devices, the system comprising: one or more crossbar switches each having a plurality of serial connections, each crossbar switch being dynamically configurable to form connection joins between serial connections to direct serial transmissions from one or more incoming serial connections to one or more outgoing serial connections; and a plurality of access ports each having one or more serial connections for connecting to one or more crossbar switches, a processor, a memory, and an internal bus, wherein each of a first subset of the plurality of access ports further includes one or more host adapters and/or peripheral device adapters for connecting to one or more hosts and/or peripheral devices, and each is connected to at least one crossbar switch, wherein each of a second subset of the plurality of access ports includes one or more input serial connections and one or more output serial connections connected to one or more crossbar switches, and is adapted to perform data processing functions, wherein one of the crossbar switches is a control crossbar switch connected to all of the plurality of access ports for transmitting control signals among the plurality of access ports, and wherein at least one of the plurality of access ports is an allocator CPU access port which is connected to the control crossbar switch via a serial connection, the allocator CPU access port being operable to control the other access ports to direct data transmissions between the other access ports connected via crossbar switches.
15. The data distributor system of claim 14, further comprising interrupt lines connected between the allocator CPU access port and the other access ports.

Browse Industry: USPTO Class 714