GPUs to Accelerate NVMe Storage in RAID Mode – Lemonde Informatique
(LemondeInformatique.fr, Serge Leblal, October 11, 2022) Harnessing the computing power of entry-level GPU cards for workstations, the start-up Graid Technology…
Read MoreIn Software-Composable Infrastructure (SCI), compute, storage, and networking resources are abstracted from their physical locations and are usually managed with software via a web-based interface. SCI makes data center resources as readily available as cloud services and is the foundation for private and hybrid cloud solutions. With the emergence of NVMe™ SSD and NVMe- oF™ technologies, SCI can disaggregate storage resources without sacrificing performance and latency. As NVMe SSD technology rapidly evolves, a significant performance bottleneck is introduced — RAID data protection.
In performing RAID computations, the user has historically had the following two options:
OS Software RAID provides an independent solution that can work with multiple media types (HDD or SSD) and protocols (SATA, SAS, NVMe). The challenge with O.S. Software RAID is generally poor performance with a high cost for CPU resources. Sequential bandwidth especially Read bandwidth, can achieve high-performance levels, but sequential writes require protection computations. Small block I/O patterns generally have even lower RAID performance levels to render this option generally usable. In summary, this option has the protocol independence needed on network-attached storage devices but lacks the required performance.
Hardware RAID was convenient because the SAS adapter card could provide it to the client who was in line with the storage housed in an external enclosure. In the HDD era, a simple ASIC on a RAID card was capable enough to handle all I/O – after all, even with SAS HDD, maximum performance was only around 200 IOPS and 150MB/s of throughput. However, a single NVMe SSD can now deliver around 1 M IOPS and 7Gb/s of throughput.
The hardware RAID Cards were slow to adapt from slower HDDs to higher performing NVMe SSDs. That transition has primarily occurred and can provide higher performance levels when using SSDs. The challenge with these RAID adapters is that they can only be used with their native physical protocols. They cannot be used with network-attached devices and don’t scale performance fully or efficiently. In summary, these adapters can potentially have the needed local performance but do not offer protocol independence to work on network-attached devices, severely limiting their usefulness in modern Software-Composable Infrastructures or high-performance applications. These considerations also prevented their testing in these benchmarks.
In this paper, we discuss and benchmark a third option: Hardware-Accelerated Software RAID. This option provides protocol independence and the high performance needed for network-attached Flash storage.
The challenge of implementing complex RAID levels such as 5 and 6 while maintaining high performance on NVMe drives is usually parity calculations. Hardware RAID parity calculations use a hardware engine within the ASIC, while software RAID can only use the CPU’s instruction set, whose performance is often limited.
Offloading and parallelizing the CPU-intensive parity calculations onto a hardware accelerator often addresses this issue. There are a few potential hardware engines where these calculations can take place. The first option would be to utilize CPU extensions (e.g., Vector/SIMD) to offload and parallelize the parity calculations to improve RAID performance. A second option would be to offload and parallelize these calculations on dedicated hardware accelerators such as GPUs (DPUs) or FPGAs. Graid Technology Inc. provides the GPU-based RAID solution tested in this project, the SupremeRAID™ SR-1000. Figure 1 provides a block diagram of its implementation.
Block Diagram of Advanced RAID Solutions (Vector, GPU, FPGA)
While GPU-based solutions are promising, each server requires a GPU. And at the time of writing, commercially available solutions of these technologies were limited, but several Vector and FPGA solutions are available.
For this project, we chose Graid Technology Inc.’s SupremeRAID SR-1000 NVMe-oF RAID card for performance benchmark in conjunction with the Western Digital OpenFlex Data24 NVMe-oF Storage Platform.
Figure 2 provides a list of components used in this test.
Load Generating Servers | Quantity | Description |
Platform | 6 | Lenovo® ThinkSystem™ SR650 |
Processor | 2 | Intel® 6154 200TDP 18-Core 3.0GHz |
Memory | 12 | 32GiB @ 2666MHz (384GiB) |
Fabric | 1 | ConnectX®-5 100 Gb Ethernet HCA |
RAID | 1 | Graid SupremeRAID SR-1000 |
NVMe-oF Storage | Quantity | Description |
Enclosure | 1 | Western Digital OpenFlex Data24 (FW4.0) |
NVMe | 24 | Ultrastar® DC SN840 3.2TB (FW03) |
Software | Quantity | Description |
OS | 1 | RHEL 8.4 |
Kernel | 1 | 4.18.0-305.25.1.el8_4.x86_64 |
MOFED | 1 | In-Box Mellanox 5.6 |
Western Digital’s OpenFlex Data24 NVMe-oF Storage Platform is similar to a 2.5″ SAS JBOD Enclosure (Figure 3). It provides 24 slots for NVMe drives and a maximum capacity of 368 TB when using Western Digital Ultrastar DC SN840 15.36 TB devices. Unlike a SAS enclosure, the Data24’s dual IO modules use Western Digital RapidFlex™ C1000 NVMe-oF Controllers. These controllers allow full access to all 24 NVMe drives over up to six ports of 100 Gb Ethernet.
The Data24 is a close replacement for the traditional SAS enclosures. However, the Data24 offers a significant benefit over these enclosures: the ability to integrate directly into Ethernet fabric, allowing for an Any-to-Any mapping of Object Storage Targets to Object Storage Servers. The OpenFlex Data24 design exposes the full performance of the NVMe SSDs to the network. With 24 Western Digital Ultrastar DC SN840 3.2 TB devices, the enclosure can achieve up to 71 GB/s of bandwidth and over 15 MIOPS at a 4K block size.
Graid Technology Inc.’s SupremeRAID SR-1000 PCIe 3.0 adapter (Figure 4) delivers SSD performance in AI-accelerated compute, All Flash Array (AFA), and High Performance Computing (HPC) applications. Designed for both Linux and Windows® operating systems, it supports RAID levels 0/1/10/5/6/JBOD, while the core software license supports up to 32 native NVMe drives.
The SupremeRAID SR-1000 enables NVMe/NVMe-oF, SAS, and SATA performance while increasing scalability, improving flexibility, and lowering TCO. This solution eliminates the traditional RAID bottleneck in mass storage to deliver maximum SSD performance for high-intensity workloads. Figure 5 shows Spec Sheet Data.
Workload | SupremeRAID™ SR-1000 | Number of Drives | Performance per Drive |
4K Random Read | 16.00 MIOPS | 12 | 1.33 MIOPS |
4K Random Write | 0.75 MIOPS | 12 | 0.06 MIOPS |
512K Sequential Read | 110.00 GB/s | 20 | 5.50 GB/s |
512K Sequential Write | 11.00 GB/S | 20 | 0.55 GB/S |
4K Random Read in Rebuild | 3.00 MIOPS | 12 | 0.25 MIOPS |
Available with one to three ports per IO Module. There are two IOM modules per Data24. The ratio of ports to IOM module will influence drive-to-port mapping options.
The configuration used for this benchmark is three ports per IO Module. In this configuration, a maximum of 8 physical drives are accessible per IO Module port. Each physical device can have up to 8 namespaces. In Figure 6 each device has a pair of ports, one port per IO Module.
Flexible IO (FIO) is the workload generator. The SupremeRAID SR-1000 solution uses the standard OpenFlex Data24 Spec Sheet Process (Figure 7).
Fundamentally, the process has two phases – the sequential process using 128K blocks (to measure bandwidth) and the random process using 4K blocks (to measure IOPS).
We ran three instances of the tests and averaged the results.
Also, we checked for excessive variability using the Coefficient of Variation (COV). Any extreme variability is investigated and resolved. Additional tests may be required if a clear cause exists, such as a test error, external interruption, etc.
The first BASELINE tests (without simulated failures) were run with FIO and tested 4K Random Read, Random Mixed, and Random Write. Figure 8 shows the results.
Six servers with each connected to eight namespaces without RAID established a performance BASELINE. Aggregate results of the BASELINE show 15.3 million IOPS for 4K Random Reads, 12 million IOPS for 4K Random Mixed, and 6.26 million IOPS for Random Writes. These results are as expected and therefore serve as a good BASELINE for Random IO Tests. We compare all RAID Results to the BASELINE.
We ran the same tests with the SupremeRAID SR-1000 Solution. We created a single eight namespace RAID 5 (7+1) set on each server, with the aggregate results showing 15.3 million IOPS for 4K Random Read, 6.17 million IOPS for 4K Random Mixed, and 2.6 million IOPS for 4K Random Writes.
The Random Read IOPS matches the OpenFlex Data24 BASELINE results and demonstrates the SupremeRAID non-blocking architecture with a 4K Random Read workload while validating the test infrastructure. Random Mixed and Random Write workloads showed the expected performance drops associated with RAID 5. The read-modify-write (parity) calculations have an unavoidable compute cost and delay.
We also tested another third-party advanced software-based RAID solution (that exploits advanced CPU instruction set features).
Again, we created a RAID 5 (7+1) set on the six servers, with the aggregate results showing 12.2 million IOPS for 4K Random Reads,
5.83 million IOPS for 4K Random Mixed, and 2.21 million IOPS for Random Writes.
In all instances, the advanced software RAID solution was less performant than the SupremeRAID GPU solution.
Next, we ran the large block (128K) Spec Sheet Sequential Benchmark on the exact configuration previously described above. The results are shown below in Figure 9.
The Spec Sheet Sequential BASELINE achieved 71.5 GB/s for 128K Sequential Reads and 39.6 GB/s for 128K sequential writes. These results are as expected for the Data24 and therefore serve as a good BASELINE for Sequential IO Tests. We compare all RAID Results to the BASELINE.
In this instance, the 128K Sequential Read results are 61.4 GB/s which is lower than the 71.5 GB/s demonstrated in the BASELINE, and these results are 12% slower when compared to the advanced software RAID results below. This slowdown isn’t due to parity computation as there is none for Reads. The lower SupremeRAID Sequential Read Performance is because all data flows from the GPU to the SSDs in 4K blocks, which requires de-blocking and re-blocking all non-4K IO
The 128K Sequential Write results are 30.3 GB/s, which outperform the advanced software RAID results (20.3 GB/s) by 49%, clearly demonstrating the advantages of offloading the compute (parity) calculations from the CPU to a software-enabled, GPU-based architecture.
Below we compare the performance of the three SupremeRAID adapter operational states (NOMINAL, DEVICE DOWN, and DEVICE REBUILD) to the BASELINE solution – a Data24 FW4.0 using 24 SN840 3.2TB devices (each has two namespaces) in a 6x8n configuration (six servers with each using eight namespaces. All comparisons will be SupremeRAID STATE (nominal, device down, or device rebuild) to the BASELINE.
The first panel in Figure 10 shows the gross performance of the BASELINE for the 4K Random Writes (R.W.), Random Mixed (R.M.), and Random Reads (R.R.) at 6.26, 12.00, and 15.30 MIPS, respectively.
The second panel shows SupremeRAID NOMINAL performance for 4K R.W., R.M., and R.R. at 2.60, 6.12, and 15.30 MIOPS, respectively.
As measured by WORK/CPU%, SupremeRAID NOMINAL is 23% more efficient than the BASELINE for the 4K R.R. workload. Else, the BASELINE is ~22 and ~7% more efficient than SupremeRAID NOMINAL for the 4K R.W. and R.M. workloads, respectively.
The data also shows that the BASELINE outperforms SupremeRAID NOMINAL in Latency (i.e., RAID 5 increases the latency). Still, the COV (Coefficient of Variation) and CPU Percent (usr+sys) are better for SupremeRAID NOMINAL.
Observations:
In this test, we remove a device from RAID Set, and then the Spec Sheet Random benchmark is run. Common SupremeRAID control commands remove the device from the RAID Set to simulate a failed device. See Figure 11.
We continue to compare the SupremeRAID performance to the BASELINE performance. The reasons for this are:
The last panel shows that SupremeRAID DEVICE DOWN performance for 4K R.W., R.M., and R.R. is 62, 62, and 44 percent lower than the BASELINE. All three of these are significant impacts from the BASELINE.
Efficiency per CPU%, shown in the rightmost column, is computed by dividing the Work in IOPS by the CPU% required to generate this workload. This calculation provides the number of IOPS per CPU% shown in the rightmost column of the first two panels.
BASELINE CPU efficiency is 23 and 16 percent better for the 4K R.W. and R.M. workloads.
Here, we add back the failed device from the previous test. Then, the rebuild process starts, and the standard Spec Sheet Random Benchmark is run. See the results in Figure 12.
The standard workloads take 20 minutes for each of the three IO Types. The rebuild operates at its highest performance and completes in 85 minutes (25 minutes longer than the standard workloads).
As previously shown, the MIOPS column shows the performance of the three workloads in Panels 1 and 2 for the BASELINE and SupremeRAID, respectively.
At a high level, there is not much performance difference between SupremeRAID DEVICE DOWN and SupremeRAID DEVICE REBUILD in terms of throughput or efficiency.
Those considering a RAID 5 solution must be able to meet their Service Levels with the lowest SupremeRAID performance or augment RAID 5 with workload shedding or deferring, fail-over, etc.
There are two essential Business Continuity and Disaster Recovery (BCDR) objectives:
RAID 5 essentially addresses and eliminates RTO and RPO assuming just one failed device. Of course, there are many elements to a comprehensive BCDR, but a well-planned and well-sized RAID solution can manage single instances of device failure.
Figure 13 Performance by Configuration Summary and Figure 14 Efficiency by Configuration Summary provide 3D and table data that summarizes SupremeRAID SR-1000 adapter R5 Performance on our standard Spec Sheet Random (SSR) benchmark. These figures show:
The BASELINE is a Data24 with 24 SN840 3.2TB devices (each with two namespaces) tested with the SSR benchmark. A SupremeRAID R5 7+1 implementation on the Data24 described above for:
The slopes of the chart elements are as expected – monotonically decreasing from:
The Random Read performance for the BASELINE and the SupremeRAID NOMINAL is similar at ~15.30 MIOPS. But, because of the move of Parity Calculations from Server CPUs to the GPU, the SupremeRAID Solution is about 24% more efficient than the BASELINE.
This finding is unique in this study. It is achieved by:
An NVMe-oF storage enclosure such as the OpenFlex Data24 allows for a broader degree of performance, flexibility, and cost savings not found with traditional hardware or OS-based software RAID.
This GPU architecture outperformed the Advanced Software RAID solution in all areas except large block sequential reads in these tests.
Consider the following:
GPU release cycles are regular, and it is fair to anticipate that performance should improve as GPU architectures are enhanced (along with server motherboard architecture – such as PCIe Gen 4). This regular product cycle, in turn, allows the consumer to balance performance requirements against the capabilities of the GPU – essentially driving a tighter cost versus performance model.
There are potential benefits to be realized in the server architecture when using this solution. Tradition hardware RAID is unlikely to meet the performance potential of NVMe devices. Such cards scale poorly and require additional cables for device connectivity. RAID Add-In-Cards (AIC) can add complexity and cost, use extra PCIe slots, and disrupt airflow. A GPU-based RAID solution may reduce or eliminate these issues. Additionally, CPU cycles are freed to be assigned elsewhere or, if not required, allowing for lower specification (lower cost) CPU to be considered.
The critical issue in this solution is making RAID 5, which has always been the most desired RAID configuration (add one device to eliminate a single point of failure), sufficiently performant for use across most general storage needs. The SupremeRAID does this in an elegantly simple implementation requiring no significant changes to the environment.
Figure 15 SupremeRAID R5 Life Cycle Chart with BASELINE shows the absolute and relative performance for the various workloads of the RAID Life Cycle (Initialization, Nominal, Device Down, and Device Rebuild). Potential consumers should understand this information to assess the applicability of Graid Technology, Inc.’s solution.
Figure 16 SupremeRAID R5 Life Cycle Chart with RAID Set and Constituent Devices provides a unique view of both the RAID Set used by the customer and the underlying Constituent Devices that make up the RAID Set.
Contributors
Name | Company | Title |
John Gatch | Western Digital | Technologist, Platforms Field Engineering |
Calvin Falldorf | Western Digital | Principal Engineer, Platforms Field Engineering |
Niall Macleod | Western Digital | Director, Platforms Field Engineering |
Barrett Edwards | Western Digital | Sr. Director, Platforms Field Engineering |
References
Document Title | Version | Date |
SupremeRAID White Paper 2022-07-25-V0.91 | V0.91 | July 25, 2022 |
SupremeRAID White Paper 2022-07-25-V0.93 | V0.93 | July 28, 2022 |
SupremeRAID White Paper 2022-07-25-V1.0 | V1.0 | August 3, 2022 |
SupremeRAID White Paper 2022-08-16-V1.1 | V1.1 | August 16, 2022 |
Version History
Contributor | Version | Date |
Niall MacLeod | V0.1 | July 1, 2022 |
Calvin Falldorf and John Gatch | V0.3 | July 8, 2022 |
John Gatch | V0.5 | July 15, 2022 |
Niall MacLeod, Calvin Falldorf, and John Gatch | V0.7 | July 22, 2022 |
Scot Rives, Western Digital Legal, First Review | V0.91 | July 25, 2022 |
John Gatch, Updated with Scot Rives Updates | V0.93 | July 28, 2022 |
John Gatch, Updated with Scot Rives Updates | V1.0 | August 3, 2022 |
John Gatch, Updated Title per Niall MacLeod | V1.1 | August 16, 2022 |
Document Feedback For feedback, questions, and suggestions for improvements to this document, send an email to the Data Center Systems (DCS) Technical Marketing Engineering (TME) team distribution list at pdl-dcs-tm@wdc.com.
Read the white paper results featured in Blocks & Files
Chosen by CRN as one of the Ten Hottest Data Storage Startups of 2021 and a 2022 Emerging Vendor in the Storage & Disaster Recovery category, Graid Technology Inc. has developed the world’s first NVMe and NVMeoF RAID card to unlock the full potential of enterprise SSD performance. We’re headquartered in Silicon Valley, with an R&D center in Taiwan, and are led by a dedicated team of experts with decades of experience in the SDS, ASIC and storage industries. Graid Technology Inc. is redefining performance standards for enterprise data protection: a single SupremeRAID™ card delivers 19 million IOPS and 110GB/s of throughput. For more information on Graid Technology Inc., connect with us on Twitter or LinkedIn.
Learn more about award-winning GPU-based NVMe RAID controller SupremeRAID™ by Graid Technology. We’re ushering in the future of high storage capacity and extreme performance for mission critical and performance-demanding workloads. Contact us today to chat with a sales representative in your region.
"*" indicates required fields
(LemondeInformatique.fr, Serge Leblal, October 11, 2022) Harnessing the computing power of entry-level GPU cards for workstations, the start-up Graid Technology…
Read More(Gartner Newsroom, July 14, 2022) “Organizations that do not invest in the short term will likely fall behind in the…
Read MoreDon’t miss our latest solution brief, exploring the fantastic performance of SupremeRAID™ NVME-oF RAID card partnered with the Western Digital®…
Read More