Blog News Whitepapers |

📃 SupremeRAID™ with Western Digital OpenFlex Data24: White Paper

Abstract: This white paper explores the performance benchmarks of Graid Technology Inc.’s SupremeRAID™ SR-1000 NVME-oF RAID card in conjunction with the Western Digital® OpenFlex™ Data24 NVMe-oF Storage Platform

Introduction

In Software-Composable Infrastructure (SCI), compute, storage, and networking resources are abstracted from their physical locations and are usually managed with software via a web-based interface. SCI makes data center resources as readily available as cloud services and is the foundation for private and hybrid cloud solutions. With the emergence of NVMe™ SSD and NVMe- oF™ technologies, SCI can disaggregate storage resources without sacrificing performance and latency. As NVMe SSD technology rapidly evolves, a significant performance bottleneck is introduced — RAID data protection.

RAID Computations

In performing RAID computations, the user has historically had the following two options:

  • O.S. Software RAID (e.g., MDADM on Linux®)
  • Hardware RAID (e.g., a RAID Controller Card)

Software RAID

OS Software RAID provides an independent solution that can work with multiple media types (HDD or SSD) and protocols (SATA, SAS, NVMe). The challenge with O.S. Software RAID is generally poor performance with a high cost for CPU resources. Sequential bandwidth especially Read bandwidth, can achieve high-performance levels, but sequential writes require protection computations. Small block I/O patterns generally have even lower RAID performance levels to render this option generally usable. In summary, this option has the protocol independence needed on network-attached storage devices but lacks the required performance.

Hardware RAID

Hardware RAID was convenient because the SAS adapter card could provide it to the client who was in line with the storage housed in an external enclosure. In the HDD era, a simple ASIC on a RAID card was capable enough to handle all I/O – after all, even with SAS HDD, maximum performance was only around 200 IOPS and 150MB/s of throughput. However, a single NVMe SSD can now deliver around 1 M IOPS and 7Gb/s of throughput.

The hardware RAID Cards were slow to adapt from slower HDDs to higher performing NVMe SSDs. That transition has primarily occurred and can provide higher performance levels when using SSDs. The challenge with these RAID adapters is that they can only be used with their native physical protocols. They cannot be used with network-attached devices and don’t scale performance fully or efficiently. In summary, these adapters can potentially have the needed local performance but do not offer protocol independence to work on network-attached devices, severely limiting their usefulness in modern Software-Composable Infrastructures or high-performance applications. These considerations also prevented their testing in these benchmarks.

In this paper, we discuss and benchmark a third option: Hardware-Accelerated Software RAID. This option provides protocol independence and the high performance needed for network-attached Flash storage.

GPU-based Hardware Accelerated Software RAID

The challenge of implementing complex RAID levels such as 5 and 6 while maintaining high performance on NVMe drives is usually parity calculations. Hardware RAID parity calculations use a hardware engine within the ASIC, while software RAID can only use the CPU’s instruction set, whose performance is often limited.

Offloading and parallelizing the CPU-intensive parity calculations onto a hardware accelerator often addresses this issue. There are a few potential hardware engines where these calculations can take place. The first option would be to utilize CPU extensions (e.g., Vector/SIMD) to offload and parallelize the parity calculations to improve RAID performance. A second option would be to offload and parallelize these calculations on dedicated hardware accelerators such as GPUs (DPUs) or FPGAs. Graid Technology Inc. provides the GPU-based RAID solution tested in this project, the SupremeRAID™ SR-1000. Figure 1 provides a block diagram of its implementation.

Block Diagram of Advanced RAID Solutions (Vector, GPU, FPGA)

Figure 1 RAID Hardware Elements

While GPU-based solutions are promising, each server requires a GPU. And at the time of writing, commercially available solutions of these technologies were limited, but several Vector and FPGA solutions are available.


Solution Components

For this project, we chose Graid Technology Inc.’s SupremeRAID SR-1000 NVMe-oF RAID card for performance benchmark in conjunction with the Western Digital OpenFlex Data24 NVMe-oF Storage Platform.

Hardware and Software Summary

Figure 2 provides a list of components used in this test.

Load Generating ServersQuantityDescription
Platform6Lenovo® ThinkSystem™ SR650
Processor2Intel® 6154 200TDP 18-Core 3.0GHz
Memory1232GiB @ 2666MHz (384GiB)
Fabric1ConnectX®-5 100 Gb Ethernet HCA
RAID1Graid SupremeRAID SR-1000
NVMe-oF StorageQuantityDescription
Enclosure1Western Digital OpenFlex Data24 (FW4.0)
NVMe24Ultrastar® DC SN840 3.2TB (FW03)
SoftwareQuantityDescription
OS1RHEL 8.4
Kernel14.18.0-305.25.1.el8_4.x86_64
MOFED1In-Box Mellanox 5.6
Figure 2 Hardware and Software Summary
(One terabyte (TB) is equal to one trillion bytes. Actual user capacity may be less due to the operating environment.)

OpenFlex Data24 NVMe-oF Storage Platform

Western Digital’s OpenFlex Data24 NVMe-oF Storage Platform is similar to a 2.5″ SAS JBOD Enclosure (Figure 3). It provides 24 slots for NVMe drives and a maximum capacity of 368 TB when using Western Digital Ultrastar DC SN840 15.36 TB devices. Unlike a SAS enclosure, the Data24’s dual IO modules use Western Digital RapidFlex™ C1000 NVMe-oF Controllers. These controllers allow full access to all 24 NVMe drives over up to six ports of 100 Gb Ethernet.

The Data24 is a close replacement for the traditional SAS enclosures. However, the Data24 offers a significant benefit over these enclosures: the ability to integrate directly into Ethernet fabric, allowing for an Any-to-Any mapping of Object Storage Targets to Object Storage Servers. The OpenFlex Data24 design exposes the full performance of the NVMe SSDs to the network. With 24 Western Digital Ultrastar DC SN840 3.2 TB devices, the enclosure can achieve up to 71 GB/s of bandwidth and over 15 MIOPS at a 4K block size.

Figure 3 OpenFlex Data24 NVMe-oF Enclosure

SupremeRAID SR-1000 NVMe-oF RAID

Graid Technology Inc.’s SupremeRAID SR-1000 PCIe 3.0 adapter (Figure 4) delivers SSD performance in AI-accelerated compute, All Flash Array (AFA), and High Performance Computing (HPC) applications. Designed for both Linux and Windows® operating systems, it supports RAID levels 0/1/10/5/6/JBOD, while the core software license supports up to 32 native NVMe drives.

The SupremeRAID SR-1000 enables NVMe/NVMe-oF, SAS, and SATA performance while increasing scalability, improving flexibility, and lowering TCO. This solution eliminates the traditional RAID bottleneck in mass storage to deliver maximum SSD performance for high-intensity workloads. Figure 5 shows Spec Sheet Data.

Figure 4 SupremeRAID SR-1000 PCIe 3.0

SupremeRAID SR-1000 Spec Sheet Data

WorkloadSupremeRAID™ SR-1000Number of DrivesPerformance per Drive
4K Random Read16.00 MIOPS121.33 MIOPS
4K Random Write0.75 MIOPS120.06 MIOPS
512K Sequential Read110.00 GB/s205.50 GB/s
512K Sequential Write11.00 GB/S200.55 GB/S
4K Random Read in Rebuild3.00 MIOPS120.25 MIOPS
Figure 5 SupremeRAID SR-1000 Spec Sheet Data
Software: Linux Version: CentOS 8.5 | Hardware: CPU: Intel® Xeon® Gold 6338 CPU 32-Core with 2.0GHz x 2; Memory: SK Hynix HMA82GR7CJR8N-XN DIMM DDR4 3200 MHz 16GiB x 16; SSD: INTEL D7-P5510 SSDPF2KX038TZ 3.8TB x 20 | RAID Configuration: Random performance based on a drive group with 12 physical drives and 1 virtual drive; sequential performance based on a drive group with 20 physical drives and 1 virtual drive.

Benchmarking Infrastructure

The OpenFlex Data24

Available with one to three ports per IO Module. There are two IOM modules per Data24. The ratio of ports to IOM module will influence drive-to-port mapping options.

The configuration used for this benchmark is three ports per IO Module. In this configuration, a maximum of 8 physical drives are accessible per IO Module port. Each physical device can have up to 8 namespaces. In Figure 6 each device has a pair of ports, one port per IO Module.

Figure 6 Data24 6×8 with 24 SSDs with 2 Namespaces Each
  • The Western Digital Ultrastar DC SN840 devices are dual-ported NVMe drives. This architecture allows both paths to the device to be used, maximizing the performance potential of that device.
  • Each of the six servers includes a SupremeRAID SR-1000 and Mellanox® CX5 RDMA network interface card (RNIC).
  • This configuration allowed for a single path to a front-end IO Module port and the eight physical drives presented by that port. In this instance, each device has two namespaces. Each server pair that access a shared device is assigned one of the device’s two namespaces.
  • This configuration is considered non-HA; therefore, this benchmark used no redundant paths or multipathing.
  • The server connects directly or via a switch. There is minimal performance impact with either implementation.

Benchmarking Methodology

Flexible IO (FIO) is the workload generator. The SupremeRAID SR-1000 solution uses the standard OpenFlex Data24 Spec Sheet Process (Figure 7).

Fundamentally, the process has two phases – the sequential process using 128K blocks (to measure bandwidth) and the random process using 4K blocks (to measure IOPS).

We ran three instances of the tests and averaged the results.

Also, we checked for excessive variability using the Coefficient of Variation (COV). Any extreme variability is investigated and resolved. Additional tests may be required if a clear cause exists, such as a test error, external interruption, etc.

Figure 7 Spec Sheet Process

Measured Performance – 4K Random IO

The first BASELINE tests (without simulated failures) were run with FIO and tested 4K Random Read, Random Mixed, and Random Write. Figure 8 shows the results.

BASELINE

Six servers with each connected to eight namespaces without RAID established a performance BASELINE. Aggregate results of the BASELINE show 15.3 million IOPS for 4K Random Reads, 12 million IOPS for 4K Random Mixed, and 6.26 million IOPS for Random Writes. These results are as expected and therefore serve as a good BASELINE for Random IO Tests. We compare all RAID Results to the BASELINE.

SupremeRAID RAID 5

We ran the same tests with the SupremeRAID SR-1000 Solution. We created a single eight namespace RAID 5 (7+1) set on each server, with the aggregate results showing 15.3 million IOPS for 4K Random Read, 6.17 million IOPS for 4K Random Mixed, and 2.6 million IOPS for 4K Random Writes.

The Random Read IOPS matches the OpenFlex Data24 BASELINE results and demonstrates the SupremeRAID non-blocking architecture with a 4K Random Read workload while validating the test infrastructure. Random Mixed and Random Write workloads showed the expected performance drops associated with RAID 5. The read-modify-write (parity) calculations have an unavoidable compute cost and delay.

Advanced Software RAID Solution

We also tested another third-party advanced software-based RAID solution (that exploits advanced CPU instruction set features).

Again, we created a RAID 5 (7+1) set on the six servers, with the aggregate results showing 12.2 million IOPS for 4K Random Reads,

5.83 million IOPS for 4K Random Mixed, and 2.21 million IOPS for Random Writes.

In all instances, the advanced software RAID solution was less performant than the SupremeRAID GPU solution.

Figure 8 4K Random IO Benchmarks

Measured Performance – 128K Sequential IO

Next, we ran the large block (128K) Spec Sheet Sequential Benchmark on the exact configuration previously described above. The results are shown below in Figure 9.

BASELINE

The Spec Sheet Sequential BASELINE achieved 71.5 GB/s for 128K Sequential Reads and 39.6 GB/s for 128K sequential writes. These results are as expected for the Data24 and therefore serve as a good BASELINE for Sequential IO Tests. We compare all RAID Results to the BASELINE.

SupremeRAID RAID 5

In this instance, the 128K Sequential Read results are 61.4 GB/s which is lower than the 71.5 GB/s demonstrated in the BASELINE, and these results are 12% slower when compared to the advanced software RAID results below. This slowdown isn’t due to parity computation as there is none for Reads. The lower SupremeRAID Sequential Read Performance is because all data flows from the GPU to the SSDs in 4K blocks, which requires de-blocking and re-blocking all non-4K IO

The 128K Sequential Write results are 30.3 GB/s, which outperform the advanced software RAID results (20.3 GB/s) by 49%, clearly demonstrating the advantages of offloading the compute (parity) calculations from the CPU to a software-enabled, GPU-based architecture.

Figure 9 Sequential IO Benchmarks

PERFORMANCE AND EFFICIENCY

SupremeRAID NOMINAL vs. BASELINE

Below we compare the performance of the three SupremeRAID adapter operational states (NOMINAL, DEVICE DOWN, and DEVICE REBUILD) to the BASELINE solution – a Data24 FW4.0 using 24 SN840 3.2TB devices (each has two namespaces) in a 6x8n configuration (six servers with each using eight namespaces. All comparisons will be SupremeRAID STATE (nominal, device down, or device rebuild) to the BASELINE.

The first panel in Figure 10 shows the gross performance of the BASELINE for the 4K Random Writes (R.W.), Random Mixed (R.M.), and Random Reads (R.R.) at 6.26, 12.00, and 15.30 MIPS, respectively.

The second panel shows SupremeRAID NOMINAL performance for 4K R.W., R.M., and R.R. at 2.60, 6.12, and 15.30 MIOPS, respectively.

  • SupremeRAID NOMINAL matches the BASELINE of 15.30 MIOPS, and this test shows that SupremeRAID NOMINAL is transparent for this workload.
  • Otherwise, the BASELINE outperforms SupremeRAID NOMINAL by 58 and 49% for the 4K R.W. and R.M. workloads.

As measured by WORK/CPU%, SupremeRAID NOMINAL is 23% more efficient than the BASELINE for the 4K R.R. workload. Else, the BASELINE is ~22 and ~7% more efficient than SupremeRAID NOMINAL for the 4K R.W. and R.M. workloads, respectively.

The data also shows that the BASELINE outperforms SupremeRAID NOMINAL in Latency (i.e., RAID 5 increases the latency). Still, the COV (Coefficient of Variation) and CPU Percent (usr+sys) are better for SupremeRAID NOMINAL.

Observations:

  • In many environments, RAID 5 can offset performance loss by protecting against a single device failing in a RAID Set.
  • Without RAID 5, other data protection methods would have to be employed, which would likely be costlier than the RAID 5 solution, more complex, and could be more disruptive to production workloads.
Figure 10 SupremeRAID NOMINAL vs. BASELINE

SupremeRAID DEVICE DOWN vs. BASELINE

In this test, we remove a device from RAID Set, and then the Spec Sheet Random benchmark is run. Common SupremeRAID control commands remove the device from the RAID Set to simulate a failed device. See Figure 11.

We continue to compare the SupremeRAID performance to the BASELINE performance. The reasons for this are:

  • The performance through the RAID Life Cycle (Nominal, Device Down, and Device rebuild) can vary with respect to the BASELINE.
  • It is sensible to compare to a well-known BASELINE – this would likely be a customer’s current solution, i.e., the BASELINE.

The last panel shows that SupremeRAID DEVICE DOWN performance for 4K R.W., R.M., and R.R. is 62, 62, and 44 percent lower than the BASELINE. All three of these are significant impacts from the BASELINE.

  • But we must remember that this solution has eliminated a single point of failure.
  • The cost and complexity of alternatives can be high and time-consuming. SupremeRAID DEVICE DOWN has:
  • Higher latency but a lower COV, i.e., it is more stable.
  • Fifty percent lower CPU (usr+sys) for all three tests.

Efficiency per CPU%, shown in the rightmost column, is computed by dividing the Work in IOPS by the CPU% required to generate this workload. This calculation provides the number of IOPS per CPU% shown in the rightmost column of the first two panels.

  • The third panel is the ratio of SupremeRAID and BASELINE for each workload, while the fourth panel converts the last panel to percent difference.
  • SupremeRAID CPU efficiency is 16 percent more than the BASELINE for the 4K R.R. workload.

BASELINE CPU efficiency is 23 and 16 percent better for the 4K R.W. and R.M. workloads.

Figure 11 SupremeRAID DEVICE DOWN vs. BASELINE

SupremeRAID DEVICE REBUILD vs. BASELINE

Here, we add back the failed device from the previous test. Then, the rebuild process starts, and the standard Spec Sheet Random Benchmark is run. See the results in Figure 12.

The standard workloads take 20 minutes for each of the three IO Types. The rebuild operates at its highest performance and completes in 85 minutes (25 minutes longer than the standard workloads).

As previously shown, the MIOPS column shows the performance of the three workloads in Panels 1 and 2 for the BASELINE and SupremeRAID, respectively.

At a high level, there is not much performance difference between SupremeRAID DEVICE DOWN and SupremeRAID DEVICE REBUILD in terms of throughput or efficiency.

Figure 12 SupremeRAID DEVICE REBUILD vs. BASELINE

Those considering a RAID 5 solution must be able to meet their Service Levels with the lowest SupremeRAID performance or augment RAID 5 with workload shedding or deferring, fail-over, etc.

There are two essential Business Continuity and Disaster Recovery (BCDR) objectives:

  • Recovery Time Objective (RTO) and
  • Recovery Point Objective (RPO)

RAID 5 essentially addresses and eliminates RTO and RPO assuming just one failed device. Of course, there are many elements to a comprehensive BCDR, but a well-planned and well-sized RAID solution can manage single instances of device failure.


Performance by Configuration Summary

Figure 13 Performance by Configuration Summary and Figure 14 Efficiency by Configuration Summary provide 3D and table data that summarizes SupremeRAID SR-1000 adapter R5 Performance on our standard Spec Sheet Random (SSR) benchmark. These figures show:

The BASELINE is a Data24 with 24 SN840 3.2TB devices (each with two namespaces) tested with the SSR benchmark. A SupremeRAID R5 7+1 implementation on the Data24 described above for:

  • SupremeRAID NOMINAL
  • SupremeRAID DEVICE DOWN
  • SupremeRAID DEVICE REBUILD

The slopes of the chart elements are as expected – monotonically decreasing from:

  • Left to right
  • Back to front
  • Left rear to right front
Figure 13 Performance by Configuration Summary

Efficiency by Configuration Summary

The Random Read performance for the BASELINE and the SupremeRAID NOMINAL is similar at ~15.30 MIOPS. But, because of the move of Parity Calculations from Server CPUs to the GPU, the SupremeRAID Solution is about 24% more efficient than the BASELINE.
This finding is unique in this study. It is achieved by:

  • Moving the CPU cycles to the GPU and
  • The efficient SupremeRAID 4K read pipeline.
  • SupremeRAID efficiency is similar for each workload (Random Reads, Random Mixed, and Random Writes) across the three life cycles (Nominal, Device Down, and Device Rebuild).
Figure 14 Efficiency by Configuration Summary

Conclusion

An NVMe-oF storage enclosure such as the OpenFlex Data24 allows for a broader degree of performance, flexibility, and cost savings not found with traditional hardware or OS-based software RAID.

This GPU architecture outperformed the Advanced Software RAID solution in all areas except large block sequential reads in these tests.

Consider the following:

  • SupremeRAID SR-1000 adapter is essentially a plug-and-play solution using a commercially available GPU.
  • SupremeRAID allows competitive pricing as the silicon architecture is not proprietary for this use.
  • The ability to separate the data path from the logic path adds value and flexibility.
  • A GPU upgrade or a GPU firmware upgrade could provide new features and performance improvements, possibly with low operational impacts
  • Traditionally, the data path has presented itself as the bottleneck via an AISC-based RAID controller or CPU computation. Direct IO between the CPU and GPU is efficient and allows the GPU’s massive computational capability to manage RAID calculations in the data path.

GPU release cycles are regular, and it is fair to anticipate that performance should improve as GPU architectures are enhanced (along with server motherboard architecture – such as PCIe Gen 4). This regular product cycle, in turn, allows the consumer to balance performance requirements against the capabilities of the GPU – essentially driving a tighter cost versus performance model.

There are potential benefits to be realized in the server architecture when using this solution. Tradition hardware RAID is unlikely to meet the performance potential of NVMe devices. Such cards scale poorly and require additional cables for device connectivity. RAID Add-In-Cards (AIC) can add complexity and cost, use extra PCIe slots, and disrupt airflow. A GPU-based RAID solution may reduce or eliminate these issues. Additionally, CPU cycles are freed to be assigned elsewhere or, if not required, allowing for lower specification (lower cost) CPU to be considered.

The critical issue in this solution is making RAID 5, which has always been the most desired RAID configuration (add one device to eliminate a single point of failure), sufficiently performant for use across most general storage needs. The SupremeRAID does this in an elegantly simple implementation requiring no significant changes to the environment.

Figure 15 SupremeRAID R5 Life Cycle Chart with BASELINE shows the absolute and relative performance for the various workloads of the RAID Life Cycle (Initialization, Nominal, Device Down, and Device Rebuild). Potential consumers should understand this information to assess the applicability of Graid Technology, Inc.’s solution.

Figure 16 SupremeRAID R5 Life Cycle Chart with RAID Set and Constituent Devices provides a unique view of both the RAID Set used by the customer and the underlying Constituent Devices that make up the RAID Set.


Appendix 1: SupremeRAID R5 Life Cycle Chart with BASELINE

Figure 15 SupremeRAID R5 Life Cycle Chart with BASELINE

Appendix 2: SupremeRAID R5 Life Cycle Chart with RAID Set and Constituent Devices

Figure 16 SupremeRAID R5 Life Cycle Chart with RAID Set and Constituent Devices

Appendix 3: Document Details

Contributors

NameCompanyTitle
John GatchWestern DigitalTechnologist, Platforms Field Engineering
Calvin FalldorfWestern DigitalPrincipal Engineer, Platforms Field Engineering
Niall MacleodWestern DigitalDirector, Platforms Field Engineering
Barrett EdwardsWestern DigitalSr. Director, Platforms Field Engineering

References

Document TitleVersionDate
SupremeRAID White Paper 2022-07-25-V0.91V0.91July 25, 2022
SupremeRAID White Paper 2022-07-25-V0.93V0.93July 28, 2022
SupremeRAID White Paper 2022-07-25-V1.0V1.0August 3, 2022
SupremeRAID White Paper 2022-08-16-V1.1V1.1August 16, 2022

Version History

ContributorVersionDate
Niall MacLeodV0.1July 1, 2022
Calvin Falldorf and John GatchV0.3July 8, 2022
John GatchV0.5July 15, 2022
Niall MacLeod, Calvin Falldorf, and John GatchV0.7July 22, 2022
Scot Rives, Western Digital Legal, First ReviewV0.91July 25, 2022
John Gatch, Updated with Scot Rives UpdatesV0.93July 28, 2022
John Gatch, Updated with Scot Rives UpdatesV1.0August 3, 2022
John Gatch, Updated Title per Niall MacLeodV1.1August 16, 2022

Document Feedback For feedback, questions, and suggestions for improvements to this document, send an email to the Data Center Systems (DCS) Technical Marketing Engineering (TME) team distribution list at pdl-dcs-tm@wdc.com.

Read the white paper results featured in Blocks & Files


About Graid Technology

Chosen by CRN as one of the Ten Hottest Data Storage Startups of 2021 and a 2022 Emerging Vendor in the Storage & Disaster Recovery category, Graid Technology Inc. has developed the world’s first NVMe and NVMeoF RAID card to unlock the full potential of enterprise SSD performance. We’re headquartered in Silicon Valley, with an R&D center in Taiwan, and are led by a dedicated team of experts with decades of experience in the SDS, ASIC and storage industries. Graid Technology Inc. is redefining performance standards for enterprise data protection: a single SupremeRAID™ card delivers 19 million IOPS and 110GB/s of throughput. For more information on Graid Technology Inc., connect with us on Twitter or LinkedIn.

Additional Resources

Related Posts

News | Apr 29, 2022

NEWS: 📰 Graid Technology SupremeRAID™️ SR-1010 featured on WccfTech.com

” Graid SupremeRAID SR-1010, The World’s Fastest NVMe PCIe 4.0 RAID Card, Offers Up To 110 Gbps Transfer Speeds &…

Read More
News | Nov 7, 2022

Storage Newsletter: Recap of the 46th IT Press Tour Featuring Graid Technology

With participation of AuriStor, Data Dynamics, Graid Technology, HYCU, N-able, Panzura, Pavilion Data, Protocol Labs, ScaleFlux and Smart IOPS (StorageNewsletter.com,…

Read More
Videos | Apr 27, 2022

VIDEO: 📹 Review of Graid Technology SupremeRAID™ by Informática Sin Limites

GRAID SupremeRAID™ SR-1010 — GPU-SSD with up to 110,000 MB/s speed Watch the Review [Spanish] or read the article below!

Read More