The new build of StarWind SAN & NAS has just gotten released and it will bring us the long-awaited FC (Fibre Channel) ecosystem support. StarWind SAN & NAS has been designed to give a new life to your existing hardware. Installed either on top of the hypervisor of your choice or bare-metal, it turns your server into a fast shared storage pool that can be accessed over iSCSI, SMB, or NFS. And in the new build, via FC as well. It uses the same SDS engine as StarWind vSAN which means high performance and also adds new features such as ZFS support to build the utmost resilient storage system using the commodity hardware.
This was a great chance to test how fast StarWind SAN & NAS can go using FC. Folks from StorageReview were kind to provide us with the testing infrastructure where we performed the benchmark. Thanks, StorageReview team once again!
Testing scope
We have tested the performance of shared storage presented from a dedicated storage node full of NVMe drives and StarWind SAN & NAS on top over FC to client nodes. We have decided to include only good old FCP (FCP – Fibre Channel Protocol) benchmark results in this article since the results of NVMe-FC were at the same level (on certain patterns even lower than FCP). To collect NVMe drives into a redundant storage array, we have used MDRAID and GRAID tools and tested them separately. MDRAID is a Linux software RAID that is present as part of StarWind SAN & NAS and serves to collect drives into a redundant array. GRAID is an extremely fast NVMe/NVMeoF RAID card, designed to deliver the full potential of PCIe Gen4 systems.
It is worth mentioning that GRAID SupremeRAID is the only NVMe RAID card as of now capable of delivering the highest SSD performance possible that removes performance bottlenecks altogether. What is the difference, you may wonder? Well, GRAID SupremeRAID SR-1010 is based on an NVIDIA A2000 GPU. In most characteristics, that doesn’t make this solution anything special, but when it comes to the NVMe RAID bottlenecks, the GPU can give a head start to lots of alternatives. In particular, the SupremeRAID is capable of processing all the I/O operations directly, and we don’t need to tell you just how much this frees up the CPU resources. Standard RAID cards are simply no match for the computing potential of the GPU card. Even though the GRAID solution is a software RAID, the NVIDIA GPU card is essential to a lot of benefits that GRAID has to offer. Additionally, thanks to the specifics of GRAID software architecture, data can flow directly from the CPU and straight to the storage, passing by the GRAID card.
Traditionally, NVIDIA cards serve various purposes. They are in demand for use in gaming, video acceleration, cryptocurrency mining, and professional working tools such as VDI. Moreover, NVIDIA also produces GPUs for vehicles. Now? NVIDIA hardware powers storage appliances. This novelty embarks nothing less but a breakthrough in utilizing the computing potential of the GPU in a whole new field.
Testing bed
Here is the list of the hardware and software used in our benchmark.
The communication between storage nodes and client nodes has been carried out over 32GFC Fibre Channel fabric. The storage node had 4x Marvell® QLogic® 2772 Series Enhanced 32GFC Fibre Channel Adapters while each client node had one. The storage and client nodes were connected using two Brocade G620 Fibre Channel Switches to ensure resilience.
The interesting thing behind Marvell Qlogic 2772 Fibre Channel adapters is that the ports on it are independently resourced which gives an additional layer of resilience. The complete port-level isolation across the FC controller architecture prevents errors and firmware crashes from propagating across all ports. Find out more about Marvell Qlogic 2772 Fibre Channel adapters in terms of high availability and reliability.
Marvell QLogic ports act independently from each other giving more flexibility in terms of resilience. More details are here
Storage connection diagram:
We have collected 8 NVMe drives on the storage node in the RAID5 array:
First, using MDRAID:
And then, with GRAID correspondingly:
Once the RAID arrays were ready, we sliced them into 32 LUNs, 1TB each. These were distributed by 8 LUNs per client node. This was done since 1 LUN has a performance limitation and we wanted to squeeze the max out of our storage.
This is the example of 8 LUNs connected on one client node:
Testing Methodology
The benchmark was held using the fio utility. fio is a cross-platform, industry-standard benchmark tool used to test local storage as well as shared.
Testing patterns:
4k random read;
4k random write;
4k random read/write 70/30;
64K random read;
64K random write;
1M sequential read;
1M sequential write.
Test duration:
Single test duration = 600 seconds;
Before starting the write benchmark, storage has been first warmed ups for 2 hours;
Testing stages
Testing single NVMe drive performance to get reference numbers;
Testing MDRAID and GRAID RAID5 array performance locally;
Talking about these NVMe SSDs, an interesting thing is that they support 10W~35W flexible power management and 25W power mode by default. Basically, Memblaze’s NVMe drives to increase performance on sequential writes by consuming more power, which gives a flexible way to tune drive performance as per specific workload.
We have received the optimal (speed/latency) performance patterns of a single NVMe drive under the following number of jobs and IO depth values:
1 NVMe PBlaze6 D6926 Series 12.8TB
Pattern
Numjobs
IOdepth
IOPs
MiB\s
Latency (ms)
4k random read
16
32
1514000
5914
0,337
4k random write
4
4
537000
2096
0,029
64k random read
4
8
103000
6467
0,308
64k random write
2
2
42000
2626
0,094
1M read
1
4
6576
6576
0,607
1M write
1
2
5393
5393
0,370
Before running the actual tests, we have determined the time needed to warm up the NVMe drives to Steady State:
From the graph, it was visible that the NVMe drives should be warmed up for around 2 hours.
2. Testing MD and GRAID RAID arrays performance locally:
Fewer words, more numbers. Heading to MDRAID and GRAID local performance tests.
4k random read:
Table result:
MDRAID5
GRAID5
Comparison
Pattern
Numjobs
IOdepth
IOPs
MiB\s
Latency (ms)
CPU usage
IOPs
MiB\s
Latency (ms)
CPU usage
IOPs
MiB\s
Latency (ms)
CPU usage
4k random read
16
16
2670000
10430
0,095
7%
1281000
5004
0,198
3%
48%
48%
208%
46%
4k random read
16
32
3591000
14027
0,141
10%
2451000
9574
0,207
6%
68%
68%
147%
60%
4k random read
32
32
4049000
15816
0,250
20%
4474000
17477
0,227
10%
110%
110%
91%
50%
4k random read
32
64
4032000
15750
0,504
30%
7393000
28879
0,275
16%
183%
183%
55%
53%
4k random read
64
64
4061000
15863
0,998
40%
10800000
42188
0,377
25%
266%
266%
38%
63%
Graphs:
4k random write:
Table result:
MDRAID5
GRAID5
Comparison
Pattern
Numjobs
IOdepth
IOPs
MiB\s
Latency (ms)
CPU usage
IOPs
MiB\s
Latency (ms)
CPU usage
IOPs
MiB\s
Latency (ms)
CPU usage
4k random read
16
16
2670000
10430
0,095
7%
1281000
5004
0,198
3%
48%
48%
208%
46%
4k random read
16
32
3591000
14027
0,141
10%
2451000
9574
0,207
6%
68%
68%
147%
60%
4k random read
32
32
4049000
15816
0,250
20%
4474000
17477
0,227
10%
110%
110%
91%
50%
4k random read
32
64
4032000
15750
0,504
30%
7393000
28879
0,275
16%
183%
183%
55%
53%
4k random read
64
64
4061000
15863
0,998
40%
10800000
42188
0,377
25%
266%
266%
38%
63%
* – in order to get maximum performance of 1.5M IOPs with GRAID SR-1010, you need PCIe Gen4x16. Our server however had only Gen4x8 PCIe slots.
Graphs:
4k random read/write 70/30:
Table result:
MDRAID5
GRAID5
Comparison
Pattern
Numjobs
IOdepth
IOPs
MiB\s
Latency (ms)
CPU usage
IOPs
MiB\s
Latency (ms)
CPU usage
IOPs
MiB\s
Latency (ms)
CPU usage
4k random read/write 70/30
8
16
765000
2988
0,202
5%
429000
1676
0,344
1%
56%
56%
170%
31%
4k random read/write 70/30
16
16
1078000
4211
0,285
14%
776000
3031
0,382
2%
72%
72%
134%
14%
4k random read/write 70/30
16
32
1100000
4297
0,518
17%
1253000
4895
0,470
3%
114%
114%
91%
18%
4k random read/write 70/30
32
32
1147000
4480
0,960
30%
1944000
7594
0,608
5%
169%
169%
63%
15%
4k random read/write 70/30
32
64
1154000
4508
1,847
30%
2686000
10492
0,882
6%
233%
233%
48%
20%
4k random read/write 70/30
64
64
1193000
4660
5,298
49%
3140000
12266
1,529
8%
263%
263%
29%
15%
Graphs:
64k random read:
Table result:
MDRAID5
GRAID5
Comparison
Pattern
Numjobs
IOdepth
IOPs
MiB\s
Latency (ms)
CPU usage
IOPs
MiB\s
Latency (ms)
CPU usage
IOPs
MiB\s
Latency (ms)
CPU usage
64k random read
8
8
186000
11625
0,343
5%
175000
10938
0,364
1%
94%
94%
106%
16%
64k random read
8
16
188000
11750
0,679
5%
292000
18250
0,438
2%
155%
155%
65%
30%
64k random read
16
16
196000
12250
1,309
10%
461000
28813
0,554
2%
235%
235%
42%
20%
64k random read
16
32
195000
12188
2,624
10%
646000
40375
0,792
3%
331%
331%
30%
30%
64k random read
32
32
195000
12188
5,242
20%
740000
46250
1,382
3%
379%
379%
26%
15%
Graphs:
64k random write:
Table result:
MDRAID5
GRAID5
Comparison
Pattern
Numjobs
IOdepth
IOPs
MiB\s
Latency (ms)
CPU usage
IOPs
MiB\s
Latency (ms)
CPU usage
IOPs
MiB\s
Latency (ms)
CPU usage
64k random write
8
8
92200
5763
0,693
7%
67400
4213
0,948
1%
73%
73%
137%
10%
64k random write
8
16
118000
7375
1,081
14%
104000
6500
1,229
1%
88%
88%
114%
10%
64k random write
16
16
117000
7313
2,179
16%
135000
8438
1,895
2%
115%
115%
87%
11%
64k random write
16
32
117000
7313
4,369
16%
146000
9125
3,496
2%
125%
125%
80%
13%
Graphs:
1M read:
Table result:
MDRAID5
GRAID5
Comparison
Pattern
Numjobs
IOdepth
IOPs
MiB\s
Latency (ms)
CPU usage
IOPs
MiB\s
Latency (ms)
CPU usage
IOPs
MiB\s
Latency (ms)
CPU usage
1M read
4
4
10000
10000
1,592
3%
18200
18200
0,880
0%
182%
182%
55%
12%
1M read
8
4
11000
11000
2,673
5%
28600
28600
1,120
1%
260%
260%
42%
10%
1M read
8
8
11900
11900
5,393
5%
39400
39400
1,623
1%
331%
331%
30%
10%
1M read
8
16
12100
12100
10,563
5%
44700
44700
2,865
1%
369%
369%
27%
12%
1M read
16
16
12100
12100
21,156
10%
47000
47000
5,442
1%
388%
388%
26%
6%
Graphs:
1M write:
Table result:
MDRAID5
GRAID5
Comparison
Pattern
Numjobs
IOdepth
IOPs
MiB\s
Latency (ms)
CPU usage
IOPs
MiB\s
Latency (ms)
CPU usage
IOPs
MiB\s
Latency (ms)
CPU usage
1M write
4
4
6938
6938
2,300
9%
5363
5363
2,981
1%
77%
77%
130%
9%
1M write
8
4
6730
6730
4,753
11%
8251
8251
3,876
1%
123%
123%
82%
12%
1M write
8
8
6782
6782
9,434
12%
10100
10100
6,312
2%
149%
149%
67%
17%
1M write
8
16
6780
6780
18,870
12%
11100
11100
11,530
2%
164%
164%
61%
17%
1M write
16
16
7071
7071
36,182
17%
11400
11400
22,490
3%
161%
161%
62%
15%
Graphs:
Results:
MDRAID shows decent performance on low Numjobs and IOdepth values but as the workload increases, so does the latency and performance stops growing. On the other hand, GRAID gives better results with high Numjobs and IOdepth values: on a 4k random read pattern, we have received the incredible 10,8M IOPs with a latency of just 0,377 ms. That is basically the speed of 7 NVMe drives out of 8. On large block reads 64k/1M, GRAID reaches the throughput of 40/47GiB/s., while MDRAID reached the ceiling with 12GiB/s.
3. Running benchmark remotely from client nodes:
Once we have received such impressive local storage results, we were fully ready to give FCP a try and see if it can deliver comparable performance on the client nodes. In the results below, the Numjobs parameter is stated for all 32 LUNs.
4k random read:
Table result:
MDRAID5
GRAID5
Comparison
Pattern
Numjobs
IOdepth
IOPs
MiB\s
Latency (ms)
IOPs
MiB\s
Latency (ms)
IOPs
MiB\s
Latency (ms)
4k random read
16
16
1664285
6501
0,132
1067226
4169
0,230
64%
64%
174%
4k random read
32
16
3184359
12439
0,141
2104438
8221
0,233
66%
66%
165%
4k random read
64
16
3531393
13795
0,274
3687970
14406
0,264
104%
104%
96%
4k random read
128
16
3544646
13847
0,563
4563635
17827
0,430
129%
129%
76%
4k random read
16
32
1783060
6965
0,199
1772981
6926
0,261
99%
99%
131%
4k random read
32
32
3500411
13674
0,253
3475477
13576
0,268
99%
99%
106%
4k random read
64
32
3532084
13797
0,563
4459783
17421
0,436
126%
126%
77%
4k random read
128
32
3549901
13867
1,139
4578663
17886
0,873
129%
129%
77%
Graphs:
4k random write:
Table result:
MDRAID5
GRAID5
Comparison
Pattern
Numjobs
IOdepth
IOPs
MiB\s
Latency (ms)
IOPs
MiB\s
Latency (ms)
IOPs
MiB\s
Latency (ms)
4k random write
16
16
204612
799
1,241
304228
1188
0,833
149%
149%
67%
4k random write
32
16
238109
930
2,143
513988
2008
0,988
216%
216%
46%
4k random write
64
16
271069
1059
3,769
514719
2011
1,980
190%
190%
53%
4k random write
128
16
331108
1294
6,176
511970
2000
3,991
155%
155%
65%
4k random write
16
32
247398
966
2,059
307504
1201
1,657
124%
124%
80%
4k random write
32
32
285527
1115
3,578
512118
2001
1,992
179%
179%
56%
4k random write
64
32
341017
1332
5,996
491534
1920
4,157
144%
144%
69%
4k random write
128
32
385361
1506
10,617
498065
1946
8,212
129%
129%
77%
Graphs:
4k random read/write 70/30:
Table result:
MDRAID5
GRAID5
Comparison
Pattern
Numjobs
IOdepth
IOPs
MiB\s
Latency (ms)
IOPs
MiB\s
Latency (ms)
IOPs
MiB\s
Latency (ms)
4k random read/write 70/30
16
16
538622
2104
0,683
646787
2 527
0,470
120%
120%
69%
4k random read/write 70/30
32
16
670407
2619
1,136
1109071
4 332
0,554
165%
165%
49%
4k random read/write 70/30
64
16
805986
3149
1,955
1072219
4 188
1,370
133%
133%
70%
4k random read/write 70/30
128
16
927080
3622
3,493
1089414
4 256
2,912
118%
118%
83%
4k random read/write 70/30
16
32
700225
2735
1,065
644987
2 520
1,133
92%
92%
106%
4k random read/write 70/30
32
32
817516
3194
1,928
1103024
4 309
1,329
135%
135%
69%
4k random read/write 70/30
64
32
933090
3645
3,471
1098277
4 290
2,888
118%
118%
83%
4k random read/write 70/30
128
32
997943
3899
6,616
1061938
4 149
6,202
106%
106%
94%
Graphs:
64k random read:
Table result:
MDRAID5
GRAID5
Comparison
Pattern
Numjobs
IOdepth
IOPs
MiB\s
Latency (ms)
IOPs
MiB\s
Latency (ms)
IOPs
MiB\s
Latency (ms)
random read 64K
8
8
192015
12001
0,326
149755
9360
0,420
78%
78%
129%
random read 64K
8
16
193967
12123
0,652
260821
16302
0,483
134%
134%
74%
random read 64K
8
32
194089
12131
1,311
397736
24859*
0,634
205%
205%
48%
* – throughput limitation of our FC adapters (3200MB\s * 8 ports = 25600MB\s).
Graphs:
64k random write:
Table result:
MDRAID5
GRAID5
Comparison
Pattern
Numjobs
IOdepth
IOPs
MiB\s
Latency (ms)
IOPs
MiB\s
Latency (ms)
IOPs
MiB\s
Latency (ms)
random write 64K
8
8
37343
2334
1,705
61839
3865
1,027
166%
166%
60%
random write 64K
8
16
51048
3191
2,497
100093
6256
1,269
196%
196%
51%
random write 64K
16
16
65517
4095
3,895
132669
8292
1,915
202%
202%
49%
random write 64K
16
32
85255
5330
5,992
138609
8664
3,677
163%
163%
61%
Graphs:
1M read:
Table result:
MDRAID5
GRAID5
Comparison
Pattern
Numjobs
IOdepth
IOPs
MiB\s
Latency (ms)
IOPs
MiB\s
Latency (ms)
IOPs
MiB\s
Latency (ms)
1M read
4
2
9690
9690
0,803
8542
8542
0,915
88%
88%
114%
1M read
4
4
10495
10495
1,503
14799
14799
1,059
141%
141%
70%
1M read
4
8
11018
11018
2,874
19841
19841
1,584
180%
180%
55%
1M read
4
16
11713
11713
5,442
25150
25150*
2,520
215%
215%
46%
* – throughput limitation of our FC adapters (3200MB\s * 8 ports = 25600MB\s).
Graphs:
1M write:
Table result:
MDRAID5
GRAID5
Comparison
Pattern
Numjobs
IOdepth
IOPs
MiB\s
Latency (ms)
IOPs
MiB\s
Latency (ms)
IOPs
MiB\s
Latency (ms)
1M write
4
2
6028
6028
1,284
2991
2991
2,633
50%
50%
205%
1M write
4
4
7222
7222
2,167
4497
4497
3,509
62%
62%
162%
1M write
4
8
6992
6992
4,521
6748
6748
4,684
96%
97%
104%
1M write
4
16
6819
6819
9,310
8902
8902
7,125
131%
131%
77%
1M write
8
16
7144
7144
17,832
10493
10493
12,117
147%
147%
68%
Graphs:
Comparing local and remote performance results:
In the tables below, we have provided the best results achieved from each test as to performance/latency ratio. The full performance benchmark results are provided above.
MDRAID:
MDRAID5 – local
MDRAID5 – FCP
Comparison
Pattern
IOPs
MiB\s
Latency (ms)
IOPs
MiB\s
Latency (ms)
IOPs
MiB\s
Latency (ms)
4k random read
4049000
15816
0,250
3531393
13795
0,274
87%
87%
110%
4k random write
478000
1867
0,535
341017
1332
5,996
71%
71%
1121%
4k random read/write 70/30
1078000
4211
0,285
927080
3622
3,493
86%
86%
1226%
64k random read
186000
11625
0,343
192015
12001
0,326
103%
103%
95%
64k random write
118000
7375
1,081
85255
5330
5,992
72%
72%
554%
1M read
11900
11900
5,393
11709
11709
5,442
98%
98%
101%
1M write
6938
6938
2,300
7221
7221
2,167
104%
104%
94%
GRAID:
GRAID5 – local
GRAID5 – FCP
Comparison
Pattern
IOPs
MiB\s
Latency (ms)
IOPs
MiB\s
Latency (ms)
IOPs
MiB\s
Latency (ms)
4k random read
10800000
42188
0,377
4563635
17827
0,430
42%
42%
114%
4k random write
975000
3809
2,100
514719
2011
1,980
53%
53%
94%
4k random read/write 70/30
3140000
12266
1,529
1109071
4332
0,554
35%
35%
36%
64k random read
740000
46250
1,382
397736
24859
0,634
54%
54%
46%
64k random write
135000
8438
1,895
132669
8292
1,915
98%
98%
101%
1M read
47000
47000
5,442
25150
25150*
2,520
54%
54%
46%
1M write
11100
11100
11,530
10493
10493
12,117
95%
95%
105%
* – throughput limitation of our FC adapters (3200MB\s * 8 ports = 25600MB\s).
Conclusions
Essentially, the most impressive shared storage performance was presented by a redundant GRAID storage array full of PBlaze6 6920 Series NVMe SSDs with StarWind SAN & NAS on top and running over Fibre Channel to client nodes, using Marvell Qlogic 2772 Fibre Channel adapters. GRAID is the only technology to guarantee probably the highest performance software-defined shared storage can get as of now. The GRAID build has managed to receive around 50% of the local RAID array performance with the approximately same latency as with the local storage. The only reason the results on 64k/1M large block reads were different is the natural technical limitations of achieving near or at maximum bandwidth speeds for a 32G Fibre Channel environment.
Locally, GRAID shows outstanding results with high data values: it was capable of receiving the seemingly impossible number of 10,8M IOPs with the latency of just 0,377 ms on a 4k random read pattern. Also, since GRAID offloads IO requests processing to GPU, the CPU usage on the storage node is 2-10 times lower than that of MDRAID which allows using free CPU resources for other tasks. With MDRAID, we have managed to practically achieve the full performance that the RAID array could provide locally but at a cost of significantly higher latency.
If you want to unleash the full GRAID performance potential, we would advise looking into NVMe-oF and RDMA which will be added in the subsequent StarWind SAN & NAS new builds. You can find more about the NVMe-oF and StarWind NVMe-oF initiator performance in one of the following articles.
GRAID Technology™️ is headquartered in Silicon Valley, California, with an office in Ontario, California, and an R&D center in Taipei, Taiwan. Named one of the Ten Hottest Data Storage Startups of 2021 by CRN, GRAID SupremeRAID performance is breaking world records as the first NVMe and NVMeoF RAID card to unlock the full potential of your SSD performance: a single SupremeRAID card delivers 19 million IOPS and 110GB/s of throughput. For more information on GRAID Technology, visit GRAID Technology or connect with us on Twitter or LinkedIn.
Learn More About The Future of Enterprise Data Performance
Named one of the Ten Hottest Data Storage Startups of 2021 by CRN, GRAID SupremeRAID™ performance is breaking world records. Contact us today to find out how GRAID’s NVMe solution can unlock the full potential of your high performance workloads.