Kioxia Demonstrates Raid Offload Scheme for NVMe Drives

sliken · 2024-08-15T17:53:32 1723744412

Seems like quite a bit of complexity for a dubious win. Today's CPUs are REALLY fast, even a single core of 64 or more cores that are common on servers today:

  [  478.047970] raid6: avx2x2   gen() 60473 MB/s
  [  478.115971] raid6: avx512x1 gen() 53469 MB/s
  [  478.149971] raid6: avx512x2 gen() 57067 MB/s

Especially since the data is coming from the CPU anyways, so likely caches are warm. It also means that node you have to send a stripe of data to a single NVME, which likely has much less than 60GB/sec of checksum speed, then initiate transfers to every other drive in the stripe. Not to mention the NVME drive likely doesn't have ECC memory and any resulting memory errors are unlikely to be visible to the OS.

Just seems like hardware RAID with all the same problems, likely not as fast as software RAID, harder to manage, a unique set of tools per vendor, harder to have global spares, and doesn't work with filesystems that do their own redundancy like ZFS.

everfrustrated · 2024-08-15T18:00:15 1723744815

The problem is not the CPU per se, but the PCI bandwidth congestion between CPU<->PCI lanes.

These NVME drives can talk directly to each other for raid which means a much larger total bandwidth is available, and potentially improved latency also.

RDMA means you might not be serving via CPU at all.

sliken · 2024-08-15T18:09:36 1723745376

The difference isn't that big though. Sure with software RAID you write 1GB to a 8 disk RAID6 you write 8/6 x 1GB = 1.33GB. But with RAID offload you nearly double the NVME bandwidth (n - 1) x 2 consumed.

I also wonder, if you have 8 NVMe, write to a stripe to one, it does the RAID calc and sends each disk the share of the stripe. What happens if the master NVMe dies? It's not really a RAID if a single disk can kill the RAID.

wmf · 2024-08-15T18:03:57 1723745037

PCIe P2P transfers go through the CPU so... no? I think what it's saving is main memory bandwidth.

zamadatix · 2024-08-15T18:28:29 1723746509

> PCIe P2P transfers go through the CPU

PCIe P2P transfers can go direct through a downstream PCIe switch such as found on chipsets without having to bump back up through the CPU.

namibj · 2024-08-15T18:39:14 1723747154

And even with e.g. AMD Matisse, aka Desktop Ryzen Zen3 on AM4, it turns around in the PCIe root complex instead of consuming infinity fabric bandwidth.

zamadatix · 2024-08-15T18:31:43 1723746703

Based on the numbers from the article it seems the problem is less how fast a CPU core can crunch numbers and more how much extra memory bandwidth it consumes to do so. Testing the AVX throughput of a single core in a storage only test skips that consideration because there is no memory bandwidth contention or usage consideration.

sliken · 2024-08-15T19:50:07 1723751407

Sure, but to write 1GB you stream 1GB from ram -> CPU in either case. With software RAID you do the calcs (60GB/sec per core) and then write 1.3GB/sec to the storage controller. Just doesn't seem that much of a difference, the CPU overhead is near zero (actual I/O / 64*60GB), and writing an extra 1/3rd for the redundancy data seems in the noise for normal server loads.

Not to mention I'd expect the parity calculations to be MUCH slower on the NVMe controllers.

zamadatix · 2024-08-15T20:44:26 1723754666

Your assumption that, from a memory perspective, the stream goes from 1 GB RAM Read -> Write to Disk to 1 GB RAM read -> Calculation -> Write to disk does not hold. There are intermediate forms of data that end up writing back to RAM then to disk. This is what the article is talking about here:

> upwards of 90% reduction in system DRAM utilization

sliken · 2024-08-15T21:08:31 1723756111

My understanding is that it's something like:

      stripe = read_from_ram(*ptr) # usually between 128k and 256k
      blobs[]=do_raid_calc(stripe) # blobs usually 25% to 33% larger than stripe
      for i in drives
          write(drive=i,blobs[i])

The above should be relatively cache friendly, my Zen 4 desktop (1 gen old) has 128MB of L3 cache, enough for 1000 ish stripes.

> upwards of 90% reduction in system DRAM utilization

That seems unbelievable, most ram isn't spend for anything I/O related let alone RAID releated. Now if it's 90% reduction in system DRAM utilization by RAID, sure. But that seems like a very small fraction of all ram.

Even if 10,000 stripes are in flight simultaneously to 100s of drives that's only 2.5GB or 1% of a servers ram (256GB or more seems common). Especially since 2/3rd of that would be in ram even with hardware RAID. Not like the buffer/page cache which might reach 50% of ram has the extra RAID in data in it.

zamadatix · 2024-08-15T21:48:36 1723758516

> 128MB of L3 cache

Sure, if you use X3D chip with the current largest amount of L3 cache accessible to a single core of any option currently available you can dedicate all of it to 128 MB of the write buffer to your disk instead of letting it be offloaded. Valid option, just as cool. I have a non X3D 7950X so jealous though ;).

You've also got the case of needing to transmit up the read of the disk for modifications to sectors not cached by the system so the CPU can perform the parity calc of the whole sector and issue the appropriate writes. Particularly bad for non-sequential IO writes.

> if it's 90% reduction in system DRAM utilization by RAID

Yes, this - not the other. It's achieved by not writing things back to RAM again before they hit the flash pool.

sliken · 2024-08-15T22:11:15 1723759875

> Sure, if you use X3D chip

Ah, sorry, lscpu shows: L3: 64 MiB (2 instances)

I originally thought that meant 64MB x 2, but it means 64MB total (32MB x 2). Still 64MB is 500 times larger than 128KB stripe and I/O normally happens on a wide variety of cores, and should only be required for stripe that are in flight. Server (normally with 5x or more cores than my 12 core desktop) and way more bandwidth (24 channels instead of my 2) will have much more cache and much more bandwidth.

> Yes, this - not the other. It's achieved by not writing things back to RAM again before they hit (comparatively slow to RAM) flash pool.

Why should the stripes be written to ram? The write should enter kernel space (write is a system call), then the software RAID driver does the calculation and then the write to the devices memory space. The PCIe connected NVMe controller is not cache coherent and can't safely read main memory, which might be cached.

I took a closer look at the original post, they seem to be considering the tiny write, which requires a read/modify/write. Said operation is pretty inefficient, and linux tries to avoid this with caching, but certainly is needed sometimes. I've not seen any analysis on what fraction of I/O to production RAID system is R/M/W instead of a normal read or write.

Even in the R/M/W case, a stripe is read by the software-RAID driver, the write is masked onto the strip, and a new checksum is calculated. Then the stripe is sent back to the I/O space for each involved NVMe controller. So a 4KB write (common minimum size) requires reading 128-256KB, doing the checksum, and writing it back to the device.

It does tip the scales more towards hardware RAID, but that's always been true for hardware RAID, which very often ends up slower than software RAID for previously discusses reasons.

zamadatix · 2024-08-15T22:24:18 1723760658

Say it were a 6 disk pool and you add an object to a database (with the goal of doing many of these as fast as possible with fsync to the disks):

- Receive the new data

- Read the multiple disks to get the current stripe(s) associated with it.

- Calculate the new parity

- Issue the multiple writes

- Wait for completion, clear that from RAM

Looking at a single write it doesn't seem so bad. You take something like ~128k in from the disks per stripe (which will arrive it ever so slightly different times and be held as that thread stalls before the calc), issue a bunch of writes, wait for that to clear while the result remains in memory (cache or RAM), then you're good to clear it out and that thread/coroutine task can process the next one. "Just" 3 GB/s is ~23,000/s of that - doing those multiple reads into RAM, parity writes into RAM (well, unless you can stick it all in massive L3 by keeping queue depths low), and caching until spat out on to the drive. On a normal non-parity setup you just have your data to be written sit and go to disk, no intermediate reads/writes.

This may not make sense on a home box but consider the approach more an alternative to solutions like https://www.graidtech.com/product/sr-1000/ which are single cards that can get a million RAIDed IOPS written at near 100 GB/s in a single PCIe slot alone with no additional load to the CPU. Just writing 100 GB/s takes a CPU core and most of the RAM bandwidth from a raw data creation/parsing perspective before talking about writing it to disk at all, it's a different problem than e.g. what the bandwidth looks like on a home NAS pool. This type of approach tries to do something similar without the extra device in-between the cards and the server.

Sometimes you also want to take the above approach and scale it out over many 100G/400G ethernet ports so your flash storage pools are reachable over network separate from compute nodes. Here the goal is to make that storage solution as dense, fast, and efficient as possible where you might want to load as much possible storage as you can on a single node until it saturates the bandwidth to the CPU. If you can do that without doubling back data to the CPU you can scale it that much better.

creshal · 2024-08-15T20:10:34 1723752634

I guess the interesting usecase would be to combine this with other hardware accelerates and do DMA between devices, e.g., stream network data directly to a RAID without ever touching the main CPU, after some initial setup work.

topspin · 2024-08-15T19:11:33 1723749093

> and doesn't work with filesystems that do their own redundancy like ZFS

Really? You somehow can't create a ZFS file system on an hardware RAID block device? Seems like that means the hardware RAID isn't the otherwise transparent block device it's supposed to be for the OS and whatever file systems it cares to employ.

You're concerns about management, tools and spares are correct for many use cases. Some uses cases, like cloud operators that don't suffer the burdens of long term management at that level of detail (where entire racks and generations of hardware are cycled in/out as a working unit, with ample spares at hand, under contract) won't care about that. They'll care about the nice efficiency gain. When you operate like that you can accommodate sophisticated integration such as this for efficiency gains.

sliken · 2024-08-15T19:59:23 1723751963

> Really? You somehow can't create a ZFS file system on an hardware RAID block device?

Sure you can do it, have two layers of checksums and a volume manager on top of a volume manager. But ZFS is designed to talk directly to block devices and try to detect and complain about the numerous failure modes. Like say a parity calc that goes awry because of a memory error.

For this and other reasons it's recommended that even with Hardware RAID it's recommended to configure it in JBOD mode.

I've also seen numerous cases where software RAID on top of hardware RAID running in JBOD mode is faster than just using hardware RAID.

> When you operate like that you can accommodate sophisticated integration such as this for efficiency gains.

Sure, if there are efficiency gains. If the strong bottleneck for writing to the controller is your limiting factor you might get a 33% increase in I/O. But for that to be true you need:

  * The bottleneck not to be elsewhere
  * The controller inside a NVMe device (often passively cooled) to be faster than the one on the CPU
  * The bandwidth between the PCI controller or PCIe switch and the NVMe controller to not care about a 2x increase in needed bandwidth

Seems unlikely to me.

mrktf · 2024-08-13T14:56:25 1723560985

I imagine these kind of schemes can be implemented as sort of on device eBPF filter (in layman terms CUDA, but for storage). It would allow deeper integration with system for example have hardware accelerated/integrated lvm (obviously speed would depend on use case, less win for thin volumes, more advantages for raid and so on). Or from other side have deeper integration with filesystems such as zfs, btrfs, bcachefs.

benlwalker · 2024-08-15T18:07:27 1723745247

We tried to standardize exactly this - eBPF programs offloaded onto the device. The NVMe standard now has a lot of infrastructure for this standardized, including commands to discover device memory topology, transfer to/from that memory, and discover and upload programs. But one of the blockers is that eBPF isn't itself standardized. The other blockers are vendors ready and willing to build these devices and customers ready to buy them in volume. The extra compute ability will introduce some extra cost.

I'm still hopeful that we see it happen some day.

doctorpangloss · 2024-08-15T18:18:03 1723745883

> The NVMe standard now has a lot of infrastructure for this standardized, including commands to discover device memory topology, transfer to/from that memory, and discover and upload programs.

On the other hand, Windows and Linux still cannot just upgrade the vast majority of firmwares on NVMe devices, least of all consumer ones, despite being completely and utterly standardized.

You have to wonder, if Samsung makes bullshit, and then this https://github.com/chrivers/samsung-firmware-magic becomes part of the ecosystem, why trust the vendors with anything else?

yencabulator · 2024-08-20T20:23:05 1724185385

That repo was last updated 3 years ago, things seem to have changed since.

There seem to be a lot of ISOs for download at https://semiconductor.samsung.com/consumer-storage/support/t...

The ISO contains an EFI filesystem with grub etc and boots fine without needing Windows.

znpy · 2024-08-15T18:41:58 1723747318

I think i remember upgrading the nvme disk firmware in work dell laptop (dell latitude 7390) from 2019 using fwupd some years ago (not more than 3 years ago).

Also i think i remember fixing (upgrading?) the firmware on a crucial ssd like 5 or 6 years ago using some live linux system (downloaded off the crucial website i think?)

Not sure about windows, but linux is getting incredibly better at this.

sroussey · 2024-08-15T18:27:40 1723746460

So, could you upload malware to the drive that way?

benlwalker · 2024-08-15T18:32:07 1723746727

The eBPF programs are strictly bounded. And they're scoped to their own memory that you have to pre-load from the actual storage with separate commands issued from the CPU (presumably from the kernel driver which is doing access control checks). It's no different than uploading a shader to a GPU. You can burn resources but that's about the extent of the damage you can cause.

ChocolateGod · 2024-08-15T21:20:28 1723756828

It was only a week ago that Google disclosed an exploit to get a root shell via eBPF.

https://bughunters.google.com/blog/6303226026131456/a-deep-d...

I wouldn't want random applications (or web pages) to be able to load eBPF modules in the same way they can send shaders to a GPU through a graphics driver.

cm2187 · 2024-08-15T21:44:30 1723758270

What I don't get is that RAID5 is a simple xor. It should be a trivial operation, that would be equally trivial to hardware accelerate.

What I am the most puzzled by is how parity (i.e. RAID5) is so bad in windows storage space. A modern CPU should be able to xor data at several gigabytes per second. And it seems that even by optimizing the block sizes, windows storage space parity caps at a couple hundred MB/s.

rkagerer · 2024-08-15T17:37:49 1723743469

Is this similar to Graid's products that have been out for a while? They basically use a GPU as a raid controller

Dylan16807 · 2024-08-15T17:57:12 1723744632

A GPU RAID card does give you some flexibility benefits. But it's also bottlenecked by the single slot.

przemub · 2024-08-15T18:17:50 1723745870

When I first bought Kioxia flash I thought it's a random Chinese knockoff. Shame they ditched Toshiba brand on these.

baruch · 2024-08-15T20:03:27 1723752207

NVMe drives fail at a fairly low rate so this is an optimization for a very small edge case and since they are also very fast it's not like you'll be doing a rebuild for 6+ hours like with HDDs.

It also doesn't change anything for distributed storage.

varispeed · 2024-08-15T21:27:01 1723757221

Recently had 8TB NVMe drive die after power outage. Such a mess.

Can't afford data recovery. It was a backup drive though, so I need to redo the backups.

Thinking about buying more smaller drivers maybe on a couple mini PCs connected to network.

baruch · 2024-08-15T22:35:34 1723761334

Do note that more drives mean higher chance to see a failure...

cm2187 · 2024-08-15T21:37:43 1723757863

A model without power loss protection I presume?

varispeed · 2024-08-15T21:57:44 1723759064

Sabrent Rocket Q

wtallis · 2024-08-15T22:05:41 1723759541

So a consumer drive, on a 22x80mm card that barely has enough physical space for 8TB of NAND (+controller and DRAM) and doesn't come anywhere close to having enough space for the capacitors needed to provide enterprise-level full power loss protection.

The drive still shouldn't fail entirely from a power outage, and should at most suffer data loss, but at the end of the day it's designed to be cheap rather than reliable.

varispeed · 2024-08-15T22:46:36 1723761996

Lesson learned.

I needed it for an experiment where I had about 6TB of small files to process and wanted to have them on a single drive. It did the job and then I repurposed it for backup / dump drive for stuff I didn't want to delete, but also didn't now where else to put it.

The drive shows up in the system but with 0TB capacity, I recall once or twice it reported 8TB but I was unable to read anything.

I'll have a look one day maybe that was something simple like dead cap that I could replace (I have microscope, rework station).

wtallis · 2024-08-16T00:29:23 1723768163

Are you using the drive in an external USB enclosure? Those sometimes have power delivery that cannot keep up with the demands of high capacity or high performance drives.

varispeed · 2024-08-16T08:55:57 1723798557

No, it is mounted on the motherboard (Gigabyte with X570 chipset) also has a thick heatsink.

magicalhippo · 2024-08-16T01:32:11 1723771931

Samsung and others (including WD IIRC) do internal journaling, so even though they don't have capacitors the drive shouldn't get bricked by a power outage.

anticensor · 2024-08-15T21:12:41 1723756361

SSDs read fast, write much slower for anything which is bigger than a few hundred megabytes.

wtallis · 2024-08-15T21:20:21 1723756821

Enterprise SSDs usually don't use SLC caching—especially not to the extent that consumer drives do—so their sequential write speed doesn't drop much for really large/sustained writes, and doesn't have a short unsustainable burst of accepting writes quickly into a cache.

nick__m · 2024-08-15T22:10:54 1723759854

In high end enterprise storage, the drive do a form of caching (SLC to TLC in background by the drive) and it also does compression and encryption. Look at the Flashcore FCM4 used in IBM Flashsystem. https://www.redbooks.ibm.com/redpapers/pdfs/redp5725.pdf (no affiliation except that work recently aquire an IBM SAN and I am satisfied by this storage unit, it's not like a Purestorage SAN but it's fast enough)

wtallis · 2024-08-16T00:26:48 1723768008

IBM's drives are exactly why I said "usually don't" rather than "never". SLC caching is still not normal for enterprise drives, whereas it is now universal for consumer SSDs.

baruch · 2024-08-15T22:32:49 1723761169

I'm working mostly with enterprise drives and not consumer. These drives can write continuously at 1 to 4 GB/s depending on the specific type (mixed use vs read intensive vs very low writes).

DaiPlusPlus · 2024-08-15T21:09:05 1723756145

> NVMe drives fail at a fairly low rate

But they still fail. Backups are great and all, but for hardware-failure nothing beats redundancy (while RAID1, RAID5, etc allow for faster reads - I don't know how-often NVMe SSDs saturate their PCIe links though...).

Granted, you don't need hardware RAID for that (and HostRAID is a joke, lol): we still want redundancy, but today you'd do it with ZFS or similar so you aren't locked-in to some HW RAID vendor, or suffer the ironic consequences of having non-redundant HW RAID controllers.

iam-TJ · 2024-08-15T21:59:00 1723759140

I have an NVME device that very rarely literally (but figuratively) falls off the PCI port and disappears [0]. It is one of several Physical Volumes (PV) in a Logical Volume Management (LVM) Volume Group (VG) that backs several RAID-1 mirror Logical Volumes (LV).

When it drops off file-systems writes to the LVs are blocked and reads can also fail but the system survives sufficiently to do a controlled power off/on that recovers it.

In some cases the LVs pair up a spinning disk with the NVME but due to how I've configured the LV the spinner is read-mostly and the NVME is write-mostly (RAID member syncing is delayed and in background). There isn't too much noticeable latency except for things like `git log -Sneedle` - and worth it for the resilience.

[0] first time it happened it was spiders that had taken up residence around the M2 header and CPU (nice and warm!) and causing dust trails allowing current leakage between contacts (yes, I did do microscopic examination because I could not identify any other cause) that a simple blast with the air-compressor resolved. Later incidents turn out to be physical stress due to extreme thermal expansion and contraction as best as I can tell - ambient air temperature can fluctuate from 14C to 40C and back over 18 hours. Re-seating the M2 adapter fixes it for a a few months before it starts again! All NVME SMART self-tests pass; the failure is of the link not the storage - effectively being removed from the PCIe port. Firmware was at one stage suspected, although it had been fine for a couple of years on the same version, but updates haven't changed it in any way. ASPM is disabled.

okasaki · 2024-08-15T21:56:49 1723759009

You would now typically use a distributed system for redundancy. Then you don't necessarily need RAID.

DaiPlusPlus · 2024-08-15T23:05:41 1723763141

How do I fit a distributed system into my laptop?

baruch · 2024-08-15T22:28:39 1723760919

They do fail and you should have redundancy and backup but there isn't a real point to do the optimization that the article describes.

ein0p · 2024-08-15T17:59:41 1723744781

I’d much prefer NVMe colocated compute. Imagine a columnar storage engine able to filter and aggregate data during scans without reading it through PCIe, for example.

jakedata · 2024-08-15T19:38:39 1723750719

ScaleFlux https://scaleflux.com computational storage might offer some of what you are imagining. Their NVMe drives have onboard ARM cores and perform hardware compression and advanced flash management with no drivers beyond standard NVMe. I believe you can tap into the computational capabilities with additional code.

jamesfmilne · 2024-08-15T19:36:40 1723750600

HW RAID is dead, they need to get over it.

We've had good experience with Xinnor, but it's a shame it's proprietary.

I'd love to see a high-performance open-source erasure coding solution for NVMe. The built in offerings in Linux are not cutting it.

znpy · 2024-08-15T18:37:46 1723747066

I wonder how (if?) this will interact and integrate with the current software stack and the various volume-managing filesystems (zfs but also btrfs).

wmf · 2024-08-15T18:48:31 1723747711

It probably won't. These clever tricks usually don't come to market.