The main problem with consumer drives is the missing power loss protection (plp)...

namibj · on Feb 21, 2024

Sadly the solution, a firmware variant with ZNS instead of the normal random write block device, just isn't on offer (please tell if I'm wrong; I'd love one!). Because with ZNS you can get away with tiny caps, large enough to complete the in-flight blocks (not the buffered ones, just those that are already at the flash chip itself), plus one metadata journal/ring buffer page to store write pointers and zone status for all zones touched since the last metadata write happened. Given that this should take about 100 μs, I don't see unannounced power loss really that problematic to deal with.

In theory the ATX PSU reports imminent power loss with a mandatory notice of no less than 1ms; this would easily be enough to finish in-flight writes and record the zone state.

jauntywundrkind · on Feb 23, 2024

It's fucking wild to me the trajectory here:

OpenChannelIO: stop making such absurd totalizing systems that hide the actual flash! We don't need your fancy controllers! Just give us a bunch of flash that we can individually talk to & control ourselves from the host! You leave so much performance on the table! There's so many access patterns that wouldn't suck & would involve random rewrites of blocks!

Then after years of squabbling, eventually: ok we have negotiated for years. We will build new standards to allow these better access patterns that require nearly no drive controller intermediation, that have exponentially higher degrees of mechanistic sympathy of what flash actually does. Then we will never release drives mainstream & only have some hard to get unbelievably expensive enterprise drives that support it.

You are so f-ing spot on. This ZNS would be perfect for lower reliability consumer drives. But: market segmentation.

The situation is so fucked. We are so impeded from excelling, as a civilization, case number nine billion three hundred forty two.

tumult · on Feb 21, 2024

My experience lately is that consumer drives will also lie and use a cache, but then drop your data on the floor if the power is lost or there’s a kernel panic / BSOD. (Samsung and others.)

bcaxis · on Feb 21, 2024

Rumors of that. I've never actually seen it myself.

tumult · on Feb 21, 2024

I can get it to happen easily. 970 Evo Plus. Write a text file and kill the power within 20 seconds or so, assuming not much other write activity. File will be zeroes or garbage, or not present on the filesystem, after reboot.

c0l0 · on Feb 21, 2024

This happens for you after you invoked an explicit sync() (et al.) before the power cut?

tumult · on Feb 21, 2024

c0l0 · on Feb 21, 2024

That is highly interesting and contrary to a number of reports I've read about the Samsung 970 EVO Plus Series (and experienced for myself) specifically! Can you share more details about your particular setup and methodology? (Specific model name/capacity, Firmware release, Kernel version, filesystem, mkfs and mount options, any relevant block layer funny business you are conciously setting would be of greatest interest.) Do you have more than one drive where this can happen?

tumult · on Feb 21, 2024

Yeah, it happens on two of the 970 EVO Plus models. One on the older revision, and one on the newer. (I think there are only two?) It happens on both Linux and Windows. Uhh, I'm not sure about the kernel versions. I don't remember what I had booted at the time. On Windows I've seen it happen as far back as 1607 and as recently as 21H2. I've also seen it happen on someone else's computer (laptop.)

It's really easy to reproduce (at least for me?) and I'm pretty sure anyone can do it if they try to on purpose.

hypercube33 · on Feb 21, 2024

Only thing I ever have seen is some cheap Samsung drives slow to a crawl when their buffer fills or those super old Intel ssds that power loss to 8mb due to some firmware bug.

dijit · on Feb 21, 2024

Eh, I've definitely seen it.

I buy Samsung drives relatively exclusively if that makes any difference.

All that to say though: this is why things like journalling and write-ahead systems exist. OS design is mostly about working around physical (often physics related) limitations of hardware and one of those is what to do if you get caught in a situation where something is incomplete.

The prevailing methodology is to paper over it with some atomic actions. For example: Copy-on-Write or POSIX move semantics (rename(2)).

Then some spiffy young dev comes along and turns off all of those guarantees and says they made something ultra fast (*cough*mongodb*cough*) then maybe claims those guarantees are somewhere up the stack instead. This is almost always a lie.

Also: Beware any database that only syncs to VFS.