The main problem with consumer drives is the missing power loss protection (plp). M.2 drives just don't have space for the caps like an enterprise 2.5 u.2/u.3 drive will have.
This matters when the DB calls a sync and it's expecting the data to be written safely to disk before it returns.
A consumer drive basically stops everything until it can report success and your IOPS falls to like 1/100th of what the drive is capable of if it's happening alot.
An enterprise drive with plp will just report success knowing it has the power to finish the pending writes. Full speed ahead.
You can "lie" to the process at the VPS level by enabling unsafe write back cache. You can do it at the OS level by launching the DB with "eatmydata". You will get the full performance of your SSD.
In the event of power loss you may well end up in an unrecoverable corrupted condition with these enabled.
I believe that if you buy all consumer parts - an enterprise drive is the best place to up spend your money profitably on an enterprise bit.
Sadly the solution, a firmware variant with ZNS instead of the normal random write block device, just isn't on offer (please tell if I'm wrong; I'd love one!).
Because with ZNS you can get away with tiny caps, large enough to complete the in-flight blocks (not the buffered ones, just those that are already at the flash chip itself), plus one metadata journal/ring buffer page to store write pointers and zone status for all zones touched since the last metadata write happened.
Given that this should take about 100 μs, I don't see unannounced power loss really that problematic to deal with.
In theory the ATX PSU reports imminent power loss with a mandatory notice of no less than 1ms; this would easily be enough to finish in-flight writes and record the zone state.
OpenChannelIO: stop making such absurd totalizing systems that hide the actual flash! We don't need your fancy controllers! Just give us a bunch of flash that we can individually talk to & control ourselves from the host! You leave so much performance on the table! There's so many access patterns that wouldn't suck & would involve random rewrites of blocks!
Then after years of squabbling, eventually: ok we have negotiated for years. We will build new standards to allow these better access patterns that require nearly no drive controller intermediation, that have exponentially higher degrees of mechanistic sympathy of what flash actually does. Then we will never release drives mainstream & only have some hard to get unbelievably expensive enterprise drives that support it.
You are so f-ing spot on. This ZNS would be perfect for lower reliability consumer drives. But: market segmentation.
The situation is so fucked. We are so impeded from excelling, as a civilization, case number nine billion three hundred forty two.
My experience lately is that consumer drives will also lie and use a cache, but then drop your data on the floor if the power is lost or there’s a kernel panic / BSOD. (Samsung and others.)
I can get it to happen easily. 970 Evo Plus. Write a text file and kill the power within 20 seconds or so, assuming not much other write activity. File will be zeroes or garbage, or not present on the filesystem, after reboot.
That is highly interesting and contrary to a number of reports I've read about the Samsung 970 EVO Plus Series (and experienced for myself) specifically! Can you share more details about your particular setup and methodology? (Specific model name/capacity, Firmware release, Kernel version, filesystem, mkfs and mount options, any relevant block layer funny business you are conciously setting would be of greatest interest.) Do you have more than one drive where this can happen?
Yeah, it happens on two of the 970 EVO Plus models. One on the older revision, and one on the newer. (I think there are only two?) It happens on both Linux and Windows. Uhh, I'm not sure about the kernel versions. I don't remember what I had booted at the time. On Windows I've seen it happen as far back as 1607 and as recently as 21H2. I've also seen it happen on someone else's computer (laptop.)
It's really easy to reproduce (at least for me?) and I'm pretty sure anyone can do it if they try to on purpose.
Only thing I ever have seen is some cheap Samsung drives slow to a crawl when their buffer fills or those super old Intel ssds that power loss to 8mb due to some firmware bug.
I buy Samsung drives relatively exclusively if that makes any difference.
All that to say though: this is why things like journalling and write-ahead systems exist. OS design is mostly about working around physical (often physics related) limitations of hardware and one of those is what to do if you get caught in a situation where something is incomplete.
The prevailing methodology is to paper over it with some atomic actions. For example: Copy-on-Write or POSIX move semantics (rename(2)).
Then some spiffy young dev comes along and turns off all of those guarantees and says they made something ultra fast (*cough*mongodb*cough*) then maybe claims those guarantees are somewhere up the stack instead. This is almost always a lie.
This matters when the DB calls a sync and it's expecting the data to be written safely to disk before it returns.
A consumer drive basically stops everything until it can report success and your IOPS falls to like 1/100th of what the drive is capable of if it's happening alot.
An enterprise drive with plp will just report success knowing it has the power to finish the pending writes. Full speed ahead.
You can "lie" to the process at the VPS level by enabling unsafe write back cache. You can do it at the OS level by launching the DB with "eatmydata". You will get the full performance of your SSD.
In the event of power loss you may well end up in an unrecoverable corrupted condition with these enabled.
I believe that if you buy all consumer parts - an enterprise drive is the best place to up spend your money profitably on an enterprise bit.