A network adds more points of failure. It does not reduce them.

supriyo-biswas · on Feb 20, 2024

A network attached, replicated storage hedges against data loss but increases latency; however most customers usually prefer higher latency to data loss. As an example, see the highly upvoted fly.io thread[1] with customers complaining about the same thing.

[1] https://news.ycombinator.com/item?id=36808296

ssl-3 · on Feb 20, 2024

Locally-attached, replicated storage also hedges against data loss.

supriyo-biswas · on Feb 20, 2024

RAID rebuild times make it an unviable option and customers typically expect problematic VMs to be live-migrated to other hosts with the disks still having their intended data.

The self hosted version of this is GlusterFS and Ceph, which have the same dynamics as EBS and its equivalents in other cloud providers.

mike_hearn · on Feb 20, 2024

With NVMe SSDs? What makes RAID unviable in that environment?

dijit · on Feb 20, 2024

This depends, like all things.

When you say RAID, what level? Software-raid or hardware raid? What controller?

Let's take best-case:

RAID10, small enough (but many) NVMe drives and an LVM/Software RAID like ZFS, which is data aware so only rebuilds actual data: rebuilds will degrade performance enough potentially that your application can become unavailable if your IOPS are 70%+ of maximum.

That's an ideal scenario, if you use hardware raid which is not data-aware then your rebuild times depend entirely on the size of the drive being rebuilt and it can punish IOPs even more during the rebuild. But it will affect your CPU less.

There's no panacea. Most people opt for higher latency distributed storage where the RAID is spread across an enormous amount of drives, which makes rebuilds much less painful.

timc3 · on Feb 21, 2024

What I used to do was swap machines over from the one with failing disks to a live spare (slave in the old frowned upon terminology), do the maintenance and then replicate from the now live spare back if I had confidence it was all good.

Yes it’s costly having the hardware to do that as it mostly meant multiple machines as I always wanted to be able to rebuild one whilst having at least two machines online.

dijit · on Feb 21, 2024

If you are doing this with your own hardware it is still less costly than cloud even if it mostly sits idle.

Cloud is approx 5x sticker cost for compute if its sustained.

Your discounts may vary, rue the day those discounts are taken away because we are all sufficiently locked in.

crazygringo · on Feb 20, 2024

A network adds more points of failures but also reduces user-facing failures overall when properly architected.

If one CPU attached to storage dies, another can take over and reattach -- or vice-versa. If one network link dies, it can be rerouted around.

bombcar · on Feb 20, 2024

Using a SAN (which is what networked storage is, after all) also lets you get various "tricks" such as snapshots, instant migration, etc for "free".