Its also important to consider how often disks will fail when you are operating hundreds of them - its probably more often than you'd think, and if you don't have someone on staff and nearby to your colo provider you're going to pay a lot in remote hands fees.
Your colo facility will almost certainly have 24/7 staff on hand who can help you with tasks like swapping disks from a pile of spares, but expect to pay $300+ minimum just to get someone to walk over to your racks, even if the job is 10 mins.
With that said, the cost savings can still be enormous. But know what you're getting into.
Like another comment said, don't bother swapping out disks, just leave the dead ones in place and disable them in software. Then eventually either replace the whole server or get someone on site to do a mass swap of disks. At this scale redundancy needs to be spread between machines anyway so no gain in replacing disks as they die.
That also means that you need extra spare disks in the system, which also means extra servers, extra racks, extra power feeds, extra cooling etc.
If you do a 60-disk 4U setup you'll need 1 full rack of those just to get your 10PB, then you'll need yet another one for redundancy. And then a quarter for hot spares. At that point you have single-redundancy, no file history and no scaling. Is it possible? Sure. Is this something you can do 'on a side track with the people you already heave'? Unlikely if you are a startup with no datacenter, no colocation yet etc.
You don't do redundancy that way at that scale, that's completely insane. You run ceph or beegfs or Windows Storage Server and backup to tape with a tape library. If youve got big bucks (though still peanuts compared to s3) you replicate the entire setup 1:1 at a second site.
The author doesn't want a second site. And at that scale you do redundancy at that scale within the requested parameters.
If you set your object store to be resilient to single-partition loss per object (within CAP) you effectively duplicate everything once. If you want more-than-one you get into sharding to spread the risk. We're not talking about RAID here, but about replicas or copies.
Windows Storage Server doesn't belong in a setup like this, and neither does tape since it needs to be accessible in under 1s. If higher latencies were fine the author would have been able to use something between S3 IA and Glacier. Heck, you could use cold HDD storage for that kind of access. The drives would need to spin up to collect the shards to assemble at least one replica to be able to read the file, but that's still multiple orders of magnitude faster than tape.
I have written a larger post with more numbers, and unless you seriously reduce the features you use, it's not really cheaper than S3 if you start off with no physical IT and no people to support it. It's not that it isn't possible, it's just that you need to spin up an entire business unit for it and at that point you're eating way more cost.
Regardless of the object store (or filesystem if you want to go full on legacy style), you still need at least the minimum amount of physical bits on disk to be able to store the data. And pretty much no object store supports a 1:1 logical-physical storage scale. It's almost always at least 1:1.66 in degraded mode or 1:2 in minimum operational mode.
>We're not talking about RAID here, but about replicas or copies.
Most distributed filesystems support some form of erasure coding. Ceph does, Minio does, HDFS does, etc. So no, you don't need to duplicate everything.
> You're talking about data integrity, this is not the same as redundancy.
To be clear, you're talking about mitigating the risk of data corruption (eg. bits will flip randomly due to cosmic rays or what have you) over time, vs. the risk of outright data loss, yes?
Isn't there some some overlap between the solutions?
No, I'm talking about mitigating system failure (be it a dead disk, PHY, entire server, single PDU, single rack or entire feed. I didn't even go down the level of individual object durability yet (or web access to those objects, consistent access control and the likes).
There is some overlap in the sense that having redundant copies makes it possible to replace a bad copy with a good copy if a checksum mismatches on one of them. That also allows for bringing the copy count back in spec if a single copy goes missing (regardless of the type of failure).
But no matter what methods are used, data is data and needs to be stored somewhere. It the bits constituting that data go missing, the data is is gone. To prevent that, you need to make sure those bits exist in more than one place. The specific places come with differences in cost, mitigations and effort:
- Two copies on the same disk mitigates bit flips in one copy but not disk failure
- Two copies on two disks on the same HBA mitigates bit flips and disk failure but not HBA failure
The list goes on until you reach the requirement posted at the top of this Ask HN where it is stated that OneZone IA is used. That means it does not need multiple zones for zone-outage mitigation. Effectively that means the racks are allowed to be placed in the same datacenter. So that datacenter being unavailable or destroyed means the data is unavailable (temporarily or permanently), which appears to be the accepted risk.
But within that zone (or datacenter) you would still need all other mitigations offered by the durable object storage S3 provides (unless specified differently - if we just make up new requirements we can make it very cheap and just accept total system failure with 1 bit flip and be done with it).
I currently pay about $40 for a half hour of remote hands at a large data center. Modern disks rarely need to be swapped. You can look at BackBlaze's published failure rates and do the math yourself if you don't believe me.
I’ve used Netapps and Isilon in the past. We didn’t change any disks, they did as part of the maintenance. Not sure how the physical security worked but they were let in by the data centre staff and did their thing. I think they came in weekly.
They whole solution wasn’t cheap though and all of these extras were baked into the cost. We were getting better than S3 costs from a per TB straight up without considering power , cooling and rack space costs. Network was significantly cheaper than AWS.
Not sure on how far these NAS’ scale but I would expect deep discounts for something of this scale.
I've run the math on this for 1PB of similar data (all pictures), and for us it was about 1.5-2 orders of magnitude cheaper over the span of 10 years (our guess for depreciation on the hardware).
Note that we were getting significantly cheaper bandwidth than S3 and similar providers, which made up over half of our savings.
Upfront costs, with networking, rack and stacked, and wired, were far under $100/TB raw, around $40-$60, but this was quite a while ago and I don't know how it looks in the era of 10+TB drives. Also remember that once you are off S3 you are in the situation of doing your own backup, and the use case dictates the required availability when things fail... we didn't need anything online, but mirrored to a second site. With erasure coding, you can get by with 1.5x copies at each site or so, with a performance hit. So properly backed up with a full double, it's about 3x raw...
Opex will be power, data center rent, and internet access are hugely hugely variable. And of course, the personnel will be at least 1 full time person who's extremely competent.
Do you remember somehow the math on how much cheaper it was or how you thought about upfront cost vs ongoing. Just order of magnitude would be great.