Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> So the leading hypothesis seems to be that perhaps the SSDs were from the same manufacturing batch and shared some defect.

Really sorry that you had to learn the hard way, but this is unfortunately common knowledge :/ Way back (2004) when I was shadowing-eventually-replacing a mentor that handled infrastructure for a major institution, he gave me a rule I took to heart from then forward: Always diversify. Diversify across manufacturer, diversify across make/model, hell, if it's super important, diversify across _technology stacks_ if you can.

It was policy within our (infrastructure) group that /any/ new server or service must be build-able from at least 2 different sources of components before going live, and for mission critical things, 3 is better. Anything "production" had to be multihomed if it connects to the internet.

Need to build a new storage server service? Get a Supermicro board _and_ a Tyan (or buy an assortment of Dell & IBM), then populate both with an assortment of drives picked randomly across 3 manufacturers, with purchases spread out across time (we used 3months) as well as resellers. Any RAID array with more than 4 drives had to include a hot spare. For even more peace of mind, add a crappy desktop PC with a ton of huge external drives and periodically sync to that.

He also taught me that it's not done until you do a few live "disaster tests" (yanking drives out of fully powered up servers, during heavy IO. Brutally ripping power cables out, quickly plugging it back in, then yanking it out again once you hear the machine doing something, then plug back in...), without giving anyone advance notice. Then, and only then, is a service "done".

I thought "Wow, $MENTOR is really into overkill!!" at the time, but he was right.

I credit his "rules for building infrastructure" for having a zero loss track record when it comes to infra I maintain, my whole life.



> this is unfortunately common knowledge

This reminds me of Voltaire: "Common sense is not so common."

Thanks for the great comment—everything you say makes perfect sense and is even obvious in hindsight, but it's the kind of thing that tends to be known by grizzled infrastructure veterans who had good mentors in their chequered past—and not so much by the rest of us.

I fear getting karmically smacked for repeating this too often, but the more I think about it, the more I feel like 8 hours of downtime is not an unreasonable price to pay for this lesson. The opportunity cost of learning it beforehand would have been high as well.


> it's the kind of thing that tends to be known by grizzled infrastructure veterans who had good mentors in their chequered past

And thanks right back at you.

I hadn't noticed before your comment that while not in the customary way (I'm brown skinned and was born into a working class family) I've got TONS of "privilege" in other areas. :D

My life would probably be quite different if I didn't have active Debian and Linux kernel developers just randomly be the older friends helping me in my metaphorical "first steps" with Linux.

Looking back 20+ years ago, I lucked into an absurdly higher than average "floor" when I started getting serious about "computery stuff". Thanks for that. That's some genuine "life perspective" gift you just gave me. I'm smiling. :) I guess it really is hard to see your own privilege.

> 8 hours of downtime is not an unreasonable price to pay for this lesson. The opportunity cost of learning it beforehand would have been high as well.

100% agree.

I'd even say the opportunity cost would have been much higher. Additionally, 8hrs of downtime is still a great "score", depending on the size of the HN organization. (bad 'score' if it's >100 people. amazing 'score' if it's 1-5 people.)


> it's the kind of thing that tends to be known by grizzled infrastructure veterans who had good mentors in their chequered past—and not so much by the rest of us

This is why your systems should be designed by grizzled infrastructure veterans.


That reminds me of Jerry Weinberg's dictum: whenever you hear the word "should" on a software project, replace it with "isn't".

https://news.ycombinator.com/item?id=590075


This is brilliant and I suspect generalises to "won't" when reading RFCs.


That goes along with "almost never" which is a synonym for "sometimes" and "maintenance-free" which is a synonym for "throw it out and buy a new one when it breaks".


"Almost never" -> "more often than you would want"


> Way back (2004) [...] he gave me a rule [...]: Always diversify.

Annoyingly, in 2000-4, I was trying to get people to understand this and failing constantly because "it makes more sense if everything is the same - less to learn!" Hilariously*, I also got the blame when things broke even though none of them were my choice or design.

(Hell, even in 2020, I hit a similar issue with a single line Ruby CLI - lots of "everything else uses Python, why is it not Python?" moaning. Because the Python was a lot faffier and less readable!)

edit: to fix the formatting


Didn’t Intel grant AMD some kind of license because the US government refused to buy x86 CPU models that only have one source?


I believe it was IBM, not the government.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: